Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-33] SSD example not working with mkl-dnn #10021

Merged
merged 11 commits into from
Apr 24, 2018

Conversation

ashokei
Copy link
Contributor

@ashokei ashokei commented Mar 7, 2018

Description

SSD VGG16-reduced example is not working if pooling convention is "full" with Max pool. This PR resolves the this issue by defaulting to CPU implementation when pooling convention is "full".

Checklist

Essentials

  • [ ✅ ] Passed code style checking (make lint)
  • [ ✅ ] Changes are complete (i.e. I finished coding on this PR)

Changes

  • disable mkl-dnn if pooling convention is "full"

Comments

  • This PR is to resolve the crash seen. A new PR may be needed to implement "full" pooling convention with mkl-dnn.

@ashokei
Copy link
Contributor Author

ashokei commented Mar 7, 2018

@zheng-da @TaoLv please review. I notice we do not have proper "full" pooling convention support, this PR disables mkl-dnn pooling for "full" convention;

We will probably need a new PR to implement the "full" convention feature with mkl-dnn for both Max and Avg Pooling cases.

@TaoLv
Copy link
Member

TaoLv commented Mar 7, 2018

There is a test case for max pooling with full convention: https://github.com/apache/incubator-mxnet/blob/master/tests/python/gpu/test_operator_gpu.py#L841
But it isn't observed that this case failed before. If the test case doesn't cover the corner case you mentioned, maybe we need add new test case for it.

@@ -92,6 +92,8 @@ inline bool SupportMKLDNNPooling(const PoolingParam &param,

if (param.pooling_convention == pool_enum::kValid)
return true;
else
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes the later code unreachable. Are you sure this is what you want?

Copy link
Contributor Author

@ashokei ashokei Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong the later code is being disabled for now until we support all "full" pooling convention cases. As @TaoLv mentioned, i will add a unit-test for this failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use "#if 0" to comment the checks below?

@ashokei ashokei changed the title [MXNET-33] Mkldnn pooling convention crash [WIP] [MXNET-33] Mkldnn pooling convention crash Mar 7, 2018
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test that produces the crash

@ashokei
Copy link
Contributor Author

ashokei commented Mar 8, 2018

@marcoabreu @TaoLv @zheng-da Unit-tests currently do not seem to test backward/training mode for pooling very well. Below is the test for reproducing this crash.

ln -s $(pwd)/example/ssd/symbol/vgg16_reduced.py $(pwd)/example/image-classification/symbols/vgg16_reduced.py
python example/image-classification/train_cifar10.py --network vgg16_reduced

This PR will fix the above crash.

@ashokei ashokei changed the title [WIP] [MXNET-33] Mkldnn pooling convention crash [MXNET-33] Mkldnn pooling convention crash Mar 8, 2018
@marcoabreu
Copy link
Contributor

Feel free to develop and add an appropriate unit test yourself. Please don't hesitate to reach out to us in case you need assistance.

@ashokei ashokei changed the title [MXNET-33] Mkldnn pooling convention crash [MXNET-33] SSD example not working with mkl-dnn Mar 14, 2018
@ashokei
Copy link
Contributor Author

ashokei commented Mar 14, 2018

@marcoabreu this issue only happens inside a particular model (SSD VGG16-reduced), i clarified in title/description. I'm not sure a special unittest just for this specific model is appropriate. I am not able to reproduce this issue on separately. please let me know if any other suggestions/feedback, thanks!.

@marcoabreu
Copy link
Contributor

That's not an issue, we're already testing a variety of models and the bigger diversity, the better. We can re-visit this decision later on and reduce the test size if required. For now, it would be good to get a test since we would like to re-enable pooling for MKL-DNN at a later point in time - this test would give us a tool to verify that it's actually working again.

"""

import mxnet as mx
model = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test_model.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this file to a subdirectory like tests/python/cpu/data/test_mkldnn_ test_mkldnn_model_model1.json? (format is %FILENAME%_%TESTNAME%_model%NUMBER%.json)?

"variable settings"


def test_mkldnn_model():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain that this test has no asserts because it is only supposed to run successfully without failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcoabreu i added assert for sanity check, and renamed file as you mentioned. any other suggestions , please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


// need to support pooling convention full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to the issue as part of the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean , link to this PR ? or JIRA ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To jira

"variable settings"


def test_mkldnn_model():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from 773a5af to 6ca171a Compare March 16, 2018 22:52
@zheng-da zheng-da mentioned this pull request Mar 19, 2018
5 tasks
@pengzhao-intel
Copy link
Contributor

#10092
@marcoabreu could you help review again and merge?

@zheng-da
Copy link
Contributor

@ashokei @pengzhao-intel do you know when the Intel MKLDNN team will fix the bug in MKLDNN?

@@ -0,0 +1,94 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests in tests/python/cpu are currently not run in CI. Please add it to the MKLDNN jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcoabreu This PR only adds to existing test, no new test file is created. Can you be specific which CI jobs you are referring to ? If there is a gap in CI, a new JIRA item can be created to address it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/apache/incubator-mxnet/blob/master/Jenkinsfile#L439, which in the end points to https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L387, we're not running the tests located in tests/python/cpu. Please rename tests/python/cpu to tests/python/mkl and add another function in the same style as https://github.com/apache/incubator-mxnet/blob/master/Jenkinsfile#L458.

@ashokei
Copy link
Contributor Author

ashokei commented Mar 26, 2018

@zheng-da This is not a bug in MKL-DNN, it is an issue with current MKL-DNN mxnet integration. This PR will use CPU fallback.

@zheng-da
Copy link
Contributor

zheng-da commented Mar 27, 2018

what i mean is that mkldnn doesn't support "full" pooling convention. will mkldnn eventually support it? if it does, when will it support it?

@ashokei
Copy link
Contributor Author

ashokei commented Mar 27, 2018

@zheng-da sorry for the confusion, mkl-dnn has more general pooling support, it allows us to implement any arbitrary convention we like. So it is not bug in mkl-dnn, it is just that we havent integrated this specific use-case in our mxnet mkldnn pooling implementation yet.

@zheng-da
Copy link
Contributor

oh, i see. do you mind implementing it? fallback is fine, but it'll be better if it can invoke the one in MKLDNN.

@ashokei
Copy link
Contributor Author

ashokei commented Mar 27, 2018

yes, we plan to add new features for pooling implementation with mkl-dnn in future PR.
this PR just fixes the crash in SSD, and this specific case (maxpool with fullconvention) is only used in 1 layer, and not common.

@ashokei
Copy link
Contributor Author

ashokei commented Mar 27, 2018

@marcoabreu does this look ok, thanks.

@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from 7e63aa1 to ed0f909 Compare March 27, 2018 23:30
@ashokei
Copy link
Contributor Author

ashokei commented Mar 27, 2018

@marcoabreu fixed now

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thanks for fixing the example :)

@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from ed0f909 to 8d4171d Compare April 2, 2018 23:36
@anirudh2290
Copy link
Member

@ashokei can you please check if the CI failure is related to your change and retrigger it if not ?

@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from 8d4171d to f6cbd76 Compare April 10, 2018 00:18
@ashokei
Copy link
Contributor Author

ashokei commented Apr 10, 2018

@anirudh2290 done...thanks.

@anirudh2290
Copy link
Member

@piiswrong is this good to merge ?

@zheng-da
Copy link
Contributor

@piiswrong this PR is to fix a bug in the MKLDNN integration. MKLDNN pooling can't handle full pooling convention correctly.
do you have any comments for this PR? if not, can you merge it?

@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from f6cbd76 to 079fba2 Compare April 18, 2018 00:19
@ashokei ashokei force-pushed the mkldnn_pooling_convention_valid branch from 079fba2 to 4fbee8d Compare April 23, 2018 18:48
@ashokei
Copy link
Contributor Author

ashokei commented Apr 23, 2018

@piiswrong can you please merge this, this PR also has mkldnn jenkins/unittest as requested by @marcoabreu

@ashokei ashokei mentioned this pull request Apr 24, 2018
7 tasks
@piiswrong piiswrong merged commit fd55a2e into apache:master Apr 24, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* use mkl-dnn for 'valid' pooling_convention only

* pooling convention full not supported by current mkl-dnn impl

* disable unreachable code

* add sample model test for mkldnn

* fix review feedback

* add jira link to comment

* fix lint issue

* rename python test for mkl

* enable python tests for mkldnn in CI

* use vgg16 with convention full

* fix unittest
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 8, 2018
* use mkl-dnn for 'valid' pooling_convention only

* pooling convention full not supported by current mkl-dnn impl

* disable unreachable code

* add sample model test for mkldnn

* fix review feedback

* add jira link to comment

* fix lint issue

* rename python test for mkl

* enable python tests for mkldnn in CI

* use vgg16 with convention full

* fix unittest
anirudh2290 pushed a commit that referenced this pull request Jun 13, 2018
* use mkl-dnn for 'valid' pooling_convention only

* pooling convention full not supported by current mkl-dnn impl

* disable unreachable code

* add sample model test for mkldnn

* fix review feedback

* add jira link to comment

* fix lint issue

* rename python test for mkl

* enable python tests for mkldnn in CI

* use vgg16 with convention full

* fix unittest
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* use mkl-dnn for 'valid' pooling_convention only

* pooling convention full not supported by current mkl-dnn impl

* disable unreachable code

* add sample model test for mkldnn

* fix review feedback

* add jira link to comment

* fix lint issue

* rename python test for mkl

* enable python tests for mkldnn in CI

* use vgg16 with convention full

* fix unittest
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants