-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] AOTI: add minimal arrayref interface #112800
Conversation
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112800
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 796da43 with merge base de4b2e5 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! ghstack-source-id: 206327561 Pull Request resolved: #112800
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! [ghstack-poisoned]
Pull Request resolved: #112800 This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. ghstack-source-id: 206414952 @exported-using-ghexport Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!
@@ -293,5 +291,137 @@ struct ThreadLocalCachedOutputTensor<ArrayRefTensor<T>> { | |||
RAIIAtenTensorHandle tensor_; | |||
}; | |||
|
|||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should move all the codes below to another place, e.g. torch/csrc/inductor/aoti_torch/tensor_converter.cpp. interface.cpp
is supposed to contain the functionalities used by the user of the generated DSO. Although we don't export the structs and functions, it still looks a bit strange to keep them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a place that is inside the DSO. AIUI we don't actually support compiling more than one .cpp file into the DSO right now. I suppose we can add another .cpp file that is copy/pasted into the generated file like interface.cpp if you like.
torch/_inductor/codegen/wrapper.py
Outdated
dtype = may_get_constant_buffer_dtype(input) | ||
assert ( | ||
dtype is not None | ||
), f"Failed to get the dtype of sympy.Expr: {graph_input}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps s/graph_input/input/ ?
|
||
self.suffix.splice( | ||
""" | ||
extern "C" AOTIRuntimeError AOTInductorModelRunMinimalArrayrefInterface( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it would be possible to put the interface API into interface.h/interface.cpp? Because these are user-facing API, it seems to be better to me if we could make them explicit in the actual interface file instead of relying on codegen. We could codegen macro to disable the these MinimalArrayrefInterface based on config.use_minimal_arrayref_interface
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation needs to come after the definition of AOTInductorModel{Inputs,Outputs}, which is (and must be) in the generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just #include
a file with the needed boilerplate? We do that in a few other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose so, but it's like 10 lines that would be entirely out of context, and IMO that would be a net decrease in readability and maintainability.
@@ -276,6 +276,29 @@ AOTI_TORCH_EXPORT AOTITorchError aoti_torch_proxy_executor_call_function( | |||
|
|||
#ifdef __cplusplus | |||
} // extern "C" | |||
|
|||
template <typename T> | |||
int32_t aoti_torch_dtype(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a little strange to put templates in a C interface. Could we use C-style function like:
inline int32_t aoti_torch_dtype_float() {
return aoti_torch_dtype_float32;
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have those functions. I need a template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the usage is in the internal diff)
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
Pull Request resolved: #112800 This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. ghstack-source-id: 206636549 @exported-using-ghexport Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!
convert_handles_to_inputs(input_handles, inputs); | ||
auto outputs = run_impl_minimal_arrayref_interface<AOTInductorModelInputs, AOTInductorModelOutputs>( | ||
inputs, stream, proxy_executor); | ||
// NOTE: outputs is full of ArrayRef to thread_local storage. If in the future we need this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be different from the semantics of outputs
in the regular path, where we transfer ownership of output tensors to the user code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is different. I do not think that that is a problem; changing the way the inputs and outputs are handled is the point of changing the interface.
""" | ||
extern "C" AOTIRuntimeError AOTInductorModelRunMinimalArrayrefInterface( | ||
AOTInductorModelHandle model_handle, | ||
const AOTInductorModelInputs& inputs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AOTInductorModelInputs
is defined to be a std::tuple
. Is it fine to pass a C++ construct across the C interface boundary? In particular, because the compiler for compiling the interface code may not be the same as the one used for compiling the user code that invokes the interface, would this impose any potential C++ ABI issue like different calling conventions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compiler for compiling the interface code may not be the same as the one used for compiling the user code that invokes the interface, would this impose any potential C++ ABI issue like different calling conventions?
the compiler probably doesn't need to match, but the C++ standard library version certainly does. I wouldn't expect gcc vs clang problems unless somebody had gcc configured to use libstdc++ and clang configured to use libc++, but using different major versions of MSVC would be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple enables efficient straightforward implementation of convert_outputs_to_handles
and convert_handles_to_inputs
. We could address this theoretical ABI compatibility concern by copying into a struct at the interface boundary, but I'm not excited about spending additional nanoseconds doing so unless you can provide an example of a realistic setup where we need to worry about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different major versions of MSVC would be an issue
apparently hasn't been an issue for 8 years: https://learn.microsoft.com/en-us/cpp/porting/binary-compat-2015-2017?view=msvc-170
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
Pull Request resolved: #112800 This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. ghstack-source-id: 206745906 @exported-using-ghexport Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!
functorch failure doesn't repro internally either on the base of this diff or on this diff, so I'm assuming it's unrelated noise coming from whatever rebasing happened during export |
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: #113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #112800
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: #113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: #112800, #113577
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: pytorch#112800 Approved by: https://github.com/chenyang78
…rch#113577) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: pytorch#113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: pytorch#112800
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: pytorch#113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: pytorch#112800, pytorch#113577
This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: pytorch#112800 Approved by: https://github.com/chenyang78
…rch#113577) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: pytorch#113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: pytorch#112800
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: pytorch#113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: pytorch#112800, pytorch#113577
Stack from ghstack (oldest at bottom):
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.
This took a while to get right (e.g., I initially required 1-D
MiniArrayRef for the inputs, but found that multi-dimensional
ArrayRefTensor ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.
Differential Revision: D50699890
NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler