-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the option to disable deep copying of large MemoryDataSet objects #1258
Comments
Hi @jstammers thank you for the detailed issue and solution. My initial reaction when I saw your title was to make sure you were aware of the You are able to do this as well: catalog = DataCatalog({"input": MemoryDataSet(copy_mode="assign", data=data)}) I'd be interested to see if other users in the community think this would be useful as a native runner. Or as an alternative we include an example like we do for the DryRunner in the docs. |
Hi @datajoely, thanks for the quick response. In my use-case, the object is one that is created as an output from one node and used as an input to a subsequent one, so I didn't bother to explicitly define it in my If I include it in the catalog = DataCatalog({ ... , "model": MemoryDataSet(copy_mode="assign", data=None),}) Then it does indeed avoid the deep copy, although this might be quite verbose for pipelines that have a lot of intermediate nodes with data that remains in memory. |
Ah understood - you can also you use that |
Hi @jstammers, this behaviour is currently possible if you use the |
Description
I have a suite of unit tests for a pipeline which test the functionality of each node in isolation. This includes a final test that runs the entire pipeline on a small set of input data. I've noticed that this final test runs much slower than the others, which I found to be the result of some
deepcopy
operations.In my pipeline, I think this is due to a
spacy
model that gets loaded in one node, with additional nodes that add components to the model. Looking into the code forSequentialRunner
, I can see that the default behaviour for datasets not in the catalog is to createMemoryDataSet
objects using it's default parameterskedro/kedro/runner/sequential_runner.py
Lines 30 to 41 in 035f463
In the case of objects that that are pandas DataFrames or numpy arrays, this performs a deep copy of the object.
It would be useful to have an option to override this default in cases where run-time or memory usage would need to be considered.
Context
Copying
MemoryDataSets
is useful in cases where two nodes receive the same input and then output a mutated input e.g.But in cases where a large python object is being passed between nodes, this can result in a large overhead in terms of runtime and memory
Possible Implementation
As a work-around for my use-case, I've implemented a
TestRunner
class that modifies the default behaviour ofMemoryDataSet
This has reduced the run-time of my slow test from 17s to 1s
The text was updated successfully, but these errors were encountered: