[Feature Request] add support for `.lmdb` #261

S-aiueo32 · 2021-12-17T11:39:00Z

I would like to add .lmdb to the file formats supported by TaskOnKart.make_target(). .lmdb is the format used by several popular datasets, and is actually suitable for handling large datasets (especially images).

For those who are not familiar with .lmdb, here is a brief explanation of its usage. .lmdb is a Key-Value store whose values can be retrieved via the lmdb.Environment object:

import lmdb

# open environment
env = lmdb.open('foo')

# put item
key: bytes = ...
value: bytes = ...
with env.begin(write=True) as txn:
    txn.put(key, value)

# get item
with env.begin() as txn:
    _value = txn.put(key)
    assert value == _value

The following changes are necessary to support .lmdb:

Add LmdbFileProcessor:

class LmdbFileProcessor(FileProcessor):

    def __init__(self, file_path: str) -> None:
        self.__file_path = file_path
        super(LmdbFileProcessor, self).__init__()

    @property
    def env(self) -> lmdb.Environment:
        # To avoid creating `env` in the constructor for paths other than .lmdb. 
        if not hasattr(self, '__env'):
            self.__env = lmdb.open(self.__file_path, subdir=False, lock=True)
        return self.__env

    def load(self, file):
        with self.env.begin(write=False) as txn:
            for k, v in txn.cursor():
                yield k, v

    def dump(self, obj):
        key, value = obj
        with self.env.begin(write=True) as txn:
            txn.put(key, value)

...

def make_file_processor(file_path: str, store_index_in_feather: bool) -> FileProcessor:
    extension2processor = {
        ...
        '.lmdb': LmdbFileProcessor(file_path)
    }

Prevent SingleFileTarget from opening files before dumping.

    def _dump(self, obj) -> None:
        if self._processor.__class__.__name__ == 'LmdbFileProcessor':
            self._processor.dump(obj)
        else:
            with self._target.open('w') as f:
                self._processor.dump(obj, f)

Write tasks!

class DownloadImages(GokartTask):
    def run(self):
        image_ids = self.load()
        for image_id in image_ids:
            image: bytes = imread(image_id)
            self.dump((image_id.encode(), image))

    def output(self):
        return self.make_target(f'{self.__class__.__name__}.lmdb')

class LoadImages(GokartTask):
    def requires(self):
        return DownloadImages()

    def run(self):
        for key, value in self.load():  # load keys and values of previous db
            # do something with the images
            ...

They work perfectly in my environment, but there are a few untested concerns:

It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().
It may cause problems with parallelization.

I would like to commit what I have sorted out here as a Pull Request, and I would like to discuss it in this issue.

The text was updated successfully, but these errors were encountered:

Hi-king · 2021-12-17T15:31:48Z

Cool feature :)

Mostly I agree to this proposal.

It may violate the design philosophy of gokart and luigi to call dump() only once at the end of run().

I think this is not a problem. With current implementation, we can dump multiple times with named targets as follows

class Task(GokartTask):
    def requires(self):
        return dict(
           a=self.make_target("a.pkl"),
           b=self.make_target("b.pkl"),
       )

    def run(self):
      self.dump(1, "a.pkl")
      self.dump(2, "b.pkl")

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

S-aiueo32 · 2021-12-21T04:13:13Z

Thank you for your reply.

For now, I have implemented a way to use SingleFileTarget to support lmdb, and passed the test locally.
https://github.com/S-aiueo32/gokart/tree/4d65e5f359a44d4113945dd82fc2c3171fa973da
In LocalTargetTest, there was an error around the locking of lmdb, so I made sure not to lock it when opening lmdb.environment (probably not a problem since it is locked in other parts of gokart).

Another discussion is whether lmdb should be implemented in SingleFileTarget or create another LmdbTarget :)

I see. It is indeed awkward to have a conditional branch depending on the processor type in SingleFileTarget. It would be better to create a new LmdbTarget that inherits from SingleFileTarget, and rewrite load and dump.

If we create a new LmdbTarget, do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target?

mski-iksm · 2021-12-21T14:39:02Z

@S-aiueo32
Thank you for proposing great feature!

It would be better to create a new LmdbTarget that inherits from SingleFileTarget

I think LmdbTarget should inherite TargetOnKart just like SingleFileTarget or ModelTarget, not to make overcomplicating inheritance dependencies.

do you have an opinion on whether it is preferable to branch in make_target or create a new function like make_lmdb_target

I think creating new function make_lmdb_target is better, just like existing make_model_target.

hirosassa · 2022-04-08T00:18:30Z

@S-aiueo32 Hi,
This is just a friendly reminder (not urgent). What's the status of this issue?

hirosassa added the freature label Dec 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] add support for `.lmdb` #261

[Feature Request] add support for `.lmdb` #261

S-aiueo32 commented Dec 17, 2021

Hi-king commented Dec 17, 2021

S-aiueo32 commented Dec 21, 2021 •

edited

Loading

mski-iksm commented Dec 21, 2021

hirosassa commented Apr 8, 2022

[Feature Request] add support for .lmdb #261

[Feature Request] add support for .lmdb #261

Comments

S-aiueo32 commented Dec 17, 2021

Hi-king commented Dec 17, 2021

S-aiueo32 commented Dec 21, 2021 • edited Loading

mski-iksm commented Dec 21, 2021

hirosassa commented Apr 8, 2022

[Feature Request] add support for `.lmdb` #261

[Feature Request] add support for `.lmdb` #261

S-aiueo32 commented Dec 21, 2021 •

edited

Loading