Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

plumbing/format/idxfile: add new Index and MemoryIndex #896

Merged

Conversation

erizocosmico
Copy link
Contributor

Closes #893

This PR adds the Index and EntryIter interfaces to idxfile and renames Idxfile to MemoryIndex as described in the design document.

The new MemoryIndex now works like the JGit index reader, it loads everything as byte slices. The only change I did that diverges from the JGit implementation is adding a mapping between the fanout index and the slice index to drastically reduce the memory usage.

In JGit they use an array of 256 byte arrays for the name, offset and crc32 tables. This is such an overkill when you don't have a gigantic packfile that uses all slots of the array. By doing this, we only use as much memory as we need.

According to the benchmarks, the speed remains the same decoding the idx file, and the memory consumption is slightly increased. Access to the entries should be way faster now due to how the accesses are done.

Benchmarks have been added for all operations on idxfiles.

Warning

This will leave the repo in a broken state (which is why it's being PR'd to a feature branch), since previously there was a packfile.Index that was writable and now there is not. This is a work in progress that will probably need some of the work being done by @jfontan right now to be fully working.

Everything inside idxfile package is expected to be working correctly, though.

Caveats

As mentioned in #893, some way of retrieving hashes by offset may be needed for the packfile decoder. After some discussion with @jfontan we agreed that it's better to wait until all the pieces are ready to see how they play together before adding such a change. That is only needed when the source is not seekable, so another solution might be to just build a map from offset to hash for those cases.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
@ajnavarro
Copy link
Contributor

If we are using the Decoder to fill a MemoryIndex struct, the Index interface is useless in my opinion.

We should use the Decoder inside the implementation, and depending on that implementation, use the decoder in different ways (loading everything at the beginning, loading each part on demand, and so on).

So, instead of doing that, we should refactor the decoder to be able to obtain the data, instead of sending a struct to fill it.

If that doesn't make sense, Index interface should be deleted and MemoryIndex should be Index. Doing that, the implementation of new Indexes will be way more difficult.

We can go even further; Instead of having a Decoder, we can have some decode helpers. Using the helpers (basically parsers) we can obtain the data on idx file in different ways, depending on our use case or index implementation.

WDYT?

@erizocosmico
Copy link
Contributor Author

Yeah, right now Index is kind of useless. What is the use case of having other index implementations? Because if we can't think of any case for implementing other indexes maybe there is no point in having that interface right now and we could just have the MemoryIndex be the index.

If there is such a case and we keep the Index interface I think, as you say, we should refactor the decoder so it can be used by any other index instead of passing the MemoryIndex struct. What I don't know is what shape the new Decoder or helpers should have, because the decoding right now is very coupled with how the MemoryIndex does things (fanout and fanout mapping).

@erizocosmico erizocosmico merged commit a8ff3e5 into src-d:perf/packfile-reads Jul 26, 2018
@erizocosmico erizocosmico deleted the feature/new-index-decoder branch July 26, 2018 12:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants