Download link (20MB xz-compressed).
Duplicates dataset consists of two parts:
- 1989 labeled pairs of Java files.
- 633 labeled pairs of Java functions.
Those pairs were labeled by several source{d} employees as "identical", "similar" or "different" in February 2018. We used src-d/code-annotation web application to perform the labeling. The goal of making the dataset was tuning for the best hyperparameters in src-d/apollo, which was the proof-of-concept for src-d/gemini.
Code similarity is quite subjective, and human labelers may contradict each other in some cases. We've set 3 categories instead of 2 to make the choice easier.
SQLite 3 database, the schema is shown below.
There are 4 tables:
- experiments - the labeling sessions. There are only two - files and functions.
- users - the people who labeled the pairs of files and functions.
- pairs - the data for each pair, including the code strings and UASTv1-s.
- assignments - the labels per person per experiment.
You need Python 3 with the dependencies installed via pip3 install -r requirements.txt
.
from duplicates import DuplicatesDataset
ds = DuplicatesDataset("/Users/sourced/Desktop/duplicates.db")
print(ds.experiments)
print(ds.users)
print(len(ds.assignments))
print(len(ds.pairs))
The choice of the files was designed in the included notebooks.
There were ~4 active human reviewers who did the labeling, they were from the same company, and talked to each other. Hence there can be bias in the labels. Code duplication is subjective, anyway.
Code: MIT. Labels: Open Data Commons Open Database License (ODbL). Actual file contents © their authors.