Name		Name	Last commit message	Last commit date
parent directory ..
notebooks		notebooks
README.md		README.md
duplicates.py		duplicates.py
requirements.txt		requirements.txt
schema.png		schema.png

README.md

Duplicates

Download link (20MB xz-compressed).

Duplicates dataset consists of two parts:

1989 labeled pairs of Java files.
633 labeled pairs of Java functions.

Those pairs were labeled by several source{d} employees as "identical", "similar" or "different" in February 2018. We used src-d/code-annotation web application to perform the labeling. The goal of making the dataset was tuning for the best hyperparameters in src-d/apollo, which was the proof-of-concept for src-d/gemini.

Code similarity is quite subjective, and human labelers may contradict each other in some cases. We've set 3 categories instead of 2 to make the choice easier.

Format

SQLite 3 database, the schema is shown below.

There are 4 tables:

experiments - the labeling sessions. There are only two - files and functions.
users - the people who labeled the pairs of files and functions.
pairs - the data for each pair, including the code strings and UASTv1-s.
assignments - the labels per person per experiment.

Sample code

You need Python 3 with the dependencies installed via pip3 install -r requirements.txt.

from duplicates import DuplicatesDataset
ds = DuplicatesDataset("/Users/sourced/Desktop/duplicates.db")
print(ds.experiments)
print(ds.users)
print(len(ds.assignments))
print(len(ds.pairs))

Origin

The choice of the files was designed in the included notebooks.

Limitations

There were ~4 active human reviewers who did the labeling, they were from the same company, and talked to each other. Hence there can be bias in the labels. Code duplication is subjective, anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates

Duplicates

README.md

Duplicates

Format

Sample code

Origin

Limitations

License

Files

Duplicates

Directory actions

More options

Directory actions

More options

Latest commit

History

Duplicates

Folders and files

parent directory

README.md

Duplicates

Format

Sample code

Origin

Limitations

License