🤗 Datasets
MigrationBench provides three datasets on Hugging Face.
All repositories are available on GitHub under MIT or Apache-2.0 licenses.
Available Datasets
| Dataset | Size | Description |
|---|---|---|
| MigrationBench-Full | 5,102 | Full dataset. Each repo has unit test. |
| MigrationBench-Selected | 300 | Curated challenging subset from Full |
| MigrationBench-UTG | 4,814 | Unit test generation dataset. Each repo doesn't have unit test. Disjoint with java-full |
Metadata
repo (str): The original repo URL without thehttps://github.com/prefixbase_commit (str): Base commit id- At this commit with
java 8andmaven 3.9.6, the repository is able to (1) compile and (2) pass existing unit tests and integration tests if any - It is the starting point for code migration from
java 8to LTS versions
- At this commit with
num_java_files (int): Number of*.javafiles in the repository atbase_commit, similarly for all othernum_*columnsnum_loc (int): Lines of code for the repositorynum_pom_xml (int): Number of modules (pom.xmlfiles) in the repositorynum_src_test_java_files (int): Number of*.javafiles in the dedicatedsrc/test/directorynum_test_cases (int): Number of test cases, based on running themvn -f test .command in the root directory- Non negative values indicate number of test cases is parsed correctly from the output
- Negative values means it's unable to parse the output:
[INFO] Results:(-2) or[INFO] Tests run:(-1) regex is missing
license (str): The license of the repository, eitherMITorApache2.0for the whole dataset
Loading Datasets
Install Hugging Face Datasets library:
Load and use the datasets:
from datasets import load_dataset
# Load java-selected dataset
dataset = load_dataset("AmazonScience/migration-bench-java-selected")
# Iterate through repositories
for item in dataset['test']:
print(f"Repository: {item['repo']}")
Stay tuned for updates!