Code Transformers Save

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Project README

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

  • Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
  • A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

  • data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
  • vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
  • cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

  1. Download and resplit data, see data_utils for details;
  2. Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
  3. Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@inproceedings{10.1145/3468264.3468611,
    author = {Chirkova, Nadezhda and Troshin, Sergey},
    title = {Empirical Study of Transformers for Source Code},
    year = {2021},
    isbn = {9781450385626},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3468264.3468611},
    doi = {10.1145/3468264.3468611},
    booktitle = {Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
    pages = {703–715},
    numpages = {13},
    keywords = {code completion, neural networks, transformer, function naming, variable misuse detection},
    location = {Athens, Greece},
    series = {ESEC/FSE 2021}
}
@inproceedings{chirkova-troshin-2021-simple,
    title = "A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code",
    author = "Chirkova, Nadezhda and Troshin, Sergey",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.26",
    doi = "10.18653/v1/2021.naacl-main.26",
    pages = "278--288",
}
Open Source Agenda is not affiliated with "Code Transformers" Project. README Source: bayesgroup/code_transformers
Stars
61
Open Issues
2
Last Commit
2 years ago

Open Source Agenda Badge

Open Source Agenda Rating