A toolkit for pre-processing large source code corpora
This is a tool for preprocessing source code corpora according to a specified vocabulary modeling choice.
Supported modeling choices are:
Make sure you have python >= 3.6 installed in your system; pip, setuptools and wheel are up to date.
python --version
python -m pip install --upgrade pip setuptools wheel
Install codeprep lib:
pip install codeprep
In order to run the ronin algorithm, you will have to additionally install Spiral module (https://github.com/casics/spiral/):
pip install git+https://github.com/casics/spiral.git
The tool can be used as a python library as well as a standalone module runnable with a CLI.
You can pass the path to the dataset or the text itself to be preprocessed. When using Python API for the former option
you need to import methods from codeprep.api.text
module, for the latter - from codeprep.api.corpus
.
Below you can see the general patterns of usage.
Python API
>>> import codeprep.api.text as cp
>>> cp.<commmand>('Some code to be split')
>>> import codeprep.api.corpus as cp
>>> cp.<commmand>('/path/to/the/dataset')
CLI
codeprep <commmand> "Some code to be split"
codeprep <commmand> --path /path/to/the/dataset
Hereafter we will demonstrate the usage as a python library. The CLI is analogous to the python API. You can find the documentation about how to use it here.
Tokenization + CamelCase- and snake_case- splitting:
>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
... if (eps >= 0.345e+4) { // FIXME
... printWord(" ... Überraschung");
... }
... }'''
>>> cp.basic(input_code)
['void', '<w>', 'test', '_', 'Word', 'Ueberraschung', 'Printer', '</w>', '(', ')', '{', '\n',
'\t', 'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', 'FIXME', '\n',
'\t', '\t', '<w>', 'print', 'Word', '</w>', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n',
'\t', '}', '\n',
'}']
>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
... if (eps >= 0.345e+4) { // FIXME
... printWord(" ... Überraschung");
... }
... }'''
>>> cp.nosplit(input_code)
['void', 'test_WordUeberraschungPrinter', '(', ')', '{', '\n',
'\t', 'if', '(', 'eps', '>', '=', '0', '.', '345e', '+', '4', ')', '{', '/', '/', 'FIXME', '\n',
'\t', '\t', 'printWord', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n',
'\t', '}', '\n',
'}']
The following code does camelCase- and snake_case- splitting and applies bpe with 10k merges on top:
>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
... if (eps >= 0.345e+4) { // FIXME
... printWord(" ... Überraschung");
... }
... }'''
>>> cp.bpe(input_code, bpe_codes_id='10k')
['v', 'oid</t>', 'test_', 'Word', 'U', 'eb', 'err', 'as', 'ch', 'un', 'g', 'Print', 'er</t>', '(</t>', ')</t>', '{</t>', '\n',
'\t', 'i', 'f</t>', '(</t>', 'e', 'ps</t>', '></t>', '=</t>', '0</t>', '.</t>', '34', '5', 'e</t>', '+</t>', '4</t>', ')</t>', '{</t>', '/</t>', '/</t>', 'FIX', 'M', 'E</t>', '\n',
'\t', '\t', 'print', 'Word</t>', '(</t>', '"</t>', '\t', '.</t>', '.</t>', '.</t>', '\t', 'Ü', 'b', 'err', 'as', 'ch', 'un', 'g</t>', '"</t>', ')</t>', ';</t>', '\n',
'\t', '}</t>', '\n',
'}</t>']
codeprep by default does BPE using bpe codes leaned on the Github Java Corpus. The argument bpe_codes_id='10k'
tells the codeprep tool to use 10,000 bpe merges.
Other possible values are 1k
and 5k
(1,000 and 5,000 merges respectively). Please refer to section Learning custom BPE codes to train custom bpe codes.
For other commands and options like chars
, --split-numbers
, --ronin
, --stem
, please refer to the docs.
Set calc_vocab
param to True
when calling a preprocessing method to calculate the vocabulary of the preprocessed corpus, e.g.:
>>> import codeprep.api.corpus as cp
>>> cp.basic('/path/to/train/on', calc_vocab=True)
...
Vocab is available at /path/to/vocab
If you don't want to use, pre-trained BPE codes, it's possible to train custom ones. For example, to train 10,000 merges on the corpus located at the path /path/to/train/on
, the following command should be run (only CLI):
codeprep learn-bpe 10000 -p /path/to/train/on --id custom-bpe-codes
Now it is possible to do bpe splitting by running the bpe command with the number of merges from 0 to 10,000 (for example with 3500 merges):
codeprep bpe custom-bpe-codes-3500 -p /path/to/preprocess
Before bpe codes are trained, the basic preprocessing is done, which can also be tuned with arguments described in section Tweaking preprocessing.
You can pass the following parameters with a True
value (default values for all of them are False), to tweak the way the imput is preprocessed:
no_str
- replace strings with no_com
- replace comments with no_spaces
- remove newlines and tabs.no_unicode
- replace words containing non-ascii characters with no_case
- lowercase words and encode information about case in <Cap> <CAP> tokens.>>> import codeprep.api.text as cp
>>> input_code = '''void test_WordUeberraschungPrinter() {
... if (eps >= 0.345e+4) { // FIXME
... printWord(" ... Überraschung");
... }
... }'''
>>> cp.basic(input_code, no_spaces=True, no_unicode=True, no_case=True, no_com=True, no_str=True)
['void', '<w>', 'test', '_', '<Cap>', 'word', '<Cap>', 'ueberraschung', '<Cap>', 'printer', '</w>', '(', ')', '{',
'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', '<CAPS>', 'fixme',
'<w>', 'print', '<Cap>', 'word', '</w>', '(', '"', '.', '.', '.', '<Cap>', '<non-en>', '"', ')', ';',
'}',
'}']
Similar params can be specified as switches --no-str
, --no-com
, --no-spaces
, --no-unicode
, --no-case
in CLI commands.
Unless explicitely specified, codeprep will assume the language is java. To make sure the input is preprocessed as intended, it is always highly recommended to specify it:
import codeprep.api.text as cp
>>> cp.bpe("volatile", '1k')
['volatile']
>>> cp.bpe("volatile", '1k', extension="py")
['v', 'ol', 'a', 'ti', 'le</t>']
# Since 'volatile' is a keyword in java, it is represented as one token unlike in python
# where it is pretty rare when used as an identifier and therefore represented as multiple subtokens.
When preprocessing a corpus, codeprep
identifies the language based on the file extension. If you want only files with (a) certain extension(s) to be preprocessed, you can specify --ext param
codeprep basic --path /path/to/be/preprocessed --ext "java"
# or if you want to pre-process multiple types of files:
codeprep basic --path /path/to/be/preprocessed --ext "java|c|py|js"
You can specify the path to where the preprocessed corpus will be written:
codeprep basic --path /path/to/preprocess --output-path /path/to/output
To print logs with log level DEBUG and higher to stdout:
codeprep basic --path /path/to/preprocess --verbose
To get help on commands and options:
codeprep --help
This library was build to run experiments for our paper accepted at ICSE 2020: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
If you you the library or the results, please cite the paper:
@article{karampatsis2020big,
title={Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code},
author={Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea},
journal={arXiv preprint arXiv:2003.07914},
year={2020}
}
When preprocessing a dataset, codeprep first parses source code and converts it into internal representation, which is after that converted to a preprocessed dataset depending on provided parameters. The intermediate representation is cached, so that when the same dataset is pre-processed again with different parameters, codeprep (providing no changes have been made to the dataset) would use the cache rather than parsing the source code again.
To store the cache, codeprep uses a directory speecified by $XDG_CACHE_HOME/codeprep/<codeprep_version>
variable if its value is set,
$HOME/.cache/codeprep/<codeprep_version>
otherwise.
Removing the cache will not change the final result, however, will result in slower pre-processing.
get_corpus_size()
method to PreprocessedCorpus
classcodeprep.__version__
--full-strings
and --max-str-length
optionsronin
method/command wit--ronin
option and apply ronin algorithm on word level instead of full identifier levelsplit_numbers
option is set to True
, split numbers not only in code but also in strings and commentsInitial PyPI release