This tool helps automatic generation of grammatically valid synthetic Code-mixed data by utilizing linguistic theories such as Equivalence Constant Theory and Matrix Language Theory.
Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data. We describe a tool that can automatically generate code-mixed data given parallel data in two languages. We implement two linguistic theories of code-mixing, the Equivalence Constraint theory
and the Matrix Language theory
to generate all possible code-mixed sentences in the language-pair, followed by sampling of the generated data to generate natural code-mixed sentences.
The toolkit provides three modes: a batch mode
, an interactive library mode
and a web-interface
to address the needs of researchers, linguists and language experts.
The toolkit can be used to generate unlabeled text data for pre-trained models, as well as visualize linguistic theories of code-mixing.
Please read the associated papers in conjunction with the documentation to understand the working of this tool, it's assumptions, corner-cases and the concepts behind it.
This project has the following structure:
CodeMixed-Text-Generator/
├── alignment_generator/
├── cm_text_generator/
├── stanford_parser/
├── web/
├── library/
├── utils/
├── aligner.py
├── pre_gcm.py
├── gcm.py
├── config.ini
├── sequence_run.py
├── parallel_run.py
└── requirements.txt
Here's more info about each component:
Note: This project is developed and tested on Ubuntu and supports only that as an OS. If you want to run it in windows, you can utilize Windows Subsystem for Linux.
This project has following dependencies as pre-requisites:
https://www.oracle.com/java/technologies/javase-jre8-downloads.html
sudo apt install cmake
https://docs.conda.io/en/latest/miniconda.html
sudo apt install librsvg2-bin
git clone https://github.com/microsoft/CodeMixed-Text-Generator.git
conda create -n gcm python==3.7.7
Activate the environment:
conda activate gcm
cd CodeMixed-Text-Generator/CodeMixed-Text-Generator/
pip install -r requirements.txt
The above might take some time based on the speed of your internet connection.
Note: We've tested this toolkit with specific version of Tensorflow==1.15.4
only.
cd alignment_generator/
python fast_align_install.py
cd ../stanford_parser/
python stanford_parser_install.py
cd ../
mkdir data
The above step will take some time based on your internet speed as Stanford Parser is a heavy package.
a. Open a python terminal and write the following code:
import benepar
benepar.download('benepar_en2')
Here's how it will look like:
If neither of the languages you're working with is English, then you'll have to download the language model for one of the two languages. You can find them here:
https://pypi.org/project/benepar/#usage
Once you have finished the above set of steps, you are all ready.
The toolkit provides three modes: a batch mode, an interactive library mode and a web-interface to address the needs of researchers, linguists and language experts.
Here's how to use the three modes:
The library mode is a nifty little Python interface that we have built around the GCM tool, which can be used for quick-prototyping and experimentation.
Simply go to the library directory:
cd library
You can start jupyter notebook here:
jupyter notebook
Select the GCM Library Mode Demo
notebook which has examples on how to run the GCM tool:
And you can import the gcm module and play around with all the sub-modules.
For more information on using the Library Mode, Here's the documentation.
This is a flask-based web app interface through which you can access the GCM tool in the browser, you can not only generate CM sentences but also visualize the parse trees of generated Code-mixed sentences.
Simply go to the web directory:
cd web
Run the web server:
flask run
You can access the GCM tool in the browser at the following address:
http://localhost:5000/
For more information on using the Web UI Mode, here's the detailed documentation on Web UI Mode.
This is the mode that's designed for large scale generation. It's interface is the config.ini file, which has different tuning knobs for different components of the GCM system.
Feel free to have a look at the Config File.
Each of the options are self-explanatory and have default values already present. You can refer the GCM Paper for more information.
Once you have set all the config options, you need to write the input files in the data directory.
Suppose you want to generate Code-mixed for languages Hindi and English as your source and target languages. GCM expects the following four files as input in the data directory:
a. Source Language Text - A file containing sentences of the source language (Hindi in this case). The default name of this file that GCM expects is hi-to-en-input_lang1
where hi
and en
are first two letters of the languages to be mixed that's picked from the config file. You can change this setting in the config file.
b. Target Language Text - A file containing sentences of the target language (English in this case). The default name of this file that GCM expects is hi-to-en-input_lang2
. You can change this setting in the config file.
c. Word-level Alignments - A file containing word-level alignments between the sentences of the source and target languages. The default name of this file that GCM expects is hi-to-en-input_parallel_alignments
. You can change this setting in the config file.
Note: You can use GCM's fast_align to automatically generate the Word-level alignments. You simply have to add aligner, pregcm, gcm
to the stages_to_run
property in the config file
like this:
stages_to_run = aligner, pregcm, gcm
d. PFMS Scores - A file containing the PFMS scores (refer this paper). The default name of this file that GCM expects is hi-to-en_pfms.txt
. You can change this setting in the config file.
Once you are ready with the above set of input files, you have two options of running the batch mode:
a. Sequential Data Generation : This is the straightforward way of running the GCM process. Once you are finished with setting up the config and the input data files, you simply go to the code directory and run this script:
python sequence_run.py
The generated Code-mixed text will be in data/hi-to-en-gcm/out-cm-hi-en.txt
.
Again, please note that the names of all of the above files will change based on the first two letters of the languages that are being mixed. This is picked up by the system from the config file. Alternatively, you can not follow the naming convention and give absolute paths to the files in the config file.
b. Parallel Data Generation : This is an experimental version, it runs the entire GCM process asynchronously using multiple processes so the over all data generation time is comparatively less. You can start this mode by running the following code:
python parallel_run.py --data_dir data --num_procs 2 --output_dir temp_data
Where:
--data_dir
is input data directory.--num_procs
is number of processes to run. This can be the number of the cores in your setup.--output_dir
is a directory to store intermediate files.The generated Code-mixed text will be in data/hi-to-en-gcm/out-cm-hi-en.txt
.
Note: The parrallel_run.py script is experimental, not extensively tested and might change in the future.
For more information on using the Batch Mode, here's the detailed documentation on Batch Mode.
If you're using this work, you can cite the following paper:
@inproceedings{rizvi-etal-2021-gcm,
title = "{GCM}: A Toolkit for Generating Synthetic Code-mixed Text",
author = "Rizvi, Mohd Sanad Zaki and
Srinivasan, Anirudh and
Ganu, Tanuja and
Choudhury, Monojit and
Sitaram, Sunayana",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eacl-demos.24",
pages = "205--211",
abstract = "Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data. We describe a tool that can automatically generate code-mixed data given parallel data in two languages. We implement two linguistic theories of code-mixing, the Equivalence Constraint theory and the Matrix Language theory to generate all possible code-mixed sentences in the language-pair, followed by sampling of the generated data to generate natural code-mixed sentences. The toolkit provides three modes: a batch mode, an interactive library mode and a web-interface to address the needs of researchers, linguists and language experts. The toolkit can be used to generate unlabeled text data for pre-trained models, as well as visualize linguistic theories of code-mixing. We plan to release the toolkit as open source and extend it by adding more implementations of linguistic theories, visualization techniques and better sampling techniques. We expect that the release of this toolkit will help facilitate more research in code-mixing in diverse language pairs.",
}
This project is an outcome of collaborative efforts of the following people:
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.