A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
[中文主页] | [Docs] | [API] | [DJ-SORA]
Data-Juicer is a one-stop multimodal data processing system to make data higher-quality, juicier, and more digestible for LLMs.
Data-Juicer (including DJ-SORA) is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us in promoting LLM data development and research!
We provide a Playground with a managed JupyterLab. Try Data-Juicer straight away in your browser!
If you find Data-Juicer useful for your research or development, please kindly cite our work. Welcome any issues/PRs and to join our Slack channel or DingDing group for discussion!
[2024-03-07] We release Data-Juicer v0.2.0 now! In this new version, we support more features for multimodal data (including video now), and introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models.
[2024-02-20] We have actively maintained an awesome list of LLM-Data, welcome to visit and contribute!
[2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
[2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's official website for more information.
[2024-01-05] We release Data-Juicer v0.1.3 now! In this new version, we support more Python versions (3.8-3.10), and support multimodal dataset converting/processing (Including texts, images, and audios. More modalities will be supported in the future). Besides, our paper is also updated to v3.
[2023-10-13] Our first data-centric LLM competition begins! Please visit the competition's official websites, FT-Data Ranker (1B Track, 7B Track), for more information.
[2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!
Systematic & Reusable: Empowering users with a systematic library of 80+ core OPs, 20+ reusable config recipes, and 20+ feature-rich dedicated toolkits, designed to function independently of specific LLM datasets and processing pipelines.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
Comprehensive Data Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, fine-tuning, en, zh, and more scenarios. Validated on reference LLaMA and LLaVA models.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory and CPU usage, optimized for maximum productivity.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
data_juicer
version in
editable mode:cd <path_to_data_juicer>
pip install -v -e .
cd <path_to_data_juicer>
pip install -v -e . # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies
The dependency options are listed below:
Tag | Description |
---|---|
. or .[mini] |
Install minimal dependencies for basic Data-Juicer. |
.[all] |
Install all optional dependencies (including minimal dependencies and all of the following). |
.[sci] |
Install all dependencies for all OPs. |
.[dist] |
Install dependencies for distributed data processing. (Experimental) |
.[dev] |
Install dependencies for developing the package as contributors. |
.[tools] |
Install dependencies for dedicated tools, such as quality classifiers. |
data_juicer
using pip
:pip install py-data-juicer
data_juicer
and two basic tools
(data processing and analysis) are available in this way. If you want customizable
and complete functions, we recommend you install data_juicer
from source.data_juicer
, we recommend you install from source.either pull our pre-built image from DockerHub:
docker pull datajuicer/data-juicer:<version_tag>
or run the following command to build the docker image including the
latest data-juicer
with provided Dockerfile:
docker build -t datajuicer/data-juicer:<version_tag> .
import data_juicer as dj
print(dj.__version__)
Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.
You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the offical ffmpeg link.
Check if your environment path is set correctly by running the ffmpeg command from the terminal.
process_data.py
tool or dj-process
command line tool with your config as the argument to process
your dataset.# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml
# use command line tool
dj-process --config configs/demo/process.yaml
Note: For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first.
The default download cache directory is ~/.cache/data_juicer
. Change the cache location by setting the shell environment variable, DATA_JUICER_CACHE_HOME
to another directory, and you can also change DATA_JUICER_MODELS_CACHE
or DATA_JUICER_ASSETS_CACHE
in the same way:
Note: When using operators with third-party models, it's necessary to declare the corresponding mem_required
in the configuration file (you can refer to the settings in the config_all.yaml
file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environment, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue.
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
We have now implemented multi-machine distributed data processing based on RAY. The corresponding demos can be run using the following commands:
# Run text data processing
python tools/process_data.py --config ./demos/process_on_ray/configs/demo.yaml
# Run video data processing
python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.yaml
ray
, e.g. ray_video_deduplicator
and ray_document_deduplicator
. Those operators also rely on a Redis instance. So in addition to starting the RAY cluster, you also need to setup your Redis instance in advance and provide host
and port
of your Redis instance in configuration.Users can also opt not to use RAY and instead split the dataset to run on a cluster with Slurm / Aliyun PAI-DLC. In this case, please use the default Data-Juicer without RAY.
analyze_data.py
tool or dj-analyze
command line tool with your config as the argument to analyse your dataset.# only for installation from source
python tools/analyze_data.py --config configs/demo/analyser.yaml
# use command line tool
dj-analyze --config configs/demo/analyser.yaml
app.py
tool to visualize your dataset in your browser.streamlit run app.py
config_all.yaml
which includes all ops and default
arguments. You just need to remove ops that you won't use and refine
some arguments of ops.config_all.yaml
, op documents, and advanced Build-Up Guide for developers.python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en
The basic config format and definition is shown below.
tools/preprocess
for you to preprocess these data.
data-juicer
, you can run the commands or tools mentioned above using this docker image.# run the data processing directly
docker run --rm \ # remove container after the processing
--name dj \ # name of the container
-v <host_data_path>:<image_data_path> \ # mount data or config directory into the container
-v ~/.cache/:/root/.cache/ \ # mount the cache directory into the container to reuse caches and models (recommended)
datajuicer/data-juicer:<version_tag> \ # image to run
dj-process --config /path/to/config.yaml # similar data processing commands
# start the container
docker run -dit \ # run the container in the background
--rm \
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
datajuicer/data-juicer:latest /bin/bash
# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash
Data-Juicer is released under Apache License 2.0.
We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to How-to Guide for Developers.
If you have any questions, please join our discussion groups.
Data-Juicer is used across various LLM products and research initiatives, including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for financial analysis, and Zhiwen for reading assistant, as well as the Alibaba Cloud's platform for AI (PAI). We look forward to more of your experience, suggestions and discussions for collaboration!
Data-Juicer thanks and refers to several community projects, such as Huggingface-Datasets, Bloom, RedPajama, Pile, Alpaca-Cot, Megatron-LM, DeepSpeed, Arrow, Ray, Beam, LM-Harness, HELM, ....
If you find our work useful for your research or development, please kindly cite the following paper.
@inproceedings{chen2024datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
booktitle={International Conference on Management of Data},
year={2024}
}