Data Juicer Versions Save

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!

v0.2.0

2 months ago

New Features

  • 🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
  • 🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
  • 💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
  • 💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

  • video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
  • video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
  • video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
  • video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
  • video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
  • image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
  • image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
  • image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter

  • video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
  • video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
  • video_resolution_filter: filters samples according to the resolution of videos in them. #227
  • video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
  • video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
  • video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper

  • video_split_by_scene_mapper: splits videos into scene clips. #227
  • video_split_by_duration_mapper: splits videos by specified duration interval. #227
  • video_split_by_key_frame_mapper: splits videos by their keyframes. #227
  • video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
  • video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
  • video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator

  • video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio

  • audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
  • audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
  • audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
  • audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image

  • image_blur_mapper: adds random noises to images to blur them. #180
  • image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates

  • "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
  • Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
  • Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
  • OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed

  • Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
  • Fix the bug that some images will be lost when converting their paths to absolute paths. #178
  • Fix the dependency problems of OPs who depend on other OPs. #181
  • Fix the bug that the predict.py tool gets stuck on the help page. #183
  • Fix face_area_filter: constrains the detection coordinates within the image. #202
  • Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
  • Fix or update invalid links in Data-Juicer. #201 #219

Others

  • Optimize the model management module. #196 #227
  • Optimize the unit test actions. #195 #196 #216 #227
  • Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
  • Update the docker image with JDK. #208
  • Support more multimodal (video) dataset conversion tools: #227
    • InternVid: 234M video-caption data
    • Youku-mPLUG: 36TB video-caption data
    • Video-ChatGPT: 100k video-instruction data
  • Optimize the generated multimodal data storage. #227
  • Support running data-juicer process jobs on Aliyun PAI-DLC. #227
  • Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

  • @liuyanyi helps to fix a bug in quality classifier tools. #183
  • @co63oc helps to fix some typos. #215
  • @liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
  • @zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

v0.1.3

4 months ago

New Features

  • Data-Juicer now supports Python3.7-3.10!
    • We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
    • We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
  • Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
    • A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
    • Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
    • Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
  • Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
  • Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
  • Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160

New OPs

Text

  • chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
  • remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
  • text_action_filter: keeps samples containing action verbs in their texts. #122
  • text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
  • replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
  • remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149

Image

  • image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
  • image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
  • image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
  • face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
  • image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72

Multimodal

  • image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
  • image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
  • phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139

Bugs fixed

  • Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
  • Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
  • Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
  • Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
  • Fix the bug of commandline arguments parsing error in some cases. #108 #165
  • Store simhash value as string type to avoid errors from PyArrow. #168 #170

Others

  • Dependency importing optimization: only require and import some dependencies when using. #35 #82
  • Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
  • Optimize the cache directory selection logic. #43
  • Support limiting the number of samples when mixing datasets. #86
  • Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
  • OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151

Acknowledgement

Here we thank public contributors for their PRs to make Data-Juicer better!

  • @JONGSKY helps to remove some unnecessary code. #85
  • @xuruidong helps to fix several broken links in the README doc. #142

v0.1.2

7 months ago

New OPs

  • nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
  • nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
  • token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

  • OP Fusion #14
    • Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
  • Cache management #19
    • Cache management works now for our Data-Juicer due to the new serialization method being applied.
    • Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
  • Distributed data processing with Ray is supported now. #21
  • Config sys optimization:
    • Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
    • A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
    • Display the config table after config parsing is ready. #17

Others

  • Replace original string constants with constant enums. #13
  • Expand the checkpoint protection range to cover the exporting process. #14
  • Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
  • Docs updates. #15 #16
  • PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
  • Docker building is available now. The official docker image for Docker Hub is in progress. #23
  • Deploy the unit tests for Data-Juicer. #29

v0.1.0

9 months ago

Summarization - Table of Contents

  • Data-Juicer: A Data-Centric Text Processing System for Large Language Models
  • Table of Contents
    • Features
    • Prerequisites
    • Installation
    • Quick Start
      • Data Processing
      • Data Analysis
      • Data Visualization
      • Build Up Config Files
      • Preprocess raw data (Optional)
    • Documentation | 文档
    • Data Recipes
    • Demos
    • License
    • Contributing
    • References

Features

  • Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.

  • Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.

  • Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.

  • Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.

  • Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.

  • User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

  • Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

  • Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.