A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! đ đ đ˝ âĄď¸ âĄď¸đ¸ đš đˇä¸şĺ¤§čŻč¨ć¨Ąĺćäžć´éŤč´¨éăć´ä¸°ĺŻăć´ćâćśĺâçć°ćŽďź
video_frames_text_similarity_filter
: keeps samples whose similarities between sampled video frame images and text within a specific range. #227video_tagging_from_frames_mapper
: generates video tags from frames extracted from the video. #227video_tagging_from_audio_mapper
: generates video tags from audio streams extracted from videos. #227video_captioning_from_video_mapper
: generates captions from frame images extracted from video to augment datasets. #227video_captioning_from_audio_mapper
: captions a video according to its audio streams. #227image_captioning_mapper
: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227image_captioning_from_gpt4v_mapper
: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227image_diffusion_mapper
: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200video_duration_filter
: keeps samples whose videos' durations are within a specified range. #227video_aspect_ratio_filter
: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227video_resolution_filter
: filters samples according to the resolution of videos in them. #227video_ocr_area_ratio_filter
: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227video_aesthetics_filter
: filters samples according to the aesthetics score of frame images extracted from videos. #227video_motion_score_filter
: keeps samples with video motion scores within a specific range. #227video_split_by_scene_mapper
: splits videos into scene clips. #227video_split_by_duration_mapper
: splits videos by specified duration interval. #227video_split_by_key_frame_mapper
: splits videos by their keyframes. #227video_resize_aspect_ratio_mapper
: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227video_resize_resolution_mapper
: maps videos to ones with a given resolution range. #227video_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to video data more conveniently. #227video_deduplicator
: deduplicates samples at document-level using exact matching of videos between documents. #227audio_duration_filter
: keeps samples whose audios' durations are within a specified range. #177audio_size_filter
: keeps samples whose audios' sizes are within a specified range. #184audio_nmf_snr_filter
: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189audio_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to audio data more conveniently. #227image_blur_mapper
: adds random noises to images to blur them. #180image_aesthetics_filter
: filter samples according to the aesthetics scores of images. #227predict.py
tool gets stuck on the help page. #183face_area_filter
: constrains the detection coordinates within the image. #202Here we thank public contributors for their PRs to make Data-Juicer better!
simhash-pybind
to solve the Python version limitation problem.replace_content_mapper
. #143chinese_convert_mapper
: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51remove_non_chinese_character_mapper
: removes non-Chinese characters in text samples. #51text_action_filter
: keeps samples containing action verbs in their texts. #122text_entity_dependency_filter
: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122replace_content_mapper
: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143remove_repeat_sentences_mapper
: Remove repeated sentences in the text. #149image_shape_filter
: keeps samples containing images with widths and heights within the specified ranges. #74image_aspect_ratio_filter
: keeps samples containing images with aspect ratios (w/h) within the specified range. #64image_size_filter
: keeps samples containing images whose sizes in bytes are within the specified range. #73face_area_filter
: keeps samples containing images with face area ratios within the specified range. #110image_deduplicator
: deduplicates samples at document-level using exact matching of images between documents. #72image_text_similarity_filter
: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69image_text_matching_filter
: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100phrase_grounding_recall_filter
: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139pandas==2.0.0 fsspec==2023.3.0
to avoid unexpected errors from third-party dependencies. #38 #42nlpaug_en_mapper
and nlpcda_zh_mapper
generate indefinite numbers of augmented samples. #76maximum_line_length_filter
might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147language_id_score_filter
supports keeping samples in multiple languages now. #125 #151Here we thank public contributors for their PRs to make Data-Juicer better!
nlpaug_en_mapper
: simple data augmentation using nlpaug library for English corpus. #17nlpcda_zh_mapper
: simple data augmentation using nlpcda library for Chinese corpus. #17token_num_filter
: filter out samples by the number of tokens in them. HF tokenizers are supported. #24text_keys
and remove previous misleading arg text_key(s)_to_process/load
. #13export_in_parallel
is added to control whether export the result datasets in parallel. #17document_simhash_deduplicator
to save more memory. #14pip install py-data-juicer
now. #23Broad Range of Operators: Equipped with 50+ core operators (OPs), including Formatters, Mappers, Filters, Deduplicators, and beyond.
Specialized Toolkits: Feature-rich specialized toolkits such as Text Quality Classifier, Dataset Splitter, Analysers, Evaluators, and more that elevate your dataset handling capabilities.
Systematic & Reusable: Empowering users with a systematic library of reusable config recipes and OPs, designed to function independently of specific datasets, models, or tasks.
Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with real-time multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
Comprehensive Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, SFT, en, zh, and more scenarios.
User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.
Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.
Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.