Official implementation of DPFM @ ICLR 2024 paper "AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts" (Huggingface Daily Papers: https://huggingface.co/papers/2402.07625)
Homepage: https://auto-data-selection.github.io.
Official implementation of DPFM @ ICLR 2024 paper "AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts" (https://arxiv.org/abs/2402.07625).
Featured as Huggingface Daily Papers! (https://huggingface.co/papers/2402.07625)
AutoMathText is an extensive and carefully curated dataset encompassing around 200 GB of mathematical texts. It's a compilation sourced from a diverse range of platforms including various websites, arXiv, and GitHub (OpenWebMath, RedPajama, Algebraic Stack). This rich repository has been autonomously selected (labeled) by the state-of-the-art open-source language model, Qwen-72B. Each piece of content in the dataset is assigned a score lm_q1q2_score
within the range of [0, 1], reflecting its relevance, quality, and educational value in the context of mathematical intelligence.
Huggingface dataset: https://huggingface.co/datasets/math-ai/AutoMathText
The primary aim of the AutoMathText dataset is to provide a comprehensive and reliable resource for a wide array of users - from academic researchers and educators to AI practitioners and mathematics enthusiasts. This dataset is particularly geared towards:
"""<system>
You are ChatGPT, the most capable large language model equipped with extensive expertise in
mathematics and coding, particularly skilled in complex reasoning and problem-solving.
In the following interaction, I will provide you with a text excerpt from a website.
Your task is to evaluate whether this text contains elements of mathematical intelligence
and if it is suitable for educational purposes for YOURSELF in the field of mathematics.
Please respond with only YES or NO
<\system>
User: {
"url": "{url}",
"text": "{text}"
}
1. Does the text contain elements of mathematical intelligence? Reply with only YES or NO
2. Is the text suitable for educational purposes for YOURSELF in the field of mathematics? Reply with only YES or NO
Assistant: 1."""
"""<system>
You are ChatGPT, the most capable large language model equipped with extensive expertise in
mathematics and coding, particularly skilled in complex reasoning and problem-solving.
In the following interaction, I will provide you with a text excerpt from the arXiv website.
Your task is to evaluate whether this text contains elements of mathematical intelligence
and if it is suitable for educational purposes for YOURSELF in the field of mathematics.
Please respond with only YES or NO
<\system>
User: {
"Title": "{title}",
"Abstract": "{abstract}",
"Text": "{text}"
}
1. Does the text contain elements of mathematical intelligence? Reply with only YES or NO
2. Is the text suitable for educational purposes for YOURSELF in the field of mathematics? Reply with only YES or NO
Assistant: 1."""
"""<system>
You are ChatGPT, the most capable large language model equipped with extensive expertise in
mathematics and coding, particularly skilled in complex reasoning and problem-solving.
In the following interaction, I will provide you with a code excerpt from a website.
Your task is to evaluate whether this code contains elements of mathematical intelligence
and if it is suitable for educational purposes for YOURSELF in the field of mathematics.
Please respond with only YES or NO
<\system>
User: {
"Repository": "{repo_name}",
"File Path": "{file_url}",
"Code Excerpt": "{text}"
}
1. Does the code contain elements of mathematical intelligence? Reply with only YES or NO
2. Is the code suitable for educational purposes for YOURSELF in the field of mathematics? Reply with only YES or NO
Assistant: 1."""
configs:
- config_name: web-0.50-to-1.00
default: true
- config_name: web-0.60-to-1.00
- config_name: web-0.70-to-1.00
- config_name: web-0.80-to-1.00
- config_name: web-full
- config_name: arxiv-0.50-to-1.00
- config_name: arxiv-0.60-to-1.00
- config_name: arxiv-0.70-to-1.00
- config_name: arxiv-0.80-to-1.00
- config_name: arxiv-full
- config_name: code-0.50-to-1.00
- config_name: code-python-0.50-to-1.00
- config_name: code-python-0.80-to-1.00
- config_name: code-full
How to load data:
from datasets import load_dataset
ds = load_dataset("math-ai/AutoMathText", "web-0.50-to-1.00") # or any valid config_name
We appreciate your use of AutoMathText in your work. If you find this repository helpful, please consider citing it and star this repo. Feel free to contact [email protected] or open an issue if you have any questions.
@article{zhang2024automathtext,
title={AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts},
author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew Chi-Chih},
journal={arXiv preprint arXiv:2402.07625},
year={2024}
}