FastTextClassification Save

FastTextClassification is all you need for text classification! Fast text classification for everyone, enjoy your NLP journey! 这可能是目前为止最全面的开源文本分类项目,支持中英双语、多种模型、多种任务。

Project README

FastTextClassification

Logo

Fast text classification for you, Start your NLP journey

GitHub Contributors Issues Issues GitHub pull requests GitHub stars
开源实现 / 简单 / 全面 / 实践

功能免费,代码开源,大家放心使用,欢迎贡献!

最新讯息

  • 2024/05/06:
    • ernie支持多卡、slurm等分布式训练
    • 支持配置文件改变mlp的层数、预训练模型等
    • 支持二分类、多分类、多标签分类
  • 2024/04/30: clone原项目进行再次开发
  • 2023/03/23:FastTextClassification V0.0.1版正式开源,版本特性:
    • 支持中英双语的文本分类
    • 支持多种文本分类模型:传统机器学习浅层模型、深度学习模型和transformers类模型
    • 支持多标签文本分类
    • 支持多种embedding方式:inner/outer/random

开发计划

本项目的开发宗旨,打造全网最全面和最实用的文本分类项目和教程。如果有机会,未来希望可以做成开箱即用的文本分类工具,文本分类任务非常特殊,大部分情况下被认为是简单且基础的,然而却很难找到比较通用的文本分类工具,往往都是针对具体任务进行训练和部署。在NLP逐渐趋于大一统的今天,这一点非常不优雅,而且浪费资源。:*Fast text classification for you, Start your NLP journey!*

简要的开发计划

  1. 【P3】支持中英双语的文本分类:100%,也欢迎支持其他语种
  2. 【P0】支持多种文本分类模型:基本完成,欢迎补充
    1. 浅层文本分类模型:done
    2. 【P1】DNN类模型:已支持常见模型
    3. 【P0】transformer类模型:Bert/ERNIE等
    4. 【P0】prompt learning for Text Classification:TODO
    5. 【P0】ChatGPT for Text Classification:TODO
  3. 【P1】支持多标签文本分类:
    1. 多种多标签分类loss:done,如有遗漏,欢迎补充
    2. 复杂的多标签分类:比如层次化等,TODO
  4. 【P0】支持不同的文本分类数据集/任务:文本分类任务又多又散,这是好事儿也是坏事儿。欢迎基于本项目报告各种数据集上的效果
  5. 【P4】支持简明易用的文本分类API:终极目标为实现一个足够通用和强大的文本分类模型,并实现自然语言交互的文本分类接口text_cls(text, candidate_labels)->label,给定文本和候选类别(有默认值),输出文本所属的类别;同时支持可无成本或尽可能小的成本向特定领域泛化
  6. 多GPU和集群训练: TODO

使用步骤

1.克隆本项目

git clone https://github.com/fast-llm/FastTextClassification.git

2.数据集下载和预处理

请自行下载数据集,将其放到data目录下,数据统一处理成text+label格式,以\t或逗号分隔。有空我再来补一个自动化脚本,暂时请自行处理或者参考preprocessing.py。

最好将数据统一放到data目录下,比如data/dbpedia,然后分3个子目录,input存放原始数据集(你下载的数据集),data存放预处理后的格式化的数据集(text-label格式),saved_dict存放训练结果(模型和日志等)。

3.运行示例

经过测试的开发环境如下,仅供参考,差不多的环境应该都可以运行。

  • python:3.10
  • torch:2.3.0
  • transformers:4.39.1
conda create -n fasttext python=3.10
conda activate fasttext

pip install poetry
poetry install

根据自己的需要选择模块运行,详见下一节。

python run.py

运行示例

1.运行DNN/transformers类模型做文本分类

python run.py

2.运行传统浅层机器学习模型做文本分类

python run_shallow.py

3.运行DNN/transformers类模型做多标签文本分类

python run_multi_label.py

下表是直接运行demo的参考结果:

运行环境:python3.6 + T4

demo 数据集 示例模型 Acc 耗时 备注
run.py THUCNews/cn TextCNN 89.94% ~2mins
run_multi_label.py rcv1/en bert 61.04% ~40mins 其他指标见运行结果
run_shallow.py THUCNews/cn NB 89.44% 105.34 ms

结果展示:持续更新中

笔者提供了从浅到深再到多标签的详细实验结果,可供大家参考。但受限于时间和算力,很多实验可能未达到最优,望知悉!因此,非常欢迎大家积极贡献,补充相关实验、代码和新的模型等等,一起建设FastTextClassification。

暂时只提供部分汇总的结果,详细的实验结果及参数等我有空再补,比较多,需要一些时间整理。

1.传统浅层文本分类模型

Data Model tokenizer 最小词长 Min_df ngram binary Use_idf Test acc 备注
THUCNews/cn LR lcut 1 2 (1,1) False True 90.61% C=1.0, max_iter=1000 词表61549; train score: 94.22% valid score: 89.84% test score: 90.61% training time: 175070.97 ms
MultinomialNB(alpha=0.3) lcut 1 2 (1,1) False True 89.86% 词表61549; training time: 94.18ms
ComplementNB(alpha=0.8) lcut 1 2 (1,1) False True 89.88% 词表61549; training time: 98.31ms
SVC(C=1.0) lcut 1 2 (1,1) False True 81.49% 词表61549; 维度200 training time: 7351155.59 ms train score: 85.95% valid score: 80.07% test score: 81.49%
DT lcut 1 2 (1,1) False True 71.19% max_depth=None training time: 149216.53 ms train score: 99.97% valid score: 70.57% test score: 71.19%
xgboost lcut 1 2 (1,1) False True 90.08% XGBClassifier(n_estimators=2000,eta=0.3,gamma=0.1,max_depth=6,subsample=1,colsample_bytree=0.8, nthread=10) training time: 1551260.28 ms train score: 99.00% valid score: 89.34% test score: 90.08%
KNN lcut 1 2 (1,1) False True 85.17% k=10 training time: 21.24 ms train score: 89.05% valid score: 84.53% test score: 85.17%
dbpedia/en LR None 2 2 (1,1) False True 98.26% C=1.0, max_iter=100 词表237777 training time: 220177.59 ms train score: 98.85% valid score: 98.19% test score: 98.26%
MultinomialNB(alpha=1.0) None 2 2 (1,1) False True 95.35% training time: 786.24 ms train score: 96.36% valid score: 95.34% test score: 95.35%
ComplementNB(alpha=1.0) None 2 2 (1,1) False True 93.73% training time: 805.69 ms train score: 95.30% valid score: 93.79% test score: 93.73%
SVC(C=1.0) None 2 2 (1,1) False True 94.67% 维度200; max_iter=100 training time: 144163.81 ms train score: 94.75% valid score: 94.59% test score: 94.67% 注意:SVM的计算和存储成本正比于样本数的平方;
DT None 2 2 (1,1) False True 92.41% max_depth=100, min_samples_leaf=5 training time: 639744.56 ms train score: 95.79% valid score: 92.43% test score: 92.41%
xgboost None 2 2 (1,1) False True 97.99% XGBClassifier(n_estimators=200,eta=0.3,gamma=0.1,max_depth=6,subsample=1,colsample_bytree=0.8, nthread=10,reg_alpha=0,reg_lambda=1) training time: 1838434.42 ms train score: 99.35% valid score: 97.96% test score: 97.99%
KNN None 2 2 (1,1) False True 80.05% k=10 training time: 137.72 ms train score: 84.66% valid score: 80.20% test score: 80.05%

2.深度学习文本分类模型

Data Model Embed Bz Lr epochs acc 备注
THUCNews/cn TextCNN outer 128 1e-3 3/20 90.45%
TextRNN - - 1e-3 5/10 90.38%
TextRNN_Att 1e-3 2/10 90.55%
TextRCNN 1e-3 3/10 91.01%
DPCNN 1e-3 3/20 90.12%
FastText 1e-3 5/20 90.48%
bert inner 5e-5 2/3 94.10% bert-base-chinese
ERNIE inner 5e-5 3/3 94.58% ernie-3.0-base-zh
bert_CNN - 3/3 94.14%
bert_RNN - 3/3 93.92%
bert_RNN - 3/3 94.45%
bert_RCNN - 3/3 94.32%
bert_DPCNN - 3/3 94.17%
dbpedia/en TextCNN outer 128 5e-5 9/20 98.35% glove
TextRNN - - - 6/10 97.97%
TextRNN_Att - 4/10 97.80%
TextRCNN - 3/10 97.71%
DPCNN - 3/20 97.86%
FastText - 10/20 97.84%
bert inner 5e-5 2/3 97.78% bert-base-uncased
ERNIE 2/10 97.75% ernie-2.0-base-en
bert_CNN - 2/3 97.91%
bert_RNN - 2/3 97.87%
bert_RCNN - 2/3 98.04%
bert_DPCNN - 2/3 97.95%
gpt 3/3 97.03
gpt2 3/3 97.00
T5 3/3 96.57

3.多标签文本分类

Data Model 分层 样本数 Embed loss Bz Lr epochs Test acc (绝对匹配率) Micro-F1 Macro-F1 备注
Rcv1/en TextCNN - all outer multi_label_circle_loss 128 1e-3 9/20 51.02% 0.7904 0.4515 eval_activate = None cls_threshold = 0
TextRNN - - - 13/20 54.00% 0.7950 0.4358
TextRNN_Att - 11/20 53.97% 0.8011 0.4538
TextRCNN - 10/20 53.62% 0.8111 0.4900
DPCNN - 10/20 51.66% 0.7890 0.4111
FastText - 12/20 51.31% 0.7936 0.4728
bert all inner - 128 2e-5 20/20 61.04% 0.8454 0.5729 bert-base-cased
ERNIE all inner - 128 2e-5 20/20 61.67% 0.8486 0.5861 ernie-2.0-base-en
Bert_CNN all inner - 128 2e-5 12/20 58.31% 0.8364 0.5736 同bert配置
Bert_RNN all inner - 128 2e-5 17/20 60.48% 0.8371 0.5640
Bert_RCNN all inner - 128 2e-5 15/20 60.54% 0.8457 0.5969
Bert_DPCNN all inner - 128 2e-5 13/20 56.52% 0.8082 0.4273

常见报错

参考资料&致谢

A Survey on Text Classification: From Shallow to Deep Learning:https://arxiv.org/pdf/2008.00364.pdf?utm_source=summari

Deep Learning--based Text Classification: A Comprehensive Review:https://arxiv.org/pdf/2004.03705.pdf

https://github.com/649453932/Chinese-Text-Classification-Pytorch

https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch

https://github.com/facebookresearch/fastText

https://github.com/brightmart/text_classification

https://github.com/kk7nc/Text_Classification

https://github.com/Tencent/NeuralNLP-NeuralClassifier

https://github.com/vandit15/Class-balanced-loss-pytorch

https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

赞助我们

Starchart

Star History Chart

贡献者

Open Source Agenda is not affiliated with "FastTextClassification" Project. README Source: fast-llm/FastTextClassification
Stars
27
Open Issues
1
Last Commit
4 days ago

Open Source Agenda Badge

Open Source Agenda Rating