Arabic News Article Classification Save

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

Project README

Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria


Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

Features Corpora
Nb. of articles 57.827
Nb. of categories 8
Nb. of words 14.068.407
Nb. of types 582.531
Nb. of tokens 15.891.729

The corpus is distributed on 8 categories [1] :

Category Nb. of articles
Culture 5322
Economic 8768
Politics 9620
Religion 4526
Society 9744
Sports 9103
World 6344
Other 4400

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

أمرت السلطات القطرية الأسواق و المراكز التجارية في البلاد برفع و إزالة السلع الواردة من السعودية و البحرين و الإمارات و مصر في الذكرى الأولى لإعلان هذه الدول الحصار عليها.

1.Tokenization

Each collected article was segmented into tokens, using NLTK.

[ أمرت, السلطات, القطرية, الأسواق, و, المراكز, التجارية, في, البلاد, ب, رفع, و, إزالة, السلع, الواردة, من, السعودية, و, البحرين, و, الإمارات, و, مصر, في, الذكرى, الأولى, ل, إعلان, هذه, الدول, الحصار, عليها, . ]

2.Removing stopwords

Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.

[ أمرت, السلطات, القطرية, الأسواق, المراكز, التجارية, البلاد, رفع, إزالة, السلع, الواردة, السعودية, البحرين, الإمارات, مصر, الذكرى, الأول, إعلان, الدول, الحصار ]

3.Stemming

Each word was stemmed using Farasa Arabic text processing toolkit.

[ أمر, سلطة, قطر, سوق, مركز, تجاري, بلد, رفع, إزالة, سلعة, وارد, سعودية, بحرين, إمارات, مصر, ذكرى, أول, إعلان, دولة, حصار ]


Dataset

Categories = {الجزائر : Algeria, الثقافة : entertainment, الدين : religion, المجتمع : society, الرياضة : sport, العالم : world}

TALAA Categories

Machine Learning Models

Many Machine Learning algorithms has been experimented:

Algorithm Precision Recall F-mesure
Decision Tree 0.82 0.84 0.83
SVM (SGD) 0.94 0.94 0.94
Naive Bayes 0.89 0.87 0.88

Evaluation (Confusion matrix)

Confusion matrix using the best model SVM with Stochastic Gradient Descent:

Confusion matrix

TODO


Contributing


Credits

Open Source Agenda is not affiliated with "Arabic News Article Classification" Project. README Source: saidziani/Arabic-News-Article-Classification

Open Source Agenda Badge

Open Source Agenda Rating