Kk7nc Text Classification Save

Text Classification Algorithms: A Survey

Project README

################################################ Text Classification Algorithms: A Survey ################################################

|DOI| |Best| |medium| |mendeley| |contributions-welcome| |arXiv| |ansicolortags| |contributors| |twitter|

.. figure:: docs/pic/WordArt.png

Referenced paper : Text Classification Algorithms: A Survey <https://arxiv.org/abs/1904.08067>__

|BPW|

################## Table of Contents ################## .. contents:: :local: :depth: 4

============ Introduction

.. figure:: docs/pic/OverviewTextClassification.png

==================================== Text and Document Feature Extraction


Text feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.

Text Cleaning and Pre-processing

In Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall perfomance. So, elimination of these features are extremely important.


Tokenization

Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classifcation, in text mining, it is necessay to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:

sentence:

.. code::

After sleeping for four hours, he decided to sleep for another four

In this case, the tokens are as follows:

.. code::

{'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}

Here is python code for Tokenization:

.. code:: python

from nltk.tokenize import word_tokenize text = "After sleeping for four hours, he decided to sleep for another four" tokens = word_tokenize(text) print(tokens)


Stop words

Text and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpuses.

Here is an exmple from geeksforgeeks <https://www.geeksforgeeks.org/removing-stop-words-nltk-python/>__

.. code:: python

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens: if w not in stop_words: filtered_sentence.append(w)

print(word_tokens) print(filtered_sentence)

Output:

.. code:: python

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.'] ['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


Capitalization

Sentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as "US" to "us" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.

.. code:: python

text = "The United States of America (USA) or America, is a federal republic composed of 50 states" print(text) print(text.lower())

Output:

.. code:: python

"The United States of America (USA) or America, is a federal republic composed of 50 states" "the united states of america (usa) or america, is a federal republic composed of 50 states"


Slangs and Abbreviations

Slangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as "lost the plot", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.


Noise Removal

Another issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.

Here is simple code to remove standard noise from text:

.. code:: python

def text_cleaner(text): rules = [ {r'>\s+': u'>'}, # remove spaces after a tag opens or closes {r'\s+': u' '}, # replace consecutive spaces {r'\s*<br\s*/?>\s*': u'\n'}, # newline after a
{r'</(div)\s*>\s*': u'\n'}, # newline after

and
Open Source Agenda is not affiliated with "Kk7nc Text Classification" Project. README Source: kk7nc/Text_Classification