Text Mining Resources Save Abandoned

Resources for learning about Text Mining and Natural Language Processing

Project README

Uncle Steve's Big List of Text Analytics and NLP Resources

 ____ ____ ____ ____ _________ ____ ____ ____ ____ ____ ____ 
||t |||e |||x |||t |||       |||m |||i |||n |||i |||n |||g ||
||__|||__|||__|||__|||_______|||__|||__|||__|||__|||__|||__||
|/__\|/__\|/__\|/__\|/_______\|/__\|/__\|/__\|/__\|/__\|/__\|

A curated list of resources for learning about natural language processing, text analytics, and unstructured data.

Books
- R
- Python
- General
Blogs
Blog articles, Papers, Case Studies
Major NLP Conferences
Benchmarks
Online Courses
APIs and Libraries
Products
Online Demos and Tools
Datasets
Misc
Other Curated Lists

Books

R

Python

General

Taming Text: How to Find, Organize, and Manipulate It. A hands-on guide to learn innovative tools and techniques for finding, organizing, and manipulating unstructured text.
Speech and Language Processing
Foundations of Statistical Natural Language Processing
Language Processing with Perl and Prolog: Theories, Implementation, and Application (Cognitive Technologies)
An introduction for information retrieval
Handbook of Natural Language Processing
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications
Fundamentals of Predictive Text Mining
Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More
Neural Network Methods for Natural Language Processing
Text Mining: A Guidebook for the Social Sciences
Practical Text Analytics: Interpreting Text and Unstructured Data for Business Intelligence
Neural Network Methods in Natural Language Processing
Machine Learning for Text (2018)
Natural Language Processing in Spanish
Foundations of Computational Linguistics Human-Computer Communication in Natural Language. Provides insights on how to build talking robots.
Statistical Methods for Speech Recognition. Highlights important research and statistical methods for speech recognition.
How To Label Data Extended guide on managing large text annotation projects

Blogs

Blog Articles, Papers, Case Studies

General

NLP in healthcare. How NLP can be used by healthcare payers and providers.
AI Harvard Business Review. The impact of improvement in NLP on human interaction with machines.
Why Accuracy in Natural Language Processing is Crucial to the Future of AI in Retail
Natural Language Processing is Fun! How computers understand Human Language. 2018.
WEF Live Campaign - Twitter fed Global News Topics & Sentiment Tracker - Live Jan 2019
Modern Deep Learning Techniques Applied to Natural Language Processing
The Definitive Guide to Natural Language Processing. MonkeyLearn. Non technical overview.
From Natural Language to Calendar Entries, with Clojure. March 2015. NLP, Clojure
Ask HN: How Can I Get into NLP (Natural Language Processing)?
Ask HN: What are the best tools for analyzing large bodies of text?
Quora: How do I learn Natural Language Processing?. Good intro for beginner with time estimate breakdown and links to Stanford CS courses.
Quora Topic: Natural Language Processing
The Definitive Guide to Natural Language Processing October 2015.
Futures of text Feb 2015. A survey of all the current innovation in text as a medium.
R or Python on Text Mining Aug 2015. Comparison of efficiency between R and Python in the field of Text Mining.
Where to start in Text Mining Aug 2012.
Text Mining in R and Python: 8 Tips To Get Started. Oct 2016
An introduction to text analysis with Python, Part 1 April 2012. A beginner’s walkthrough on the basics idea of sentiment analysis in Python.
Mining Twitter Data with Python (Part 1: Collecting Data)
Why Text Mining May Be The Next Big Thing. March 2012.
SAS CEO offers analytics over BI, reveals use cases for text analytics June 2011.
Value and benefits of text mining. Sep 2015.
Text Mining South Park Feb 2016 - A Text Mining blog which covers on a variety of topics.
Natural Language Processing: An Introduction
Natural Language Processing Tutorial. June 2013.
Natural Language Processing blog.
An Introduction to Text Mining using Twitter Streaming API and Python
- GitHub repo with code: https://github.com/adilmoujahid/Twitter_Analytics
How To Get Into Natural Language Processing'. Basic non technical intro to NLP.
Betty: a friendly English-like interface for your command line.
Creating machine learning models to analyze startup news - Part1. Part 2. Part 3.
Comparison of the Most Useful Text Processing APIs
100 Must-Read NLP Papers
Python Guide for dealing with Text Data
Crowdsourcing Ground Truth for Medical Relation Extraction
Natural language based financial forecasting: a survey
Natural language based financial forecasting: a survey. An article that clarifies the scope of Natural Language Financial Forecasting.
5 Heroic Tools for Natural Language Processing
Natural Language Processing unlocks hidden data to transform healthcare efficiency, quality and cost
Extracting medical problems from electronic clinical documents
Natural Language Processing (NLP) for Machine Learning. Includes basic, easy to understand preprocessing and compares a few ML classificaiotn models in Python.
How to Write a Spelling Corrector - by Peter Norvig
Using AI to unleash the power of unstructured government data: (W. Eggers, N. Malik, & M. Gracie, January 2019). “Think of unstructured text as being ‘trapped’ in physical and virtual file cabinets. The promise is clear: Governments could improve effectiveness and prevent many catastrophes by improving their ability to ‘connect the dots’ and identify patterns in available data.” This Deloitte article provides an easy-to-comprehend primer and background on NLP, and the various applications NLP could be used on unstructured Government text data. The article includes many US Government examples on how NLP is currently deployed across different domains (e.g. to help analyze public feedback/sentiment analysis/topic modelling, to improve forensic investigations, to aid in Government policy-making and regulatory compliance). The key point is to apply different NLP techniques to explore and uncover key Government intelligence insights.
Extracting Features of Entertainment Products: A Guided Latent Dirichlet Allocation Approach Informed by the Psychology of Media Consumption: (O. Toubia, G. Iyengar, R. Bunnell, & A. Lemaire, February 2019). “We rely on the NLP literature to develop a method for tagging entertainment products in an automated and scalable manner. In the context of movies, we first show that the proposed features improve our ability to predict consumption at the individual level… We also show that guided LDA features have the potential to improve the performance of models that predict aggregate performance outcomes rather than individual-level consumption.” This academic article provides both a framework and managerial implications that suggest the application of LDA and NLP for feature extraction in entertainment products that can aid in traditional content-based consumer behavior models, and relevant marketing models applied to the media and entertainment industry.
Lessons learned building natural language processing systems in health care
How Algorithms Know What You’ll Type Next

Biases in NLP

AI bias: It is the responsibility of humans to ensure fairness
Venturebeat Blogpost - Gender biases in datasets - Based on UCLA research paper "Learning Gender Neutral Word Embeddings" Aug 2018.
Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. 2018
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

Scraping

Scraping HTML using Scrapy Tutorial on using the Python module Scrapy for easy data extraction from messy HTML websites.
Extract text from any document; no muss, no fuss. July 2014.
Using Scrapy to Build your Own Dataset Sep 2017.

Cleaning

How to solve 90% of NLP problems: a step-by-step guide Jan 2018. A step by step guide on data cleaning and exploration for successful NLP model building.
Text Preprocessing in Python: Steps, Tools, and Examples. Oct 2018
How to Clean Text for Machine Learning with Python October 2017. Step-by-step guide of how to perform text data pre-processing.
Feature Extraction, Basic Pre-processing, and Advanced Processing

Stop Words

Stemming

Article: Text Stemming: Approaches, Applications, and Challenges. Dec 2016.
What is the Difference Between Stemming and Lemmatization?. Feb 2018. Differences and examples of using stemming and lemmatization in different languages.
Stemming and Lemmatization in Python. Oct 2018. Comparison of stemming and lemmatization with algorithms behind, results, pros and cons, context to use, and code syntax.
Sentiment Symposium Tutorial: Stemming

Dimensionality Reduction

Sarcasm Detection

Automatic Sarcasm Detection: A Survey ACM Computer Surveys, Sep 2017.
CASCADE: Contextual Sarcasm Detection in Online Discussion Forums 27th International Conference on Computational Linguistics, Aug 2018.
A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks International Journal of Advanced Research in Computer Engineering & Technology, Volume 6, Issue 1, Jan 2017.
Detecting Sarcasm with Deep Convolutional Neural Networks. Apr 30, 2018. Contextual learning using CNNs for effective detection of sarcasm.

Document Classification

Naive Bayes and Text Classification, 2014. An in-depth overview of both the Naive Bayes algorithm and how it can be used in the document classification process.
Bag of Tricks for Efficient Text Classification, 2016. A paper from Facebook researchers that introduces fastText, a fast and effective document classification algorithm.
Text Classifier Algorithms in Machine Learning, 2017. A blog article that shows how to apply several deep learning algorithms to document classification problems.
Classifying Documents in the Reuters-21578 R8 Dataset, 2016. A nice tutorial in R that shows how to classify news articles using three different ML algorithms.
Tidy Text Mining Beer Reviews, 2018. Uses the KNN algorithm to classify reviews of craft beer products into styles of beer (e.g., "pilsner", "IPA", or "Belgian").
Using fastText and Comet.ml to classify relationships in Knowledge Graphs
Multi-Class Text Classification with Scikit-Learn, 2018. An article that shows how to deal with multi-class problems, such as classifying consumer complaints into one of 12 categories.
Machine Learning with Text in scikit-learn (PyCon 2016), 2016. A nice video tutorial that discusses how to use scikit-learn in the document classification process.
Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers, 2018. The title says it all.
Text Classification in Python with scikit-learn and nltk, 2017. Another tutorial showing how to perform text classification using scikit-learn.
Introducing state of the art text classification with universal language models, 2019. Introduces a groundbreaking transfer learning method for document classification.
Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews - paper with code on Github
Towards Explainable NLP: A Generative Explanation Framework for Text Classification, 2019. A paper that describes a new approach for explaining the inner workings of text classification models.

Entity and Information Extraction

Entity Extraction and Network Analysis. Python, StanfordCoreNLP
Natural Language Processing for Information Extraction
NLP Techniques for Extracting Information. In-depth exploration of the seven steps framework of NLP data mining tools and techniques.

Document Clustering and Document Similarity

Text Clustering: Get quick insights from Unstructured Data. July 2017.
Document Clustering. MSc Thesis.
Document Clustering: A Detailed Review. Shah and Mahajan. IJAIS 2012.
Document Clustering with Python A GitHub repository that clusters IMDB movie descriptions. Based on this original tutorial, whose GitHub repo is here.
Text mining and sentiment analysis on video game user reviews using SAS® Enterprise Miner
Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity

Concept Analysis/Topic Modeling

Sentiment Analysis

Methods

CACM: Techniques and Applications for Sentiment Analysis, 2013. A nice overview of sentiment analysis from the Communications of the ACM journal.
Unsupervised Sentiment Analysis with Signed Social Networks, 2017. A conference paper that describes that challenges of applying sentiment analysis to social networks, and presents an new unsupervised method.
Lexicon-Based Methods for Sentiment Analysis, 2010. Uses SO-CAL (Semantic Orientation CALculator), a measure of subjectivity and opinion for sentimental analysis.
That Sentimental Feeling, 2015. Compares the result of R's Syezhet package with human labels on a series of novels. A 2016 update.
Unsupervised Sentiment Neuron, 2017. OpenAI's team developed a new way of using deep NNs to perform sentiment analysis, on much less data than usual.
Current State of Text Sentiment Analysis from Opinion to Emotion Mining, 2017. A journal article that surveys the current state of sentiment analysis research and tools.
Sentiment Analysis Tools Overview, Part 1. Positive and Negative Words Databases, 2017. A blog article that outlines some lexicon databases.
Sentiment analysis, Concept analysis and Applications, 2018. An overview of sentiment analysis, with an analysis of tweets about Uber.
Breakthrough Research Papers and Models for Sentiment Analysis, 2018. A blog that compares the performance of simple to advanced methods for sentiment analysis.
Twitter sentiment analysis using combined LSTM-CNN models, 2018. A blog article that describes a new method for sentiment analysis that uses deep learning.
VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text, 2014. A conference paper that presents VADER, a simple rule-based model of sentiment analysis.
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog posts, 2014. A conference paper that presents a new lexicon-based approach for sentiment analysis of Twitter posts, based on lexical resources such as SentiWordNet.

Challenges

On the negativity of negation, 2011. A conference paper that discusses the challenges of dealing with negativity in text, with a case study on IMDB movie reviews.
Challenges in Sentiment Analysis, 2015. A practical guide from the National Reseach Council of Canada that describes some of the main challenges of sentiment analysis.
A survey on sentiment analysis challenges, 2016. A journal article that discusses and compares sentiment analysis challenges among forty-seven papers.

Politics

Sentiment analysis on Trump's tweets using Python, 2017. Sentiment analysis on Trump's tweets using tweepy and textblob for NLP processing.
Donald Trump vs Hillary Clinton: sentiment analysis on Twitter mentions, 2016. Compares the sentiment of Trump's tweets vs. Hillary's tweets leading up to the 2016 US presidential election.
Does sentiment analysis work? A tidy analysis of Yelp reviews, 2016. Combined prediction results and individual words in reviews to show that sentiment analysis worked well on Yelp reviews.
From tweets to polls: Linking text sentiment to public opinion time series, 2010. A conference paper that describes how sentiment analysis on Twitter is connected to public opinion polls.

Stock Market

Twitter mood predicts the stock market, 2010. A journal article that measures the "mood" of daily Twitter feedsa and shows that the moods can predict the DJIA.
A nonlinear impact: evidences of causal effects of social media on market prices, 2016. A journal article that shows that social media's relationship with the DJIA is nonlinear.
Forbes: How Quant Traders Use Sentiment To Get An Edge On The Market, 2015. An article that shows how quant traders can use sentiment analysis.
Sentdex: Quantifying the Qualitative. An online tool that measures the overall sentiment of different stocks.
Trump2Cash: A stock trading bot powered by Trump tweets. A bot that watches Donald Trump's Twitter account and waits for him to mention any publicly-traded companies. A related blog article describes a bot that turns Trump's tweets into Planned Parenthood donations.

Applications

Lost at Sea: How Social Media is Helping Cruise Lines Attract Millennials, 2016. A whitepaper describing how cruise lines can attract a different audience.
Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R, 2015. A technical article showing how to apply sentiment analysis to the text of the Harry Potter series.
Data Science 101: Sentiment Analysis in R Tutorial, 2017. A technical article describing how to use the Tidytext package in R to analyze US presidential speeches.
Cannes Lions 2017: Hungerithm, Mars Chocolate Australia (Clemenger BBDO, Melbourne), 2017. A video that shows how Snickers developed a tool to change the price of Snickers bar based on the mood of the internet.
Sentiment analysis: 10 applications and 4 services, 2018. A brief but concise introduction to sentiment analysis, it's business implications and four sentiment analysis cloud service providers including Google, Amazon and Microsoft.
What Your Boss Could Learn by Reading the Whole Company’s Emails, 2018. "The lesson: Figure out the truth about how the workforce is feeling not by eavesdropping on the substance of what employees say, but by examining how they are saying it.” This article is centered around the topic of applying sentiment analysis to large internal unstructured text datasets (e.g. employee e-mails). Text analytics and NLP have become an increasingly popular approach to help search for clues that may indicate the level of employee engagement in the workplace, and any potential ‘red-flags’ that should receive particular attention by an organization and its ethical implications.
Aspect Based Sentiment Analysis of Amazon Product Reviews, 2018. An article showing how to apply sentiment analysis on different aspects of a product review on Amazon.
Sentiment Analysis of 2.2 million tweets from Super Bowl 51, 2017. An article showing how to apply sentiment analysis to tweets about the Super Bowl.
Emotion and Sentiment Analysis: A Practitioner’s Guide to NLP, 2018. An overview of sentiment analysis, applied to news articles.

Tools and Technology

Streaming Analytics Tutorial on Azure.
How to Analyze sentiment in Azure.
how-to-perform-sentiment-analysis-using-python-tutorial/.
Twitter Sentiment Analysis Overview, 2016. Overview of sentiment analysis, and a step-by-step walkthrough on how to perform sentiment analysis using TextBlob.
ELMO embeddings in Keras using Tensorflow Hub, 2018. A guide to use Google's ELMO in your Keras model using Tensorflow hub.
Twitter Sentiment Analysis in Python using TextBlob, 2018.

Text Summarization

Text Summarization with Gensim
Unsupervised Text Summarization using Sentence Embeddings
Improving Abstraction in Text Summarization Proposing two techniques for improvement
Text Summarization and Categorization for Scientific and Health-Related Data -Text summarization with TensorFlow. 2016. A basic study on text summarization.

Machine Translation

Blog Post: Found in translation: More accurate, fluent sentences in Google Translate Nov 2016
NYTimes: The Great A.I. Awakening Dec 2016. How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.
Machine Learning Translation and the Google Translate Algorithm
Neural Machine Translation (seq2seq) Tutorial
Paper Dissected: “Attention is All You Need” Explained Explanation of an important paper that first introduced 'Attention mechanism' in 2017.
The Annotated Transformer A line-by-line implementation of "Attention Is All You Need".
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A new language representation model published in 2018. Implementation code. Pytorch port.
Phrase-Based & Neural Unsupervised Machine Translation Proposed two model variants: neural and phrase-based model. Awarded as the Best Paper Award at EMNLP 2018. Implementation code.

ChatGPT

...in Education

ChatGPT User Experience: Implications for Education. Xiaoming Zhai (Unviversity of Georgia). December 2022.
New Modes of Learning Enabled by AI Chatbots: Three Methods and Assignments Mollick and Mollick (University of Pennsylvania). December 2022.
Educators Battle Plagiarism As 89% Of Students Admit To Using OpenAI’s ChatGPT For Homework. Forbes, January 2023
ChatGPT: Educational friend or foe?. Hirsh-Pasek and Blinkoff (Temple University). January 2023.
Don’t Ban ChatGPT in Schools. Teach With It.. New York Times (January 2023).
ChatGPT and the Future of Business Education. Feb 2023.
Udemy course (January 2023). ChatGPT for Teachers in Education.

Deep Learning

Keras LSTM tutorial – How to easily build a powerful deep learning language model.
- First half of the article describes RNNs, the anatomy of an LSTM cell, LSTM networks. Second half is a walkthrough of features in Keras for LSTM implementation using generators for data input.
Deep Learning for Natural Language Processing: Tutorials with Jupyter Notebooks.
- A short article containing links and descriptions to further video tutorials for DL approaches to NLP problems. Five lessons total including preprocessing, word representations, and LSTM, among other topics.
A Survey of the Usages of Deep Learning in Natural Language Processing.
- A 35-page academic literature review of DL in NLP (University of Colorado, July 2018). Detailed description of neural network architectures followed by a comprehensive set of applications.
Sequence Classification with Human Attention: Using human attention derived from eye-tracking corpora to regularize attention in recurrent neural networks (RNN). Implementation code.
Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python
Multi-Task Deep Neural Networks for Natural Language Understanding. Academic article detailing Microsoft's MTDNN algorithm which has outperformed BERT, ELMo & BiLSTM as of February 2019 in the GLUE Benchmark.
Natural Language Processing Tutorial for Deep Learning Researchers: A 2019 NLP tutorial repository using TensorFlow and Pytorch.
Deep Learning for Sentiment Analysis : A Survey
NEURAL READING COMPREHENSION AND BEYOND December 2018 Stanford - Reading comprehension models built on top of deep neural networks.
Microsoft: Multi-Task Deep Neural Network (MT-DNN): Microsoft's improvement on Google's BERT with focus on natural language understanding. Code to be released. January 31, 2019.
A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING

Capsule Networks

Investigating Capsule Networks with Dynamic Routing for Text Classification. 2018.
Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction. 2018.
TWITTER SENTIMENT ANALYSIS USING CAPSULE NETS AND GRU. 2018.
Identifying Aggression and Toxicity in Comments using Capsule Network. 2018. It is early days for Capsule Networks, which was introduced by Geoffrey Hinton, et al., in 2017 as an attempt to introduce an NN architecture superior to the classical CNNs. The idea aims to capture hierarchincal relationships in the input layer through dynamic routing between "capsules" of neurons. Due likely to the affinitity of the theme of addressing hierarchical complexities, the idea's extention to the NLP field has since been a sujbect of active research, such as in the papers listed above.
Dynamic Routing Between Capsules. 2017.
MATRIX CAPSULES WITH EM ROUTING. 2018.

Knowledge Graphs

Using fastText and Comet.ml to classify relationships in Knowledge Graphs
WTF is a knowledge graph?
A survey of graphs in natural language processing. Nastase et al, 2015.

Major NLP Conferences

Benchmarks

SQuAD leaderboard. A list of the strongest-performing NLP models on the Stanford Question Answering Dataset (SQuAD).
- SQuAD 1.0 paper (Last updated October 2016). SQuAD v1.1 includes over 100,000 question and answer pairs based on Wikipedia articles.
- SQuAD 2.0 paper (October 2018). The second generation of SQuAD includes unanswerable questions that the NLP model must identify as being unanswerable from the training data.
GLUE leaderboard.
- GLUE paper (September 2018). A collection of nine NLP tasks including single-sentence tasks (e.g. check if grammar is correct, sentiment analysis), similarity and paraphrase tasks (e.g. determine if two questions are equivalent), and inference tasks (e.g. determine whether a premise contradicts a hypothesis).

Online courses

Udemy

Stanford

Coursera

DataCamp

Others

Deep Learning Drizzle : Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP from this curated list of exciting lectures!
Natural Language Processing | Dan Jurafsky, Christopher Manning
Deep Learning for NLP. DeepMind and University of Oxford Department of Computer Science.
CMU CS 11-747: Neural Network for NLP
YSDA NLP course. Yandex School of data analysis.
CMU Language and Statistics II: (More) Empirical Methods in Natural Language Processing
UT CS 388: Natural Language Processing
Columbia: COMS W4705: Natural Language Processing
Columbia: COMS E6998: Machine Learning for Natural Language Processing (Spring 2012)
Machine Translation: Spring 2016
Commonlounge: Learn Natural Language Processing: From Beginner to Expert
Big Data University: Advanced Text Analytics – Getting Results with SystemT
Udacity: Natural Language Processing Nanodegree
edX: Natural Language Processing: An introduction to NLP, taught by Microsoft researchers

APIs and Libraries

R packages
- tm: Text Mining.
- lsa: Latent Semantic Analysis.
- lda: Collapsed Gibbs Sampling Methods for Topic Models.
- textir: Inverse Regression for Text Analysis.
- corpora: Statistics and data sets for corpus frequency data.
- tau: Text Analysis Utilities.
- tidytext: Text mining using dplyr, ggplot2, and other tidy tools.
- Sentiment140: Sentiment text analysis
- sentimentr: Lexicon-based sentiment analysis.
- cleanNLP: ML-based sentiment analysis.
- RSentiment: Lexicon-based sentiment analysis. Contains support for negation detection and sarcasm.
- text2vec: Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities.
- fastTextR: Interface to the fastText library.
- LDAvis: Interactive visualization of topic models.
- keras: Interface to Keras, a high-level neural networks 'API'. (RStudio Blog: TensorFlow for R)
- retweet: Client for accessing Twitter’s REST and stream APIs. (21 Recipes for Mining Twitter Data with rtweet)
- topicmodels: Interface to the C code for Latent Dirichlet Allocation (LDA).
- textmineR: Aid for text mining in R, with a syntax that should be familiar to experienced R users.
- wordVectors: Creating and exploring word2vec and other word embedding models.
- gtrendsR: Interface for retrieving and displaying the information returned online by Google Trends.
  - Analyzing Google Trends Data in R
- textstem: Tools that stem and lemmatize text.
- NLPutils Utilities for Natural Language Processing.
- Udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing using UDPipe.
Python modules
- NLTK: Natural Language Toolkit.
  - Video: NLTK with Python 3 for Natural Language Processing
- scikit-learn: Machine Learning in Python
  - Tutorial
- Spark NLP: Open source text processing library for Python, Java, and Scala. It provides production-grade, scalable, and trainable versions of the latest research in natural language processing.
- spaCy: Industrial-Strength Natural Language Processing in Python.
- textblob: Simplified Text processing.
  - Natural Language Basics with TextBlob
- Gensim: Topic Modeling for humans.
- Pattern.en: A fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface.
- textmining: Python Text Mining utilities.
- Scrapy: Open source and collaborative framework for extracting the data you need from websites.
- lda2vec: Tools for interpreting natural language.
- PyText A deep-learning based NLP modeling framework built on PyTorch.
- sent2vec: General purpose unsupervised sentence representations.
- flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
- word_forms: Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.
- AllenNLP: Open-source NLP research library, built on PyTorch.
- Beautiful Soup: Parse HTML and XML documents. Useful for webscraping.
- BigARTM: Fast topic modeling platform.
- Scattertext: Beautiful visualizations of how language differs among document types.
- embeddings: Pretrained word embeddings in Python.
- fastText: Library for efficient learning of word representations and sentence classification.
- Google Seq2Seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
- polyglot: A natural language pipeline that supports multilingual applications.
- textacy: NLP, before and after spaCy
- Glove-Python: A “toy” implementation of GloVe in Python. Includes a paragraph embedder.
- Bert As A Service: Client/Server package for sentence encoding, i.e. mapping a variable-length sentence to a fixed-length vector. Design intent to provide a scalable production ready service, also allowing researchers to apply BERT quickly.
- Keras-BERT: A Keras Implementation of BERT
- Paragraph embedding scripts and Pre-trained models: Scripts for training and testing paragraph vectors, with links to some pre-trained Doc2Vec and Word2Vec models
- Texthero Text preprocessing, representation and visualization from zero to hero.
Apache Tika: a content analysis tookilt.
Apache Spark: is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
- MLlib: MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. Related to NLP there are methods available for LDA, Word2Vec, and TFIDF.
- LDA: latent Dirichlet allocation
- Word2Vec: is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document
- TFIDF: term frequency-inverse document frequency
HDF5: an open source file format that supports large, complex, heterogeneous data. Requires no configuration.
- h5py: Python HDF5 package
Stanford CoreNLP: a suite of core NLP tools
- Also checkout http://corenlp.run for a hosted version of the CoreNLP server.
- Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)
Stanford Parser: A probabilistic natural language parser.
Stanford POS Tagger: A Parts-of-Speech tagger.
Stanford Named Entity Recognizer: Recognizes proper nouns (things, places, organizations) and labels them as such.
Stanford Classifier: A softmax classifier.
Stanford OpenIE: Extracts relationships between words in a sentence (e.g. Mark Zuckerberg; founded; Facebook).
Stanford Topic Modeling Toolbox
MALLET: MAchine Learning for LanguagE Toolkit
- Github: https://github.com/mimno/Mallet
Apache OpenNLP: Machine learning based toolkit for text NLP.
Streamcrab: Real-Time, Twitter sentiment analyzer engine http:/www.streamcrab.com
TextRazor API: Extract Meaning from your Text.
fastText. Library for fast text representation and classification. Facebook.
Comparison of Top 6 Python NLP Libraries.
pyCaret's NLP Module. PyCaret is an open source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights; also, PyCaret's Founder Moez Ali is a Smith Alumni - MMA 2020.

Products

Systran - Enterprise Translation Products
SAS Text Miner (Part of SAS Enterprise Miner)
SAS Sentiment Analysis
STATISTICA
- Text Mining (Big Data, Unstructured Data)
KNIME
RapidMiner
Gate
IBM Watson
- Video: How IBM Watson learns (3 minutes)
- Video: IBM Watson on Jeapardy! (10 minutes)
- Video: IBM Watson: The Science Behind an Answer (7 minutes)
Crimson Hexagon
Stocktwits: Tap into the Pulse of Markets
Meltwater
CrowdFlower: AI for your business.
Lexalytics Sematria: API and Excel plugin.
Rosette Text Analytics: AI for Human Language
Alchemy API
Monkey Learn
LightTag Annotation Tool. Hosted annotation tool for teams.
UBIAI. Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling
Anafora: Free and open source web-based raw text annotation tool
brat: Rapid annotation tool.
Google's Colab: Ready-to-go Notebook environment that makes it easy to get up and running.
Lyrebird.ai: “Ultra-Realistic Voice Cloning and Text-to-Speech” recognition platform. This Canadian start-up has created a product/platform that syncs both voice cloning with text-to-speech. Lyrebird recognizes the intonations and voice patterns from audio recordings, and overlays text data input to recreate a text-to-speech audio file output from the selected voice pattern audio recording.
Ask Data by Tableau Software Inc.: In February 2019, Tableau released a new NLP feature service add-on to help assist existing Tableau platform users with retrieving quick and easy data visualizations to drive business intelligence insights. Similar to a search engine user interface, Tableau’s Ask Data feature interface applies NLP from user text input to extract key words to find data analytics and business insights quickly on the Tableau Platform.
Dialogflow Google's Natural Language Platform used to integrate conversational user interfaces into mobile apps, web applications, bots, VRUs, etc.
Weka Easy-to-use, graphical Machine Learning Workbench including NLP capabilities.
Annotation Lab - Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents.

Cloud

Microsoft Azure Text Analytics
Amazon Lex: A service for building conversational interfaces into any application using voice and text.
Amazon Comprehend
Google Cloud Natural Language
IBM Watson

Getting Data out of PDFs

Online Demos and Tools

MIT OpenNPT for neural machine translation and neural sequence modeling
Stanford Parser
Stanford CoreNLP
word2vec demo
Another word2vec demo
sense2vec: Semantic Analysis of the Reddit Hivemind
RegexPal: Great tool for testing out regular expressions.
AllenNLP Demo: Great demo using AllenNLP of everything from Named Entity Recognition to Textual Entailment.
Cognitive Computation Group - Part of Speech Tagging Demo These demos exhibit part-of-speech tagging, information extraction tasks etc.

Datasets

UCI's Text Datasets. A collection of databases, domain theories, and data generators used by Machine Learning community.
data.world's Text Datasets
Awesome Public Datasets' Natural Languge
Insight Resources Datasets
Bing Sentiment Analysis
Consumer Complaint Database. From the Consumer Financial Protection Bureau.
Sentiment Labelled Sentences Data Set . Contains sentences labelled as "positive" or "negative", from imdb.com, amazon.com, and yelp.com.
Amazon product data
Data is Plural
FiveThirtyEight's datasets
r/datasets
Awesome public datasets
R's datasets package
200,000 Russian Troll Tweets - Released by Congress from Twitter suspended accounts and removed from public view.
Wikipedia: List of datasets for ML research
Google Dataset Search
Kaggle: UMICH SI650 - Sentiment Classification
Lee's Similarity Data Sets
Corpus of Presidential Speeches (CoPS) and a Clinton/Trump Corpus
15 Best Chatbot Datasets for Machine Learning
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
nlp-datasets
Hate-speech-and-offensive-language
First Quora Dataset Release: Question Pairs
The Best 25 Datasets for Natural Language Processing
SWAG: A large-scale dataset created for Natural Language Inference (NLI) with common-sense reasoning.
MIMIC: an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients.
Clinical NLP Dataset Repository: A curated list of publicly-available clinical datasets for use in NLP research.
Million Song Lyrics
The Multi-Genre NLI Corpus
Twitter US Airline Sentiment
Million Song Lyrics: Dataset of song lyrics in Bag-Of-Words (BOW) format.
DuoRC – 186K unique question-answer pairs with evaluation script for Paraphrased Reading Comprehension
EDGAR Financial Statements: Reporting engine for financial and regulatory filings for companies worldwide. A huge repository of financial and company data for text mining.
American National Corpus Download
Santa Barbara Corpus of Spoken American English
Leipzig Corpora Collection: Corpora in English, Arabic, French, Russian, German
Awesome Twitter
The Big Bad NLP Database
CBC News Coronavirus articles
Huggingface

Lexicons for Sentiment Analysis

Misc

AskReddit: People with a mother tongue that isn't English, what are the most annoying things about the English language when you are trying to learn it?
Funny Video: Emotional Spell Check
How to win Kaggle competition based on NLP task, if you are not an NLP expert
Detecting Gang-Involved Escalation on Social Media Using Context Detecting Aggression and Loss in social media using CNN
Reasoning about Actions and State Changes by Injecting Commonsense Knowledge Incorporating global, commonsense constraints & biasing reading with preferences from large-scale corp
The Language of Hip Hop: A 2017 analysis by Matt Daniels of Pudding determining the popularity of various words in hip hop music and across artists.
Using Natural Language Processing for Automatic Detection of Plagiarism
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing
Human Emotion How to determine confidence level for manually labeled sentiment data?
A Complete Exploratory Data Analysis and Visualization for Text Data

Other Curated Lists

awesome-nlp: A curated list of resources dedicated to Natural Language Processing (NLP)
awesome-machine-learning
Awesome Deep Learning for Natural Language Processing (NLP)
Paper with Code: A fantastic list of recent machine learning papers on ArXiv, with links to code.
Chinese NLP Tools. 2019. List of tools for NLP in Chinese Language.
Association for Computational Linguistics Papers Anthology: The ACL Anthology currently hosts almost 50,000 papers on the study of computational linguistics and natural language processing. Includes all papers from recent conferences.
Over 150 of the Best Machine Learning, NLP, and Python Tutorials I’ve Found

Contribute

Contributions are more than welcome! Please read the contribution guidelines first.

License

To the extent possible under law, @stepthom has waived all copyright and related or neighboring rights to this work.

Open Source Agenda is not affiliated with "Text Mining Resources" Project. README Source: stepthom/text_mining_resources

Stars

526

Open Issues

Last Commit

1 year ago

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/text-mining-resources"><img src="https://www.opensourceagenda.com/projects/text-mining-resources/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022

Text Mining Resources Save Abandoned

Uncle Steve's Big List of Text Analytics and NLP Resources

Table of Contents

Books

R

Python

General

Blogs

Blog Articles, Papers, Case Studies

General

Biases in NLP

Scraping

Cleaning

Stop Words

Stemming

Dimensionality Reduction

Sarcasm Detection

Document Classification

Entity and Information Extraction

Document Clustering and Document Similarity

Concept Analysis/Topic Modeling

Sentiment Analysis

Methods

Challenges

Politics

Stock Market

Applications

Tools and Technology

Text Summarization

Machine Translation

Q&A Systems, Chatbots

Fuzzy Matching, Probabilistic Matching, Record Linkage, Etc.

Word and Document Embeddings

Transformers and Language Models

ChatGPT

...in Education

Deep Learning

Capsule Networks

Knowledge Graphs

Major NLP Conferences

Benchmarks

Online courses

Udemy

Stanford

Coursera

DataCamp

Others

APIs and Libraries

Products

Cloud

Getting Data out of PDFs

Online Demos and Tools

Datasets

Lexicons for Sentiment Analysis

Misc

Other Curated Lists

Contribute

License

Open Source Agenda Badge

From the blog

How to Choose Which Programming Language to Learn First?

From the blog

How to Choose Which Programming Language to Learn First?