Machine Learning Research
Wikipedia has a list of machine learning text datasets, tabulated with useful information such as dataset size
Datahub has lots of datasets, though not all of it is Machine Learning focused
Microsoft Research has a collection of datasets (look under the ‘Dataset directory’ tab)
SOTA NLP:
A small list of well-known standard datasets for common NLP tasks:
An alphabetical list of free or public domain text datasets:
Datasets for machine translation:
Syntactic corpora for many languages:
A script to search arXiv papers for a keyword, and extract important information such as performance metrics on a task:
StanfordNLP, a Python library providing tokenization, tagging, parsing, and other capabilities.
Software from the Stanford NLP Group
NLTK, a lightweight Natural Language Toolkit package in Python.
spaCy, another Python package that can do preprocessing, but also includes neural models (e.g. Language Models)
Machine Learning
Computer Vision
Natural Language Processing
Data
Artificial Intelligence