A very simple framework for state-of-the-art Natural Language Processing (NLP)
This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.
We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:
french_embedding = WordEmbeddings('fr')
Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:
flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')
Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:
custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')
DocumentPoolEmbeddings
(#191 )Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:
word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')
generate_text()
(#167 )The LanguageModel
class now has an in-built generate_text()
method to sample the LM. Run code like this:
# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')
# generate 2000 characters
text = model.generate_text(20000)
print(text)
Metric
class (#164 )Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.
On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.
Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.
Previously, the language model was never applied to test data during training. A final testing step has been added in (again).
Since autograd is unused here this gives us minor speedups.
This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.
Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.
After prediction, remove embeddings from memory to avoid filling up memory.
Align signatures and features of the two training classes to make it easier to understand training options.
Remove unused flag and code from DocumentLSTMEmbeddings
Some dependencies are no longer required.
Label
class with confidence score (https://github.com/zalandoresearch/flair/issues/38)A tag prediction is not a simple string anymore but a Label
, which holds a value and a confidence score.
To obtain the tag name you need to call tag.value
. To get the score call tag.score
. This can help you build
applications in which you only want to use predictions that lie above a specific confidence threshold.
LockedDropout
moved to the new flair.nn
module (https://github.com/zalandoresearch/flair/issues/48)Entities are can now be wrapped into multi-token spans (type: Span
). This is helpful for entities that span multiple words, such as "George Washington". A Span
contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the get_spans()
method, like so:
from flair.data import Sentence
from flair.models import SequenceTagger
# make a sentence
sentence = Sentence('George Washington went to Washington .')
# load and run NER
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)
# get span entities, together with tag and confidence score
for entity in sentence.get_spans('ner'):
print('{} {} {}'.format(entity.text, entity.tag, entity.score))
Predicted tags are no longer simple strings, but objects of type Label
that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:
from flair.data import Sentence
from flair.models import SequenceTagger
# make a sentence
sentence = Sentence('George Washington went to Washington .')
# load the POS tagger
tagger = SequenceTagger.load('pos')
# run POS over sentence
tagger.predict(sentence)
# print token, predicted POS tag and confidence score
for token in sentence:
print('{} {} {}'.format(token.text, token.get_tag('pos').value, token.get_tag('pos').score))
flair
now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import CharLMEmbeddings
from flair.visual import Visualizer
# get a list of Sentence objects
corpus = NLPTaskDataFetcher.fetch_data(NLPTask.CONLL_03).downsample(0.1)
sentences = corpus.train + corpus.test + corpus.dev
# init embeddings (can also be a StackedEmbedding)
embeddings = CharLMEmbeddings('news-forward-fast')
# embed corpus batch-wise
batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)]
for batch in batches:
embeddings.embed(batch)
# visualize
visualizer = Visualizer()
visualizer.visualize_word_emeddings(embeddings, sentences, 'data/visual/embeddings.html')
Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.
flair
now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.
Added pre-trained language models for Polish, donated by (Borchmann et al., 2018). Load the Polish embeddings like this:
flm_embeddings = CharLMEmbeddings('polish-forward')
blm_embeddings = CharLMEmbeddings('polish-backward')
The script eval.pl
for sequence tagger contained bugs. flair
now uses its own evaluation methods.
Fixed bugs in single label training and out-of-memory errors during evaluation.
Logging output for sequence tagger and text classifier is imporved and standardized.
flair now uses torch version 0.4.1
Expanded documentation and tutorials.
There are now two packages: flair.models
and flair.trainers
for the models and model trainers respectively.
flair.models
contains 3 model classes: SequenceTagger
, TextClassifier
and LanguageModel
.
flair.trainers
contains 3 model trainer classes: SequenceTaggerTrainer
, TextClassifierTrainer
and LanguageModelTrainer
.
You call these classes directly from the packages, for instance the SequenceTagger is now instantiated as:
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
Clear distinction between token-level and document-level embeddings by adding two classes, namely TokenEmbeddings
and DocumentEmbeddings
from which respective embeddings need to inherit.
Added LanguageModelTrainer
class to train your own LM embeddings.
Added experimental TextClassifier
model for document-level text classification. Also added corresponding model trainer class, i.e. TextClassifierTrainer
.
Added batching into prediction method for faster sequence tagging
Added pre-trained models with smaller LM embeddings for faster CPU-inference speed
You can load them by adding '-fast' to the model name. Only for English at present.
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner-fast')
Added learning rate schedulers to all trainer classes for improved learning rate annealing functionality and control.
All model classes now automatically spawn on GPUs if available. The separate .cuda()
call is no longer necessary.
Fixed error that occurred when using multiple pre-trained taggers on the same sentence.
Fixed error that caused data fetchers to sometimes create empty sentences.
Added a large set of automated unit tests for better stability.
Expanded documentation and tutorials. Also expanded descriptions of APIs.
A number of code simplifications all around, hopefully making the code easier to understand.
First release of Flair Framework
Static word embeddings:
Contextual string embeddings:
Text embeddings:
Sequence labeling: