Flair Versions Save

A very simple framework for state-of-the-art Natural Language Processing (NLP)

v0.3.2

5 years ago

This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (#194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

french_embedding = WordEmbeddings('fr')

More character LM embeddings (#204 #187 )

Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')

Custom embeddings (#170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

New embeddings type: DocumentPoolEmbeddings (#191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')

Language model

New method: generate_text() (#167 )

The LanguageModel class now has an in-built generate_text() method to sample the LM. Run code like this:

# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')

# generate 2000 characters
text = model.generate_text(20000)
print(text)

Metrics

Class-based metrics in Metric class (#164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (#184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (#178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

Since autograd is unused here this gives us minor speedups.

v0.3.1

5 years ago

This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.

Optimizations

Retain Token embeddings in memory by default (#146 )

Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.

Always clear embeddings after prediction (#149 )

After prediction, remove embeddings from memory to avoid filling up memory.

Refactorings

Alignd TextClassificationTrainer and SquenceTaggerTrainer (#148 )

Align signatures and features of the two training classes to make it easier to understand training options.

Updated DocumentLSTMEmbeddings (#150 )

Remove unused flag and code from DocumentLSTMEmbeddings

Removed unneeded AWS and Jinja2 dependencies (#158 )

Some dependencies are no longer required.

Bug Fixes

Fixed error when predicting over empty sentences. (#157)

Serialization: reset cache settings when saving a model. (#153 )

v0.3.0

5 years ago

Breaking Changes

New Label class with confidence score (https://github.com/zalandoresearch/flair/issues/38)

A tag prediction is not a simple string anymore but a Label, which holds a value and a confidence score. To obtain the tag name you need to call tag.value. To get the score call tag.score. This can help you build applications in which you only want to use predictions that lie above a specific confidence threshold.

LockedDropout moved to the new flair.nn module (https://github.com/zalandoresearch/flair/issues/48)

New Features

Multi-token spans (https://github.com/zalandoresearch/flair/issues/54, https://github.com/zalandoresearch/flair/issues/97)

Entities are can now be wrapped into multi-token spans (type: Span). This is helpful for entities that span multiple words, such as "George Washington". A Span contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the get_spans() method, like so:

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('George Washington went to Washington .')

# load and run NER
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

# get span entities, together with tag and confidence score
for entity in sentence.get_spans('ner'):
    print('{} {} {}'.format(entity.text, entity.tag, entity.score))

Predictions with confidence score (https://github.com/zalandoresearch/flair/issues/38)

Predicted tags are no longer simple strings, but objects of type Label that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('George Washington went to Washington .')

# load the POS tagger
tagger = SequenceTagger.load('pos')

# run POS over sentence
tagger.predict(sentence)

# print token, predicted POS tag and confidence score
for token in sentence:
    print('{} {} {}'.format(token.text, token.get_tag('pos').value, token.get_tag('pos').score))

Visualization routines (https://github.com/zalandoresearch/flair/issues/61)

flair now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import CharLMEmbeddings
from flair.visual import Visualizer

# get a list of Sentence objects
corpus = NLPTaskDataFetcher.fetch_data(NLPTask.CONLL_03).downsample(0.1)
sentences = corpus.train + corpus.test + corpus.dev

# init embeddings (can also be a StackedEmbedding)
embeddings = CharLMEmbeddings('news-forward-fast')

# embed corpus batch-wise
batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)]
for batch in batches:
    embeddings.embed(batch)

# visualize
visualizer = Visualizer()
visualizer.visualize_word_emeddings(embeddings, sentences, 'data/visual/embeddings.html')

Implementation of different dropouts (https://github.com/zalandoresearch/flair/issues/48)

Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.

Memory management for training on large data sets (https://github.com/zalandoresearch/flair/issues/137)

flair now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.

Pre-trained language models for Polish

Added pre-trained language models for Polish, donated by (Borchmann et al., 2018). Load the Polish embeddings like this:

flm_embeddings = CharLMEmbeddings('polish-forward')
blm_embeddings = CharLMEmbeddings('polish-backward')

Bug Fixes

Fix evaluation of sequence tagger (https://github.com/zalandoresearch/flair/issues/79, https://github.com/zalandoresearch/flair/issues/75)

The script eval.pl for sequence tagger contained bugs. flair now uses its own evaluation methods.

Fix bugs in text classifier (https://github.com/zalandoresearch/flair/issues/108)

Fixed bugs in single label training and out-of-memory errors during evaluation.

Others

Standardize logging output (https://github.com/zalandoresearch/flair/issues/16)

Logging output for sequence tagger and text classifier is imporved and standardized.

Update torch version (https://github.com/zalandoresearch/flair/issues/34, https://github.com/zalandoresearch/flair/issues/106)

flair now uses torch version 0.4.1

Updated documentation (https://github.com/zalandoresearch/flair/issues/138, https://github.com/zalandoresearch/flair/issues/89)

Expanded documentation and tutorials.

v0.2.0

5 years ago

Breaking Changes

Reorganized package structure #12

There are now two packages: flair.models and flair.trainers for the models and model trainers respectively.

Models package

flair.models contains 3 model classes: SequenceTagger, TextClassifier and LanguageModel.

Trainers package

flair.trainers contains 3 model trainer classes: SequenceTaggerTrainer, TextClassifierTrainer and LanguageModelTrainer.

Direct import from package

You call these classes directly from the packages, for instance the SequenceTagger is now instantiated as:

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')

Reorganized embeddings #12

Clear distinction between token-level and document-level embeddings by adding two classes, namely TokenEmbeddings and DocumentEmbeddings from which respective embeddings need to inherit.

New Features

LanguageModelTrainer #24 #17

Added LanguageModelTrainer class to train your own LM embeddings.

Document Classification #10

Added experimental TextClassifier model for document-level text classification. Also added corresponding model trainer class, i.e. TextClassifierTrainer.

Batch prediction #7

Added batching into prediction method for faster sequence tagging

CPU-friendly pre-trained models #29

Added pre-trained models with smaller LM embeddings for faster CPU-inference speed

You can load them by adding '-fast' to the model name. Only for English at present.

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner-fast')

Learning Rate Scheduling #19

Added learning rate schedulers to all trainer classes for improved learning rate annealing functionality and control.

Auto-spawn on GPUs #19

All model classes now automatically spawn on GPUs if available. The separate .cuda() call is no longer necessary.

Bug Fixes

Retagging error #23

Fixed error that occurred when using multiple pre-trained taggers on the same sentence.

Empty sentence error #33

Fixed error that caused data fetchers to sometimes create empty sentences.

Other

Unit Tests #15

Added a large set of automated unit tests for better stability.

Documentation #15

Expanded documentation and tutorials. Also expanded descriptions of APIs.

Code Simplifications in sequence tagger #19

A number of code simplifications all around, hopefully making the code easier to understand.

v0.1.0

5 years ago

First release of Flair Framework

Static word embeddings:

  • includes prepared word embeddings from GloVe, FastText, Numberbatch and Extvec
  • includes prepared word embeddings for English, German and Swedish

Contextual string embeddings:

  • includes pre-trained models for English and German

Text embeddings:

  • Two experimental methods for full-text embeddings (LSTM and Mean)

Sequence labeling:

  • pre-trained models for English (PoS-tagging, chunking and NER)
  • pre-trained models for German (PoS-tagging and NER)
  • experimental semantic frame detector for English