A very simple framework for state-of-the-art Natural Language Processing (NLP)
Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.
Add legal NER model for German. Trained using the German legal NER dataset available here that can be loaded in Flair with the LER_GERMAN
corpus object.
Uses German Flair and FastText embeddings and gets 96.35 F1 score.
Use like this:
# load German LER tagger
tagger = SequenceTagger.load('de-ler')
# example text
text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."
sentence = Sentence(text)
# predict and print entities
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
print(entity)
These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:
# load German Xtreme
german_corpus = XTREME('de')
print(german_corpus)
# load French Xtreme
french_corpus = XTREME('fr')
print(french_corpus)
Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:
# load Xtreme MultiCorpus for all
multi_corpus = XTREME()
print(multi_corpus)
Dataset of tweets annotated with NER tags. Load with:
# load twitter dataset
corpus = TWITTER_NER()
# print example tweet
print(corpus.test[0])
Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:
# load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)
# print first test sentence
print(corpus.test[1])
Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:
# load restaurant dataset
corpus = MIT_RESTAURANTS()
# print example sentence
print(corpus.test[0])
Our kickoff into supporting the Universal Proposition Banks adds the first two UP datasets to Flair. Load with:
# load German UP
corpus = UP_GERMAN()
print(corpus)
# print example sentence
print(corpus.dev[1])
Adds the Kyoto dataset for Chinese. Load with:
# load Chinese UD dataset
corpus = UD_CHINESE_KYOTO()
# print example sentence
print(corpus.test[0])
Release 0.6 is a major biomedical NLP upgrade for Flair, adding state-of-the-art models for biomedical NER, support for 31 biomedical NER corpora, clinical POS tagging, speculation and negation detection in biomedical literature, and many other features such as multi-tagging and one-cycle learning.
Most of the biomedical models and datasets were developed together with the Knowledge Management in Bioinformatics group at the HU Berlin, in particular @leonweber and @mariosaenger. This page gives an overview of the new models and datasets, and example tutorials. Some highlights:
Flair now has pre-trained models for biomedical NER trained over unified versions of 31 different biomedical corpora. Because they are trained on so many different datasets, the models are shown to be very robust with new datasets, outperforming all previously available off-the-shelf datasets. If you want to load a model to detect "diseases" in text for instance, do:
# make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")
# load disease tagger and predict
tagger = SequenceTagger.load("hunflair-disease")
tagger.predict(sentence)
Done! Let's print the diseases found by the tagger:
for entity in sentence.get_spans():
print(entity)
This should print:
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
You can also get one model that finds 5 biomedical entity types (diseases, genes, species, chemicals and cell lines), like this:
# load bio-NER tagger and predict
tagger = MultiTagger.load("hunflair")
tagger.predict(sentence)
This should print:
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
Span [5]: "Fmr1" [− Labels: Gene (0.838)]
Span [7]: "Mouse" [− Labels: Species (0.9979)]
So it now also finds genes and species. As explained here these models work best if you use them together with a biomedical tokenizer.
Flair now supports 31 biomedical NER datasets out of the box, both in their standard versions as well as the "Huner" splits for reproducibility of experiments. For a full list of datasets, refer to this page.
You can load a dataset like this:
# load one of the bioinformatics corpora
corpus = JNLPBA()
# print statistics and one sentence
print(corpus)
print(corpus.train[0])
We also include "huner" corpora that combine many different biomedical datasets into a single corpus. For instance, if you execute the following line:
# load combined chemicals corpus
corpus = HUNER_CHEMICAL()
This loads a combination of 6 different corpora that contain annotation of chemicals into a single corpus. This allows you to train stronger cross-corpus models since you now combine training data from many sources. See more info here.
Thanks to @LucasFerroHAILab, we now include a model for part-of-speech tagging in Portuguese clinical text. Run this model like this:
# load your tagger
tagger = SequenceTagger.load('pt-pos-clinical')
# example sentence
sentence = Sentence('O vírus Covid causa fortes dores .')
tagger.predict(sentence)
print(sentence)
You can find more details in their paper here.
Using the BioScope corpus, we trained a model to recognize negation and speculation in biomedical literature. Use it like this:
sentence = Sentence("The picture most likely reflects airways disease")
tagger = SequenceTagger.load("negation-speculation")
tagger.predict(sentence)
for entity in sentence.get_spans():
print(entity)
This should print:
Span [4,5,6,7]: "likely reflects airways disease" [− Labels: SPECULATION (0.9992)]
Thus indicating that this portion of the sentence is speculation.
We added support for tagging text with multiple models at the same time. This can save memory usage and increase tagging speed.
For instance, if you want to POS tag, chunk, NER and detect frames in your text at the same time, do:
# load tagger for POS, chunking, NER and frame detection
tagger = MultiTagger.load(['pos', 'upos', 'chunk', 'ner', 'frame'])
# example sentence
sentence = Sentence("George Washington was born in Washington")
# predict
tagger.predict(sentence)
print(sentence)
This will give you a sentence annotated with 5 different layers of annotation.
Flair now includes convenience methods for sentence splitting. For instance, to use segtok to split and tokenize a text into sentences, use the following code:
from flair.tokenization import SegtokSentenceSplitter
# example text with many sentences
text = "This is a sentence. This is another sentence. I love Berlin."
# initialize sentence splitter
splitter = SegtokSentenceSplitter()
# use splitter to split text into list of sentences
sentences = splitter.split(text)
We also ship other splitters, such as SpacySentenceSplitter
(requires SpaCy to be installed).
Thanks to @himkt we now have expanded support for Japanese tokenization in Flair. For instance, use the following code to tokenize a Japanese sentence without installing extra libraries:
from flair.data import Sentence
from flair.tokenization import JapaneseTokenizer
# init japanese tokenizer
tokenizer = JapaneseTokenizer("janome")
# make sentence (and tokenize)
sentence = Sentence("私はベルリンが好き", use_tokenizer=tokenizer)
# output tokenized sentence
print(sentence)
Thanks to @lucaventurini2 Flair one supports one-cycle learning, which may give quicker convergence. For instance, train a model in 20 epochs using the code below:
# train as always
trainer = ModelTrainer(tagger, corpus)
# set one cycle LR as scheduler
trainer.train('onecycle_ner',
scheduler=OneCycleLR,
max_epochs=20)
Sentence
object (#1806)The Sentence
object now executes tokenization (use_tokenizer=True
) by default:
# Tokenizes by default
sentence = Sentence("I love Berlin.")
print(sentence)
# i.e. this is equivalent to
sentence = Sentence("I love Berlin.", use_tokenizer=True)
print(sentence)
# i.e. if you don't want to use tokenization, set it to False
sentence = Sentence("I love Berlin.", use_tokenizer=False)
print(sentence)
TransformerWordEmbeddings
now handle long documents by defaultPreviously, so had to set allow_long_sentences=True
to enable handling of long sequences (greater than 512 subtokens) in TransformerWordEmbeddings
. This is no longer necessary as this value is now set to True
by default.
BytePairEmbeddings
(#1802)ELMoEmbeddings
(#1803)TextClassifier
if no label_type
is passed (#1748)Release 0.5.1 with new features, datasets and models, including support for sentence transformers, transformer embeddings for arbitrary length sentences, new Dutch NER models, new tasks and more refactorings of evaluation and training routines to better organize the code!
Adds a heuristic as a workaround to the max sequence length of some transformer embeddings, making it possible to now embed sequences of arbitrary length if you set allow_long_sentences=True
, like so:
TransformerWordEmbeddings(
allow_long_sentences=True, # set allow_long_sentences to True to enable this features
),
It is now possible to set seeds when loading and downsampling corpora, so that the sample is always the same:
# set a random seed
import random
random.seed(4)
# load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)
# print first sentence of dev and test
print(corpus.dev[0])
print(corpus.test[0])
Makes the reprojection layer optional in SequenceTagger. You can control this behavior through the reproject_embeddings
parameter. If you set it to True
, embeddings are reprojected via linear map to identical size. If set to False
, no reprojection happens. If you set this parameter to an integer, the linear map maps embedding vectors to vectors of this size.
# tagger with standard reprojection
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=True,
)
# tagger without reprojection
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=False,
)
# reprojection to vectors of length 128
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=128,
)
You can now optionally specify the "label name" of the predicted label. This may be useful if you want to for instance run two different NER models on the same sentence:
sentence = Sentence('I love Berlin')
# load two NER taggers
tagger_1 = SequenceTagger.load('ner')
tagger_2 = SequenceTagger.load('ontonotes-ner')
# specify label name of tagger_1 to be 'conll03_ner'
tagger_1.predict(sentence, label_name='conll03_ner')
# specify label name of tagger_2 to be 'onto_ner'
tagger_1.predict(sentence, label_name='onto_ner')
print(sentence)
This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.
Adds the SentenceTransformerDocumentEmbeddings
class so you get embeddings from the sentence-transformer
library. Use as follows:
from flair.data import Sentence
from flair.embeddings import SentenceTransformerDocumentEmbeddings
# init embedding
embedding = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')
# create a sentence
sentence = Sentence('The grass is green .')
# embed the sentence
embedding.embed(sentence)
You can find a full list of their pretained models here.
The new default model is a BERT-based RNN model with the highest accuracy:
from flair.data import Sentence
from flair.models import SequenceTagger
# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner')
# tag sentence
sentence = Sentence('Ik hou van Amsterdam')
tagger.predict(sentence)
You can also load a Flair-based RNN model (might be faster on some setups):
# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner-rnn')
Adds corpus of communicate functions in scientific literature, described in this LREC paper and available here. Load with:
corpus = COMMUNICATIVE_FUNCTIONS()
print(corpus)
We also ship a pre-trained model on this corpus, which you can load with:
# load communicative function tagger
tagger = TextClassifier.load('communicative-functions')
# load communicative function tagger
sentence = Sentence("However, previous approaches are limited in scalability .")
# predict and print labels
tagger.predict(sentence)
print(sentence.labels)
Added 3 datasets available for keyphrase extraction via sequence labeling: Inspec, SemEval-2017 and Processed SemEval-2010
Load like this:
inspec_corpus = INSPEC()
semeval_2010_corpus = SEMEVAL2010()
semeval_2017 = SEMEVAL2017()
We also ship a pre-trained model on this corpus, which you can load with:
# load keyphrase tagger
tagger = SequenceTagger.load('keyphrase')
# load communicative function tagger
sentence = Sentence("Here, we describe the engineering of a new class of ECHs through the "
"functionalization of non-conductive polymers with a conductive choline-based "
"bio-ionic liquid (Bio-IL).", use_tokenizer=True)
# predict and print labels
tagger.predict(sentence)
print(sentence)
Add corpus for swedish NER using dataset https://github.com/klintan/swedish-ner-corpus/. Load with:
corpus = NER_SWEDISH()
print(corpus)
Adds corpus of legal named entities for German. Load with:
corpus = LER_GERMAN()
print(corpus)
We made a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.
A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval
script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval
script.
In more detail, this PR makes the following changes:
Span
is now a list of Token
and can now be iterated like a sentenceflair.DataLoader
is now used throughoutevaluate()
interface in the Model
base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence
or a Dataset
SequenceTagger.evaluate()
now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (#1663) and a non-sklearn implementation is used.evaluate()
method of the SequenceTagger
and TextClassifier
, we now explicitly call the .predict()
method.SequenceTagger
(#1659)DocumentPoolEmbeddings
(#1671)Release 0.5 with tons of new models, embeddings and datasets, support for fine-tuning transformers, greatly improved sentiment analysis models for English, tons of new features and big internal refactorings to better organize the code!
Flair 0.5 adds support for transformers and fine-tuning with two new embeddings classes: TransformerWordEmbeddings
and TransformerDocumentEmbeddings
, for word- and document-level transformer embeddings respectively. Both classes can be initialized with a model name that indicates what type of transformer (BERT, XLNet, RoBERTa, etc.) you wish to use (check the full list Here)
If you want to embed the words in a sentence with transformers, do it like this:
from flair.embeddings import TransformerWordEmbeddings
# init embedding
embedding = TransformerWordEmbeddings('bert-base-uncased')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
from flair.embeddings import TransformerWordEmbeddings
# init embedding
embedding = TransformerWordEmbeddings('roberta-base')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
To get a single embedding for the whole document with BERT, do:
from flair.embeddings import TransformerDocumentEmbeddings
# init embedding
embedding = TransformerDocumentEmbeddings('bert-base-uncased')
# create a sentence
sentence = Sentence('The grass is green .')
# embed the sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
from flair.embeddings import TransformerDocumentEmbeddings
# init embedding
embedding = TransformerDocumentEmbeddings('roberta-base')
# create a sentence
sentence = Sentence('The grass is green .')
# embed the sentence
embedding.embed(sentence)
Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks.
Use TransformerDocumentEmbeddings
for this and set fine_tune=True
. Then, use the following example code:
from torch.optim.adam import Adam
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# 1. get the corpus
corpus: Corpus = TREC_6()
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()
# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)
# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)
# 6. start the training
trainer.train('resources/taggers/trec',
learning_rate=3e-5, # use very small learning rate
mini_batch_size=16,
mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
max_epochs=5, # terminate after 5 epochs
)
Flair 0.5 adds a ton of new taggers, embeddings and datasets.
We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.
There are two new models, a transformer-based model you can load like this:
# load tagger
classifier = TextClassifier.load('sentiment')
# predict for example sentence
sentence = Sentence("enormously entertaining for moviegoers of any age .")
classifier.predict(sentence)
# check prediction
print(sentence)
And a faster, slightly less accurate model based on RNNs you can load like this:
classifier = TextClassifier.load('sentiment-fast')
Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:
# Fine-grained POS model
tagger = SequenceTagger.load('pos')
# Fine-grained POS model (fast variant)
tagger = SequenceTagger.load('pos-fast')
# Universal POS model
tagger = SequenceTagger.load('upos')
# Universal POS model (fast variant)
tagger = SequenceTagger.load('upos-fast')
Load the language models with:
embeddings_forward = FlairEmbeddings('de-historic-rw-forward')
embeddings_backward = FlairEmbeddings('de-historic-rw-backward')
embeddings_forward = FlairEmbeddings('ml-forward')
embeddings_backward = FlairEmbeddings('ml-backward')
Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the CLEF HIPE shared task.
You can now load a Finnish NER corpus with
ner_finnish = flair.datasets.NER_FINNISH()
You can now load a Danish NER corpus with
dane = flair.datasets.DANE()
Adds 6 SentEval classification datasets to Flair:
senteval_corpus_1 = flair.datasets.SENTEVAL_CR()
senteval_corpus_2 = flair.datasets.SENTEVAL_MR()
senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ()
senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA()
senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY()
senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()
Adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment.
amazon_reviews = flair.datasets.AMAZON_REVIEWS()
sentiment_140 = flair.datasets.SENTIMENT_140()
biofid = flair.datasets.BIOFID()
Refactored the DataPoint
class and classes that inherit from it (Token
, Sentence
, Image
, Span
, etc.) so that all have the same methods for adding and accessing labels.
DataPoint
base class now defined labeling methods (closes #1449)Sentence
constructor, so instead of:sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")])
you should now do:
sentence_1 = Sentence("this is great")
sentence_1.add_label('sentiment', 'POSITIVE')
or:
sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')
Note that Sentence labels now have a label_type
(in the example that's 'sentiment').
Corpus
method _get_class_to_count
is renamed to _count_sentence_labels
Corpus
method _get_tag_to_count
is renamed to _count_token_labels
Span
is now a DataPoint
(so it has an embedding
and labels
)Split the previously huge embeddings.py
into several submodules organized in an embeddings/
folder. The submodules are:
token.py
for all TokenEmbeddings
classesdocument.py
for all DocumentEmbeddings
classesimage.py
for all ImageEmbeddings
classeslegacy.py
for embeddings that are now deprecatedbase.py
for remaining basic classesAll embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:
from flair.embeddings import FlairEmbeddings
embeddings = FlairEmbeddings('news-forward')
so specifying the submodule is not needed.
Split the previously huge datasets.py
into several submodules organized in a datasets/
folder. The submodules are:
sequence_labeling.py
for all sequence labeling datasetsdocument_classification.py
for all document classification datasetstreebanks.py
for all dependency parsed corpora (UD treebanks)text_text.py
for all bi-text datasets (currently only parallel corpora)text_image.py
for all paired text-image datasets (currently only Feidegger)base.py
for remaining basic classesAll datasets are still exposed through the datasets package, so it is still possible to load corpora with
from flair.datasets import TREC_6
without specifying the submodule.
Small refactorings on flair.datasets
for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent Corpus
class. You can now instantiate a Corpus
only with a train file which will trigger the sampling. (2) Moved the default logic for identifying train, dev and test files into a dedicated method to avoid duplicates in code.
You now have the option of specifying a document_delimiter when training a LanguageModel. Say, you have a corpus of textual lists and use "[SEP]" to mark boundaries between two lists, like this:
Colors:
- blue
- green
- red
[SEP]
Cities:
- Berlin
- Munich
[SEP]
...
Then you can now train a language model by setting the document_delimiter
in the TextCorpus
and LanguageModel
objects. This will make sure only documents as a whole will get shuffled during training (i.e. the lists in the above example):
# your document delimiter
delimiter = '[SEP]'
# set it when you load the corpus
corpus = TextCorpus(
"data/corpora/conala-corpus/",
dictionary,
is_forward_lm,
character_level=True,
document_delimiter=delimiter,
)
# set it when you init the language model
language_model = LanguageModel(
dictionary,
is_forward_lm=True,
hidden_size=512,
nlayers=1,
document_delimiter=delimiter
)
# train your language model as always
trainer = LanguageModelTrainer(language_model, corpus)
Added the possibility to set a different column delimite for ColumnCorpus
, i.e.
corpus = ColumnCorpus(
Path("/path/to/corpus/"),
column_format={0: 'text', 1: 'ner'},
column_delimiter='\t', # set a different delimiter
)
if you want to read a tab-separated column corpus.
There are a number of improvements for the ClassificationCorpus
and ClassificationDataset
classes:
Add new scheduler that uses dev score as main metric to anneal against, but additionally uses dev loss in case two epochs have the same dev score.
Adds the option to choose which hidden state to use in FlairEmbeddings: either the state at the end of each word, or the state at the whitespace after. Default is the state at the whitespace after.
You can change the default like this:
embeddings = FlairEmbeddings('news-forward', with_whitespace=False)
This configuration seems to be better for syntactic tasks. For POS tagging, it seems that you should set with_whitespace=False
. For instance, on UD_ENGLISH POS-tagging, we get 96.56 +- 0.03 with whitespace and 96.72 +- 0.04 without, averaged over three runs.
See the discussion in #1362 for more details.
Added the option of passing different tokenizers when loading classification datasets (#1579)
Added option for true whitespaces in ColumnCorpus #1583
Configurable cache_root from environment variable (#507)
Improve performance for loading not-in-memory corpus (#1413)
A new lmdb based alternative backend for word embeddings (#1515 #1536)
Slim down requirements (#1419)
Fix issue where flair was crashing for cpu only version of pytorch (#1393 #1418)
Fix GPU memory error in PooledFlairEmbeddings (#1417)
Various small fixes (#1402 #1533 #1511 #1560 #1616)
Improve documentation (#1446 #1447 #1520 #1525 #1556)
Fix various issues in classification datasets (#1499)
This is an enhancement release that slims down Flair for quicker/easier installation and smaller library size. It also makes Flair compatible with torch 1.4.0 and adds enhancements that reduce model size and improve runtime speed for some embeddings. New features include the ability to steer the precision/recall tradeoff during training of models and support for CamemBERT embeddings.
We want to keep list of dependencies of Flair generally small to avoid errors like #1245 and keep the library small and quick to setup. So we removed dependencies that were each only used for one particular feature, namely:
ipython
and ipython-genutils
, only used for visualization settings in iPython notebookstiny_tokenizer
, used for Japanese tokenization (replaced with instructions for how to install for all users who want to use Japanese tokenizers)pymongo
, used for MongoDB datasets (replaced with instructions for how to install for all users who want to use MongoDB datasets)torchvision
, now only loaded when neededWe also relaxed version requirements for easier installation on Google CoLab (#1335 #1336)
@shoarora optimized the BERTEmbeddings implementation by removing redundant calls. This was shown to lead to dramatic speed improvements.
@timnon added a method to replace word embeddings in trained model with sqlite database to dramatically reduce memory usage. Creates class WordEmbeedingsStore
which can be used to replace a WordEmbeddings
-instance in a flair model via duck-typing. By using this, @timnon was able to reduce our ner-servers memory consumption from 6gig to 600mb (10x decrease) by adding a few lines of code. It can be tested using the following lines (also in the docstring). First create a headless version of a model without word embeddings:
from flair.inference_utils import WordEmbeddingsStore
from flair.models import SequenceTagger
import pickle
tagger = SequenceTagger.load("multi-ner-fast")
WordEmbeddingsStore.create_stores(tagger)
pickle.dump(tagger, open("multi-ner-fast-headless.pickle", "wb"))
and then to run the stored headless model without word embeddings, use:
from flair.data import Sentence
tagger = pickle.load(open("multi-ner-fast-headless.pickle", "rb"))
WordEmbeddingsStore.load_stores(tagger)
text = "Schade um den Ameisenbären. Lukas Bärfuss veröffentlicht Erzählungen aus zwanzig Jahren."
sentence = Sentence(text)
tagger.predict(sentence)
@klasocki added ways to steer the precision/recall tradeoff during training of models, as well as prioritize certain classes. This option was added to the SequenceTagger
and the TextClassifier
.
You can steer precision/recall tradeoff by adding the beta
parameter, which indicates how many more times recall is important than precision. So if you set beta=0.5
, precision becomes twice as important than recall. If you set beta=2
, recall becomes twice as important as precision. Do it like this:
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
beta=0.5)
If you want to prioritize classes, you can pass a weight_loss
dictionary to the model classes. For instance, to prioritize learning the NEGATIVE class in a sentiment tagger, do:
tagger = TextClassifier(
document_embeddings=embeddings,
label_dictionary=tag_dictionary,
loss_weights={'NEGATIVE': 10.})
which will increase the importance of class NEGATIVE by a factor of 10.
@stefan-it added support for the recently proposed French language model: CamemBERT.
Thanks to the awesome 🤗/Transformers library, CamemBERT can be used in Flair like in this example:
from flair.data import Sentence
from flair.embeddings import CamembertEmbeddings
embedding = CamembertEmbeddings()
sentence = Sentence("J'aime le camembert !")
embedding.embed(sentence)
for token in sentence.tokens:
print(token.embedding)
Release 0.4.4 introduces dramatic improvements in inference speed for taggers (thanks to many contributions by @pommedeterresautee), Flair embeddings in 300 languages (thanks @stefan-it), modular tokenization and many new features and refactorings.
Many refactorings by @pommedeterresautee to improve inference speed of sequence tagger (#1038 #1053 #1068 #1093 #1130), Flair embeddings (#1074 #1095 #1107 #1132 #1145), word embeddings (#1084), embeddings memory management (#1082 #1117), general optimizations (#1112) and classification (#1187).
The combined improvements increase inference speed by a factor of 2-3!
You can now pass custom tokenizers to Sentence
objects and Dataset
loaders to use different tokenizers than the included segtok
library by implementing a tokenizer method. Currently, in-built support exists for whitespace tokenization, segtok tokenization and Japanese tokenization with mecab (requires mecab to be installed). In the future, we expect support for additional external tokenizers to be added.
For instance, if you wish to use Japanese tokanization performed by mecab, you can instantiate the Sentence
object like this:
from flair.data import build_japanese_tokenizer
from flair.data import Sentence
# instantiate Japanese tokenizer
japanese_tokenizer = build_japanese_tokenizer()
# init sentence and pass this tokenizer
sentence = Sentence("私はベルリンが好きです。", use_tokenizer=japanese_tokenizer)
print(sentence)
Thanks to @stefan-it, there is now a massivey multilingual Flair embeddings model that covers 300 languages. See #1099 for more info on these embeddings and this repo for more details.
This replaces the old multilingual Flair embeddings that were trained for 6 languages. Load them with:
embeddings_fw = FlairEmbeddings('multi-forward')
embeddings_bw = FlairEmbeddings('multi-backward')
Adds two multilingual character dictionaries computed by @stefan-it.
Load with
dictionary = Dictionary.load('chars-large')
print(len(dictionary.idx2item))
dictionary = Dictionary.load('chars-xl')
print(len(dictionary.idx2item))
The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.
This version adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy. It introduces the parameter mini_batch_chunk_size
that you can set to break down large mini-batches into smaller chunks for processing purposes.
So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:
trainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/experiment/folder",
# set large mini-batch size
mini_batch_size=128,
# set chunk size to lower memory requirements
mini_batch_chunk_size=32,
)
Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:
trainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/experiment/folder",
# set initial mini-batch size
mini_batch_size=32,
# choose batch growth annealing
batch_growth_annealing=True,
)
Introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03
, CONLL_03_GERMAN
and CONLL_03_DUTCH
datasets which indicate document boundaries.
Here's how to train a model on CoNLL-03 on the document level:
# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)
# what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=WordEmbeddings('glove'),
tag_dictionary=tag_dictionary,
tag_type=tag_type,
)
# initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# start training
trainer.train(
'path/to/your/experiment',
# set a much smaller mini-batch size because documents are huge
mini_batch_size=2,
)
Previously, the ModelTrainer
only allowed monitoring of dev and test splits during training. Now, you can also monitor the train split to better check if your method is overfitting.
Adds support for Danish POS and NER thanks to @AmaliePauli!
Use like this:
from flair.data import Sentence
from flair.models import SequenceTagger
# example sentence
sentence = Sentence("København er en fantastisk by .")
# load Danish NER model and predict
ner_tagger = SequenceTagger.load('da-ner')
ner_tagger.predict(sentence)
# print annotations (NER)
print(sentence.to_tagged_string())
# load Danish POS model and predict
pos_tagger = SequenceTagger.load('da-pos')
pos_tagger.predict(sentence)
# print annotations (NER + POS)
print(sentence.to_tagged_string())
You can use them like this:
from flair.data import Sentence
from flair.embeddings import BertEmbeddings
embeddings = BertEmbeddings("distilbert-base-uncased")
s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)
for token in s.tokens:
print(token.embedding)
print(token.embedding.shape)
Adds the option of reading data from MongoDB. See this documentation on how to use this features.
Adds a dataset downloader for the Feidegger corpus consisting of text-image pairs. Instantiate the corpus like this:
from flair.datasets import FeideggerCorpus
# instantiate Feidegger corpus
corpus = FeideggerCorpus()
# print a text-image pair
print(corpus.train[0])
Refactored the checkpointing mechanism and slimmed down interfaces / code required to load checkpoints.
In detail:
save_checkpoint
and load_checkpoint
are no longer part of the flair.nn.Model
interface. Instead, saving and restoring checkpoints is now (fully) performed by the ModelTrainer
.ModelTrainer
constructor since they are no longer required here.# 1. initialize trainer as always with a model and a corpus
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(model, corpus)
# 2. train your model for 2 epochs
trainer.train(
'experiment/folder',
max_epochs=2,
# example checkpointing
checkpoint=True,
)
# 3. load last checkpoint with one line of code
trainer = ModelTrainer.load_checkpoint('experiment/folder/checkpoint.pt', corpus)
# 4. continue training for 2 extra epochs
trainer.train('experiment/folder_2', max_epochs=4)
Adds a FlairSampler
interface to better enable passing custom samplers to the ModelTrainer
.
For instance, if you want to always shuffle your dataset in chunks of 5 to 10 sentences, you provide a sampler like this:
# your trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# execute training run
trainer.train('path/to/experiment/folder',
max_epochs=150,
# sample data in chunks of 5 to 10
sampler=ChunkSampler(block_size=5, plus_window=5)
)
Switch everything to batch first mode (#1077)
Refactor classification to be more consistent with SequenceTagger (#1151)
PyTorch-Transformers -> Transformers #1163
In-place transpose of tensors (#1047)
rnn_type
used in SequenceTagger
(#1113)Example usage:
# init tagger
tagger= SequenceTagger.load('ner')
# predict over list of strings
sentences = tagger.predict(
[
'George Washington went to Berlin .',
'George Berlin lived in Washington .'
]
)
# output predictions
for sentence in sentences:
print(sentence.to_tagged_string())
CSVClassificationDataset
(#1055) CharacterEmbeddings
(#1088 )StackedEmbeddings
always has the same embedding order (#1114) cache_root
(#1134)Release 0.4.3 includes a host of new features including transformer-based embeddings (roBERTa, XLNet, XLM, etc.), fine-tuneable FlairEmbeddings
, crosslingual MUSE embeddings, new data loading/sampling methods, speed/memory optimizations, bug fixes and enhancements. It also begins a refactoring of interfaces that prepares more general applicability of Flair to other types of downstream tasks.
Updates the old pytorch-pretrained-BERT
library to the latest version of pytorch-transformers
to support various new Transformer-based architectures for embeddings.
A total of 7 (new/updated) transformer-based embeddings can be used in Flair now:
from flair.embeddings import (
BertEmbeddings,
OpenAIGPTEmbeddings,
OpenAIGPT2Embeddings,
TransformerXLEmbeddings,
XLNetEmbeddings,
XLMEmbeddings,
RoBERTaEmbeddings,
)
bert_embeddings = BertEmbeddings()
gpt1_embeddings = OpenAIGPTEmbeddings()
gpt2_embeddings = OpenAIGPT2Embeddings()
txl_embeddings = TransformerXLEmbeddings()
xlnet_embeddings = XLNetEmbeddings()
xlm_embeddings = XLMEmbeddings()
roberta_embeddings = RoBERTaEmbeddings()
Detailed benchmarks on the downsampled CoNLL-2003 NER dataset for English can be found in #873 .
Use the new MuseCrosslingualEmbeddings
class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.
Here's how to embed:
# initialize embeddings
embeddings = MuseCrosslingualEmbeddings()
# two sentences in different languages
sentence_1 = Sentence("This red shoe is new .")
sentence_2 = Sentence("Dieser rote Schuh ist rot .")
# language code is auto-detected
print(sentence_1.get_language_code())
print(sentence_2.get_language_code())
# embed sentences
embeddings.embed([sentence_1, sentence_2])
# print similarities
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for token_1, token_2 in zip (sentence_1, sentence_2):
print(f"'{token_1.text}' and '{token_2.text}' similarity: {cos(token_1.embedding, token_2.embedding)}")
Adds FastTextEmbeddings
capable of handling for oov words. Be warned though that these embeddings are huge. BytePairEmbeddings
are much smaller and reportedly of similar quality so it is probably advisable to use those instead.
You can now fine-tune FlairEmbeddings on downstream tasks. You can fine-tune an existing LM by simply passing the fine_tune
parameter in the FlairEmbeddings
constructor, like this:
embeddings = FlairEmbeddings('news-foward', fine_tune=True)
You can also use this option to task-train a wholly new language model by passing an empty LanguageModel
to the FlairEmbeddings
constructor and the fine_tune
parameter, like this:
# make an empty language model
language_model = LanguageModel(
Dictionary.load('chars'),
is_forward_lm=True,
hidden_size=256,
nlayers=1)
# init FlairEmbeddings to task-train this model
embeddings = FlairEmbeddings(language_model, fine_tune=True)
Mixed precision training can significantly speed up training. It can now be enabled by setting use_amp=True
in the trainer classes. For instance for training language models you can do:
# train your language model
trainer = LanguageModelTrainer(language_model, corpus)
trainer.train('resources/taggers/language_model',
sequence_length=256,
mini_batch_size=256,
max_epochs=10,
use_amp=True)
In our experiments, we saw 3x speedup of training large language models though results vary depending on your model size and experimental setup.
This release introduces the embeddings_storage_mode
parameter to the ModelTrainer
class and predict()
methods. This parameter can be one of 'none', 'cpu' and 'gpu' and allows you to control the tradeoff between memory usage and speed during training:
To use this option during training, simply set the parameter:
# initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/your/model",
embeddings_storage_mode='gpu',
)
This release also removes the FlairEmbeddings
-specific disk-caching mechanism. In the future, a more general caching mechanism applicable to all embedding types may potentially be added as a fourth memory management option.
A new DataLoader
abstract base class used in Flair will speed up data loading for in-memory datasets.
This release also slims down interfaces of flair.nn.Model
and adds a new DataPoint
interface that is currently implemented by the Token
and Sentence
classes. The idea is to widen the applicability of Flair to other data types and other tasks. In the future, the DataPoint
interface will for example also be implemented by an Image
object and new downstream tasks added to Flair.
The release also slims down the evaluate()
method in the flair.nn.Model
interface to take a DataLoader
instead of a group of parameters. And refactors the logging header logic. Both refactorings prepare adding new new downstream tasks to Flair in the near future.
Adds the CSVClassificationCorpus
so you can train classifiers directly from CSVs instead of first having to convert to FastText format. To load a CSV, you need to pass a column_name_map
(like in ColumnCorpus
), which indicates which column(s) in the CSV holds the text and which field(s) the label(s):
corpus = CSVClassificationCorpus(
# path to the data folder containing train / test / dev files
data_folder='path/to/data',
# indicates which columns are text and labels
column_name_map={4: "text", 1: "label_topic", 2: "label_subtopic"},
# if CSV has a header, you can skip it
skip_header=True)
We added the first (of many) data samplers that can be passed to the ModelTrainer
to influence training. The ImbalancedClassificationDatasetSampler
for instance will upsample rare classes and downsample common classes in a classification dataset. It may potentially help with imbalanced datasets. Call like this:
# initialize trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
'path/to/folder',
learning_rate=0.1,
mini_batch_size=32,
sampler=ImbalancedClassificationDatasetSampler,
)
There are two experimental chunk samplers (ChunkSampler
and ExpandingChunkSampler
) split a dataset into chunks and shuffle them. This preserves some ordering of the original data while also randomizing the data.
from flair.visual.ner_html import render_ner_html
tagger = SequenceTagger.load('ner')
sentence = Sentence(
"Thibaut Pinot's challenge ended on Friday due to injury, and then Julian Alaphilippe saw "
"his lead fall away. The BBC's Hugh Schofield in Paris reflects on 34 years of hurt."
)
tagger.predict(sentence)
html = render_ner_html(sentence)
with open("sentence.html", "w") as writer:
writer.write(html)
CharacterEmbeddings
now let you specify number of hidden states and embedding size (#834)embedding = CharacterEmbedding(char_embedding_dim=64, hidden_size_char=64)
num_workers
is a parameter of LanguageModelTrainer
(#962 )DocumentRNNEmbeddings
(#793)ELMoEmbeddings
now use flair.device
param (#825)ColumnCorpus
in which words that begin with hashtags were skipped as comments (#956)max_tokens_per_do
c param in ClassificationCorpus
(#991)ColumnCorpus
(#990)ELMoEmbeddings
(#1019)SequenceTagger
(#899)SequenceTagger
now optionally returns a distribution of tag probabilities over all classes (#782 #949 #1016)bad_epochs
in training logs and no longer evaluates on test data at each epoch by default (#818 )Release 0.4.2 includes new features such as streaming data loading (allowing training over very large datasets), support for OpenAI GPT Embeddings, pre-trained Flair embeddings for many new languages, better classification baselines using one-hot embeddings and fine-tuneable document pool embeddings, and text regression as a third task next to sequence labeling and text classification.
The data loading part has been completely refactored to enable streaming data loading from disk using PyTorch's DataLoaders. I.e. training no longer requires the full dataset to be kept in memory, allowing us to train models over much larger datasets. This version also changes the syntax of how to load datasets.
Old way (now deprecated):
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
New way:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
To use streaming loading, i.e. to not load into memory, you can pass the in_memory
parameter:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH(in_memory=False)
This release brings Flair embeddings to 11 new languages (thanks @stefan-it!): Arabic (ar), Danish (da), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Indonesian (id), Italian (it), Norwegian (no) and Swedish (sv). It also improves support for Bulgarian (bg), Czech, Basque (eu), Dutch (nl) and Slovenian (sl), and adds special language models for historical German. Load with language code, i.e.
# load Flair embeddings for Italian
embeddings = FlairEmbeddings('it-forward')
Some classification baselines work astonishingly well with simple learnable word embeddings. To support testing these baselines, we've added learnable word embeddings that start from a one-hot encoding of words. To initialize, you need to pass a corpus to initialize the vocabulary.
# load corpus
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
# init learnable word embeddings with corpus
embeddings = OneHotEmbeddings(corpus)
DocumentPoolEmbeddings
(#747)We now allow users to specify a fine-tuning option before the pooling operation is executed in document pool embeddings. Options are 'none' (no fine-tuning), 'linear' (linear remapping of word embeddings), 'nonlinear' (nonlinear remapping of word embeddings). Nonlinear should be used together with WordEmbeddings
, while None should be used with OneHotEmbeddings
(not necessary since they are already learnt on data). So, to replicate FastText classification you can either do:
# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)
# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')
or
# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')
# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')
We now support embeddings from the OpenAI GPT model. We use the excellent pytorch-pretrained-BERT library to download the GPT model, tokenize the input and extract embeddings from the subtokens.
Initialize with:
embeddings = OpenAIGPTEmbeddings()
Previously, we had the SequenceTagger
and TextClassifier
as the two downstream tasks supported by Flair. The ModelTrainer
had specific methods to train these two models, making it difficult for users to add new types of tasks (such as text regression) to Flair.
This release refactors the flair.nn.Model
and ModelTrainer
functionality to make it uniform across tagging models and enable users to add new tasks to Flair. Now, by implementing the 5 methods in the flair.nn.Model
interface, a custom model immediately becomes trainable with the ModelTrainer
. Now, three types of downstream tasks implement this interface:
SequenceTagger
,TextClassifier
TextRegressor
.The code refactor removes a lot of code redundancies and slims down the interfaces of the downstream tasks classes. As the sole breaking change, it removes the load_from_file()
methods, which are now part of the load()
method, i.e. if previously you loaded a self-trained model like this:
tagger = SequenceTagger.load_from_file('/path/to/model.pt')
You now do it like this:
tagger = SequenceTagger.load('/path/to/model.pt')
# corpus
corpus = TREC_6()
# make label_dictionary
label_dictionary = corpus.make_label_dictionary()
# init text classifier
classifier = TextClassifier(document_embeddings, label_dictionary)
flair.datasets
(#749)flair.datasets
(#749)flair.datasets
(NEWSGROUPS object)Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).
Added first embeddings trained over PubMed data, namely
Load these for instance with:
# Flair embeddings PubMed
flair_embedding_forward = FlairEmbeddings('pubmed-forward')
flair_embedding_backward = FlairEmbeddings('pubmed-backward')
# ELMo embeddings PubMed
elmo_embeddings = ELMoEmbeddings('pubmed')
Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:
# initialize embeddings
embeddings = BytePairEmbeddings(language='en')
Transformer-XL embeddings added by @stefan-it. Load with:
# initialize embeddings
embeddings = TransformerXLEmbeddings()
Experimental transformer version of ELMo embeddings added by @stefan-it.
The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.
Initialize like this:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
glove_embedding = WordEmbeddings('glove')
document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')
FlairEmbeddings
for Japanese trained by @frtacoa and @minh-agent:
# forward and backward embedding
embeddings_fw = FlairEmbeddings('japanese-forward')
embeddings_bw = FlairEmbeddings('japanese-backward')
Added pre-computed FlairEmbeddings
for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )
To load Spanish FlairEmbeddings
, simply do:
# default forward and backward embedding
embeddings_fw = FlairEmbeddings('spanish-forward')
embeddings_bw = FlairEmbeddings('spanish-backward')
# CPU-friendly forward and backward embedding
embeddings_fw_fast = FlairEmbeddings('spanish-forward-fast')
embeddings_bw_fast = FlairEmbeddings('spanish-backward-fast')
FlairEmbeddings
for Basque which we now include, load with:forward_lm_embeddings = FlairEmbeddings('basque-forward')
backward_lm_embeddings = FlairEmbeddings('basque-backward')
wikipedia_embeddings = WordEmbeddings('eu-wiki')
crawl_embeddings = WordEmbeddings('eu-crawl')
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE)
corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)
FlairEmbeddings
can now be generated for arbitrarily long strings without causing out of memory errors. See #444
Use like this:
from flair.embeddings import FlairEmbeddings
# get language model
language_model = FlairEmbeddings('news-forward-fast').lm
# calculate perplexity for grammatical sentence
grammatical = 'The company made a profit'
perplexity_gramamtical_sentence = language_model.calculate_perplexity(grammatical)
# calculate perplexity for ungrammatical sentence
ungrammatical = 'Nook negh qapla!'
perplexity_ungramamtical_sentence = language_model.calculate_perplexity(ungrammatical)
# print both
print(f'"{grammatical}" - perplexity is {perplexity_gramamtical_sentence}')
print(f'"{ungrammatical}" - perplexity is {perplexity_ungramamtical_sentence}')
Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.
We now include new language models for:
In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:
flair_embeddings = FlairEmbeddings('dutch-forward')
We now include pre-trained FastText Embeddings for 30 languages: English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.
Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:
# German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')
# German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')
Thanks to the Flair community, we now include NER models for:
Next to the previous models for English and German.
Thanks to the Flair community, we now include PoS models for:
As a major new feature, we now include models that can tag text in various languages.
We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).
# load model
tagger = SequenceTagger.load('pos-multi')
# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')
# predict PoS tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence.to_tagged_string())
We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).
# load model
tagger = SequenceTagger.load('ner-multi')
# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence.to_tagged_string())
This model also kind of works on other languages, such as French.
Flair now also includes two pre-trained classification models:
Simply load the TextClassifier
using the preferred model, such as
TextClassifier.load('en-sentiment')
We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.
We added BERT embeddings to Flair. We are using the implementation of huggingface. The embeddings can be used as any other embedding type in Flair:
from flair.embeddings import BertEmbeddings
# init embedding
embedding = BertEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
Flair now also includes ELMo embeddings. We use the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, you need to first install the library via pip install allennlp
before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:
from flair.embeddings import ELMoEmbeddings
# init embedding
embedding = ELMoEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
You can now train a model on on multiple datasets with the MultiCorpus
object. We use this to train our multilingual models.
Just create multiple corpora and put them into MultiCorpus
:
english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
The multi_corpus
can now be used for training, just as any other corpus before. Check the tutorial for more details.
We built a wrapper around hyperopt to allow you to search for the best hyperparameters for your downstream task.
Define your search space and start training using several different parameter settings. The results are written to a specific file called param_selection.txt
in the result directory. Check the tutorial for more details.
To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in the tutorial.
We added various other features to model training.
The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in training.log
.
It is now possible to stop training at any point in time and to resume it later by training with checkpoint
set to True
. Check the tutorial for more details.
You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.
A new helper method to assist you in finding a good learning rate for model training.
This release introduces breaking changes. The most important are:
Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely ModelTrainer
. This replaces the earlier classes SequenceTaggerTrainer
and TextClassifierTrainer
.
Downstream task models now implement the new flair.nn.Model
interface. So, both the SequenceTagger
and TextClassifier
now inherit from flair.nn.Model
. This allows both models to be trained with the ModelTrainer
, like this:
# Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')
# Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')
The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.
The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum EvaluationMetric
which you can pass to the ModelTrainer to tell it what to use for evaluation.
Flair now bulids on torch 1.0.
Flair now uses Path
wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.