Leveraging BERT and c-TF-IDF to create easily interpretable topics.
A number of fixes, documentation updates, and small features:
BERTopic(diversity=0.1)
to change how diverse the words in a topic representation are (ranges from 0 to 1).transform
(#356)Fix #282, #285, and #288.
transform
functionpyyaml
that can be found in Google ColabYAMLLoadWarning
each time BERTopic is importedimport yaml
yaml._warnings_enabled["YAMLLoadWarning"] = False
A release focused on algorithmic optimization and fixing several issues:
Highlights:
Fixes:
.visualize_term_rank()
(#253)topic_model.get_representative_docs(topic=1)
normalize_frequency
parameter to visualize_topics_per_class
and visualize_topics_over_time
in order to better compare the relative topic frequencies between topicscalculate_probabilities
is TrueGuided BERTopic works in two ways:
First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.
Second, we take all words in seed_topic_list
and assign them a multiplier larger than 1.
Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an
irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to
remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,
like taking the distribution of IDF values and its position into account when defining the multiplier.
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
["acquisition", "procurement", "merge"],
["exchange", "currency", "trading", "rate", "euro"],
["grain", "wheat", "corn"],
["coffee", "cocoa"],
["natural", "gas", "oil", "fuel", "products", "petrol"]]
topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)
Highlights:
"paraphrase-MiniLM-L6-v2"
"paraphrase-multilingual-MiniLM-L12-v2"
plotting
API documentationFor even better performance, please use the following models:
"paraphrase-mpnet-base-v2"
"paraphrase-multilingual-mpnet-base-v2"
Fixes:
Mainly a visualization update to improve understanding of the topic model.
topic_model.visualize_hierarchy()
topic_model.visualize_heatmap()
topic_model.visualize_barchart()
topic_model.visualize_term_rank()
bertopic.plotting
library to easily extend visualizationsThe two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and SentenceTransformers!
Highlights:
model.fit(docs, y=target_classes)
bertopic.backend.BaseEmbedder
topics_per_class = topic_model.topics_per_class(docs, topics, classes)
topic_model.visualize_topics_per_class(topics_per_class)
Fixes:
pip install bertopic[visualization]
becomes pip install bertopic
model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
model.find_topics(search_term)
Highlights:
model.topics_over_time(docs, timestamps, global_tuning=True)
model.topics_over_time(docs, timestamps, evolution_tuning=True)
model.visualize_topics_over_time(topics_over_time)
model.topics_over_time(docs, timestamps, nr_bins=10)
get_topic_info()
Fixes:
Flair
to allow for more (custom) token/document embeddingslow_memory
parameter to reduce memory during computationsentence-transformers
as it speeds ups encoding significantlyvisualize_topics()
get_params()
embedding_model
, should reduce BERTopic size significantlystop_words
and n_neighbors
were removed. These can still be used when a custom UMAP or CountVectorizer is used.calculate_probabilities
to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.