BERTopic Versions Save

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

v0.9.4

2 years ago

A number of fixes, documentation updates, and small features:

Highlights:

  • Expose diversity parameter
    • Use BERTopic(diversity=0.1) to change how diverse the words in a topic representation are (ranges from 0 to 1)
  • Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings
  • Added property to c-TF-IDF that all IDF values should be positive (#351)
  • Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)
  • Additional logging for .transform (#356)

Fixes:

  • Drop python 3.6 (#333)
  • Relax plotly dependency (#88)
  • Improve stability of .visualize_barchart() and .visualize_hierarchy()

v0.9.3

2 years ago

Fix #282, #285, and #288.

Fixes

  • #282
    • As it turns out the old implementation of topic mapping was still found in the transform function
  • #285
    • Fix getting all representative docs
  • Fix #288
    • A recent issue with the package pyyaml that can be found in Google Colab
  • Remove the YAMLLoadWarning each time BERTopic is imported
import yaml
yaml._warnings_enabled["YAMLLoadWarning"] = False

v0.9.2

2 years ago

A release focused on algorithmic optimization and fixing several issues:

Highlights:

  • Update the non-multilingual paraphrase-* models to the all-* models due to improved performance
  • Reduce necessary RAM in c-TF-IDF top 30 word extraction

Fixes:

  • Fix topic mapping
    • When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version
    • A new class was created as a way to track these mappings regardless of how many times they were executed
    • In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model
  • Fix typo in embeddings page (#200)
  • Fix link in README (#233)
  • Fix documentation .visualize_term_rank() (#253)
  • Fix getting correct representative docs (#258)
  • Update memory FAQ with HDBSCAN pr

v0.9.1

2 years ago

Fixes:

  • Fix TypeError when auto-reducing topics (#210)
  • Fix mapping representative docs when reducing topics (#208)
  • Fix visualization issues with probabilities (#205)
  • Fix missing normalize_frequency param in plots (#213)

v0.9.0

2 years ago

Highlights

  • Implemented a Guided BERTopic -> Use seeds to steer the Topic Modeling
  • Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)
    • This allows users to see which documents are good representations of a topic and better understand the topics that were created
  • Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics
  • Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True
  • Added several FAQs

Fixes

  • Fix loading pre-trained BERTopic model
  • Fix mapping of probabilities
  • Fix #190

Guided BERTopic

Guided BERTopic works in two ways:

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

v0.8.1

2 years ago

Highlights:

  • Improved models:
    • For English documents the default is now: "paraphrase-MiniLM-L6-v2"
    • For Non-English or multi-lingual documents the default is now: "paraphrase-multilingual-MiniLM-L12-v2"
    • Both models show not only great performance but are much faster!
  • Add interactive visualizations to the plotting API documentation

For even better performance, please use the following models:

  • English: "paraphrase-mpnet-base-v2"
  • Non-English or multi-lingual: "paraphrase-multilingual-mpnet-base-v2"

Fixes:

  • Improved unit testing for more stability
  • Set transformers version for Flair

v0.8.0

3 years ago

Mainly a visualization update to improve understanding of the topic model.

Features

  • Additional visualizations:
    • Topic Hierarchy: topic_model.visualize_hierarchy()
    • Topic Similarity Heatmap: topic_model.visualize_heatmap()
    • Topic Representation Barchart: topic_model.visualize_barchart()
    • Term Score Decline: topic_model.visualize_term_rank()

Improvements

  • Created bertopic.plotting library to easily extend visualizations
  • Improved automatic topic reduction by using HDBSCAN to detect similar topics
  • Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
  • Update MKDOCS with new visualizations

Fixes

v0.7.0

3 years ago

The two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and SentenceTransformers!

Highlights:

  • (semi-)supervised topic modeling by leveraging supervised options in UMAP
    • model.fit(docs, y=target_classes)
  • Backends:
    • Added Spacy, Gensim, USE (TFHub)
    • Use a different backend for document embeddings and word embeddings
    • Create your own backends with bertopic.backend.BaseEmbedder
    • Click here for an overview of all new backends
  • Calculate and visualize topics per class
    • Calculate: topics_per_class = topic_model.topics_per_class(docs, topics, classes)
    • Visualize: topic_model.visualize_topics_per_class(topics_per_class)
  • Several tutorials were updated and added:
Name Link
Topic Modeling with BERTopic Open In Colab
(Custom) Embedding Models in BERTopic Open In Colab
Advanced Customization in BERTopic Open In Colab
(semi-)Supervised Topic Modeling with BERTopic Open In Colab
Dynamic Topic Modeling with Trump's Tweets Open In Colab

Fixes:

  • Fixed issues with Torch req
  • Prevent saving term frequency matrix in CTFIDF class
  • Fixed DTM not working when reducing topics (#96)
  • Moved visualization dependencies to base BERTopic
    • pip install bertopic[visualization] becomes pip install bertopic
  • Allow precomputed embeddings in bertopic.find_topics() (#79):
model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
model.find_topics(search_term)

v0.6.0

3 years ago

Highlights:

  • DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
    • model.topics_over_time(docs, timestamps, global_tuning=True)
  • DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
    • Only uses topics at t-1 and skips evolution if there is a gap
    • model.topics_over_time(docs, timestamps, evolution_tuning=True)
  • DTM: Function to visualize topics over time
    • model.visualize_topics_over_time(topics_over_time)
  • DTM: Add binning of timestamps
    • model.topics_over_time(docs, timestamps, nr_bins=10)
  • Add function get general information about topics (id, frequency, name, etc.)
    • get_topic_info()
  • Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

Fixes:

  • _map_probabilities() does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)

v0.5.0

3 years ago

Features

  • Add Flair to allow for more (custom) token/document embeddings
  • Option to use custom UMAP, HDBSCAN, and CountVectorizer
  • Added low_memory parameter to reduce memory during computation
  • Improved verbosity (shows progress bar)
  • Improved testing
  • Use the newest version of sentence-transformers as it speeds ups encoding significantly
  • Return the figure of visualize_topics()
  • Expose all parameters with a single function: get_params()
  • Option to disable the saving of embedding_model, should reduce BERTopic size significantly
  • Add FAQ page

Fixes

  • To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
  • Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.