v0.14 by MaartenGr · Pull Request #977 · MaartenGr/BERTopic

MaartenGr · 2023-02-02T13:26:36Z

Add an optional layer on top of BERTopic that allows for fine-tuning topic representations. You can use GPT-3, T5, KeyBERT, MMR, POS, and many other models!

Fourteen.mp4

KeyBERTInspired

The algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:

from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic

# Create your representation model
representation_model = KeyBERTInspired()

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

PartOfSpeech

Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.

from bertopic.representation import PartOfSpeech
from bertopic import BERTopic

# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

MaximalMarginalRelevance

When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant. We can use MaximalMarginalRelevance to improve diversity of our candidate topics:

from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic

# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Zero-Shot Classification

To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.

We use it in BERTopic as follows:

from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic

# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Text Generation: 🤗 Transformers

Nearly every week, there are new and improved models released on the 🤗 Model Hub that, with some creativity, allow for
further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
methods are created as a way to support whatever might be released in the future.

Using a GPT-like model from the huggingface hub is rather straightforward:

from bertopic.representation import TextGeneration
from bertopic import BERTopic

# Create your representation model
representation_model = TextGeneration('gpt2')

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Text Generation: Cohere

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:

pip install cohere

Then, get yourself an API key and use Cohere's API as follows:

import cohere
from bertopic.representation import Cohere
from bertopic import BERTopic

# Create your representation model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Text Generation: OpenAI

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:

pip install openai

Then, get yourself an API key and use OpenAI's API as follows:

import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic

# Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI()

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Text Generation: LangChain

Langchain is a package that helps users with chaining large language models.
In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
external knowledge are the most representative documents in each topic.

To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain,
like openai:

pip install langchain, openai

Then, you can create your chain as follows:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=MY_API_KEY), chain_type="stuff")

Finally, you can pass the chain to BERTopic as follows:

from bertopic.representation import LangChain

# Create your representation model
representation_model = LangChain(chain)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Features

Improved the topic reduction technique when using nr_topics=int
Added title parameters for all graphs (Feature request: title parameter for visualizations #800)

Fixes

Documentation

Added word cloud example in tips and tricks section

…custom model

maarten_iknl and others added 15 commits February 2, 2023 14:19

Add representation models and several document updates/changes

2f5f7ad

Account for different scikit-learn versions

8f596dd

Fix topic selection when extracting repr docs

22f1db4

Fix indexing

a5d028f

Improved testing

b804254

Improve documentation, #769 and #954

5f61f56

Improve documentation, #912

6f08a6a

Added examples of chaining representation models and how to create a …

41aab54

…custom model

Add wordcloud example, open up top_n_words to any value

11de16d

Add wordcloud example, fix #903, #911, #965

d2b7deb

Add title param for each graphs (#800), update testing

3855234

Improved nr_topics procedure, update outlier reduction documentation

046e454

Fix reduction

f62caa4

Update testing

84c07a7

Update testing, fix #952, add #976

6bce89a

MaartenGr mentioned this pull request Feb 10, 2023

words similarity #913

Closed

maarten_iknl added 2 commits February 13, 2023 09:14

Fix missing imports

cabbb58

Add example output, update videos

89e57d8

MaartenGr mentioned this pull request Feb 13, 2023

keyBERTInspired #998

Closed

Prepare changelog for release

52b15f3

MaartenGr merged commit 7142ce7 into master Feb 14, 2023

MaartenGr deleted the v0.14 branch December 8, 2023 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14#977

v0.14#977
MaartenGr merged 18 commits intomasterfrom
v0.14

MaartenGr commented Feb 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaartenGr commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KeyBERTInspired

PartOfSpeech

MaximalMarginalRelevance

Zero-Shot Classification

Text Generation: 🤗 Transformers

Text Generation: Cohere

Text Generation: OpenAI

Text Generation: LangChain

Features

Fixes

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaartenGr commented Feb 2, 2023 •

edited

Loading