Conversation
Closed
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add an optional layer on top of BERTopic that allows for fine-tuning topic representations. You can use GPT-3, T5, KeyBERT, MMR, POS, and many other models!
Fourteen.mp4
KeyBERTInspired
The algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:
PartOfSpeech
Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.
MaximalMarginalRelevance
When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant. We can use
MaximalMarginalRelevanceto improve diversity of our candidate topics:Zero-Shot Classification
To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
We use it in BERTopic as follows:
Text Generation: 🤗 Transformers
Nearly every week, there are new and improved models released on the 🤗 Model Hub that, with some creativity, allow for
further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
methods are created as a way to support whatever might be released in the future.
Using a GPT-like model from the huggingface hub is rather straightforward:
Text Generation: Cohere
Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:
Then, get yourself an API key and use Cohere's API as follows:
Text Generation: OpenAI
Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:
pip install openaiThen, get yourself an API key and use OpenAI's API as follows:
Text Generation: LangChain
Langchain is a package that helps users with chaining large language models.
In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
external knowledge are the most representative documents in each topic.
To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain,
like openai:
Then, you can create your chain as follows:
Finally, you can pass the chain to BERTopic as follows:
Features
nr_topics=inttitleparameters for all graphs (Feature request: title parameter for visualizations #800)Fixes
Documentation