Skip to content

v0.12#668

Merged
MaartenGr merged 30 commits intomasterfrom
v0.12
Sep 11, 2022
Merged

v0.12#668
MaartenGr merged 30 commits intomasterfrom
v0.12

Conversation

@MaartenGr
Copy link
Copy Markdown
Owner

@MaartenGr MaartenGr commented Aug 10, 2022

Highlights:

  • Online/incremental topic modeling with .partial_fit
  • Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
    • Several parameters were added to potentially improve representations:
      • bm25_weighting
      • reduce_frequent_words
  • Expose attributes for easier access to internal data
  • Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:
  • Added an example of combining BERTopic with KeyBERT
  • Added many tests with the intention of making development a bit more stable

Online/Incremental topic modeling:

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic

# Prepare documents
all_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))["data"]
doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model)

# Incrementally fit the topic model by training on 1000 documents at a time
for docs in doc_chunks:
    topic_model.partial_fit(docs)

Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

c-TF-IDF model:

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer()

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model =ctfidf_model )

Attributes:

Attribute Description
.topics_ The topics that are generated for each document after training or updating the topic model.
.probabilities_ The probabilities that are generated for each document if HDBSCAN is used.
topic_sizes_ The size of each topic
topic_mapper_ A class for tracking topics and their mappings anytime they are merged/reduced.
topic_representations_ The top n terms per topic and their respective c-TF-IDF values.
c_tf_idf_ The topic-term matrix as calculated through c-TF-IDF.
topic_labels_ The default labels for each topic.
custom_labels_ Custom labels for each topic.
topic_embeddings_ The embeddings for each topic.
representative_docs_ The representative documents for each topic.

Fixes:

@MaartenGr MaartenGr mentioned this pull request Sep 8, 2022
@MaartenGr MaartenGr merged commit 09c1732 into master Sep 11, 2022
@MaartenGr MaartenGr deleted the v0.12 branch May 4, 2023 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant