|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- AyoubChLin/CNN_News_Articles_2011-2022 |
|
language: |
|
- en |
|
tags: |
|
- topic modeling |
|
- BERT |
|
- CNN news articles |
|
--- |
|
# BERTopic Model for CNN News Articles |
|
|
|
This model is a BERTopic model fine-tuned on CNN news articles. It uses the sentence transformer model "all-MiniLM-L6-v2" to encode the sentences and UMAP for dimensionality reduction. |
|
|
|
## Usage |
|
|
|
First, install the required packages: |
|
|
|
```console |
|
pip install sentence_transformers umap-learn bertopic |
|
``` |
|
|
|
``` python |
|
|
|
Then, load the model and encode your documents: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from umap import UMAP |
|
from bertopic import BERTopic |
|
|
|
# Load the sentence transformer model |
|
sentence_model = SentenceTransformer("all-MiniLM-L6-v2") |
|
|
|
# Set the random state in the UMAP model to prevent stochastic behavior |
|
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) |
|
|
|
# Load the BERTopic model |
|
my_model = BERTopic.load("from/path/model.bin") |
|
|
|
# Encode your documents |
|
document_embeddings = sentence_model.encode(documents) |
|
``` |
|
|
|
|
|
# predict : |
|
|
|
|
|
```python |
|
|
|
sentences = "my sentence" |
|
|
|
embeddings = sentence_model.encode([sentences]) |
|
|
|
topic , _ =my_model.transform([sentences],embeddings) |
|
|
|
``` |
|
|
|
|
|
For more information on how to use the BERTopic model, see the (BERTopic documentation)[https://maartengr.github.io/BERTopic/index.html]. |
|
|