Add BERTopic model
Browse files- README.md +181 -0
- config.json +16 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# MARTINI_enrich_BERTopic_BritainFirst
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("AIDA-UPM/MARTINI_enrich_BERTopic_BritainFirst")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 112
|
34 |
+
* Number of training documents: 13176
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | brexit - islamist - twitter - arrested - wakefield | 20 | -1_brexit_islamist_twitter_arrested |
|
42 |
+
| 0 | novotel - migrants - accommodation - leeds - gatwick | 7332 | 0_novotel_migrants_accommodation_leeds |
|
43 |
+
| 1 | trump - ballots - clinton - impeachment - michigan | 178 | 1_trump_ballots_clinton_impeachment |
|
44 |
+
| 2 | calais - refugees - dunkirk - tents - sudanese | 168 | 2_calais_refugees_dunkirk_tents |
|
45 |
+
| 3 | marseille - macron - zemmour - jean - president | 165 | 3_marseille_macron_zemmour_jean |
|
46 |
+
| 4 | lockdowns - omicron - vaccinated - nhs - morons | 162 | 4_lockdowns_omicron_vaccinated_nhs |
|
47 |
+
| 5 | antifa - rioters - oregon - molotov - flames | 141 | 5_antifa_rioters_oregon_molotov |
|
48 |
+
| 6 | poland - migrants - border - minsk - guards | 120 | 6_poland_migrants_border_minsk |
|
49 |
+
| 7 | membership - british - fightback - nationalist - steadfast | 117 | 7_membership_british_fightback_nationalist |
|
50 |
+
| 8 | taliban - afghans - kandahar - soldiers - suhail | 108 | 8_taliban_afghans_kandahar_soldiers |
|
51 |
+
| 9 | salford - dartford - swanscombe - elections - cambridgeshire | 107 | 9_salford_dartford_swanscombe_elections |
|
52 |
+
| 10 | meghan - harry - royal - oprah - hypocrites | 106 | 10_meghan_harry_royal_oprah |
|
53 |
+
| 11 | migrants - crossing - yesterday - boat - 2021 | 93 | 11_migrants_crossing_yesterday_boat |
|
54 |
+
| 12 | barracks - migrant - bexhill - pembrokeshire - dambusters | 93 | 12_barracks_migrant_bexhill_pembrokeshire |
|
55 |
+
| 13 | londonistan - stabbings - khan - mayor - lawless | 86 | 13_londonistan_stabbings_khan_mayor |
|
56 |
+
| 14 | aegean - cyprus - erdogan - smugglers - nikos | 86 | 14_aegean_cyprus_erdogan_smugglers |
|
57 |
+
| 15 | dover - harbour - immigrants - grimsby - disembarked | 86 | 15_dover_harbour_immigrants_grimsby |
|
58 |
+
| 16 | biden - delaware - forgets - donnell - alzheimers | 85 | 16_biden_delaware_forgets_donnell |
|
59 |
+
| 17 | dublin - mullingar - kilkenny - ulster - crimewaves | 80 | 17_dublin_mullingar_kilkenny_ulster |
|
60 |
+
| 18 | traitors - extremist - hope - exposing - tommy | 80 | 18_traitors_extremist_hope_exposing |
|
61 |
+
| 19 | rioters - blacks - minneapolis - victimhood - damond | 80 | 19_rioters_blacks_minneapolis_victimhood |
|
62 |
+
| 20 | england - luton - multiculturalism - cockney - saxons | 78 | 20_england_luton_multiculturalism_cockney |
|
63 |
+
| 21 | psalm - philippians - supplication - worry - angels | 77 | 21_psalm_philippians_supplication_worry |
|
64 |
+
| 22 | reparations - colston - statues - enslaved - vandalised | 76 | 22_reparations_colston_statues_enslaved |
|
65 |
+
| 23 | constables - politically - sacked - newham - robberies | 74 | 23_constables_politically_sacked_newham |
|
66 |
+
| 24 | golding - speeches - paul - regional - pontefract | 70 | 24_golding_speeches_paul_regional |
|
67 |
+
| 25 | border - texas - rio - smugglers - venezuela | 68 | 25_border_texas_rio_smugglers |
|
68 |
+
| 26 | deportations - accommodation - holidaymakers - 8million - taxpayers | 68 | 26_deportations_accommodation_holidaymakers_8million |
|
69 |
+
| 27 | corbyn - antisemitism - starmer - sympathisers - gaza | 67 | 27_corbyn_antisemitism_starmer_sympathisers |
|
70 |
+
| 28 | transgender - women - menstruate - olympic - weightlifter | 66 | 28_transgender_women_menstruate_olympic |
|
71 |
+
| 29 | lampedusa - migration - disembarked - malta - libya | 65 | 29_lampedusa_migration_disembarked_malta |
|
72 |
+
| 30 | censorship - totalitarianism - brainwashed - demonise - moronic | 63 | 30_censorship_totalitarianism_brainwashed_demonise |
|
73 |
+
| 31 | bletchley - battles - trafalgar - anzac - 1944 | 62 | 31_bletchley_battles_trafalgar_anzac |
|
74 |
+
| 32 | lifeboat - rnli - smuggling - beachgoers - bangladesh | 61 | 32_lifeboat_rnli_smuggling_beachgoers |
|
75 |
+
| 33 | ballymena - golding - bail - terrorism - solicitor | 61 | 33_ballymena_golding_bail_terrorism |
|
76 |
+
| 34 | borders - mexico - mcallen - democrats - doomed | 60 | 34_borders_mexico_mcallen_democrats |
|
77 |
+
| 35 | gaza - israelis - ashkelon - airstrikes - islamists | 60 | 35_gaza_israelis_ashkelon_airstrikes |
|
78 |
+
| 36 | lockdown - arrested - protest - peckham - policeman | 56 | 36_lockdown_arrested_protest_peckham |
|
79 |
+
| 37 | lampedusa - borders - invading - disembarking - armada | 54 | 37_lampedusa_borders_invading_disembarking |
|
80 |
+
| 38 | minibuses - campaigning - scotland - embarked - churchill | 54 | 38_minibuses_campaigning_scotland_embarked |
|
81 |
+
| 39 | conference - tickets - nuneaton - 2021 - southampton | 54 | 39_conference_tickets_nuneaton_2021 |
|
82 |
+
| 40 | paul - persecution - golding - spearheaded - paperback | 53 | 40_paul_persecution_golding_spearheaded |
|
83 |
+
| 41 | tommy - assaulting - paedophile - magistrates - trumped | 52 | 41_tommy_assaulting_paedophile_magistrates |
|
84 |
+
| 42 | melilla - migrants - africa - gibraltar - guardsmen | 51 | 42_melilla_migrants_africa_gibraltar |
|
85 |
+
| 43 | mosques - ramadan - praying - adhan - call | 51 | 43_mosques_ramadan_praying_adhan |
|
86 |
+
| 44 | beheaded - aoussaoui - gunman - radicalized - mogadishu | 50 | 44_beheaded_aoussaoui_gunman_radicalized |
|
87 |
+
| 45 | великобритания - россия - голдинг - партии - могиле | 50 | 45_великобритания_россия_голдинг_партии |
|
88 |
+
| 46 | patriotmerchandise - hoodie - shirts - merchandise - products | 50 | 46_patriotmerchandise_hoodie_shirts_merchandise |
|
89 |
+
| 47 | shamima - jihadist - returning - shariah - bride | 49 | 47_shamima_jihadist_returning_shariah |
|
90 |
+
| 48 | ulster - ballymena - unionist - newtownards - bfni | 48 | 48_ulster_ballymena_unionist_newtownards |
|
91 |
+
| 49 | popes - benedict - sacraments - archbishopric - persecuted | 46 | 49_popes_benedict_sacraments_archbishopric |
|
92 |
+
| 50 | helmand - corporal - died - battalion - stephen | 45 | 50_helmand_corporal_died_battalion |
|
93 |
+
| 51 | jesus - blessed - whosoever - persecuted - everlasting | 45 | 51_jesus_blessed_whosoever_persecuted |
|
94 |
+
| 52 | racism - brainwash - curriculum - white - decolonise | 45 | 52_racism_brainwash_curriculum_white |
|
95 |
+
| 53 | whitephobia - whiter - silence - systemically - consensual | 42 | 53_whitephobia_whiter_silence_systemically |
|
96 |
+
| 54 | refugees - smugglers - genuine - freeloading - ukrainian | 42 | 54_refugees_smugglers_genuine_freeloading |
|
97 |
+
| 55 | rotherham - rape - scandals - islamist - nawaz | 41 | 55_rotherham_rape_scandals_islamist |
|
98 |
+
| 56 | mexicans - caravan - tapachula - cartels - marches | 40 | 56_mexicans_caravan_tapachula_cartels |
|
99 |
+
| 57 | britbox - broadcaster - stormzy - gary - complain | 39 | 57_britbox_broadcaster_stormzy_gary |
|
100 |
+
| 58 | referendum - deadline - approved - september - 24th | 39 | 58_referendum_deadline_approved_september |
|
101 |
+
| 59 | lithuania - belarusian - rudninkai - trespassers - sanctions | 39 | 59_lithuania_belarusian_rudninkai_trespassers |
|
102 |
+
| 60 | deported - offenders - jamaicans - degenerates - partygate | 38 | 60_deported_offenders_jamaicans_degenerates |
|
103 |
+
| 61 | britannia - wigan - protests - deported - schoolgirls | 38 | 61_britannia_wigan_protests_deported |
|
104 |
+
| 62 | share - telegram - videos - everywhere - retweet | 38 | 62_share_telegram_videos_everywhere |
|
105 |
+
| 63 | raped - convicted - rashid - hussain - kirklees | 37 | 63_raped_convicted_rashid_hussain |
|
106 |
+
| 64 | charlenedownes - blackpool - murdered - saddleworth - vicky | 37 | 64_charlenedownes_blackpool_murdered_saddleworth |
|
107 |
+
| 65 | crimea - russophobia - nordstream - katyusha - vladimir | 37 | 65_crimea_russophobia_nordstream_katyusha |
|
108 |
+
| 66 | buckinghamshire - middlesbrough - rochdale - telford - perpetrators | 37 | 66_buckinghamshire_middlesbrough_rochdale_telford |
|
109 |
+
| 67 | libya - deporting - amnesty - eritrean - escaped | 36 | 67_libya_deporting_amnesty_eritrean |
|
110 |
+
| 68 | leicestershire - birmingham - activists - harassed - gestapo | 34 | 68_leicestershire_birmingham_activists_harassed |
|
111 |
+
| 69 | terrorist - qaeda - murdering - knifeman - sharif | 34 | 69_terrorist_qaeda_murdering_knifeman |
|
112 |
+
| 70 | germany - rioting - rotterdam - baghdad - allensbach | 34 | 70_germany_rioting_rotterdam_baghdad |
|
113 |
+
| 71 | sunday - blessings - christian - worship - supporters | 33 | 71_sunday_blessings_christian_worship |
|
114 |
+
| 72 | canvassing - north - wigan - altrincham - winning | 33 | 72_canvassing_north_wigan_altrincham |
|
115 |
+
| 73 | activist - meetings - welcomed - tonight - sussex | 33 | 73_activist_meetings_welcomed_tonight |
|
116 |
+
| 74 | dlive - paul - golding - 7pm - bitchute | 33 | 74_dlive_paul_golding_7pm |
|
117 |
+
| 75 | apartheid - johannesburg - zimbabwe - boers - looters | 33 | 75_apartheid_johannesburg_zimbabwe_boers |
|
118 |
+
| 76 | twitter - musk - shadowban - msnbc - shareholders | 32 | 76_twitter_musk_shadowban_msnbc |
|
119 |
+
| 77 | canaryislands - spaniards - morocco - asylum - arguineguin | 32 | 77_canaryislands_spaniards_morocco_asylum |
|
120 |
+
| 78 | hungary - referendum - prohibits - lgbtq - viktor | 32 | 78_hungary_referendum_prohibits_lgbtq |
|
121 |
+
| 79 | twitter - trumpbook - lindell - censorship - influencers | 31 | 79_twitter_trumpbook_lindell_censorship |
|
122 |
+
| 80 | islamist - jihadists - radicalisation - mi5 - fear | 31 | 80_islamist_jihadists_radicalisation_mi5 |
|
123 |
+
| 81 | britannia - glorious - silbury - stonehenge - fought | 31 | 81_britannia_glorious_silbury_stonehenge |
|
124 |
+
| 82 | universities - whiteness - decolonise - walkout - lipscomb | 31 | 82_universities_whiteness_decolonise_walkout |
|
125 |
+
| 83 | sweden - gangs - somalis - gottsunda - mikael | 30 | 83_sweden_gangs_somalis_gottsunda |
|
126 |
+
| 84 | paedophiles - lgbtqoajsrgoauib - sexualisation - brainwashing - pansexual | 30 | 84_paedophiles_lgbtqoajsrgoauib_sexualisation_brainwashing |
|
127 |
+
| 85 | eurosceptics - juncker - undemocratic - voted - remain | 30 | 85_eurosceptics_juncker_undemocratic_voted |
|
128 |
+
| 86 | europeans - globalist - migration - islamophobe - invaded | 29 | 86_europeans_globalist_migration_islamophobe |
|
129 |
+
| 87 | capitol - rioters - pelosi - msnbc - deputies | 29 | 87_capitol_rioters_pelosi_msnbc |
|
130 |
+
| 88 | armagh - killed - bombings - fusiliers - 1974 | 29 | 88_armagh_killed_bombings_fusiliers |
|
131 |
+
| 89 | bfd - bodyguards - training - aikido - krav | 28 | 89_bfd_bodyguards_training_aikido |
|
132 |
+
| 90 | newspapers - subscribed - patriot - worldwide - mainstream | 28 | 90_newspapers_subscribed_patriot_worldwide |
|
133 |
+
| 91 | queen - funeral - westminster - dignitaries - carriage | 27 | 91_queen_funeral_westminster_dignitaries |
|
134 |
+
| 92 | transwomen - lesbian - stonewall - matchmaker - elliot | 26 | 92_transwomen_lesbian_stonewall_matchmaker |
|
135 |
+
| 93 | christianity - crusade - englishman - blesses - rediscovering | 26 | 93_christianity_crusade_englishman_blesses |
|
136 |
+
| 94 | pray - almighty - victory - nationalists - strengthen | 25 | 94_pray_almighty_victory_nationalists |
|
137 |
+
| 95 | dartford - campaigning - essex - canvassers - east | 25 | 95_dartford_campaigning_essex_canvassers |
|
138 |
+
| 96 | livestream - thelionsradio - liamwalsall - tomorrow - 6pm | 24 | 96_livestream_thelionsradio_liamwalsall_tomorrow |
|
139 |
+
| 97 | cathedrals - firebombed - flames - saint - france | 23 | 97_cathedrals_firebombed_flames_saint |
|
140 |
+
| 98 | 4climate - antarctica - coldest - thunberg - alarmists | 23 | 98_4climate_antarctica_coldest_thunberg |
|
141 |
+
| 99 | parler - followers - alternatives - ashleasimonnews - censored | 23 | 99_parler_followers_alternatives_ashleasimonnews |
|
142 |
+
| 100 | schoolteacher - mohammed - blasphemy - batley - cartoon | 22 | 100_schoolteacher_mohammed_blasphemy_batley |
|
143 |
+
| 101 | postal - stamp - wellingborough - votes - urgently | 22 | 101_postal_stamp_wellingborough_votes |
|
144 |
+
| 102 | britannia - camp - august - activist - 2022 | 22 | 102_britannia_camp_august_activist |
|
145 |
+
| 103 | muslims - birmingham - temple - riots - smethwick | 22 | 103_muslims_birmingham_temple_riots |
|
146 |
+
| 104 | islamophobia - taqiyya - hijab - holiest - shitholes | 21 | 104_islamophobia_taqiyya_hijab_holiest |
|
147 |
+
| 105 | asylum - hostel - residents - hull - flintshire | 21 | 105_asylum_hostel_residents_hull |
|
148 |
+
| 106 | asylum - murdered - rapist - assaulting - grandmother | 21 | 106_asylum_murdered_rapist_assaulting |
|
149 |
+
| 107 | parliament - remainers - prorogue - amendments - undemocratic | 21 | 107_parliament_remainers_prorogue_amendments |
|
150 |
+
| 108 | ukprotests - trending - leftists - blmlondon - followers | 20 | 108_ukprotests_trending_leftists_blmlondon |
|
151 |
+
| 109 | belfastlive - defrauding - jolene - verdict - commissioner | 20 | 109_belfastlive_defrauding_jolene_verdict |
|
152 |
+
| 110 | wakefield - campaigning - palestinians - councillors - thursday | 20 | 110_wakefield_campaigning_palestinians_councillors |
|
153 |
+
|
154 |
+
</details>
|
155 |
+
|
156 |
+
## Training hyperparameters
|
157 |
+
|
158 |
+
* calculate_probabilities: True
|
159 |
+
* language: None
|
160 |
+
* low_memory: False
|
161 |
+
* min_topic_size: 10
|
162 |
+
* n_gram_range: (1, 1)
|
163 |
+
* nr_topics: None
|
164 |
+
* seed_topic_list: None
|
165 |
+
* top_n_words: 10
|
166 |
+
* verbose: False
|
167 |
+
* zeroshot_min_similarity: 0.7
|
168 |
+
* zeroshot_topic_list: None
|
169 |
+
|
170 |
+
## Framework versions
|
171 |
+
|
172 |
+
* Numpy: 1.26.4
|
173 |
+
* HDBSCAN: 0.8.40
|
174 |
+
* UMAP: 0.5.7
|
175 |
+
* Pandas: 2.2.3
|
176 |
+
* Scikit-Learn: 1.5.2
|
177 |
+
* Sentence-transformers: 3.3.1
|
178 |
+
* Transformers: 4.46.3
|
179 |
+
* Numba: 0.60.0
|
180 |
+
* Plotly: 5.24.1
|
181 |
+
* Python: 3.10.12
|
config.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": true,
|
3 |
+
"language": null,
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 10,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": null,
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": false,
|
14 |
+
"zeroshot_min_similarity": 0.7,
|
15 |
+
"zeroshot_topic_list": null
|
16 |
+
}
|
ctfidf.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e513801b329617979ce8aad9542008aede946439b56105fa635577567cae2cdc
|
3 |
+
size 1001684
|
ctfidf_config.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5befa4521853e42cabd1bfda6e607da2db6b236a0dd8a836b5afd5ed93763ae7
|
3 |
+
size 458848
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|