AngelPanizo commited on
Commit
dcbe40f
·
verified ·
1 Parent(s): 362186f

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # MARTINI_enrich_BERTopic_BritainFirst
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("AIDA-UPM/MARTINI_enrich_BERTopic_BritainFirst")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 112
34
+ * Number of training documents: 13176
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | brexit - islamist - twitter - arrested - wakefield | 20 | -1_brexit_islamist_twitter_arrested |
42
+ | 0 | novotel - migrants - accommodation - leeds - gatwick | 7332 | 0_novotel_migrants_accommodation_leeds |
43
+ | 1 | trump - ballots - clinton - impeachment - michigan | 178 | 1_trump_ballots_clinton_impeachment |
44
+ | 2 | calais - refugees - dunkirk - tents - sudanese | 168 | 2_calais_refugees_dunkirk_tents |
45
+ | 3 | marseille - macron - zemmour - jean - president | 165 | 3_marseille_macron_zemmour_jean |
46
+ | 4 | lockdowns - omicron - vaccinated - nhs - morons | 162 | 4_lockdowns_omicron_vaccinated_nhs |
47
+ | 5 | antifa - rioters - oregon - molotov - flames | 141 | 5_antifa_rioters_oregon_molotov |
48
+ | 6 | poland - migrants - border - minsk - guards | 120 | 6_poland_migrants_border_minsk |
49
+ | 7 | membership - british - fightback - nationalist - steadfast | 117 | 7_membership_british_fightback_nationalist |
50
+ | 8 | taliban - afghans - kandahar - soldiers - suhail | 108 | 8_taliban_afghans_kandahar_soldiers |
51
+ | 9 | salford - dartford - swanscombe - elections - cambridgeshire | 107 | 9_salford_dartford_swanscombe_elections |
52
+ | 10 | meghan - harry - royal - oprah - hypocrites | 106 | 10_meghan_harry_royal_oprah |
53
+ | 11 | migrants - crossing - yesterday - boat - 2021 | 93 | 11_migrants_crossing_yesterday_boat |
54
+ | 12 | barracks - migrant - bexhill - pembrokeshire - dambusters | 93 | 12_barracks_migrant_bexhill_pembrokeshire |
55
+ | 13 | londonistan - stabbings - khan - mayor - lawless | 86 | 13_londonistan_stabbings_khan_mayor |
56
+ | 14 | aegean - cyprus - erdogan - smugglers - nikos | 86 | 14_aegean_cyprus_erdogan_smugglers |
57
+ | 15 | dover - harbour - immigrants - grimsby - disembarked | 86 | 15_dover_harbour_immigrants_grimsby |
58
+ | 16 | biden - delaware - forgets - donnell - alzheimers | 85 | 16_biden_delaware_forgets_donnell |
59
+ | 17 | dublin - mullingar - kilkenny - ulster - crimewaves | 80 | 17_dublin_mullingar_kilkenny_ulster |
60
+ | 18 | traitors - extremist - hope - exposing - tommy | 80 | 18_traitors_extremist_hope_exposing |
61
+ | 19 | rioters - blacks - minneapolis - victimhood - damond | 80 | 19_rioters_blacks_minneapolis_victimhood |
62
+ | 20 | england - luton - multiculturalism - cockney - saxons | 78 | 20_england_luton_multiculturalism_cockney |
63
+ | 21 | psalm - philippians - supplication - worry - angels | 77 | 21_psalm_philippians_supplication_worry |
64
+ | 22 | reparations - colston - statues - enslaved - vandalised | 76 | 22_reparations_colston_statues_enslaved |
65
+ | 23 | constables - politically - sacked - newham - robberies | 74 | 23_constables_politically_sacked_newham |
66
+ | 24 | golding - speeches - paul - regional - pontefract | 70 | 24_golding_speeches_paul_regional |
67
+ | 25 | border - texas - rio - smugglers - venezuela | 68 | 25_border_texas_rio_smugglers |
68
+ | 26 | deportations - accommodation - holidaymakers - 8million - taxpayers | 68 | 26_deportations_accommodation_holidaymakers_8million |
69
+ | 27 | corbyn - antisemitism - starmer - sympathisers - gaza | 67 | 27_corbyn_antisemitism_starmer_sympathisers |
70
+ | 28 | transgender - women - menstruate - olympic - weightlifter | 66 | 28_transgender_women_menstruate_olympic |
71
+ | 29 | lampedusa - migration - disembarked - malta - libya | 65 | 29_lampedusa_migration_disembarked_malta |
72
+ | 30 | censorship - totalitarianism - brainwashed - demonise - moronic | 63 | 30_censorship_totalitarianism_brainwashed_demonise |
73
+ | 31 | bletchley - battles - trafalgar - anzac - 1944 | 62 | 31_bletchley_battles_trafalgar_anzac |
74
+ | 32 | lifeboat - rnli - smuggling - beachgoers - bangladesh | 61 | 32_lifeboat_rnli_smuggling_beachgoers |
75
+ | 33 | ballymena - golding - bail - terrorism - solicitor | 61 | 33_ballymena_golding_bail_terrorism |
76
+ | 34 | borders - mexico - mcallen - democrats - doomed | 60 | 34_borders_mexico_mcallen_democrats |
77
+ | 35 | gaza - israelis - ashkelon - airstrikes - islamists | 60 | 35_gaza_israelis_ashkelon_airstrikes |
78
+ | 36 | lockdown - arrested - protest - peckham - policeman | 56 | 36_lockdown_arrested_protest_peckham |
79
+ | 37 | lampedusa - borders - invading - disembarking - armada | 54 | 37_lampedusa_borders_invading_disembarking |
80
+ | 38 | minibuses - campaigning - scotland - embarked - churchill | 54 | 38_minibuses_campaigning_scotland_embarked |
81
+ | 39 | conference - tickets - nuneaton - 2021 - southampton | 54 | 39_conference_tickets_nuneaton_2021 |
82
+ | 40 | paul - persecution - golding - spearheaded - paperback | 53 | 40_paul_persecution_golding_spearheaded |
83
+ | 41 | tommy - assaulting - paedophile - magistrates - trumped | 52 | 41_tommy_assaulting_paedophile_magistrates |
84
+ | 42 | melilla - migrants - africa - gibraltar - guardsmen | 51 | 42_melilla_migrants_africa_gibraltar |
85
+ | 43 | mosques - ramadan - praying - adhan - call | 51 | 43_mosques_ramadan_praying_adhan |
86
+ | 44 | beheaded - aoussaoui - gunman - radicalized - mogadishu | 50 | 44_beheaded_aoussaoui_gunman_radicalized |
87
+ | 45 | великобритания - россия - голдинг - партии - могиле | 50 | 45_великобритания_россия_голдинг_партии |
88
+ | 46 | patriotmerchandise - hoodie - shirts - merchandise - products | 50 | 46_patriotmerchandise_hoodie_shirts_merchandise |
89
+ | 47 | shamima - jihadist - returning - shariah - bride | 49 | 47_shamima_jihadist_returning_shariah |
90
+ | 48 | ulster - ballymena - unionist - newtownards - bfni | 48 | 48_ulster_ballymena_unionist_newtownards |
91
+ | 49 | popes - benedict - sacraments - archbishopric - persecuted | 46 | 49_popes_benedict_sacraments_archbishopric |
92
+ | 50 | helmand - corporal - died - battalion - stephen | 45 | 50_helmand_corporal_died_battalion |
93
+ | 51 | jesus - blessed - whosoever - persecuted - everlasting | 45 | 51_jesus_blessed_whosoever_persecuted |
94
+ | 52 | racism - brainwash - curriculum - white - decolonise | 45 | 52_racism_brainwash_curriculum_white |
95
+ | 53 | whitephobia - whiter - silence - systemically - consensual | 42 | 53_whitephobia_whiter_silence_systemically |
96
+ | 54 | refugees - smugglers - genuine - freeloading - ukrainian | 42 | 54_refugees_smugglers_genuine_freeloading |
97
+ | 55 | rotherham - rape - scandals - islamist - nawaz | 41 | 55_rotherham_rape_scandals_islamist |
98
+ | 56 | mexicans - caravan - tapachula - cartels - marches | 40 | 56_mexicans_caravan_tapachula_cartels |
99
+ | 57 | britbox - broadcaster - stormzy - gary - complain | 39 | 57_britbox_broadcaster_stormzy_gary |
100
+ | 58 | referendum - deadline - approved - september - 24th | 39 | 58_referendum_deadline_approved_september |
101
+ | 59 | lithuania - belarusian - rudninkai - trespassers - sanctions | 39 | 59_lithuania_belarusian_rudninkai_trespassers |
102
+ | 60 | deported - offenders - jamaicans - degenerates - partygate | 38 | 60_deported_offenders_jamaicans_degenerates |
103
+ | 61 | britannia - wigan - protests - deported - schoolgirls | 38 | 61_britannia_wigan_protests_deported |
104
+ | 62 | share - telegram - videos - everywhere - retweet | 38 | 62_share_telegram_videos_everywhere |
105
+ | 63 | raped - convicted - rashid - hussain - kirklees | 37 | 63_raped_convicted_rashid_hussain |
106
+ | 64 | charlenedownes - blackpool - murdered - saddleworth - vicky | 37 | 64_charlenedownes_blackpool_murdered_saddleworth |
107
+ | 65 | crimea - russophobia - nordstream - katyusha - vladimir | 37 | 65_crimea_russophobia_nordstream_katyusha |
108
+ | 66 | buckinghamshire - middlesbrough - rochdale - telford - perpetrators | 37 | 66_buckinghamshire_middlesbrough_rochdale_telford |
109
+ | 67 | libya - deporting - amnesty - eritrean - escaped | 36 | 67_libya_deporting_amnesty_eritrean |
110
+ | 68 | leicestershire - birmingham - activists - harassed - gestapo | 34 | 68_leicestershire_birmingham_activists_harassed |
111
+ | 69 | terrorist - qaeda - murdering - knifeman - sharif | 34 | 69_terrorist_qaeda_murdering_knifeman |
112
+ | 70 | germany - rioting - rotterdam - baghdad - allensbach | 34 | 70_germany_rioting_rotterdam_baghdad |
113
+ | 71 | sunday - blessings - christian - worship - supporters | 33 | 71_sunday_blessings_christian_worship |
114
+ | 72 | canvassing - north - wigan - altrincham - winning | 33 | 72_canvassing_north_wigan_altrincham |
115
+ | 73 | activist - meetings - welcomed - tonight - sussex | 33 | 73_activist_meetings_welcomed_tonight |
116
+ | 74 | dlive - paul - golding - 7pm - bitchute | 33 | 74_dlive_paul_golding_7pm |
117
+ | 75 | apartheid - johannesburg - zimbabwe - boers - looters | 33 | 75_apartheid_johannesburg_zimbabwe_boers |
118
+ | 76 | twitter - musk - shadowban - msnbc - shareholders | 32 | 76_twitter_musk_shadowban_msnbc |
119
+ | 77 | canaryislands - spaniards - morocco - asylum - arguineguin | 32 | 77_canaryislands_spaniards_morocco_asylum |
120
+ | 78 | hungary - referendum - prohibits - lgbtq - viktor | 32 | 78_hungary_referendum_prohibits_lgbtq |
121
+ | 79 | twitter - trumpbook - lindell - censorship - influencers | 31 | 79_twitter_trumpbook_lindell_censorship |
122
+ | 80 | islamist - jihadists - radicalisation - mi5 - fear | 31 | 80_islamist_jihadists_radicalisation_mi5 |
123
+ | 81 | britannia - glorious - silbury - stonehenge - fought | 31 | 81_britannia_glorious_silbury_stonehenge |
124
+ | 82 | universities - whiteness - decolonise - walkout - lipscomb | 31 | 82_universities_whiteness_decolonise_walkout |
125
+ | 83 | sweden - gangs - somalis - gottsunda - mikael | 30 | 83_sweden_gangs_somalis_gottsunda |
126
+ | 84 | paedophiles - lgbtqoajsrgoauib - sexualisation - brainwashing - pansexual | 30 | 84_paedophiles_lgbtqoajsrgoauib_sexualisation_brainwashing |
127
+ | 85 | eurosceptics - juncker - undemocratic - voted - remain | 30 | 85_eurosceptics_juncker_undemocratic_voted |
128
+ | 86 | europeans - globalist - migration - islamophobe - invaded | 29 | 86_europeans_globalist_migration_islamophobe |
129
+ | 87 | capitol - rioters - pelosi - msnbc - deputies | 29 | 87_capitol_rioters_pelosi_msnbc |
130
+ | 88 | armagh - killed - bombings - fusiliers - 1974 | 29 | 88_armagh_killed_bombings_fusiliers |
131
+ | 89 | bfd - bodyguards - training - aikido - krav | 28 | 89_bfd_bodyguards_training_aikido |
132
+ | 90 | newspapers - subscribed - patriot - worldwide - mainstream | 28 | 90_newspapers_subscribed_patriot_worldwide |
133
+ | 91 | queen - funeral - westminster - dignitaries - carriage | 27 | 91_queen_funeral_westminster_dignitaries |
134
+ | 92 | transwomen - lesbian - stonewall - matchmaker - elliot | 26 | 92_transwomen_lesbian_stonewall_matchmaker |
135
+ | 93 | christianity - crusade - englishman - blesses - rediscovering | 26 | 93_christianity_crusade_englishman_blesses |
136
+ | 94 | pray - almighty - victory - nationalists - strengthen | 25 | 94_pray_almighty_victory_nationalists |
137
+ | 95 | dartford - campaigning - essex - canvassers - east | 25 | 95_dartford_campaigning_essex_canvassers |
138
+ | 96 | livestream - thelionsradio - liamwalsall - tomorrow - 6pm | 24 | 96_livestream_thelionsradio_liamwalsall_tomorrow |
139
+ | 97 | cathedrals - firebombed - flames - saint - france | 23 | 97_cathedrals_firebombed_flames_saint |
140
+ | 98 | 4climate - antarctica - coldest - thunberg - alarmists | 23 | 98_4climate_antarctica_coldest_thunberg |
141
+ | 99 | parler - followers - alternatives - ashleasimonnews - censored | 23 | 99_parler_followers_alternatives_ashleasimonnews |
142
+ | 100 | schoolteacher - mohammed - blasphemy - batley - cartoon | 22 | 100_schoolteacher_mohammed_blasphemy_batley |
143
+ | 101 | postal - stamp - wellingborough - votes - urgently | 22 | 101_postal_stamp_wellingborough_votes |
144
+ | 102 | britannia - camp - august - activist - 2022 | 22 | 102_britannia_camp_august_activist |
145
+ | 103 | muslims - birmingham - temple - riots - smethwick | 22 | 103_muslims_birmingham_temple_riots |
146
+ | 104 | islamophobia - taqiyya - hijab - holiest - shitholes | 21 | 104_islamophobia_taqiyya_hijab_holiest |
147
+ | 105 | asylum - hostel - residents - hull - flintshire | 21 | 105_asylum_hostel_residents_hull |
148
+ | 106 | asylum - murdered - rapist - assaulting - grandmother | 21 | 106_asylum_murdered_rapist_assaulting |
149
+ | 107 | parliament - remainers - prorogue - amendments - undemocratic | 21 | 107_parliament_remainers_prorogue_amendments |
150
+ | 108 | ukprotests - trending - leftists - blmlondon - followers | 20 | 108_ukprotests_trending_leftists_blmlondon |
151
+ | 109 | belfastlive - defrauding - jolene - verdict - commissioner | 20 | 109_belfastlive_defrauding_jolene_verdict |
152
+ | 110 | wakefield - campaigning - palestinians - councillors - thursday | 20 | 110_wakefield_campaigning_palestinians_councillors |
153
+
154
+ </details>
155
+
156
+ ## Training hyperparameters
157
+
158
+ * calculate_probabilities: True
159
+ * language: None
160
+ * low_memory: False
161
+ * min_topic_size: 10
162
+ * n_gram_range: (1, 1)
163
+ * nr_topics: None
164
+ * seed_topic_list: None
165
+ * top_n_words: 10
166
+ * verbose: False
167
+ * zeroshot_min_similarity: 0.7
168
+ * zeroshot_topic_list: None
169
+
170
+ ## Framework versions
171
+
172
+ * Numpy: 1.26.4
173
+ * HDBSCAN: 0.8.40
174
+ * UMAP: 0.5.7
175
+ * Pandas: 2.2.3
176
+ * Scikit-Learn: 1.5.2
177
+ * Sentence-transformers: 3.3.1
178
+ * Transformers: 4.46.3
179
+ * Numba: 0.60.0
180
+ * Plotly: 5.24.1
181
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null
16
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e513801b329617979ce8aad9542008aede946439b56105fa635577567cae2cdc
3
+ size 1001684
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5befa4521853e42cabd1bfda6e607da2db6b236a0dd8a836b5afd5ed93763ae7
3
+ size 458848
topics.json ADDED
The diff for this file is too large to render. See raw diff