AngelPanizo commited on
Commit
73700ab
·
verified ·
1 Parent(s): 784eb84

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # MARTINI_enrich_BERTopic_qanonplus
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("AIDA-UPM/MARTINI_enrich_BERTopic_qanonplus")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 118
34
+ * Number of training documents: 16268
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | trump - fbi - epstein - news - states | 20 | -1_trump_fbi_epstein_news |
42
+ | 0 | illegals - deported - cbp - matamoros - smuggled | 9495 | 0_illegals_deported_cbp_matamoros |
43
+ | 1 | ukraine - donetsk - mariupol - russians - missiles | 303 | 1_ukraine_donetsk_mariupol_russians |
44
+ | 2 | stabbed - robbers - bronx - gunpoint - homicide | 280 | 2_stabbed_robbers_bronx_gunpoint |
45
+ | 3 | biden - kamala - jill - reporters - vp | 213 | 3_biden_kamala_jill_reporters |
46
+ | 4 | fauci - coronaviruses - darpa - bioweapon - institutes | 182 | 4_fauci_coronaviruses_darpa_bioweapon |
47
+ | 5 | comey - counterintelligence - dossier - fisa - danchenko | 181 | 5_comey_counterintelligence_dossier_fisa |
48
+ | 6 | gaza - airstrikes - hezbollah - israeli - qassam | 180 | 6_gaza_airstrikes_hezbollah_israeli |
49
+ | 7 | musk - twittergate - dorsey - shareholders - takeover | 157 | 7_musk_twittergate_dorsey_shareholders |
50
+ | 8 | bailouts - fdic - depositors - bancorp - yellen | 149 | 8_bailouts_fdic_depositors_bancorp |
51
+ | 9 | bidens - bribery - laundering - oligarchs - vp | 131 | 9_bidens_bribery_laundering_oligarchs |
52
+ | 10 | trudeau - truckersforfreedom - ontario - protestors - crackdown | 125 | 10_trudeau_truckersforfreedom_ontario_protestors |
53
+ | 11 | pfizer - mercola - virologist - delisted - freedommed | 122 | 11_pfizer_mercola_virologist_delisted |
54
+ | 12 | trannies - ftm - estradiol - female - castrate | 120 | 12_trannies_ftm_estradiol_female |
55
+ | 13 | trump - judge - indicted - prosecutors - watergate | 119 | 13_trump_judge_indicted_prosecutors |
56
+ | 14 | davos - globalists - agenda - documentary - cbdc | 119 | 14_davos_globalists_agenda_documentary |
57
+ | 15 | patriotbusiness - cryptocoins - chatroom - reposted - 39k | 116 | 15_patriotbusiness_cryptocoins_chatroom_reposted |
58
+ | 16 | explosions - refinery - firefighters - flames - evacuations | 116 | 16_explosions_refinery_firefighters_flames |
59
+ | 17 | taliban - kandahar - milley - zabihullah - helicopters | 109 | 17_taliban_kandahar_milley_zabihullah |
60
+ | 18 | yangtze - qinghai - headwaters - floods - henan | 106 | 18_yangtze_qinghai_headwaters_floods |
61
+ | 19 | taiwan - warfare - invade - destroyers - pacific | 102 | 19_taiwan_warfare_invade_destroyers |
62
+ | 20 | migrants - lampedusa - islamist - stabbed - perpetrator | 101 | 20_migrants_lampedusa_islamist_stabbed |
63
+ | 21 | petroleum - bidenflation - prices - saudis - gallon | 97 | 21_petroleum_bidenflation_prices_saudis |
64
+ | 22 | patriot - believing - sheeple - hold - hardships | 92 | 22_patriot_believing_sheeple_hold |
65
+ | 23 | france - riots - bastille - marseille - liberte | 92 | 23_france_riots_bastille_marseille |
66
+ | 24 | bolsonaro - argentineans - venezuela - electoral - janeiro | 89 | 24_bolsonaro_argentineans_venezuela_electoral |
67
+ | 25 | fbi - capitol - prisoner - rioters - january | 85 | 25_fbi_capitol_prisoner_rioters |
68
+ | 26 | aspartame - sweeteners - gummies - synthetic - pepsi | 81 | 26_aspartame_sweeteners_gummies_synthetic |
69
+ | 27 | ivermectin - hydroxychloroquine - ritonavir - fenbendazole - 5000iu | 77 | 27_ivermectin_hydroxychloroquine_ritonavir_fenbendazole |
70
+ | 28 | cabal - cnn - soros - plandemic - jfkjr | 68 | 28_cabal_cnn_soros_plandemic |
71
+ | 29 | riots - counterprotesters - minnesota - officer - floyd | 65 | 29_riots_counterprotesters_minnesota_officer |
72
+ | 30 | ballots - pennsylvania - defendflorida - affidavits - precincts | 64 | 30_ballots_pennsylvania_defendflorida_affidavits |
73
+ | 31 | patriots - chatroom - bots - 11k - blackpillers | 60 | 31_patriots_chatroom_bots_11k |
74
+ | 32 | donaldjtrump - rallies - newsmax - watchlive - mcmaster | 59 | 32_donaldjtrump_rallies_newsmax_watchlive |
75
+ | 33 | biden - laptop - blackmailers - rudy - reveals | 58 | 33_biden_laptop_blackmailers_rudy |
76
+ | 34 | pompeo - 27th - fawkes - date - q1314 | 58 | 34_pompeo_27th_fawkes_date |
77
+ | 35 | kilauea - soufriere - eruptionis - palma - cataclysm | 57 | 35_kilauea_soufriere_eruptionis_palma |
78
+ | 36 | whatsapp - deleting - appstore - uninstall - protonmail | 55 | 36_whatsapp_deleting_appstore_uninstall |
79
+ | 37 | donald - federalizing - proclamation - military - 2020 | 54 | 37_donald_federalizing_proclamation_military |
80
+ | 38 | ghislaine - epstein - alleged - jurors - multimillionaire | 53 | 38_ghislaine_epstein_alleged_jurors |
81
+ | 39 | vaccine - perimyocarditis - mrna - inflammation - cardiologist | 52 | 39_vaccine_perimyocarditis_mrna_inflammation |
82
+ | 40 | duckduckgo - pichai - yahoo_youtube - blacklisted - presearch | 51 | 40_duckduckgo_pichai_yahoo_youtube_blacklisted |
83
+ | 41 | maui - hawaiians - lahaina - wildfire - landowners | 50 | 41_maui_hawaiians_lahaina_wildfire |
84
+ | 42 | ukraine - biolaboratories - pentagon - anthrax - proliferation | 47 | 42_ukraine_biolaboratories_pentagon_anthrax |
85
+ | 43 | zelensky - million - pinchuk - mironyuk - pensions | 44 | 43_zelensky_million_pinchuk_mironyuk |
86
+ | 44 | pompeo - dunford - genflynn - nsa - loyalists | 44 | 44_pompeo_dunford_genflynn_nsa |
87
+ | 45 | chinese - jingwei - zhengdong - espionage - defector | 43 | 45_chinese_jingwei_zhengdong_espionage |
88
+ | 46 | scotus - gorsuch - docket - dismissed - brett | 41 | 46_scotus_gorsuch_docket_dismissed |
89
+ | 47 | zuckerberg - fb - meta - censorship - lnstagram | 40 | 47_zuckerberg_fb_meta_censorship |
90
+ | 48 | qanon - posts - shill - surrender - disseminate | 40 | 48_qanon_posts_shill_surrender |
91
+ | 49 | fauci - _vaccine - news - mandates - falsified | 40 | 49_fauci__vaccine_news_mandates |
92
+ | 50 | thepatriotlight - tyrants - 21 - pray - everywhere | 40 | 50_thepatriotlight_tyrants_21_pray |
93
+ | 51 | rupaul - lgbtqxyz - paedophiles - christmas - parading | 39 | 51_rupaul_lgbtqxyz_paedophiles_christmas |
94
+ | 52 | vaers - deaths - janssen - ad26 - pericarditis | 39 | 52_vaers_deaths_janssen_ad26 |
95
+ | 53 | representatives - impeach - murkowski - voted - 5251 | 39 | 53_representatives_impeach_murkowski_voted |
96
+ | 54 | derailments - hazmat - metra - dakota - spills | 38 | 54_derailments_hazmat_metra_dakota |
97
+ | 55 | vaxxers - vacunados - shingles - smerconish - needle | 37 | 55_vaxxers_vacunados_shingles_smerconish |
98
+ | 56 | covidvaccinevictims - astrazeneca - 2021 - stephanie - coma | 37 | 56_covidvaccinevictims_astrazeneca_2021_stephanie |
99
+ | 57 | obamas - drowned - martha - paddleboarder - accidental | 37 | 57_obamas_drowned_martha_paddleboarder |
100
+ | 58 | stimulus - billion - obamacare - americorp - fy2019 | 37 | 58_stimulus_billion_obamacare_americorp |
101
+ | 59 | australia - nsw - lockdown - protesters - busselton | 36 | 59_australia_nsw_lockdown_protesters |
102
+ | 60 | royal - coronation - buckingham - meghan - sceptre | 36 | 60_royal_coronation_buckingham_meghan |
103
+ | 61 | pelosi - nancy - paulie - husband - drunken | 36 | 61_pelosi_nancy_paulie_husband |
104
+ | 62 | monkeypox - smallpox - vaccinate - transmissibility - nonproliferation | 36 | 62_monkeypox_smallpox_vaccinate_transmissibility |
105
+ | 63 | humantraffickinghotline - arrested - prostituted - kidnapping - ohio | 35 | 63_humantraffickinghotline_arrested_prostituted_kidnapping |
106
+ | 64 | squadrons - awacs - stratotanker - rc135s - dornier | 35 | 64_squadrons_awacs_stratotanker_rc135s |
107
+ | 65 | sextortion - molestation - convicted - imprisonment - victims | 35 | 65_sextortion_molestation_convicted_imprisonment |
108
+ | 66 | blockfi - winklevoss - laundered - scams - 32billion | 35 | 66_blockfi_winklevoss_laundered_scams |
109
+ | 67 | cryptocrash - coinmarketcap - litecoin - hashrate - xlm | 34 | 67_cryptocrash_coinmarketcap_litecoin_hashrate |
110
+ | 68 | iraq - airstrikes - khorramabad - dhahran - kurdish | 33 | 68_iraq_airstrikes_khorramabad_dhahran |
111
+ | 69 | lockdown - guangzhou - sichuan - tibet - xishuangbanna | 33 | 69_lockdown_guangzhou_sichuan_tibet |
112
+ | 70 | biden - delaware - folders - mishandled - chung | 33 | 70_biden_delaware_folders_mishandled |
113
+ | 71 | earthquakes - aftershocks - tsunamis - usgs - panama | 32 | 71_earthquakes_aftershocks_tsunamis_usgs |
114
+ | 72 | netherlands - rotterdam - rotterdammers - joris - ministers | 32 | 72_netherlands_rotterdam_rotterdammers_joris |
115
+ | 73 | epidemics - nipah - h5n1 - dengue - deadliest | 32 | 73_epidemics_nipah_h5n1_dengue |
116
+ | 74 | covidvaxexposed - vaccinate - injections - children - mandating | 32 | 74_covidvaxexposed_vaccinate_injections_children |
117
+ | 75 | unvaccinated - covid - hospitalizations - breakthrough - fully | 31 | 75_unvaccinated_covid_hospitalizations_breakthrough |
118
+ | 76 | eurosceptic - netherland - coalition - polls - geert | 31 | 76_eurosceptic_netherland_coalition_polls |
119
+ | 77 | electionguard - hackable - computerized - techcrunch - audited | 30 | 77_electionguard_hackable_computerized_techcrunch |
120
+ | 78 | sunspot - auroras - comet - magnetometers - nasa | 30 | 78_sunspot_auroras_comet_magnetometers |
121
+ | 79 | maricopa - auditors - subpoena - truthhammer - bennett | 30 | 79_maricopa_auditors_subpoena_truthhammer |
122
+ | 80 | trumps - products - memorabilia - patriot - donate | 30 | 80_trumps_products_memorabilia_patriot |
123
+ | 81 | papal - archbishopric - scandals - nuns - defrocked | 30 | 81_papal_archbishopric_scandals_nuns |
124
+ | 82 | protests - lockdown - milan - boomrome - denmark | 30 | 82_protests_lockdown_milan_boomrome |
125
+ | 83 | ep - trap - ds - 2943b - treason | 30 | 83_ep_trap_ds_2943b |
126
+ | 84 | impeachment - vindman - liar - julie - swalwell | 30 | 84_impeachment_vindman_liar_julie |
127
+ | 85 | pilots - jetblue - unvaccinated - boarded - mandates | 29 | 85_pilots_jetblue_unvaccinated_boarded |
128
+ | 86 | miscarriages - pfizer - immunized - placenta - trimester | 28 | 86_miscarriages_pfizer_immunized_placenta |
129
+ | 87 | vaccinated - mandates - fda - tyson - workers | 28 | 87_vaccinated_mandates_fda_tyson |
130
+ | 88 | maersk - dockworkers - longshore - containers - backlogged | 27 | 88_maersk_dockworkers_longshore_containers |
131
+ | 89 | jury - courttv - kenosha - dismissed - youngkin | 27 | 89_jury_courttv_kenosha_dismissed |
132
+ | 90 | mortality - vaxx - dowd - policyholders - 2022 | 27 | 90_mortality_vaxx_dowd_policyholders |
133
+ | 91 | capitol - inauguration - guardsmen - dunford - _mil | 27 | 91_capitol_inauguration_guardsmen_dunford |
134
+ | 92 | ballots - votesecure - maricopa - counted - ineligible | 27 | 92_ballots_votesecure_maricopa_counted |
135
+ | 93 | download - members - signal - cellphones - virginia | 26 | 93_download_members_signal_cellphones |
136
+ | 94 | jpmorgan - epstein - pedophile - magnate - trustees | 26 | 94_jpmorgan_epstein_pedophile_magnate |
137
+ | 95 | omicron - mutated - marburg - botswana - clade | 26 | 95_omicron_mutated_marburg_botswana |
138
+ | 96 | poweroutage - generators - texas - supplies - temperatures | 26 | 96_poweroutage_generators_texas_supplies |
139
+ | 97 | extradited - snowden - courts - juilan - ecuador | 25 | 97_extradited_snowden_courts_juilan |
140
+ | 98 | fentanyl - sinaloa - naloxone - oxycodone - overdose | 25 | 98_fentanyl_sinaloa_naloxone_oxycodone |
141
+ | 99 | australians - quarantine - victoria - deported - djokovic | 25 | 99_australians_quarantine_victoria_deported |
142
+ | 100 | shootings - parkland - aiden - classroom - superintendent | 24 | 100_shootings_parkland_aiden_classroom |
143
+ | 101 | inflation - skyrocketed - unaffordable - costco - 40 | 24 | 101_inflation_skyrocketed_unaffordable_costco |
144
+ | 102 | shortages - rationed - soybeans - growers - tesco | 24 | 102_shortages_rationed_soybeans_growers |
145
+ | 103 | pcr - piandemic - false - influenza - detected | 23 | 103_pcr_piandemic_false_influenza |
146
+ | 104 | tunnelsb - submersible - worldriotsbegin - implosions - maglev | 23 | 104_tunnelsb_submersible_worldriotsbegin_implosions |
147
+ | 105 | trump - bullllshit - naysayers - victory - loosing | 23 | 105_trump_bullllshit_naysayers_victory |
148
+ | 106 | praying - demonic - attacks - 21 - awake | 22 | 106_praying_demonic_attacks_21 |
149
+ | 107 | trafficked - underage - bondage - bastrop - rangers | 22 | 107_trafficked_underage_bondage_bastrop |
150
+ | 108 | michaeljlindell - frankspeech - tonight - kimmel - streaming | 22 | 108_michaeljlindell_frankspeech_tonight_kimmel |
151
+ | 109 | footballer - athletes - myocarditis - collapsed - sergio | 22 | 109_footballer_athletes_myocarditis_collapsed |
152
+ | 110 | desantis - governors - wfla - fda - maga | 22 | 110_desantis_governors_wfla_fda |
153
+ | 111 | cuomo - allegations - albany - andrew - groped | 21 | 111_cuomo_allegations_albany_andrew |
154
+ | 112 | webull - gamestonk - robinhood - brokers - delisting | 21 | 112_webull_gamestonk_robinhood_brokers |
155
+ | 113 | mcafee - killswitch - 04b0400009d0400007a0c0000f90d0000af1000005e1200000414000068170000 - decode - john | 21 | 113_mcafee_killswitch_04b0400009d0400007a0c0000f90d0000af1000005e1200000414000068170000_decode |
156
+ | 114 | fbi - extremism - biased - redactions - infiltrated | 21 | 114_fbi_extremism_biased_redactions |
157
+ | 115 | whistleblower - bidens - irs - investigation - transcript | 21 | 115_whistleblower_bidens_irs_investigation |
158
+ | 116 | gauteng - riots - looted - jabulani - zuma | 21 | 116_gauteng_riots_looted_jabulani |
159
+
160
+ </details>
161
+
162
+ ## Training hyperparameters
163
+
164
+ * calculate_probabilities: True
165
+ * language: None
166
+ * low_memory: False
167
+ * min_topic_size: 10
168
+ * n_gram_range: (1, 1)
169
+ * nr_topics: None
170
+ * seed_topic_list: None
171
+ * top_n_words: 10
172
+ * verbose: False
173
+ * zeroshot_min_similarity: 0.7
174
+ * zeroshot_topic_list: None
175
+
176
+ ## Framework versions
177
+
178
+ * Numpy: 1.26.4
179
+ * HDBSCAN: 0.8.40
180
+ * UMAP: 0.5.7
181
+ * Pandas: 2.2.3
182
+ * Scikit-Learn: 1.5.2
183
+ * Sentence-transformers: 3.3.1
184
+ * Transformers: 4.46.3
185
+ * Numba: 0.60.0
186
+ * Plotly: 5.24.1
187
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": false,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null
16
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e7fed64ac71b20542dc2fbf0d4dbbf0b4df372629fc0bf76c520c71429ca4d3
3
+ size 1790420
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26114189d0af0ce0bf5479ac53b9505269e18da371dac182e799eb9db9d6b3ba
3
+ size 483424
topics.json ADDED
The diff for this file is too large to render. See raw diff