Mars203020 commited on
Commit
54d4f91
·
verified ·
1 Parent(s): e954ef8

Upload 9 files

Browse files
Social Media Topic Modeling System.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Social Media Topic Modeling System
2
+
3
+ A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
4
+
5
+ ## Features
6
+
7
+ - **📊 Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
8
+ - **⚙️ Flexible Configuration**:
9
+ - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
10
+ - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
11
+ - **🌍 Multilingual Support**: Handles English and 50+ other languages.
12
+ - **📈 Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
13
+ - **⏰ Topic Evolution**: Tracks how topics change over time.
14
+ - **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
15
+ - **📱 Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.
16
+
17
+ ## Requirements
18
+
19
+ ### CSV File Format
20
+
21
+ Your CSV file must contain columns that can be mapped to the following roles:
22
+ - **User ID**: A column with unique identifiers for each user (string).
23
+ - **Post Content**: A column with the text content of the social media post (string).
24
+ - **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
25
+
26
+ The application will prompt you to select the correct column for each role after you upload your file.
27
+
28
+ ### Dependencies
29
+
30
+ See `requirements.txt` for a full list of dependencies.
31
+
32
+ ## Installation
33
+
34
+ ### Option 1: Local Installation
35
+
36
+ 1. **Clone or download the project files.**
37
+ 2. **Install dependencies:**
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ### Option 2: Docker Installation (Recommended)
43
+
44
+ 1. **Using Docker Compose (easiest):**
45
+ ```bash
46
+ docker-compose up --build
47
+ ```
48
+ 2. **Access the application:**
49
+ ```
50
+ http://localhost:8501
51
+ ```
52
+
53
+ ## Usage
54
+
55
+ 1. **Start the Streamlit application:**
56
+ ```bash
57
+ streamlit run app.py
58
+ ```
59
+ 2. **Open your browser** and navigate to `http://localhost:8501`.
60
+ 3. **Follow the steps in the sidebar:**
61
+ - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
62
+ - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
63
+ - **3. Configure Analysis**:
64
+ - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
65
+ - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
66
+ - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
67
+ - **4. Run Analysis**: Click the "🚀 Analyze Topics" button.
68
+
69
+ 4. **Explore the results** in the five interactive tabs in the main panel.
70
+
71
+ ### Using the Interface
72
+
73
+ The application provides five main tabs:
74
+
75
+ #### 📋 Overview
76
+ - Key metrics, dataset preview, and average Gini coefficient.
77
+
78
+ #### 🎯 Topics
79
+ - Topic information table and topic distribution bar chart.
80
+
81
+ #### 📊 Gini Analysis
82
+ - Analysis of topic diversity for each user and user concentration for each topic.
83
+
84
+ #### 📈 Topic Evolution
85
+ - Timelines showing how topic popularity changes over time, for all users and for individual users.
86
+
87
+ #### 📄 Documents
88
+ - A detailed view of your original data with assigned topics and probabilities.
89
+
90
+ ## Understanding the Results
91
+
92
+ ### Gini Coefficient
93
+ - **Range**: 0 to 1
94
+ - **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
95
+ - **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.
96
+
97
+ ---
98
+
99
+ **Built with ❤️ using Streamlit and BERTopic**
app.py ADDED
@@ -0,0 +1,534 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+
5
+ import plotly.express as px
6
+ from wordcloud import WordCloud
7
+ import matplotlib.pyplot as plt
8
+
9
+ # Import custom modules
10
+ from text_preprocessor import MultilingualPreprocessor
11
+ from topic_modeling import perform_topic_modeling
12
+ from gini_calculator import calculate_gini_per_user, calculate_gini_per_topic
13
+ from topic_evolution import analyze_general_topic_evolution
14
+ from narrative_similarity import calculate_narrative_similarity
15
+
16
+ # --- Page Configuration ---
17
+ st.set_page_config(
18
+ page_title="Social Media Topic Modeling System",
19
+ page_icon="📊",
20
+ layout="wide",
21
+ )
22
+
23
+ # --- Custom CSS ---
24
+ st.markdown("""
25
+ <style>
26
+ .main-header { font-size: 2.5rem; color: #1f77b4; text-align: center; margin-bottom: 1rem; }
27
+ .sub-header { font-size: 1.75rem; color: #2c3e50; border-bottom: 2px solid #f0f2f6; padding-bottom: 0.3rem; margin-top: 2rem; margin-bottom: 1rem;}
28
+ </style>
29
+ """, unsafe_allow_html=True)
30
+
31
+ # --- Session State Initialization ---
32
+ if 'results' not in st.session_state:
33
+ st.session_state.results = None
34
+ if 'df_raw' not in st.session_state:
35
+ st.session_state.df_raw = None
36
+ if 'custom_stopwords_text' not in st.session_state:
37
+ st.session_state.custom_stopwords_text = ""
38
+ if "topics_info_for_sync" not in st.session_state:
39
+ st.session_state.topics_info_for_sync = []
40
+
41
+
42
+ # --- Helper Functions ---
43
+ @st.cache_data
44
+ def create_word_cloud(_topic_model, topic_id):
45
+ word_freq = _topic_model.get_topic(topic_id)
46
+ if not word_freq: return None
47
+ wc = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=50).generate_from_frequencies(dict(word_freq))
48
+ fig, ax = plt.subplots(figsize=(10, 5))
49
+ ax.imshow(wc, interpolation='bilinear')
50
+ ax.axis("off")
51
+ plt.close(fig)
52
+ return fig
53
+
54
+
55
+
56
+ def interpret_gini(gini_score):
57
+ # Logic is now FLIPPED for Gini Impurity
58
+ if gini_score >= 0.6: return "🌐 Diverse Interests"
59
+ elif gini_score >= 0.3: return "🎯 Moderately Focused"
60
+ else: return "🔥 Highly Specialized"
61
+
62
+ # --- START OF DEFINITIVE FIX: Centralized Callback Function ---
63
+ def sync_stopwords():
64
+ """
65
+ This function is the single source of truth for updating stopwords.
66
+ It's called whenever any related widget changes.
67
+ """
68
+ # 1. Get words from all multiselect lists
69
+ selected_from_lists = set()
70
+ for topic_id in st.session_state.topics_info_for_sync:
71
+ key = f"multiselect_topic_{topic_id}"
72
+ if key in st.session_state:
73
+ selected_from_lists.update([s.split(' ')[0] for s in st.session_state[key]])
74
+
75
+ # 2. Get words from the text area
76
+ # The key for the text area is now the master state variable itself.
77
+ typed_stopwords = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
78
+
79
+ # 3. Combine them and update the master state variable
80
+ combined_stopwords = typed_stopwords.union(selected_from_lists)
81
+ st.session_state.custom_stopwords_text = ", ".join(sorted(list(combined_stopwords)))
82
+
83
+
84
+ # --- Main Page Layout ---
85
+ st.title("🌍 Multilingual Topic Modeling Dashboard")
86
+ st.markdown("Analyze textual data in multiple languages to discover topics and user trends.")
87
+
88
+ # Use a key to ensure the file uploader keeps its state, and update session_state directly
89
+ uploaded_file = st.file_uploader("Upload your CSV data", type="csv", key="csv_uploader")
90
+
91
+ # Check if a new file has been uploaded (or if it's the first time and a file exists)
92
+ if uploaded_file is not None and uploaded_file != st.session_state.get('last_uploaded_file', None):
93
+ try:
94
+ st.session_state.df_raw = pd.read_csv(uploaded_file)
95
+ st.session_state.results = None # Reset results if a new file is uploaded
96
+ st.session_state.custom_stopwords_text = ""
97
+ st.session_state.last_uploaded_file = uploaded_file # Store the uploaded file itself
98
+ st.success("CSV file loaded successfully!")
99
+ except Exception as e:
100
+ st.error(f"Could not read CSV file. Error: {e}")
101
+ st.session_state.df_raw = None
102
+ st.session_state.last_uploaded_file = None
103
+
104
+ if st.session_state.df_raw is not None:
105
+ df_raw = st.session_state.df_raw
106
+ col1, col2, col3 = st.columns(3)
107
+
108
+ with col1: user_id_col = st.selectbox("User ID Column", df_raw.columns, index=0, key="user_id_col")
109
+ with col2: post_content_col = st.selectbox("Post Content Column", df_raw.columns, index=1, key="post_content_col")
110
+ with col3: timestamp_col = st.selectbox("Timestamp Column", df_raw.columns, index=2, key="timestamp_col")
111
+
112
+ st.subheader("Topic Modeling Settings")
113
+ lang_col, topics_col = st.columns(2)
114
+ with lang_col: language = st.selectbox("Language Model", ["english", "multilingual"], key="language_model")
115
+ with topics_col: num_topics = st.number_input("Number of Topics", -1, help="Use -1 for automatic detection", key="num_topics")
116
+
117
+ with st.expander("Advanced: Text Cleaning & Preprocessing Options", expanded=False):
118
+ c1, c2 = st.columns(2)
119
+ with c1:
120
+ opts = {
121
+ 'lowercase': st.checkbox("Convert to Lowercase", True, key="opt_lowercase"),
122
+ 'lemmatize': st.checkbox("Lemmatize words", False, key="opt_lemmatize"),
123
+ 'remove_urls': st.checkbox("Remove URLs", False, key="opt_remove_urls"),
124
+ 'remove_html': st.checkbox("Remove HTML Tags", False, key="opt_remove_html")
125
+ }
126
+ with c2:
127
+ opts.update({
128
+ 'remove_special_chars': st.checkbox("Remove Special Characters", False, key="opt_remove_special_chars"),
129
+ 'remove_punctuation': st.checkbox("Remove Punctuation", False, key="opt_remove_punctuation"),
130
+ 'remove_numbers': st.checkbox("Remove Numbers", False, key="opt_remove_numbers")
131
+ })
132
+ st.markdown("---")
133
+ c1_emoji, c2_hashtag, c3_mention = st.columns(3)
134
+ with c1_emoji: opts['handle_emojis'] = st.radio("Emoji Handling", ["Keep Emojis", "Remove Emojis", "Convert Emojis to Text"], index=0, key="opt_handle_emojis")
135
+ with c2_hashtag: opts['handle_hashtags'] = st.radio("Hashtag (#) Handling", ["Keep Hashtags", "Remove Hashtags", "Extract Hashtags"], index=0, key="opt_handle_hashtags")
136
+ with c3_mention: opts['handle_mentions'] = st.radio("Mention (@) Handling", ["Keep Mentions", "Remove Mentions", "Extract Mentions"], index=0, key="opt_handle_mentions")
137
+ st.markdown("---")
138
+ opts['remove_stopwords'] = st.checkbox("Remove Stopwords", True, key="opt_remove_stopwords")
139
+
140
+ st.text_area(
141
+ "Custom Stopwords (comma-separated)",
142
+ key="custom_stopwords_text", # This one already had a key
143
+ on_change=sync_stopwords
144
+ )
145
+ opts['custom_stopwords'] = [s.strip().lower() for s in st.session_state.custom_stopwords_text.split(',') if s]
146
+
147
+ st.divider()
148
+ process_button = st.button("🚀 Run Full Analysis", type="primary", use_container_width=True)
149
+ else:
150
+ process_button = False
151
+
152
+ st.divider()
153
+
154
+ # --- Main Processing Logic ---
155
+ if process_button:
156
+ st.session_state.results = None
157
+ with st.spinner("Processing your data... This may take a few minutes."):
158
+ try:
159
+ df = df_raw[[user_id_col, post_content_col, timestamp_col]].copy()
160
+ df.columns = ['user_id', 'post_content', 'timestamp']
161
+ df.dropna(subset=['user_id', 'post_content', 'timestamp'], inplace=True)
162
+ df['timestamp'] = pd.to_datetime(df['timestamp'])
163
+ if opts['handle_hashtags'] == 'Extract Hashtags': df['hashtags'] = df['post_content'].str.findall(r'#\w+')
164
+ if opts['handle_mentions'] == 'Extract Mentions': df['mentions'] = df['post_content'].str.findall(r'@\w+')
165
+
166
+ # 1. Capture the user's actual choice about stopwords
167
+ user_wants_stopwords_removed = opts.get("remove_stopwords", False)
168
+ custom_stopwords_list = opts.get("custom_stopwords", [])
169
+
170
+ # 2. Tell the preprocessor to KEEP stopwords in the text.
171
+ opts_for_preprocessor = opts.copy()
172
+ opts_for_preprocessor['remove_stopwords'] = False
173
+
174
+ st.info("⚙️ Initializing preprocessor and cleaning text (keeping stopwords for now)...")
175
+ preprocessor = MultilingualPreprocessor(language=language)
176
+ df['processed_content'] = preprocessor.preprocess_series(
177
+ df['post_content'],
178
+ opts_for_preprocessor,
179
+ n_process_spacy=1 # Keep this for stability
180
+ )
181
+
182
+ st.info("🔍 Performing topic modeling...")
183
+ # FIX 3: Add the +1 logic to better target the number of topics
184
+ if num_topics > 0:
185
+ bertopic_nr_topics = num_topics + 1
186
+ else:
187
+ bertopic_nr_topics = "auto"
188
+
189
+ docs_series = df['processed_content'].fillna('').astype(str)
190
+ docs_to_model = docs_series[docs_series.str.len() > 0].tolist()
191
+ df_with_content = df[docs_series.str.len() > 0].copy()
192
+
193
+ if not docs_to_model:
194
+ st.error("❌ After preprocessing, no documents were left to analyze. Please adjust your cleaning options.")
195
+ st.stop()
196
+
197
+ # 3. Pass the user's choice and stopwords list to BERTopic
198
+ topic_model, topics, probs, coherence_score = perform_topic_modeling(
199
+ docs=docs_to_model,
200
+ language=language,
201
+ nr_topics=bertopic_nr_topics,
202
+ remove_stopwords_bertopic=user_wants_stopwords_removed,
203
+ custom_stopwords=custom_stopwords_list
204
+ )
205
+
206
+ df_with_content['topic_id'] = topics
207
+ df_with_content['probability'] = probs
208
+ df = pd.merge(df, df_with_content[['topic_id', 'probability']], left_index=True, right_index=True, how='left')
209
+ df['topic_id'] = df['topic_id'].fillna(-1).astype(int)
210
+
211
+ st.info("📊 Calculating user engagement metrics...")
212
+ all_unique_topics = sorted(df[df['topic_id'] != -1]['topic_id'].unique().tolist())
213
+ all_unique_users = sorted(df['user_id'].unique().tolist())
214
+
215
+ gini_per_user = calculate_gini_per_user(df[['user_id', 'topic_id']], all_topics=all_unique_topics)
216
+ gini_per_topic = calculate_gini_per_topic(df[['user_id', 'topic_id']], all_users=all_unique_users)
217
+
218
+ st.info("📈 Analyzing topic evolution...")
219
+ general_evolution = analyze_general_topic_evolution(topic_model, docs_to_model, df_with_content['timestamp'].tolist())
220
+
221
+ st.session_state.results = {
222
+ 'topic_model': topic_model,
223
+ 'topic_info': topic_model.get_topic_info(),
224
+ 'df': df,
225
+ 'gini_per_user': gini_per_user,
226
+ 'gini_per_topic': gini_per_topic, # FIX 2: Save gini_per_topic to session state
227
+ 'general_evolution': general_evolution,
228
+ 'coherence_score': coherence_score
229
+ }
230
+
231
+ st.success("✅ Analysis complete!")
232
+ except OSError as e:
233
+ st.error(f"spaCy Model Error: Could not load model. Please run `python -m spacy download en_core_web_sm` and `python -m spacy download xx_ent_wiki_sm` from your terminal.")
234
+ except Exception as e:
235
+ st.error(f"❌ An error occurred during processing: {e}")
236
+ st.exception(e)
237
+ # --- Display Results ---
238
+ if st.session_state.results:
239
+ results = st.session_state.results
240
+ df = results['df']
241
+ topic_model = results['topic_model']
242
+ topic_info = results['topic_info']
243
+
244
+ st.markdown('<h2 class="sub-header">📋 Overview & Preprocessing</h2>', unsafe_allow_html=True)
245
+ score_text = f"{results['coherence_score']:.3f}" if results['coherence_score'] is not None else "N/A"
246
+ num_users = df['user_id'].nunique()
247
+ avg_posts = len(df) / num_users if num_users > 0 else 0
248
+ start_date, end_date = df['timestamp'].min(), df['timestamp'].max()
249
+ # Option 1: More Compact Date Format
250
+ if start_date.year == end_date.year:
251
+ # If both dates are in the same year, only show year on the end date
252
+ time_range_str = f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}"
253
+ else:
254
+ # If dates span multiple years, show year on both
255
+ time_range_str = f"{start_date.strftime('%b %d, %Y')} - {end_date.strftime('%b %d, %Y')}"
256
+ col1, col2, col3, col4, col5 = st.columns(5)
257
+ col1.metric("Total Posts", len(df))
258
+ col2.metric("Unique Users", num_users)
259
+ col3.metric("Avg Posts / User", f"{avg_posts:.1f}")
260
+ col4.metric("Time Range", f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}")
261
+ col5.metric("Topic Coherence", score_text)
262
+ st.markdown("#### Preprocessing Results (Sample)")
263
+ st.dataframe(df[['post_content', 'processed_content']].head())
264
+
265
+ with st.expander("📊 Topic Model Evaluation Metrics"):
266
+ st.write("""
267
+ ### 🔹Coherence Score
268
+ - measures how well the discovered topics make sense:
269
+ - **> 0.6**: Excellent - Topics are very distinct and meaningful
270
+ - **0.5 - 0.6**: Good - Topics are generally clear and interpretable
271
+ - **0.4 - 0.5**: Fair - Topics are somewhat meaningful but may overlap
272
+ - **< 0.4**: Poor - Topics may be unclear or too similar
273
+
274
+ 💡 **Tip**: If coherence is low, try adjusting the number of topics or cleaning options.
275
+ """)
276
+
277
+ st.markdown('<h2 class="sub-header">🎯 Topic Visualization & Refinement</h2>', unsafe_allow_html=True)
278
+ topic_options = topic_info[topic_info.Topic != -1].sort_values('Count', ascending=False)
279
+
280
+
281
+
282
+
283
+ view1, view2 = st.tabs(["Word Clouds", "Interactive Word Lists & Refinement"])
284
+
285
+ with view1:
286
+ st.info("Visual representation of the most important words for each topic.")
287
+ topics_to_show = topic_options.head(9)
288
+ num_cols = 3
289
+ cols = st.columns(num_cols)
290
+ for i, row in enumerate(topics_to_show.itertuples()):
291
+ with cols[i % num_cols]:
292
+ st.markdown(f"##### Topic {row.Topic}: {row.Name}")
293
+ fig = create_word_cloud(topic_model, row.Topic)
294
+ if fig: st.pyplot(fig, use_container_width=True)
295
+
296
+ with view2:
297
+ st.info("Select or deselect words from the lists below to instantly update the custom stopwords list in the configuration section above.")
298
+ topics_to_show = topic_options.head(9)
299
+ # Store the topic IDs we are showing so the callback can find the right widgets
300
+ st.session_state.topics_info_for_sync = [row.Topic for row in topics_to_show.itertuples()]
301
+
302
+ num_cols = 3
303
+ cols = st.columns(num_cols)
304
+
305
+ # Calculate which words should be pre-selected in the multiselects
306
+ current_stopwords_set = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
307
+
308
+ for i, row in enumerate(topics_to_show.itertuples()):
309
+ with cols[i % num_cols]:
310
+ st.markdown(f"##### Topic {row.Topic}")
311
+ topic_words = topic_model.get_topic(row.Topic)
312
+
313
+ # The options for the multiselect, e.g., ["word1 (0.123)", "word2 (0.122)"]
314
+ formatted_options = [f"{word} ({score:.3f})" for word, score in topic_words[:15]]
315
+
316
+ # Determine the default selected values for this specific multiselect
317
+ default_selection = []
318
+ for formatted_word in formatted_options:
319
+ word_part = formatted_word.split(' ')[0]
320
+ if word_part in current_stopwords_set:
321
+ default_selection.append(formatted_word)
322
+
323
+ st.multiselect(
324
+ f"Select words from Topic {row.Topic}",
325
+ options=formatted_options,
326
+ default=default_selection, # Pre-select words that are already in the list
327
+ key=f"multiselect_topic_{row.Topic}",
328
+ on_change=sync_stopwords, # The callback synchronizes everything
329
+ label_visibility="collapsed"
330
+ )
331
+
332
+
333
+
334
+
335
+ st.markdown('<h2 class="sub-header">📈 Topic Evolution</h2>', unsafe_allow_html=True)
336
+ if not results['general_evolution'].empty:
337
+ evo = results['general_evolution']
338
+
339
+
340
+ # 1. Filter out the outlier topic (-1) and ensure Timestamp is a datetime object
341
+ evo_filtered = evo[evo.Topic != -1].copy()
342
+ evo_filtered['Timestamp'] = pd.to_datetime(evo_filtered['Timestamp'])
343
+
344
+ if not evo_filtered.empty:
345
+ # 2. Pivot the data to get topics as columns and aggregate frequencies
346
+ evo_pivot = evo_filtered.pivot_table(
347
+ index='Timestamp',
348
+ columns='Topic',
349
+ values='Frequency',
350
+ aggfunc='sum'
351
+ ).fillna(0)
352
+
353
+ # 3. Dynamically choose a good resampling frequency (Hourly, Daily, or Weekly)
354
+ time_delta = evo_pivot.index.max() - evo_pivot.index.min()
355
+ if time_delta.days > 60:
356
+ resample_freq, freq_label = 'W', 'Weekly'
357
+ elif time_delta.days > 5:
358
+ resample_freq, freq_label = 'D', 'Daily'
359
+ else:
360
+ resample_freq, freq_label = 'H', 'Hourly'
361
+
362
+ # Resample the data into the chosen time bins by summing up the frequencies
363
+ evo_resampled = evo_pivot.resample(resample_freq).sum()
364
+
365
+ # 4. Create the line chart using plotly.express.line
366
+ # --- The main change is here: from px.area to px.line ---
367
+ fig_evo = px.line(
368
+ evo_resampled,
369
+ x=evo_resampled.index,
370
+ y=evo_resampled.columns,
371
+ title=f"Topic Frequency Over Time ({freq_label} Line Chart)",
372
+ labels={'value': 'Total Frequency', 'variable': 'Topic ID', 'index': 'Time'},
373
+ height=500
374
+ )
375
+ # Make the topic IDs in the legend categorical for better color mapping
376
+ fig_evo.for_each_trace(lambda t: t.update(name=str(t.name)))
377
+ fig_evo.update_layout(legend_title_text='Topic')
378
+
379
+ st.plotly_chart(fig_evo, use_container_width=True)
380
+ else:
381
+ st.info("No topic evolution data available to display (all posts may have been outliers).")
382
+ else:
383
+ st.warning("Could not compute topic evolution (requires more data points over time).")
384
+
385
+
386
+
387
+
388
+
389
+ st.markdown('<h2 class="sub-header">🧑‍🤝‍🧑 User Engagement Profile</h2>', unsafe_allow_html=True)
390
+
391
+ # --- START OF THE CRITICAL FIX ---
392
+
393
+ # 1. Create a new DataFrame containing ONLY posts from meaningful topics.
394
+ df_meaningful = df[df['topic_id'] != -1].copy()
395
+
396
+ # 2. Get post counts based on this meaningful data.
397
+ meaningful_post_counts = df_meaningful.groupby('user_id').size().reset_index(name='post_count')
398
+
399
+ # 3. Merge with the Gini results (which were already correctly calculated on meaningful topics).
400
+ # Using an 'inner' merge ensures we only consider users who have at least one meaningful post.
401
+ user_metrics_df = pd.merge(
402
+ meaningful_post_counts,
403
+ results['gini_per_user'],
404
+ on='user_id',
405
+ how='inner'
406
+ )
407
+
408
+ # 4. Filter to include only users with more than one MEANINGFUL post.
409
+ metrics_to_plot = user_metrics_df[user_metrics_df['post_count'] > 1].copy()
410
+
411
+ total_meaningful_users = len(user_metrics_df)
412
+ st.info(f"Displaying engagement profile for {len(metrics_to_plot)} users out of {total_meaningful_users} who contributed to meaningful topics.")
413
+
414
+ # 5. Add jitter for better visualization (this part is the same as before)
415
+ jitter_strength = 0.02
416
+ metrics_to_plot['gini_jittered'] = metrics_to_plot['gini_coefficient'] + \
417
+ np.random.uniform(-jitter_strength, jitter_strength, size=len(metrics_to_plot))
418
+
419
+ # 6. Create the plot using the correctly filtered and prepared data.
420
+ fig = px.scatter(
421
+ metrics_to_plot,
422
+ x='post_count',
423
+ y='gini_jittered',
424
+ title='User Engagement Profile (based on posts in meaningful topics)',
425
+ labels={
426
+ 'post_count': 'Number of Posts in Meaningful Topics', # Updated label
427
+ 'gini_jittered': 'Gini Index (Topic Diversity)'
428
+ },
429
+ custom_data=['user_id', 'gini_coefficient']
430
+ )
431
+ fig.update_traces(
432
+ marker=dict(opacity=0.5),
433
+ hovertemplate="<b>User</b>: %{customdata[0]}<br><b>Meaningful Posts</b>: %{x}<br><b>Gini (Original)</b>: %{customdata[1]:.3f}<extra></extra>"
434
+ )
435
+ fig.update_yaxes(range=[-0.05, 1.05])
436
+ st.plotly_chart(fig, use_container_width=True)
437
+
438
+ # --- END OF THE CRITICAL FIX ---
439
+
440
+ st.markdown('<h2 class="sub-header">👤 User Deep Dive</h2>', unsafe_allow_html=True)
441
+ selected_user = st.selectbox("Select a User to Analyze", options=sorted(df['user_id'].unique()), key="selected_user_dropdown")
442
+
443
+ if selected_user:
444
+ user_df = df[df['user_id'] == selected_user]
445
+ user_gini_info = user_metrics_df[user_metrics_df['user_id'] == selected_user].iloc[0]
446
+
447
+ # Display the top-level metrics for the user first
448
+ c1, c2 = st.columns(2)
449
+ with c1: st.metric("Total Posts by User", len(user_df))
450
+ with c2: st.metric("Topic Diversity (Gini)", f"{user_gini_info['gini_coefficient']:.3f}", help=interpret_gini(user_gini_info['gini_coefficient']))
451
+
452
+ st.markdown("---") # Add a visual separator
453
+
454
+ # --- START: New Two-Column Layout for Charts ---
455
+ col1, col2 = st.columns(2)
456
+
457
+ with col1:
458
+ # --- Chart 1: Topic Distribution Pie Chart ---
459
+ user_topic_counts = user_df['topic_id'].value_counts().reset_index()
460
+ user_topic_counts.columns = ['topic_id', 'count']
461
+
462
+ fig_pie = px.pie(
463
+ user_topic_counts[user_topic_counts.topic_id != -1],
464
+ names='topic_id',
465
+ values='count',
466
+ title=f"Overall Topic Distribution for {selected_user}",
467
+ hole=0.4
468
+ )
469
+ fig_pie.update_layout(margin=dict(l=0, r=0, t=40, b=0))
470
+ st.plotly_chart(fig_pie, use_container_width=True)
471
+
472
+ with col2:
473
+ # --- Chart 2: Topic Evolution for User ---
474
+ if len(user_df) > 1:
475
+ user_evo_df = user_df[user_df['topic_id'] != -1].copy()
476
+ user_evo_df['timestamp'] = pd.to_datetime(user_evo_df['timestamp'])
477
+
478
+ if not user_evo_df.empty and user_evo_df['timestamp'].nunique() > 1:
479
+ user_pivot = user_evo_df.pivot_table(index='timestamp', columns='topic_id', aggfunc='size', fill_value=0)
480
+
481
+ time_delta = user_pivot.index.max() - user_pivot.index.min()
482
+ if time_delta.days > 30: resample_freq = 'D'
483
+ elif time_delta.days > 2: resample_freq = 'H'
484
+ else: resample_freq = 'T'
485
+
486
+ user_resampled = user_pivot.resample(resample_freq).sum()
487
+ row_sums = user_resampled.sum(axis=1)
488
+ user_proportions = user_resampled.div(row_sums, axis=0).fillna(0)
489
+
490
+ topic_name_map = topic_info.set_index('Topic')['Name'].to_dict()
491
+ user_proportions.rename(columns=topic_name_map, inplace=True)
492
+
493
+ fig_user_evo = px.area(
494
+ user_proportions,
495
+ x=user_proportions.index,
496
+ y=user_proportions.columns,
497
+ title=f"Topic Proportion Over Time for {selected_user}",
498
+ labels={'value': 'Topic Proportion', 'variable': 'Topic', 'index': 'Time'},
499
+ )
500
+ fig_user_evo.update_layout(margin=dict(l=0, r=0, t=40, b=0))
501
+ st.plotly_chart(fig_user_evo, use_container_width=True)
502
+ else:
503
+ st.info("This user has no posts in meaningful topics or all posts occurred at the same time.")
504
+ else:
505
+ st.info("Topic evolution requires more than one post to display.")
506
+
507
+
508
+ st.markdown("#### User's Most Recent Posts")
509
+ user_posts_table = user_df[['post_content', 'timestamp', 'topic_id']] \
510
+ .sort_values(by='timestamp', ascending=False) \
511
+ .head(100)
512
+ user_posts_table.columns = ['Post Content', 'Timestamp', 'Assigned Topic']
513
+ st.dataframe(user_posts_table, use_container_width=True)
514
+
515
+ with st.expander("Show User Distribution by Post Count"):
516
+ # We use 'user_metrics_df' because it's based on meaningful posts
517
+ post_distribution = user_metrics_df['post_count'].value_counts().reset_index()
518
+ post_distribution.columns = ['Number of Posts', 'Number of Users']
519
+ post_distribution = post_distribution.sort_values(by='Number of Posts')
520
+
521
+ # Create a bar chart for the distribution
522
+ fig_dist = px.bar(
523
+ post_distribution,
524
+ x='Number of Posts',
525
+ y='Number of Users',
526
+ title='User Distribution by Number of Meaningful Posts'
527
+ )
528
+ st.plotly_chart(fig_dist, use_container_width=True)
529
+
530
+ # Display the raw data in a table
531
+ st.write("Data Table: User Distribution")
532
+ st.dataframe(post_distribution, use_container_width=True)
533
+
534
+
gini_calculator.py ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # import numpy as np
2
+ # import pandas as pd
3
+ # from typing import List
4
+
5
+ # def calculate_gini(array):
6
+ # """
7
+ # Calculates the Gini coefficient of a numpy array.
8
+ # Based on: http://www.statsdirect.com/help/default.htm#nonparametric_tests/gini.htm
9
+ # """
10
+ # array = np.array(array)
11
+ # if array.size == 0 or np.all(array == 0): # Check if all elements are zero
12
+ # return 0.0
13
+ # array = array.flatten()
14
+ # if np.amin(array) < 0: # Values cannot be negative: https://en.wikipedia.org/wiki/Gini_coefficient
15
+ # array -= np.amin(array)
16
+ # array = np.sort(array)
17
+ # index = np.arange(1, array.shape[0] + 1)
18
+ # n = array.shape[0]
19
+ # if np.sum(array) == 0: # Avoid division by zero for empty arrays or arrays with all zeros
20
+ # return 0.0
21
+ # return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)))
22
+
23
+ # # def calculate_gini_per_user(df: pd.DataFrame):
24
+ # # """
25
+ # # Calculates the Gini coefficient for topic distribution per user.
26
+
27
+ # # Args:
28
+ # # df (pd.DataFrame): DataFrame with 'user_id' and 'topic_id' columns.
29
+
30
+ # # Returns:
31
+ # # pd.DataFrame: DataFrame with 'user_id' and 'gini_coefficient'.
32
+ # # """
33
+ # # user_gini = []
34
+ # # for user_id in df["user_id"].unique():
35
+ # # user_posts = df[df["user_id"] == user_id]
36
+ # # topic_counts = user_posts["topic_id"].value_counts().values
37
+ # # gini = calculate_gini(topic_counts)
38
+ # # user_gini.append({"user_id": user_id, "gini_coefficient": gini})
39
+ # # return pd.DataFrame(user_gini)
40
+
41
+ # # def calculate_gini_per_topic(df: pd.DataFrame):
42
+ # # """
43
+ # # Calculates the Gini coefficient for topic distribution per topic across users.
44
+
45
+ # # Args:
46
+ # # df (pd.DataFrame): DataFrame with 'user_id' and 'topic_id' columns.
47
+
48
+ # # Returns:
49
+ # # pd.DataFrame: DataFrame with 'topic_id' and 'gini_coefficient'.
50
+ # # """
51
+ # # topic_gini = []
52
+ # # for topic_id in df["topic_id"].unique():
53
+ # # topic_posts = df[df["topic_id"] == topic_id]
54
+ # # user_counts = topic_posts["user_id"].value_counts().values
55
+ # # gini = calculate_gini(user_counts)
56
+ # # topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
57
+ # # return pd.DataFrame(topic_gini)
58
+
59
+ # def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
60
+ # user_gini = []
61
+ # for user_id in df["user_id"].unique():
62
+ # user_posts = df[df["user_id"] == user_id]
63
+
64
+ # # Get counts for topics the user posted in
65
+ # existing_topic_counts = user_posts["topic_id"].value_counts()
66
+
67
+ # # Create a full series with all topics, filling missing with 0
68
+ # full_topic_counts = pd.Series(0, index=all_topics)
69
+ # full_topic_counts.update(existing_topic_counts)
70
+
71
+ # gini = calculate_gini(full_topic_counts.values)
72
+ # user_gini.append({"user_id": user_id, "gini_coefficient": gini})
73
+ # return pd.DataFrame(user_gini)
74
+
75
+ # def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
76
+ # topic_gini = []
77
+ # for topic_id in df["topic_id"].unique(): # Or iterate over all_topics if you want Gini for topics with no posts
78
+ # topic_posts = df[df["topic_id"] == topic_id]
79
+
80
+ # # Get counts for users who posted in this topic
81
+ # existing_user_counts = topic_posts["user_id"].value_counts()
82
+
83
+ # # Create a full series with all users, filling missing with 0
84
+ # full_user_counts = pd.Series(0, index=all_users)
85
+ # full_user_counts.update(existing_user_counts)
86
+
87
+ # gini = calculate_gini(full_user_counts.values)
88
+ # topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
89
+ # return pd.DataFrame(topic_gini)
90
+
91
+ # if __name__ == "__main__":
92
+ # # Example Usage with more diverse data:
93
+ # data = {
94
+ # 'user_id': ['userA', 'userA', 'userA', 'userB', 'userB', 'userC', 'userC', 'userC', 'userC', 'userD'],
95
+ # 'topic_id': [1, 1, 2, 1, 3, 2, 2, 3, 4, 1]
96
+ # }
97
+ # df = pd.DataFrame(data)
98
+
99
+ # print("Calculating Gini per user...")
100
+ # gini_per_user_df = calculate_gini_per_user(df)
101
+ # print(gini_per_user_df)
102
+
103
+ # print("\nCalculating Gini per topic...")
104
+ # gini_per_topic_df = calculate_gini_per_topic(df)
105
+ # print(gini_per_topic_df)
106
+
107
+ # gini_calculator.py
108
+
109
+ import pandas as pd
110
+ from math import isnan
111
+ import math
112
+ from typing import List
113
+
114
+ def calculate_gini(counts, *, min_posts=None, normalize=False):
115
+ """
116
+ Compute 1 - sum(p_i^2) where p_i are category probabilities (Gini Impurity).
117
+ Handles: list/tuple of counts, dict {cat: count}, numpy array, pandas Series.
118
+
119
+ Edge cases:
120
+ - total == 0 -> return float('nan')
121
+ - total == 1 -> return 0.0
122
+ - min_posts set and total < min_posts -> return float('nan')
123
+ - normalize=True -> divide by (1 - 1/k_nonzero) when k_nonzero > 1
124
+
125
+ Parameters
126
+ ----------
127
+ counts : Iterable[int] | dict | pandas.Series | numpy.ndarray
128
+ Nonnegative counts per category.
129
+ min_posts : int | None
130
+ If provided and total posts < min_posts, returns NaN.
131
+ normalize : bool
132
+ If True, returns Gini / (1 - 1/k_nonzero) for k_nonzero > 1.
133
+
134
+ Returns
135
+ -------
136
+ float
137
+ """
138
+ # Convert to a flat list of counts
139
+ if counts is None:
140
+ return float('nan')
141
+
142
+ if isinstance(counts, dict):
143
+ vals = list(counts.values())
144
+ else:
145
+ # Works for list/tuple/np.array/Series
146
+ try:
147
+ vals = list(counts)
148
+ except TypeError:
149
+ return float('nan')
150
+
151
+ # Validate & clean
152
+ vals = [float(v) for v in vals if v is not None and not math.isnan(v)]
153
+ if any(v < 0 for v in vals):
154
+ raise ValueError("Counts must be nonnegative.")
155
+ total = sum(vals)
156
+
157
+ # Edge cases
158
+ if total == 0:
159
+ return float('nan')
160
+ if min_posts is not None and total < min_posts:
161
+ return float('nan')
162
+ if total == 1:
163
+ base = 0.0
164
+ else:
165
+ # Compute 1 - sum p_i^2
166
+ s2 = sum((v / total) ** 2 for v in vals)
167
+ base = 1.0 - s2
168
+
169
+ if not normalize:
170
+ return base
171
+
172
+ # Normalization by maximum possible diversity for observed nonzero categories
173
+ k_nonzero = sum(1 for v in vals if v > 0)
174
+ if k_nonzero <= 1:
175
+ # If only one category has posts, diversity is 0 and normalization isn't defined—return 0
176
+ return 0.0
177
+ denom = 1.0 - 1.0 / k_nonzero
178
+ # Guard against floating tiny negatives due to FP
179
+ return max(0.0, min(1.0, base / denom))
180
+
181
+
182
+ def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
183
+ """
184
+ Calculates the Gini Impurity for topic distribution per user.
185
+ A high value indicates high topic diversity.
186
+ """
187
+ user_gini = []
188
+ for user_id in df["user_id"].unique():
189
+ user_posts = df[df["user_id"] == user_id]
190
+ existing_topic_counts = user_posts["topic_id"].value_counts()
191
+ full_topic_counts = pd.Series(0, index=all_topics)
192
+ full_topic_counts.update(existing_topic_counts)
193
+ # We use normalize=True to make scores more comparable
194
+ gini = calculate_gini(full_topic_counts.values, normalize=True)
195
+ user_gini.append({"user_id": user_id, "gini_coefficient": gini})
196
+ # The new function returns NaN for zero counts, so we fill with 0
197
+ return pd.DataFrame(user_gini).fillna(0)
198
+
199
+
200
+ def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
201
+ """
202
+ Calculates the Gini Impurity for user distribution per topic.
203
+ A high value indicates the topic is discussed by a diverse set of users.
204
+ """
205
+ topic_gini = []
206
+ for topic_id in df["topic_id"].unique():
207
+ topic_posts = df[df["topic_id"] == topic_id]
208
+ existing_user_counts = topic_posts["user_id"].value_counts()
209
+ full_user_counts = pd.Series(0, index=all_users)
210
+ full_user_counts.update(existing_user_counts)
211
+ gini = calculate_gini(full_user_counts.values, normalize=True)
212
+ topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
213
+ return pd.DataFrame(topic_gini).fillna(0)
narrative_similarity.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # narrative_similarity.py
2
+
3
+ import pandas as pd
4
+ from sklearn.metrics.pairwise import cosine_similarity
5
+
6
+ def calculate_narrative_similarity(df: pd.DataFrame):
7
+ """
8
+ Calculates the narrative overlap between users based on their topic distributions.
9
+
10
+ Args:
11
+ df (pd.DataFrame): DataFrame containing 'user_id' and 'topic_id' columns.
12
+
13
+ Returns:
14
+ pd.DataFrame: A square DataFrame where rows and columns are user_ids
15
+ and values are the cosine similarity of their topic distributions.
16
+ """
17
+ # 1. Filter out outlier posts for a more meaningful similarity score
18
+ df_meaningful = df[df['topic_id'] != -1]
19
+
20
+ # 2. Create the "narrative vector" for each user
21
+ # Rows: user_id, Columns: topic_id, Values: count of posts
22
+ user_topic_matrix = pd.crosstab(df_meaningful['user_id'], df_meaningful['topic_id'])
23
+
24
+ # 3. Calculate pairwise cosine similarity between all users
25
+ similarity_matrix = cosine_similarity(user_topic_matrix)
26
+
27
+ # 4. Convert the result back to a DataFrame with user_ids as labels
28
+ similarity_df = pd.DataFrame(similarity_matrix, index=user_topic_matrix.index, columns=user_topic_matrix.index)
29
+
30
+ return similarity_df
readme.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Social Media Topic Modeling System
2
+
3
+ A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation for diversity analysis, topic evolution tracking, and semantic narrative overlap detection.
4
+
5
+ ## Features
6
+
7
+ - **📊 Topic Modeling**: Uses BERTopic for state-of-the-art, transformer-based topic modeling.
8
+ - **⚙️ Flexible Configuration**:
9
+ - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
10
+ - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
11
+ - **🌍 Multilingual Support**: Handles English and 50+ other languages using appropriate language models.
12
+ - **📈 Gini Index Analysis**: Calculates topic and user diversity.
13
+ - **⏰ Topic Evolution**: Tracks how topic popularity and user interests change over time with interactive charts.
14
+ - **🤝 Narrative Overlap Analysis**: Identifies users with semantically similar posting patterns (shared narratives), even when their wording differs.
15
+ - **✍️ Interactive Topic Refinement**: Fine-tune topic quality by adding words to a custom stopword list directly from the dashboard.
16
+ - **🎯 Interactive Visualizations**: A rich dashboard with built-in charts and data tables using Plotly.
17
+ - **📱 Responsive Interface**: Clean, modern Streamlit interface with a control panel for all settings.
18
+
19
+ ## Requirements
20
+
21
+ ### CSV File Format
22
+
23
+ Your CSV file must contain columns that can be mapped to the following roles:
24
+ - **User ID**: A column with unique identifiers for each user (string).
25
+ - **Post Content**: A column with the text content of the social media post (string).
26
+ - **Timestamp**: A column with the date and time of the post.
27
+
28
+ The application will prompt you to select the correct column for each role after you upload your file.
29
+
30
+ #### A Note on Timestamp Formatting
31
+
32
+ The application is highly flexible and can automatically parse many common date and time formats thanks to the powerful Pandas library. However, to ensure 100% accuracy and avoid errors, please follow these guidelines for your timestamp column:
33
+
34
+ * **Best Practice (Recommended):** Use a standard, unambiguous format like ISO 8601.
35
+ - `YYYY-MM-DD HH:MM:SS` (e.g., `2023-10-27 15:30:00`)
36
+ - `YYYY-MM-DDTHH:MM:SS` (e.g., `2023-10-27T15:30:00`)
37
+
38
+ * **Supported Formats:** Most common formats will work, including:
39
+ - `MM/DD/YYYY HH:MM` (e.g., `10/27/2023 15:30`)
40
+ - `DD/MM/YYYY HH:MM` (e.g., `27/10/2023 15:30`)
41
+ - `Month D, YYYY` (e.g., `October 27, 2023`)
42
+
43
+ * **Potential Issues to Avoid:**
44
+ - **Ambiguous formats:** A date like `01/02/2023` can be interpreted as either Jan 2nd or Feb 1st. Using a `YYYY-MM-DD` format avoids this.
45
+ - **Mixed formats in one column:** Ensure all timestamps in your column follow the same format for best performance and reliability.
46
+ - **Timezone information:** Formats with timezone offsets (e.g., `2023-10-27 15:30:00+05:30`) are fully supported.
47
+
48
+ ### Dependencies
49
+
50
+ See `requirements.txt` for a full list of dependencies.
51
+
52
+ ## Installation
53
+
54
+ ### Option 1: Local Installation
55
+
56
+ 1. **Clone or download the project files.**
57
+ 2. **Install dependencies:**
58
+ ```bash
59
+ pip install -r requirements.txt
60
+ ```
61
+ 3. **Download spaCy models:**
62
+ ```bash
63
+ python -m spacy download en_core_web_sm
64
+ python -m spacy download xx_ent_wiki_sm
65
+ ```
66
+
67
+ ### Option 2: Docker Installation (Recommended)
68
+
69
+ 1. **Using Docker Compose (easiest):**
70
+ ```bash
71
+ docker-compose up --build
72
+ ```
73
+ 2. **Access the application:**
74
+ Open your browser and go to `http://localhost:8501`.
75
+
76
+ ## Usage
77
+
78
+ 1. **Start the Streamlit application:**
79
+ ```bash
80
+ streamlit run app.py
81
+ ```
82
+ 2. **Open your browser** and navigate to the local URL provided by Streamlit (usually `http://localhost:8501`).
83
+ 3. **Follow the steps in the application:**
84
+ - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
85
+ - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
86
+ - **3. Configure Analysis**:
87
+ - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
88
+ - **Number of Topics**: Enter a specific number of meaningful topics to find, or use `-1` to let the model decide automatically.
89
+ - **Text Preprocessing**: Expand the advanced options to select cleaning steps like lowercasing, punctuation removal, and more.
90
+ - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
91
+ - **4. Run Analysis**: Click the "🚀 Run Full Analysis" button.
92
+
93
+ 4. **Explore the results** in the interactive sections of the main panel.
94
+
95
+ ### Exploring the Interface
96
+
97
+ The application provides a series of detailed sections:
98
+
99
+ #### 📋 Overview & Preprocessing
100
+ - Key metrics (total posts, unique users), dataset time range, and a topic coherence score.
101
+ - A sample of your data showing the original and processed text.
102
+
103
+ #### 🎯 Topic Visualization & Refinement
104
+ - **Word Clouds**: Visual representation of the most important words for top topics.
105
+ - **Interactive Word Lists**: Interactively select words from topic lists to add them to your custom stopwords for re-analysis.
106
+
107
+ #### 📈 Topic Evolution
108
+ - An interactive line chart showing how topic frequencies change over the entire dataset's timespan.
109
+
110
+ #### 🧑‍🤝‍🧑 User Engagement Profile
111
+ - A scatter plot visualizing the relationship between the number of posts a user makes and the diversity of their topics.
112
+ - An expandable section showing the distribution of users by their post count.
113
+
114
+ #### 👤 User Deep Dive
115
+ - Select a specific user to analyze.
116
+ - View their key metrics, overall topic distribution pie chart, and their personal topic evolution over time.
117
+ - See detailed tables of their topic breakdown and their most recent posts.
118
+
119
+ #### 🤝 Narrative Overlap Analysis
120
+ - Select a user to find other users who discuss a similar mix of topics.
121
+ - Use the slider to adjust the similarity threshold.
122
+ - The results table shows the overlap score and post count of similar users, providing context on both narrative alignment and engagement level.
123
+
124
+ ## Understanding the Results
125
+
126
+ ### Gini Impurity Index
127
+ This application uses the **Gini Impurity Index**, a measure of diversity.
128
+ - **Range**: 0 to 1
129
+ - **User Gini (Topic Diversity)**: Measures how diverse a user's topics are. **0** = perfectly specialized (posts on only one topic), **1** = perfectly diverse (posts spread evenly across all topics).
130
+ - **Topic Gini (User Diversity)**: Measures how concentrated a topic is among users. **0** = dominated by a single user, **1** = widely and evenly discussed by many users.
131
+
132
+ ### Narrative Overlap Score
133
+ - **Range**: 0 to 1
134
+ - This score measures the **cosine similarity** between the topic distributions of two users.
135
+ - A score of **1.0** means the two users have an identical proportional interest in topics (e.g., both are 100% focused on Topic 3).
136
+ - A score of **0.0** means their topic interests are completely different.
137
+ - This helps identify users with similar narrative focus, regardless of their total post count.
138
+
requirements.txt CHANGED
@@ -1,3 +1,19 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.17.0
2
+ bertopic[all]>=0.16.0
3
+ pandas>=2.0.0
4
+ numpy>=1.20.0
5
+ plotly>=5.0.0
6
+ transformers>=4.21.0
7
+ sentence-transformers>=2.2.0
8
+ scikit-learn>=1.0.0
9
+ hdbscan>=0.8.29
10
+ umap-learn>=0.5.0
11
+ torch>=1.11.0
12
+ matplotlib>=3.5.0
13
+ seaborn>=0.11.0
14
+ gensim>=4.3.0
15
+ nltk>=3.8.0
16
+ wordcloud>=1.9.0
17
+ emoji>=2.2.0
18
+ spacy>=3.4.0
19
+ pyinstaller
text_preprocessor.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import string
3
+ import pandas as pd
4
+ import spacy
5
+ import emoji
6
+ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
7
+ from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
8
+ from spacy.util import compile_infix_regex
9
+ from pathlib import Path
10
+
11
+ from resource_path import resource_path
12
+
13
+
14
+ class MultilingualPreprocessor:
15
+ """
16
+ A robust text preprocessor using spaCy for multilingual support.
17
+ """
18
+ def __init__(self, language: str):
19
+ """
20
+ Initializes the preprocessor and loads the appropriate spaCy model.
21
+
22
+ Args:
23
+ language (str): 'english' or 'multilingual'.
24
+ """
25
+
26
+ model_map = {
27
+ 'english': resource_path('en_core_web_sm'),
28
+ 'multilingual': resource_path('xx_ent_wiki_sm')
29
+ }
30
+ self.model_name = model_map.get(language, resource_path('xx_ent_wiki_sm')) # This is the STRING path
31
+
32
+
33
+ #self.model_name = model_map.get(language, 'xx_ent_wiki_sm')
34
+ try:
35
+ # 1. Convert the string path to a Path object
36
+ model_path_obj = Path(self.model_name)
37
+
38
+ # 2. FIX: Pass the Path object to the spaCy loader
39
+ self.nlp = spacy.util.load_model_from_path(model_path_obj)
40
+
41
+
42
+ except OSError:
43
+ # The error message now points to a more fundamental issue if it fails
44
+ print(f"spaCy Model Error: Could not load model from path: {self.model_name}")
45
+ print(f"This indicates a problem with the PyInstaller bundling of the spaCy models.")
46
+ raise# Re-raise the error to be caught by the Streamlit app
47
+
48
+ # Customize tokenizer to not split on hyphens in words
49
+ # CORRECTED LINE: CONCAT_QUOTES is wrapped in a list []
50
+ infixes = LIST_ELLIPSES + LIST_ICONS + [CONCAT_QUOTES]
51
+ infix_regex = compile_infix_regex(infixes)
52
+ self.nlp.tokenizer.infix_finditer = infix_regex.finditer
53
+
54
+ def preprocess_series(self, text_series: pd.Series, options: dict, n_process_spacy: int = -1) -> pd.Series:
55
+ """
56
+ Applies a series of cleaning steps to a pandas Series of text.
57
+
58
+ Args:
59
+ text_series (pd.Series): The text to be cleaned.
60
+ options (dict): A dictionary of preprocessing options.
61
+
62
+ Returns:
63
+ pd.Series: The cleaned text Series.
64
+ """
65
+ # --- Stage 1: Fast, Regex-based cleaning ---
66
+ processed_text = text_series.copy().astype(str)
67
+ if options.get("remove_html"):
68
+ processed_text = processed_text.str.replace(r"<.*?>", "", regex=True)
69
+ if options.get("remove_urls"):
70
+ processed_text = processed_text.str.replace(r"http\S+|www\.\S+", "", regex=True)
71
+
72
+ emoji_option = options.get("handle_emojis", "Keep Emojis")
73
+ if emoji_option == "Remove Emojis":
74
+ processed_text = processed_text.apply(lambda s: emoji.replace_emoji(s, replace=''))
75
+ elif emoji_option == "Convert Emojis to Text":
76
+ processed_text = processed_text.apply(emoji.demojize)
77
+
78
+ if options.get("handle_hashtags") == "Remove Hashtags":
79
+ processed_text = processed_text.str.replace(r"#\w+", "", regex=True)
80
+ if options.get("handle_mentions") == "Remove Mentions":
81
+ processed_text = processed_text.str.replace(r"@\w+", "", regex=True)
82
+
83
+ # --- Stage 2: spaCy-based advanced processing ---
84
+ # Using nlp.pipe for efficiency on a Series
85
+ cleaned_docs = []
86
+ # docs = self.nlp.pipe(processed_text, n_process=-1, batch_size=500)
87
+ docs = self.nlp.pipe(processed_text, n_process=n_process_spacy, batch_size=500)
88
+
89
+
90
+ # Get custom stopwords and convert to lowercase set for fast lookups
91
+ custom_stopwords = set(options.get("custom_stopwords", []))
92
+
93
+ for doc in docs:
94
+ tokens = []
95
+ for token in doc:
96
+ # Punctuation and Number handling
97
+ if options.get("remove_punctuation") and token.is_punct:
98
+ continue
99
+ if options.get("remove_numbers") and (token.is_digit or token.like_num):
100
+ continue
101
+
102
+ # Stopword handling (including custom stopwords)
103
+ is_stopword = token.is_stop or token.text.lower() in custom_stopwords
104
+ if options.get("remove_stopwords") and is_stopword:
105
+ continue
106
+
107
+ # Use lemma if lemmatization is on, otherwise use the original text
108
+ token_text = token.lemma_ if options.get("lemmatize") else token.text
109
+
110
+ # Lowercasing (language-aware)
111
+ if options.get("lowercase"):
112
+ token_text = token_text.lower()
113
+
114
+ # Remove any leftover special characters or whitespace
115
+ if options.get("remove_special_chars"):
116
+ token_text = re.sub(r'[^\w\s-]', '', token_text)
117
+
118
+ if token_text.strip():
119
+ tokens.append(token_text.strip())
120
+
121
+ cleaned_docs.append(" ".join(tokens))
122
+
123
+ return pd.Series(cleaned_docs, index=text_series.index)
topic_evolution.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from bertopic import BERTopic
3
+ from bertopic.representation import KeyBERTInspired
4
+
5
+
6
+ def analyze_general_topic_evolution(topic_model, docs, timestamps):
7
+ """
8
+ Analyzes general topic evolution over time.
9
+
10
+ Args:
11
+ topic_model: Trained BERTopic model.
12
+ docs (list): List of documents.
13
+ timestamps (list): List of timestamps corresponding to the documents.
14
+
15
+ Returns:
16
+ pd.DataFrame: DataFrame with topic evolution information.
17
+ """
18
+ try:
19
+ topics_over_time = topic_model.topics_over_time(docs, timestamps, global_tuning=True)
20
+ return topics_over_time
21
+ except Exception:
22
+ # Fallback for small datasets or cases where evolution can't be computed
23
+ return pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
24
+
25
+
26
+ def analyze_user_topic_evolution(df: pd.DataFrame, topic_model):
27
+ """
28
+ Analyzes topic evolution per user.
29
+
30
+ Args:
31
+ df (pd.DataFrame): DataFrame with (
32
+ "user_id", "post_content", "timestamp", and "topic_id" columns.
33
+ topic_model: Trained BERTopic model.
34
+
35
+ Returns:
36
+ dict: A dictionary where keys are user_ids and values are DataFrames of topic evolution for that user.
37
+ """
38
+ user_topic_evolution = {}
39
+ for user_id in df["user_id"].unique():
40
+ user_df = df[df["user_id"] == user_id].copy()
41
+ if not user_df.empty and len(user_df) > 1:
42
+ try:
43
+ # Ensure timestamps are sorted for topics_over_time
44
+ user_df = user_df.sort_values(by="timestamp")
45
+ docs = user_df["post_content"].tolist()
46
+ timestamps = user_df["timestamp"].tolist()
47
+ selected_topics = user_df["topic_id"].tolist() # Get topic_ids for the user's posts
48
+ topics_over_time = topic_model.topics_over_time(docs, timestamps, topics=selected_topics, global_tuning=True)
49
+ user_topic_evolution[user_id] = topics_over_time
50
+ except Exception:
51
+ user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
52
+ else:
53
+ user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
54
+ return user_topic_evolution
55
+
56
+ if __name__ == "__main__":
57
+ # Example Usage:
58
+ data = {
59
+ "user_id": ["user1", "user2", "user1", "user3", "user2", "user1", "user4", "user3", "user2", "user1", "user5", "user4", "user3", "user2", "user1"],
60
+ "post_content": [
61
+ "This is a great movie, I loved the acting and the plot. It was truly captivating.",
62
+ "The new phone has an amazing camera and long battery life. Highly recommend it.",
63
+ "I enjoyed the film, especially the special effects and the soundtrack. A must-watch.",
64
+ "Learning about AI and machine learning is fascinating. The future is here.",
65
+ "My old phone is so slow, I need an upgrade soon. Thinking about the latest model.",
66
+ "The best part of the movie was the soundtrack and the stunning visuals. Very immersive.",
67
+ "Exploring the vastness of space is a lifelong dream. Astronomy is amazing.",
68
+ "Data science is revolutionizing industries. Predictive analytics is key.",
69
+ "I need a new laptop for work. Something powerful and portable.",
70
+ "Just finished reading a fantastic book on quantum physics. Mind-blowing concepts.",
71
+ "Cooking new recipes is my passion. Today, I tried a spicy Thai curry.",
72
+ "The universe is full of mysteries. Black holes and dark matter are intriguing.",
73
+ "Deep learning models are becoming incredibly sophisticated. Image recognition is impressive.",
74
+ "My current laptop is crashing frequently. Time for an upgrade.",
75
+ "Science fiction movies always make me think about the future of humanity."
76
+ ],
77
+ "timestamp": [
78
+ "2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 10:30:00",
79
+ "2023-01-02 14:00:00", "2023-01-03 09:00:00", "2023-01-03 16:00:00",
80
+ "2023-01-04 08:00:00", "2023-01-04 12:00:00", "2023-01-05 10:00:00",
81
+ "2023-01-05 15:00:00", "2023-01-06 09:30:00", "2023-01-06 13:00:00",
82
+ "2023-01-07 11:00:00", "2023-01-07 14:30:00", "2023-01-08 10:00:00"
83
+ ]
84
+ }
85
+ df = pd.DataFrame(data)
86
+ df["timestamp"] = pd.to_datetime(df["timestamp"])
87
+
88
+ print("Performing topic modeling (English)...")
89
+ model_en, topics_en, probs_en = perform_topic_modeling(df, language="english")
90
+ df["topic_id"] = topics_en
91
+
92
+ print("\nAnalyzing general topic evolution...")
93
+ general_evolution_df = analyze_general_topic_evolution(model_en, df["post_content"].tolist(), df["timestamp"].tolist())
94
+ print(general_evolution_df.head())
95
+
96
+ print("\nAnalyzing per user topic evolution...")
97
+ user_evolution_dict = analyze_user_topic_evolution(df, model_en)
98
+ for user_id, evolution_df in user_evolution_dict.items():
99
+ print(f"\nTopic evolution for {user_id}:")
100
+ print(evolution_df.head())
topic_modeling.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # topic_modeling.py
2
+
3
+ import pandas as pd
4
+ from bertopic import BERTopic
5
+ from gensim.corpora import Dictionary
6
+ from gensim.models import CoherenceModel
7
+ from nltk.tokenize import word_tokenize
8
+ from typing import List
9
+ from sklearn.feature_extraction.text import CountVectorizer # <-- Make sure this is imported
10
+
11
+ def perform_topic_modeling(
12
+ docs: List[str],
13
+ language: str = "english",
14
+ nr_topics=None,
15
+ remove_stopwords_bertopic: bool = False, # New parameter to control behavior
16
+ custom_stopwords: List[str] = None
17
+ ):
18
+ """
19
+ Performs topic modeling on a list of documents.
20
+
21
+ Args:
22
+ docs (List[str]): A list of documents. Stopwords should be INCLUDED for best results.
23
+ language (str): Language for the BERTopic model ('english', 'multilingual').
24
+ nr_topics: The number of topics to find ("auto" or an int).
25
+ remove_stopwords_bertopic (bool): If True, stopwords will be removed internally by BERTopic.
26
+ custom_stopwords (List[str]): A list of custom stopwords to use.
27
+
28
+ Returns:
29
+ tuple: BERTopic model, topics, probabilities, and coherence score.
30
+ """
31
+ vectorizer_model = None # Default to no custom vectorizer
32
+
33
+ if remove_stopwords_bertopic:
34
+ stop_words_list = []
35
+ if language == "english":
36
+ # Start with the built-in English stopword list from scikit-learn
37
+ from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
38
+ stop_words_list = list(ENGLISH_STOP_WORDS)
39
+
40
+ # Add any custom stopwords provided by the user
41
+ if custom_stopwords:
42
+ stop_words_list.extend(custom_stopwords)
43
+
44
+ # Only create a vectorizer if there's a list of stopwords to use
45
+ if stop_words_list:
46
+ vectorizer_model = CountVectorizer(stop_words=stop_words_list)
47
+
48
+ # Instantiate BERTopic, passing the vectorizer_model if it was created
49
+ if language == "multilingual":
50
+ topic_model = BERTopic(language="multilingual", nr_topics=nr_topics, vectorizer_model=vectorizer_model)
51
+ else:
52
+ topic_model = BERTopic(language=language, nr_topics=nr_topics, vectorizer_model=vectorizer_model)
53
+
54
+ # The 'docs' passed here should contain stopwords for the embedding model to work best
55
+ topics, probs = topic_model.fit_transform(docs)
56
+
57
+ # --- Calculate Coherence Score ---
58
+ # This part remains the same.
59
+ tokenized_docs = [word_tokenize(doc) for doc in docs]
60
+ dictionary = Dictionary(tokenized_docs)
61
+ corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
62
+ topic_words = topic_model.get_topics()
63
+ topics_for_coherence = []
64
+ for topic_id in sorted(topic_words.keys()):
65
+ if topic_id != -1:
66
+ words = [word for word, _ in topic_model.get_topic(topic_id)]
67
+ topics_for_coherence.append(words)
68
+ coherence_score = None
69
+ if topics_for_coherence and corpus:
70
+ try:
71
+ coherence_model = CoherenceModel(
72
+ topics=topics_for_coherence,
73
+ texts=tokenized_docs,
74
+ dictionary=dictionary,
75
+ coherence='c_v'
76
+ )
77
+ coherence_score = coherence_model.get_coherence()
78
+ except Exception as e:
79
+ print(f"Could not calculate coherence score: {e}")
80
+
81
+ return topic_model, topics, probs, coherence_score