Spaces:

Mars203020
/

bertopic

Running

App Files Files Community

Mars203020 commited on 29 days ago

Commit

54d4f91

verified ·

1 Parent(s): e954ef8

Upload 9 files

Browse files

Files changed (9) hide show

Social Media Topic Modeling System.md +99 -0
app.py +534 -0
gini_calculator.py +213 -0
narrative_similarity.py +30 -0
readme.md +138 -0
requirements.txt +19 -3
text_preprocessor.py +123 -0
topic_evolution.py +100 -0
topic_modeling.py +81 -0

Social Media Topic Modeling System.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Social Media Topic Modeling System
+A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
+## Features
+- **📊 Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
+- **⚙️ Flexible Configuration**:
+    - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
+    - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
+- **🌍 Multilingual Support**: Handles English and 50+ other languages.
+- **📈 Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
+- **⏰ Topic Evolution**: Tracks how topics change over time.
+- **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
+- **📱 Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.
+## Requirements
+### CSV File Format
+Your CSV file must contain columns that can be mapped to the following roles:
+- **User ID**: A column with unique identifiers for each user (string).
+- **Post Content**: A column with the text content of the social media post (string).
+- **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
+The application will prompt you to select the correct column for each role after you upload your file.
+### Dependencies
+See `requirements.txt` for a full list of dependencies.
+## Installation
+### Option 1: Local Installation
+1.  **Clone or download the project files.**
+2.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+### Option 2: Docker Installation (Recommended)
+1.  **Using Docker Compose (easiest):**
+    ```bash
+    docker-compose up --build
+    ```
+2.  **Access the application:**
+    ```
+    http://localhost:8501
+    ```
+## Usage
+1.  **Start the Streamlit application:**
+    ```bash
+    streamlit run app.py
+    ```
+2.  **Open your browser** and navigate to `http://localhost:8501`.
+3.  **Follow the steps in the sidebar:**
+    - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
+    - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
+    - **3. Configure Analysis**:
+        - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
+        - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
+        - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
+    - **4. Run Analysis**: Click the "🚀 Analyze Topics" button.
+4.  **Explore the results** in the five interactive tabs in the main panel.
+### Using the Interface
+The application provides five main tabs:
+#### 📋 Overview
+- Key metrics, dataset preview, and average Gini coefficient.
+#### 🎯 Topics
+- Topic information table and topic distribution bar chart.
+#### 📊 Gini Analysis
+- Analysis of topic diversity for each user and user concentration for each topic.
+#### 📈 Topic Evolution
+- Timelines showing how topic popularity changes over time, for all users and for individual users.
+#### 📄 Documents
+- A detailed view of your original data with assigned topics and probabilities.
+## Understanding the Results
+### Gini Coefficient
+- **Range**: 0 to 1
+- **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
+- **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.
+---
+**Built with ❤️ using Streamlit and BERTopic**

app.py ADDED Viewed

	@@ -0,0 +1,534 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import plotly.express as px
+from wordcloud import WordCloud
+import matplotlib.pyplot as plt
+# Import custom modules
+from text_preprocessor import MultilingualPreprocessor
+from topic_modeling import perform_topic_modeling
+from gini_calculator import calculate_gini_per_user, calculate_gini_per_topic
+from topic_evolution import analyze_general_topic_evolution
+from narrative_similarity import calculate_narrative_similarity
+# --- Page Configuration ---
+st.set_page_config(
+    page_title="Social Media Topic Modeling System",
+    page_icon="📊",
+    layout="wide",
+)
+# --- Custom CSS ---
+st.markdown("""
+<style>
+    .main-header { font-size: 2.5rem; color: #1f77b4; text-align: center; margin-bottom: 1rem; }
+    .sub-header { font-size: 1.75rem; color: #2c3e50; border-bottom: 2px solid #f0f2f6; padding-bottom: 0.3rem; margin-top: 2rem; margin-bottom: 1rem;}
+</style>
+""", unsafe_allow_html=True)
+# --- Session State Initialization ---
+if 'results' not in st.session_state:
+    st.session_state.results = None
+if 'df_raw' not in st.session_state:
+    st.session_state.df_raw = None
+if 'custom_stopwords_text' not in st.session_state:
+    st.session_state.custom_stopwords_text = ""
+if "topics_info_for_sync" not in st.session_state:
+    st.session_state.topics_info_for_sync = []
+# --- Helper Functions ---
+@st.cache_data
+def create_word_cloud(_topic_model, topic_id):
+    word_freq = _topic_model.get_topic(topic_id)
+    if not word_freq: return None
+    wc = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=50).generate_from_frequencies(dict(word_freq))
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ax.imshow(wc, interpolation='bilinear')
+    ax.axis("off")
+    plt.close(fig)
+    return fig
+def interpret_gini(gini_score):
+    # Logic is now FLIPPED for Gini Impurity
+    if gini_score >= 0.6: return "🌐 Diverse Interests"
+    elif gini_score >= 0.3: return "🎯 Moderately Focused"
+    else: return "🔥 Highly Specialized"
+# --- START OF DEFINITIVE FIX: Centralized Callback Function ---
+def sync_stopwords():
+    """
+    This function is the single source of truth for updating stopwords.
+    It's called whenever any related widget changes.
+    """
+    # 1. Get words from all multiselect lists
+    selected_from_lists = set()
+    for topic_id in st.session_state.topics_info_for_sync:
+        key = f"multiselect_topic_{topic_id}"
+        if key in st.session_state:
+            selected_from_lists.update([s.split(' ')[0] for s in st.session_state[key]])
+    # 2. Get words from the text area
+    # The key for the text area is now the master state variable itself.
+    typed_stopwords = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
+    # 3. Combine them and update the master state variable
+    combined_stopwords = typed_stopwords.union(selected_from_lists)
+    st.session_state.custom_stopwords_text = ", ".join(sorted(list(combined_stopwords)))
+# --- Main Page Layout ---
+st.title("🌍 Multilingual Topic Modeling Dashboard")
+st.markdown("Analyze textual data in multiple languages to discover topics and user trends.")
+# Use a key to ensure the file uploader keeps its state, and update session_state directly
+uploaded_file = st.file_uploader("Upload your CSV data", type="csv", key="csv_uploader")
+# Check if a new file has been uploaded (or if it's the first time and a file exists)
+if uploaded_file is not None and uploaded_file != st.session_state.get('last_uploaded_file', None):
+    try:
+        st.session_state.df_raw = pd.read_csv(uploaded_file)
+        st.session_state.results = None # Reset results if a new file is uploaded
+        st.session_state.custom_stopwords_text = ""
+        st.session_state.last_uploaded_file = uploaded_file # Store the uploaded file itself
+        st.success("CSV file loaded successfully!")
+    except Exception as e:
+        st.error(f"Could not read CSV file. Error: {e}")
+        st.session_state.df_raw = None
+        st.session_state.last_uploaded_file = None
+if st.session_state.df_raw is not None:
+    df_raw = st.session_state.df_raw
+    col1, col2, col3 = st.columns(3)
+    with col1: user_id_col = st.selectbox("User ID Column", df_raw.columns, index=0, key="user_id_col")
+    with col2: post_content_col = st.selectbox("Post Content Column", df_raw.columns, index=1, key="post_content_col")
+    with col3: timestamp_col = st.selectbox("Timestamp Column", df_raw.columns, index=2, key="timestamp_col")
+    st.subheader("Topic Modeling Settings")
+    lang_col, topics_col = st.columns(2)
+    with lang_col: language = st.selectbox("Language Model", ["english", "multilingual"], key="language_model")
+    with topics_col: num_topics = st.number_input("Number of Topics", -1, help="Use -1 for automatic detection", key="num_topics")
+    with st.expander("Advanced: Text Cleaning & Preprocessing Options", expanded=False):
+        c1, c2 = st.columns(2)
+        with c1:
+            opts = {
+                'lowercase': st.checkbox("Convert to Lowercase", True, key="opt_lowercase"),
+                'lemmatize': st.checkbox("Lemmatize words", False, key="opt_lemmatize"),
+                'remove_urls': st.checkbox("Remove URLs", False, key="opt_remove_urls"),
+                'remove_html': st.checkbox("Remove HTML Tags", False, key="opt_remove_html")
+            }
+        with c2:
+            opts.update({
+                'remove_special_chars': st.checkbox("Remove Special Characters", False, key="opt_remove_special_chars"),
+                'remove_punctuation': st.checkbox("Remove Punctuation", False, key="opt_remove_punctuation"),
+                'remove_numbers': st.checkbox("Remove Numbers", False, key="opt_remove_numbers")
+            })
+        st.markdown("---")
+        c1_emoji, c2_hashtag, c3_mention = st.columns(3)
+        with c1_emoji: opts['handle_emojis'] = st.radio("Emoji Handling", ["Keep Emojis", "Remove Emojis", "Convert Emojis to Text"], index=0, key="opt_handle_emojis")
+        with c2_hashtag: opts['handle_hashtags'] = st.radio("Hashtag (#) Handling", ["Keep Hashtags", "Remove Hashtags", "Extract Hashtags"], index=0, key="opt_handle_hashtags")
+        with c3_mention: opts['handle_mentions'] = st.radio("Mention (@) Handling", ["Keep Mentions", "Remove Mentions", "Extract Mentions"], index=0, key="opt_handle_mentions")
+        st.markdown("---")
+        opts['remove_stopwords'] = st.checkbox("Remove Stopwords", True, key="opt_remove_stopwords")
+        st.text_area(
+            "Custom Stopwords (comma-separated)",
+            key="custom_stopwords_text", # This one already had a key
+            on_change=sync_stopwords
+        )
+        opts['custom_stopwords'] = [s.strip().lower() for s in st.session_state.custom_stopwords_text.split(',') if s]
+    st.divider()
+    process_button = st.button("🚀 Run Full Analysis", type="primary", use_container_width=True)
+else:
+    process_button = False
+st.divider()
+# --- Main Processing Logic ---
+if process_button:
+    st.session_state.results = None
+    with st.spinner("Processing your data... This may take a few minutes."):
+        try:
+            df = df_raw[[user_id_col, post_content_col, timestamp_col]].copy()
+            df.columns = ['user_id', 'post_content', 'timestamp']
+            df.dropna(subset=['user_id', 'post_content', 'timestamp'], inplace=True)
+            df['timestamp'] = pd.to_datetime(df['timestamp'])
+            if opts['handle_hashtags'] == 'Extract Hashtags': df['hashtags'] = df['post_content'].str.findall(r'#\w+')
+            if opts['handle_mentions'] == 'Extract Mentions': df['mentions'] = df['post_content'].str.findall(r'@\w+')
+            # 1. Capture the user's actual choice about stopwords
+            user_wants_stopwords_removed = opts.get("remove_stopwords", False)
+            custom_stopwords_list = opts.get("custom_stopwords", [])
+            # 2. Tell the preprocessor to KEEP stopwords in the text.
+            opts_for_preprocessor = opts.copy()
+            opts_for_preprocessor['remove_stopwords'] = False
+            st.info("⚙️ Initializing preprocessor and cleaning text (keeping stopwords for now)...")
+            preprocessor = MultilingualPreprocessor(language=language)
+            df['processed_content'] = preprocessor.preprocess_series(
+                df['post_content'],
+                opts_for_preprocessor,
+                n_process_spacy=1 # Keep this for stability
+            )
+            st.info("🔍 Performing topic modeling...")
+            # FIX 3: Add the +1 logic to better target the number of topics
+            if num_topics > 0:
+                bertopic_nr_topics = num_topics + 1
+            else:
+                bertopic_nr_topics = "auto"
+            docs_series = df['processed_content'].fillna('').astype(str)
+            docs_to_model = docs_series[docs_series.str.len() > 0].tolist()
+            df_with_content = df[docs_series.str.len() > 0].copy()
+            if not docs_to_model:
+                st.error("❌ After preprocessing, no documents were left to analyze. Please adjust your cleaning options.")
+                st.stop()
+            # 3. Pass the user's choice and stopwords list to BERTopic
+            topic_model, topics, probs, coherence_score = perform_topic_modeling(
+                docs=docs_to_model,
+                language=language,
+                nr_topics=bertopic_nr_topics,
+                remove_stopwords_bertopic=user_wants_stopwords_removed,
+                custom_stopwords=custom_stopwords_list
+            )
+            df_with_content['topic_id'] = topics
+            df_with_content['probability'] = probs
+            df = pd.merge(df, df_with_content[['topic_id', 'probability']], left_index=True, right_index=True, how='left')
+            df['topic_id'] = df['topic_id'].fillna(-1).astype(int)
+            st.info("📊 Calculating user engagement metrics...")
+            all_unique_topics = sorted(df[df['topic_id'] != -1]['topic_id'].unique().tolist())
+            all_unique_users = sorted(df['user_id'].unique().tolist())
+            gini_per_user = calculate_gini_per_user(df[['user_id', 'topic_id']], all_topics=all_unique_topics)
+            gini_per_topic = calculate_gini_per_topic(df[['user_id', 'topic_id']], all_users=all_unique_users)
+            st.info("📈 Analyzing topic evolution...")
+            general_evolution = analyze_general_topic_evolution(topic_model, docs_to_model, df_with_content['timestamp'].tolist())
+            st.session_state.results = {
+                'topic_model': topic_model,
+                'topic_info': topic_model.get_topic_info(),
+                'df': df,
+                'gini_per_user': gini_per_user,
+                'gini_per_topic': gini_per_topic, # FIX 2: Save gini_per_topic to session state
+                'general_evolution': general_evolution,
+                'coherence_score': coherence_score
+            }
+            st.success("✅ Analysis complete!")
+        except OSError as e:
+            st.error(f"spaCy Model Error: Could not load model. Please run `python -m spacy download en_core_web_sm` and `python -m spacy download xx_ent_wiki_sm` from your terminal.")
+        except Exception as e:
+            st.error(f"❌ An error occurred during processing: {e}")
+            st.exception(e)
+# --- Display Results ---
+if st.session_state.results:
+    results = st.session_state.results
+    df = results['df']
+    topic_model = results['topic_model']
+    topic_info = results['topic_info']
+    st.markdown('<h2 class="sub-header">📋 Overview & Preprocessing</h2>', unsafe_allow_html=True)
+    score_text = f"{results['coherence_score']:.3f}" if results['coherence_score'] is not None else "N/A"
+    num_users = df['user_id'].nunique()
+    avg_posts = len(df) / num_users if num_users > 0 else 0
+    start_date, end_date = df['timestamp'].min(), df['timestamp'].max()
+     # Option 1: More Compact Date Format
+    if start_date.year == end_date.year:
+        # If both dates are in the same year, only show year on the end date
+        time_range_str = f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}"
+    else:
+        # If dates span multiple years, show year on both
+        time_range_str = f"{start_date.strftime('%b %d, %Y')} - {end_date.strftime('%b %d, %Y')}"
+    col1, col2, col3, col4, col5 = st.columns(5)
+    col1.metric("Total Posts", len(df))
+    col2.metric("Unique Users", num_users)
+    col3.metric("Avg Posts / User", f"{avg_posts:.1f}")
+    col4.metric("Time Range", f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}")
+    col5.metric("Topic Coherence", score_text)
+    st.markdown("#### Preprocessing Results (Sample)")
+    st.dataframe(df[['post_content', 'processed_content']].head())
+    with st.expander("📊 Topic Model Evaluation Metrics"):
+        st.write("""
+        ### 🔹Coherence Score
+        - measures how well the discovered topics make sense:
+        - **> 0.6**: Excellent - Topics are very distinct and meaningful
+        - **0.5 - 0.6**: Good - Topics are generally clear and interpretable
+        - **0.4 - 0.5**: Fair - Topics are somewhat meaningful but may overlap
+        - **< 0.4**: Poor - Topics may be unclear or too similar
+        💡 **Tip**: If coherence is low, try adjusting the number of topics or cleaning options.
+        """)
+    st.markdown('<h2 class="sub-header">🎯 Topic Visualization & Refinement</h2>', unsafe_allow_html=True)
+    topic_options = topic_info[topic_info.Topic != -1].sort_values('Count', ascending=False)
+    view1, view2 = st.tabs(["Word Clouds", "Interactive Word Lists & Refinement"])
+    with view1:
+        st.info("Visual representation of the most important words for each topic.")
+        topics_to_show = topic_options.head(9)
+        num_cols = 3
+        cols = st.columns(num_cols)
+        for i, row in enumerate(topics_to_show.itertuples()):
+            with cols[i % num_cols]:
+                st.markdown(f"##### Topic {row.Topic}: {row.Name}")
+                fig = create_word_cloud(topic_model, row.Topic)
+                if fig: st.pyplot(fig, use_container_width=True)
+    with view2:
+        st.info("Select or deselect words from the lists below to instantly update the custom stopwords list in the configuration section above.")
+        topics_to_show = topic_options.head(9)
+        # Store the topic IDs we are showing so the callback can find the right widgets
+        st.session_state.topics_info_for_sync = [row.Topic for row in topics_to_show.itertuples()]
+        num_cols = 3
+        cols = st.columns(num_cols)
+        # Calculate which words should be pre-selected in the multiselects
+        current_stopwords_set = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
+        for i, row in enumerate(topics_to_show.itertuples()):
+            with cols[i % num_cols]:
+                st.markdown(f"##### Topic {row.Topic}")
+                topic_words = topic_model.get_topic(row.Topic)
+                # The options for the multiselect, e.g., ["word1 (0.123)", "word2 (0.122)"]
+                formatted_options = [f"{word} ({score:.3f})" for word, score in topic_words[:15]]
+                # Determine the default selected values for this specific multiselect
+                default_selection = []
+                for formatted_word in formatted_options:
+                    word_part = formatted_word.split(' ')[0]
+                    if word_part in current_stopwords_set:
+                        default_selection.append(formatted_word)
+                st.multiselect(
+                    f"Select words from Topic {row.Topic}",
+                    options=formatted_options,
+                    default=default_selection, # Pre-select words that are already in the list
+                    key=f"multiselect_topic_{row.Topic}",
+                    on_change=sync_stopwords, # The callback synchronizes everything
+                    label_visibility="collapsed"
+                )
+    st.markdown('<h2 class="sub-header">📈 Topic Evolution</h2>', unsafe_allow_html=True)
+    if not results['general_evolution'].empty:
+        evo = results['general_evolution']
+        # 1. Filter out the outlier topic (-1) and ensure Timestamp is a datetime object
+        evo_filtered = evo[evo.Topic != -1].copy()
+        evo_filtered['Timestamp'] = pd.to_datetime(evo_filtered['Timestamp'])
+        if not evo_filtered.empty:
+            # 2. Pivot the data to get topics as columns and aggregate frequencies
+            evo_pivot = evo_filtered.pivot_table(
+                index='Timestamp',
+                columns='Topic',
+                values='Frequency',
+                aggfunc='sum'
+            ).fillna(0)
+            # 3. Dynamically choose a good resampling frequency (Hourly, Daily, or Weekly)
+            time_delta = evo_pivot.index.max() - evo_pivot.index.min()
+            if time_delta.days > 60:
+                resample_freq, freq_label = 'W', 'Weekly'
+            elif time_delta.days > 5:
+                resample_freq, freq_label = 'D', 'Daily'
+            else:
+                resample_freq, freq_label = 'H', 'Hourly'
+            # Resample the data into the chosen time bins by summing up the frequencies
+            evo_resampled = evo_pivot.resample(resample_freq).sum()
+            # 4. Create the line chart using plotly.express.line
+            # --- The main change is here: from px.area to px.line ---
+            fig_evo = px.line(
+                evo_resampled,
+                x=evo_resampled.index,
+                y=evo_resampled.columns,
+                title=f"Topic Frequency Over Time ({freq_label} Line Chart)",
+                labels={'value': 'Total Frequency', 'variable': 'Topic ID', 'index': 'Time'},
+                height=500
+            )
+            # Make the topic IDs in the legend categorical for better color mapping
+            fig_evo.for_each_trace(lambda t: t.update(name=str(t.name)))
+            fig_evo.update_layout(legend_title_text='Topic')
+            st.plotly_chart(fig_evo, use_container_width=True)
+        else:
+            st.info("No topic evolution data available to display (all posts may have been outliers).")
+    else:
+        st.warning("Could not compute topic evolution (requires more data points over time).")
+    st.markdown('<h2 class="sub-header">🧑‍🤝‍🧑 User Engagement Profile</h2>', unsafe_allow_html=True)
+    # --- START OF THE CRITICAL FIX ---
+    # 1. Create a new DataFrame containing ONLY posts from meaningful topics.
+    df_meaningful = df[df['topic_id'] != -1].copy()
+    # 2. Get post counts based on this meaningful data.
+    meaningful_post_counts = df_meaningful.groupby('user_id').size().reset_index(name='post_count')
+    # 3. Merge with the Gini results (which were already correctly calculated on meaningful topics).
+    #    Using an 'inner' merge ensures we only consider users who have at least one meaningful post.
+    user_metrics_df = pd.merge(
+        meaningful_post_counts,
+        results['gini_per_user'],
+        on='user_id',
+        how='inner'
+    )
+    # 4. Filter to include only users with more than one MEANINGFUL post.
+    metrics_to_plot = user_metrics_df[user_metrics_df['post_count'] > 1].copy()
+    total_meaningful_users = len(user_metrics_df)
+    st.info(f"Displaying engagement profile for {len(metrics_to_plot)} users out of {total_meaningful_users} who contributed to meaningful topics.")
+    # 5. Add jitter for better visualization (this part is the same as before)
+    jitter_strength = 0.02
+    metrics_to_plot['gini_jittered'] = metrics_to_plot['gini_coefficient'] + \
+                                        np.random.uniform(-jitter_strength, jitter_strength, size=len(metrics_to_plot))
+    # 6. Create the plot using the correctly filtered and prepared data.
+    fig = px.scatter(
+        metrics_to_plot,
+        x='post_count',
+        y='gini_jittered',
+        title='User Engagement Profile (based on posts in meaningful topics)',
+        labels={
+            'post_count': 'Number of Posts in Meaningful Topics', # Updated label
+            'gini_jittered': 'Gini Index (Topic Diversity)'
+        },
+        custom_data=['user_id', 'gini_coefficient']
+    )
+    fig.update_traces(
+        marker=dict(opacity=0.5),
+        hovertemplate="<b>User</b>: %{customdata[0]}<br><b>Meaningful Posts</b>: %{x}<br><b>Gini (Original)</b>: %{customdata[1]:.3f}<extra></extra>"
+    )
+    fig.update_yaxes(range=[-0.05, 1.05])
+    st.plotly_chart(fig, use_container_width=True)
+    # --- END OF THE CRITICAL FIX ---
+    st.markdown('<h2 class="sub-header">👤 User Deep Dive</h2>', unsafe_allow_html=True)
+    selected_user = st.selectbox("Select a User to Analyze", options=sorted(df['user_id'].unique()), key="selected_user_dropdown")
+    if selected_user:
+        user_df = df[df['user_id'] == selected_user]
+        user_gini_info = user_metrics_df[user_metrics_df['user_id'] == selected_user].iloc[0]
+        # Display the top-level metrics for the user first
+        c1, c2 = st.columns(2)
+        with c1: st.metric("Total Posts by User", len(user_df))
+        with c2: st.metric("Topic Diversity (Gini)", f"{user_gini_info['gini_coefficient']:.3f}", help=interpret_gini(user_gini_info['gini_coefficient']))
+        st.markdown("---") # Add a visual separator
+        # --- START: New Two-Column Layout for Charts ---
+        col1, col2 = st.columns(2)
+        with col1:
+            # --- Chart 1: Topic Distribution Pie Chart ---
+            user_topic_counts = user_df['topic_id'].value_counts().reset_index()
+            user_topic_counts.columns = ['topic_id', 'count']
+            fig_pie = px.pie(
+                user_topic_counts[user_topic_counts.topic_id != -1],
+                names='topic_id',
+                values='count',
+                title=f"Overall Topic Distribution for {selected_user}",
+                hole=0.4
+            )
+            fig_pie.update_layout(margin=dict(l=0, r=0, t=40, b=0))
+            st.plotly_chart(fig_pie, use_container_width=True)
+        with col2:
+            # --- Chart 2: Topic Evolution for User ---
+            if len(user_df) > 1:
+                user_evo_df = user_df[user_df['topic_id'] != -1].copy()
+                user_evo_df['timestamp'] = pd.to_datetime(user_evo_df['timestamp'])
+                if not user_evo_df.empty and user_evo_df['timestamp'].nunique() > 1:
+                    user_pivot = user_evo_df.pivot_table(index='timestamp', columns='topic_id', aggfunc='size', fill_value=0)
+                    time_delta = user_pivot.index.max() - user_pivot.index.min()
+                    if time_delta.days > 30: resample_freq = 'D'
+                    elif time_delta.days > 2: resample_freq = 'H'
+                    else: resample_freq = 'T'
+                    user_resampled = user_pivot.resample(resample_freq).sum()
+                    row_sums = user_resampled.sum(axis=1)
+                    user_proportions = user_resampled.div(row_sums, axis=0).fillna(0)
+                    topic_name_map = topic_info.set_index('Topic')['Name'].to_dict()
+                    user_proportions.rename(columns=topic_name_map, inplace=True)
+                    fig_user_evo = px.area(
+                        user_proportions,
+                        x=user_proportions.index,
+                        y=user_proportions.columns,
+                        title=f"Topic Proportion Over Time for {selected_user}",
+                        labels={'value': 'Topic Proportion', 'variable': 'Topic', 'index': 'Time'},
+                    )
+                    fig_user_evo.update_layout(margin=dict(l=0, r=0, t=40, b=0))
+                    st.plotly_chart(fig_user_evo, use_container_width=True)
+                else:
+                    st.info("This user has no posts in meaningful topics or all posts occurred at the same time.")
+            else:
+                st.info("Topic evolution requires more than one post to display.")
+        st.markdown("#### User's Most Recent Posts")
+        user_posts_table = user_df[['post_content', 'timestamp', 'topic_id']] \
+            .sort_values(by='timestamp', ascending=False) \
+            .head(100)
+        user_posts_table.columns = ['Post Content', 'Timestamp', 'Assigned Topic']
+        st.dataframe(user_posts_table, use_container_width=True)
+        with st.expander("Show User Distribution by Post Count"):
+            # We use 'user_metrics_df' because it's based on meaningful posts
+            post_distribution = user_metrics_df['post_count'].value_counts().reset_index()
+            post_distribution.columns = ['Number of Posts', 'Number of Users']
+            post_distribution = post_distribution.sort_values(by='Number of Posts')
+            # Create a bar chart for the distribution
+            fig_dist = px.bar(
+                post_distribution,
+                x='Number of Posts',
+                y='Number of Users',
+                title='User Distribution by Number of Meaningful Posts'
+            )
+            st.plotly_chart(fig_dist, use_container_width=True)
+            # Display the raw data in a table
+            st.write("Data Table: User Distribution")
+            st.dataframe(post_distribution, use_container_width=True)

gini_calculator.py ADDED Viewed

	@@ -0,0 +1,213 @@

+# import numpy as np
+# import pandas as pd
+# from typing import List
+# def calculate_gini(array):
+#     """
+#     Calculates the Gini coefficient of a numpy array.
+#     Based on: http://www.statsdirect.com/help/default.htm#nonparametric_tests/gini.htm
+#     """
+#     array = np.array(array)
+#     if array.size == 0 or np.all(array == 0): # Check if all elements are zero
+#         return 0.0
+#     array = array.flatten()
+#     if np.amin(array) < 0: # Values cannot be negative: https://en.wikipedia.org/wiki/Gini_coefficient
+#         array -= np.amin(array)
+#     array = np.sort(array)
+#     index = np.arange(1, array.shape[0] + 1)
+#     n = array.shape[0]
+#     if np.sum(array) == 0: # Avoid division by zero for empty arrays or arrays with all zeros
+#         return 0.0
+#     return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)))
+# # def calculate_gini_per_user(df: pd.DataFrame):
+# #     """
+# #     Calculates the Gini coefficient for topic distribution per user.
+# #     Args:
+# #         df (pd.DataFrame): DataFrame with 'user_id' and 'topic_id' columns.
+# #     Returns:
+# #         pd.DataFrame: DataFrame with 'user_id' and 'gini_coefficient'.
+# #     """
+# #     user_gini = []
+# #     for user_id in df["user_id"].unique():
+# #         user_posts = df[df["user_id"] == user_id]
+# #         topic_counts = user_posts["topic_id"].value_counts().values
+# #         gini = calculate_gini(topic_counts)
+# #         user_gini.append({"user_id": user_id, "gini_coefficient": gini})
+# #     return pd.DataFrame(user_gini)
+# # def calculate_gini_per_topic(df: pd.DataFrame):
+# #     """
+# #     Calculates the Gini coefficient for topic distribution per topic across users.
+# #     Args:
+# #         df (pd.DataFrame): DataFrame with 'user_id' and 'topic_id' columns.
+# #     Returns:
+# #         pd.DataFrame: DataFrame with 'topic_id' and 'gini_coefficient'.
+# #     """
+# #     topic_gini = []
+# #     for topic_id in df["topic_id"].unique():
+# #         topic_posts = df[df["topic_id"] == topic_id]
+# #         user_counts = topic_posts["user_id"].value_counts().values
+# #         gini = calculate_gini(user_counts)
+# #         topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
+# #     return pd.DataFrame(topic_gini)
+# def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
+#     user_gini = []
+#     for user_id in df["user_id"].unique():
+#         user_posts = df[df["user_id"] == user_id]
+#         # Get counts for topics the user posted in
+#         existing_topic_counts = user_posts["topic_id"].value_counts()
+#         # Create a full series with all topics, filling missing with 0
+#         full_topic_counts = pd.Series(0, index=all_topics)
+#         full_topic_counts.update(existing_topic_counts)
+#         gini = calculate_gini(full_topic_counts.values)
+#         user_gini.append({"user_id": user_id, "gini_coefficient": gini})
+#     return pd.DataFrame(user_gini)
+# def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
+#     topic_gini = []
+#     for topic_id in df["topic_id"].unique(): # Or iterate over all_topics if you want Gini for topics with no posts
+#         topic_posts = df[df["topic_id"] == topic_id]
+#         # Get counts for users who posted in this topic
+#         existing_user_counts = topic_posts["user_id"].value_counts()
+#         # Create a full series with all users, filling missing with 0
+#         full_user_counts = pd.Series(0, index=all_users)
+#         full_user_counts.update(existing_user_counts)
+#         gini = calculate_gini(full_user_counts.values)
+#         topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
+#     return pd.DataFrame(topic_gini)
+# if __name__ == "__main__":
+#     # Example Usage with more diverse data:
+#     data = {
+#         'user_id': ['userA', 'userA', 'userA', 'userB', 'userB', 'userC', 'userC', 'userC', 'userC', 'userD'],
+#         'topic_id': [1, 1, 2, 1, 3, 2, 2, 3, 4, 1]
+#     }
+#     df = pd.DataFrame(data)
+#     print("Calculating Gini per user...")
+#     gini_per_user_df = calculate_gini_per_user(df)
+#     print(gini_per_user_df)
+#     print("\nCalculating Gini per topic...")
+#     gini_per_topic_df = calculate_gini_per_topic(df)
+#     print(gini_per_topic_df)
+# gini_calculator.py
+import pandas as pd
+from math import isnan
+import math
+from typing import List
+def calculate_gini(counts, *, min_posts=None, normalize=False):
+    """
+    Compute 1 - sum(p_i^2) where p_i are category probabilities (Gini Impurity).
+    Handles: list/tuple of counts, dict {cat: count}, numpy array, pandas Series.
+    Edge cases:
+      - total == 0  -> return float('nan')
+      - total == 1  -> return 0.0
+      - min_posts set and total < min_posts -> return float('nan')
+      - normalize=True -> divide by (1 - 1/k_nonzero) when k_nonzero > 1
+    Parameters
+    ----------
+    counts : Iterable[int] | dict | pandas.Series | numpy.ndarray
+        Nonnegative counts per category.
+    min_posts : int | None
+        If provided and total posts < min_posts, returns NaN.
+    normalize : bool
+        If True, returns Gini / (1 - 1/k_nonzero) for k_nonzero > 1.
+    Returns
+    -------
+    float
+    """
+    # Convert to a flat list of counts
+    if counts is None:
+        return float('nan')
+    if isinstance(counts, dict):
+        vals = list(counts.values())
+    else:
+        # Works for list/tuple/np.array/Series
+        try:
+            vals = list(counts)
+        except TypeError:
+            return float('nan')
+    # Validate & clean
+    vals = [float(v) for v in vals if v is not None and not math.isnan(v)]
+    if any(v < 0 for v in vals):
+        raise ValueError("Counts must be nonnegative.")
+    total = sum(vals)
+    # Edge cases
+    if total == 0:
+        return float('nan')
+    if min_posts is not None and total < min_posts:
+        return float('nan')
+    if total == 1:
+        base = 0.0
+    else:
+        # Compute 1 - sum p_i^2
+        s2 = sum((v / total) ** 2 for v in vals)
+        base = 1.0 - s2
+    if not normalize:
+        return base
+    # Normalization by maximum possible diversity for observed nonzero categories
+    k_nonzero = sum(1 for v in vals if v > 0)
+    if k_nonzero <= 1:
+        # If only one category has posts, diversity is 0 and normalization isn't defined—return 0
+        return 0.0
+    denom = 1.0 - 1.0 / k_nonzero
+    # Guard against floating tiny negatives due to FP
+    return max(0.0, min(1.0, base / denom))
+def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
+    """
+    Calculates the Gini Impurity for topic distribution per user.
+    A high value indicates high topic diversity.
+    """
+    user_gini = []
+    for user_id in df["user_id"].unique():
+        user_posts = df[df["user_id"] == user_id]
+        existing_topic_counts = user_posts["topic_id"].value_counts()
+        full_topic_counts = pd.Series(0, index=all_topics)
+        full_topic_counts.update(existing_topic_counts)
+        # We use normalize=True to make scores more comparable
+        gini = calculate_gini(full_topic_counts.values, normalize=True)
+        user_gini.append({"user_id": user_id, "gini_coefficient": gini})
+    # The new function returns NaN for zero counts, so we fill with 0
+    return pd.DataFrame(user_gini).fillna(0)
+def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
+    """
+    Calculates the Gini Impurity for user distribution per topic.
+    A high value indicates the topic is discussed by a diverse set of users.
+    """
+    topic_gini = []
+    for topic_id in df["topic_id"].unique():
+        topic_posts = df[df["topic_id"] == topic_id]
+        existing_user_counts = topic_posts["user_id"].value_counts()
+        full_user_counts = pd.Series(0, index=all_users)
+        full_user_counts.update(existing_user_counts)
+        gini = calculate_gini(full_user_counts.values, normalize=True)
+        topic_gini.append({"topic_id": topic_id, "gini_coefficient": gini})
+    return pd.DataFrame(topic_gini).fillna(0)

narrative_similarity.py ADDED Viewed

	@@ -0,0 +1,30 @@

+# narrative_similarity.py
+import pandas as pd
+from sklearn.metrics.pairwise import cosine_similarity
+def calculate_narrative_similarity(df: pd.DataFrame):
+    """
+    Calculates the narrative overlap between users based on their topic distributions.
+    Args:
+        df (pd.DataFrame): DataFrame containing 'user_id' and 'topic_id' columns.
+    Returns:
+        pd.DataFrame: A square DataFrame where rows and columns are user_ids
+                      and values are the cosine similarity of their topic distributions.
+    """
+    # 1. Filter out outlier posts for a more meaningful similarity score
+    df_meaningful = df[df['topic_id'] != -1]
+    # 2. Create the "narrative vector" for each user
+    #    Rows: user_id, Columns: topic_id, Values: count of posts
+    user_topic_matrix = pd.crosstab(df_meaningful['user_id'], df_meaningful['topic_id'])
+    # 3. Calculate pairwise cosine similarity between all users
+    similarity_matrix = cosine_similarity(user_topic_matrix)
+    # 4. Convert the result back to a DataFrame with user_ids as labels
+    similarity_df = pd.DataFrame(similarity_matrix, index=user_topic_matrix.index, columns=user_topic_matrix.index)
+    return similarity_df

readme.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# Social Media Topic Modeling System
+A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation for diversity analysis, topic evolution tracking, and semantic narrative overlap detection.
+## Features
+- **📊 Topic Modeling**: Uses BERTopic for state-of-the-art, transformer-based topic modeling.
+- **⚙️ Flexible Configuration**:
+    - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
+    - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
+- **🌍 Multilingual Support**: Handles English and 50+ other languages using appropriate language models.
+- **📈 Gini Index Analysis**: Calculates topic and user diversity.
+- **⏰ Topic Evolution**: Tracks how topic popularity and user interests change over time with interactive charts.
+- **🤝 Narrative Overlap Analysis**: Identifies users with semantically similar posting patterns (shared narratives), even when their wording differs.
+- **✍️ Interactive Topic Refinement**: Fine-tune topic quality by adding words to a custom stopword list directly from the dashboard.
+- **🎯 Interactive Visualizations**: A rich dashboard with built-in charts and data tables using Plotly.
+- **📱 Responsive Interface**: Clean, modern Streamlit interface with a control panel for all settings.
+## Requirements
+### CSV File Format
+Your CSV file must contain columns that can be mapped to the following roles:
+- **User ID**: A column with unique identifiers for each user (string).
+- **Post Content**: A column with the text content of the social media post (string).
+- **Timestamp**: A column with the date and time of the post.
+The application will prompt you to select the correct column for each role after you upload your file.
+#### A Note on Timestamp Formatting
+The application is highly flexible and can automatically parse many common date and time formats thanks to the powerful Pandas library. However, to ensure 100% accuracy and avoid errors, please follow these guidelines for your timestamp column:
+*   **Best Practice (Recommended):** Use a standard, unambiguous format like ISO 8601.
+    - `YYYY-MM-DD HH:MM:SS` (e.g., `2023-10-27 15:30:00`)
+    - `YYYY-MM-DDTHH:MM:SS` (e.g., `2023-10-27T15:30:00`)
+*   **Supported Formats:** Most common formats will work, including:
+    - `MM/DD/YYYY HH:MM` (e.g., `10/27/2023 15:30`)
+    - `DD/MM/YYYY HH:MM` (e.g., `27/10/2023 15:30`)
+    - `Month D, YYYY` (e.g., `October 27, 2023`)
+*   **Potential Issues to Avoid:**
+    - **Ambiguous formats:** A date like `01/02/2023` can be interpreted as either Jan 2nd or Feb 1st. Using a `YYYY-MM-DD` format avoids this.
+    - **Mixed formats in one column:** Ensure all timestamps in your column follow the same format for best performance and reliability.
+    - **Timezone information:** Formats with timezone offsets (e.g., `2023-10-27 15:30:00+05:30`) are fully supported.
+### Dependencies
+See `requirements.txt` for a full list of dependencies.
+## Installation
+### Option 1: Local Installation
+1.  **Clone or download the project files.**
+2.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+3.  **Download spaCy models:**
+    ```bash
+    python -m spacy download en_core_web_sm
+    python -m spacy download xx_ent_wiki_sm
+    ```
+### Option 2: Docker Installation (Recommended)
+1.  **Using Docker Compose (easiest):**
+    ```bash
+    docker-compose up --build
+    ```
+2.  **Access the application:**
+    Open your browser and go to `http://localhost:8501`.
+## Usage
+1.  **Start the Streamlit application:**
+    ```bash
+    streamlit run app.py
+    ```
+2.  **Open your browser** and navigate to the local URL provided by Streamlit (usually `http://localhost:8501`).
+3.  **Follow the steps in the application:**
+    - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
+    - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
+    - **3. Configure Analysis**:
+        - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
+        - **Number of Topics**: Enter a specific number of meaningful topics to find, or use `-1` to let the model decide automatically.
+        - **Text Preprocessing**: Expand the advanced options to select cleaning steps like lowercasing, punctuation removal, and more.
+        - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
+    - **4. Run Analysis**: Click the "🚀 Run Full Analysis" button.
+4.  **Explore the results** in the interactive sections of the main panel.
+### Exploring the Interface
+The application provides a series of detailed sections:
+#### 📋 Overview & Preprocessing
+- Key metrics (total posts, unique users), dataset time range, and a topic coherence score.
+- A sample of your data showing the original and processed text.
+#### 🎯 Topic Visualization & Refinement
+- **Word Clouds**: Visual representation of the most important words for top topics.
+- **Interactive Word Lists**: Interactively select words from topic lists to add them to your custom stopwords for re-analysis.
+#### 📈 Topic Evolution
+- An interactive line chart showing how topic frequencies change over the entire dataset's timespan.
+#### 🧑‍🤝‍🧑 User Engagement Profile
+- A scatter plot visualizing the relationship between the number of posts a user makes and the diversity of their topics.
+- An expandable section showing the distribution of users by their post count.
+#### 👤 User Deep Dive
+- Select a specific user to analyze.
+- View their key metrics, overall topic distribution pie chart, and their personal topic evolution over time.
+- See detailed tables of their topic breakdown and their most recent posts.
+#### 🤝 Narrative Overlap Analysis
+- Select a user to find other users who discuss a similar mix of topics.
+- Use the slider to adjust the similarity threshold.
+- The results table shows the overlap score and post count of similar users, providing context on both narrative alignment and engagement level.
+## Understanding the Results
+### Gini Impurity Index
+This application uses the **Gini Impurity Index**, a measure of diversity.
+- **Range**: 0 to 1
+- **User Gini (Topic Diversity)**: Measures how diverse a user's topics are. **0** = perfectly specialized (posts on only one topic), **1** = perfectly diverse (posts spread evenly across all topics).
+- **Topic Gini (User Diversity)**: Measures how concentrated a topic is among users. **0** = dominated by a single user, **1** = widely and evenly discussed by many users.
+### Narrative Overlap Score
+- **Range**: 0 to 1
+- This score measures the **cosine similarity** between the topic distributions of two users.
+- A score of **1.0** means the two users have an identical proportional interest in topics (e.g., both are 100% focused on Topic 3).
+- A score of **0.0** means their topic interests are completely different.
+- This helps identify users with similar narrative focus, regardless of their total post count.

requirements.txt CHANGED Viewed

@@ -1,3 +1,19 @@
-altair
-pandas
-streamlit

+streamlit>=1.17.0
+bertopic[all]>=0.16.0
+pandas>=2.0.0
+numpy>=1.20.0
+plotly>=5.0.0
+transformers>=4.21.0
+sentence-transformers>=2.2.0
+scikit-learn>=1.0.0
+hdbscan>=0.8.29
+umap-learn>=0.5.0
+torch>=1.11.0
+matplotlib>=3.5.0
+seaborn>=0.11.0
+gensim>=4.3.0
+nltk>=3.8.0
+wordcloud>=1.9.0
+emoji>=2.2.0
+spacy>=3.4.0
+pyinstaller

text_preprocessor.py ADDED Viewed

	@@ -0,0 +1,123 @@

+import re
+import string
+import pandas as pd
+import spacy
+import emoji
+from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
+from spacy.util import compile_infix_regex
+from pathlib import Path
+from resource_path import resource_path
+class MultilingualPreprocessor:
+    """
+    A robust text preprocessor using spaCy for multilingual support.
+    """
+    def __init__(self, language: str):
+        """
+        Initializes the preprocessor and loads the appropriate spaCy model.
+        Args:
+            language (str): 'english' or 'multilingual'.
+        """
+        model_map = {
+            'english': resource_path('en_core_web_sm'),
+            'multilingual': resource_path('xx_ent_wiki_sm')
+        }
+        self.model_name = model_map.get(language, resource_path('xx_ent_wiki_sm')) # This is the STRING path
+        #self.model_name = model_map.get(language, 'xx_ent_wiki_sm')
+        try:
+            # 1. Convert the string path to a Path object
+            model_path_obj = Path(self.model_name)
+            # 2. FIX: Pass the Path object to the spaCy loader
+            self.nlp = spacy.util.load_model_from_path(model_path_obj)
+        except OSError:
+            # The error message now points to a more fundamental issue if it fails
+            print(f"spaCy Model Error: Could not load model from path: {self.model_name}")
+            print(f"This indicates a problem with the PyInstaller bundling of the spaCy models.")
+            raise# Re-raise the error to be caught by the Streamlit app
+        # Customize tokenizer to not split on hyphens in words
+        # CORRECTED LINE: CONCAT_QUOTES is wrapped in a list []
+        infixes = LIST_ELLIPSES + LIST_ICONS + [CONCAT_QUOTES]
+        infix_regex = compile_infix_regex(infixes)
+        self.nlp.tokenizer.infix_finditer = infix_regex.finditer
+    def preprocess_series(self, text_series: pd.Series, options: dict, n_process_spacy: int = -1) -> pd.Series:
+        """
+        Applies a series of cleaning steps to a pandas Series of text.
+        Args:
+            text_series (pd.Series): The text to be cleaned.
+            options (dict): A dictionary of preprocessing options.
+        Returns:
+            pd.Series: The cleaned text Series.
+        """
+        # --- Stage 1: Fast, Regex-based cleaning ---
+        processed_text = text_series.copy().astype(str)
+        if options.get("remove_html"):
+            processed_text = processed_text.str.replace(r"<.*?>", "", regex=True)
+        if options.get("remove_urls"):
+            processed_text = processed_text.str.replace(r"http\S+|www\.\S+", "", regex=True)
+        emoji_option = options.get("handle_emojis", "Keep Emojis")
+        if emoji_option == "Remove Emojis":
+            processed_text = processed_text.apply(lambda s: emoji.replace_emoji(s, replace=''))
+        elif emoji_option == "Convert Emojis to Text":
+            processed_text = processed_text.apply(emoji.demojize)
+        if options.get("handle_hashtags") == "Remove Hashtags":
+            processed_text = processed_text.str.replace(r"#\w+", "", regex=True)
+        if options.get("handle_mentions") == "Remove Mentions":
+            processed_text = processed_text.str.replace(r"@\w+", "", regex=True)
+        # --- Stage 2: spaCy-based advanced processing ---
+        # Using nlp.pipe for efficiency on a Series
+        cleaned_docs = []
+        # docs = self.nlp.pipe(processed_text, n_process=-1, batch_size=500)
+        docs = self.nlp.pipe(processed_text, n_process=n_process_spacy, batch_size=500)
+        # Get custom stopwords and convert to lowercase set for fast lookups
+        custom_stopwords = set(options.get("custom_stopwords", []))
+        for doc in docs:
+            tokens = []
+            for token in doc:
+                # Punctuation and Number handling
+                if options.get("remove_punctuation") and token.is_punct:
+                    continue
+                if options.get("remove_numbers") and (token.is_digit or token.like_num):
+                    continue
+                # Stopword handling (including custom stopwords)
+                is_stopword = token.is_stop or token.text.lower() in custom_stopwords
+                if options.get("remove_stopwords") and is_stopword:
+                    continue
+                # Use lemma if lemmatization is on, otherwise use the original text
+                token_text = token.lemma_ if options.get("lemmatize") else token.text
+                # Lowercasing (language-aware)
+                if options.get("lowercase"):
+                    token_text = token_text.lower()
+                # Remove any leftover special characters or whitespace
+                if options.get("remove_special_chars"):
+                    token_text = re.sub(r'[^\w\s-]', '', token_text)
+                if token_text.strip():
+                    tokens.append(token_text.strip())
+            cleaned_docs.append(" ".join(tokens))
+        return pd.Series(cleaned_docs, index=text_series.index)

topic_evolution.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import pandas as pd
+from bertopic import BERTopic
+from bertopic.representation import KeyBERTInspired
+def analyze_general_topic_evolution(topic_model, docs, timestamps):
+    """
+    Analyzes general topic evolution over time.
+    Args:
+        topic_model: Trained BERTopic model.
+        docs (list): List of documents.
+        timestamps (list): List of timestamps corresponding to the documents.
+    Returns:
+        pd.DataFrame: DataFrame with topic evolution information.
+    """
+    try:
+        topics_over_time = topic_model.topics_over_time(docs, timestamps, global_tuning=True)
+        return topics_over_time
+    except Exception:
+        # Fallback for small datasets or cases where evolution can't be computed
+        return pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+def analyze_user_topic_evolution(df: pd.DataFrame, topic_model):
+    """
+    Analyzes topic evolution per user.
+    Args:
+        df (pd.DataFrame): DataFrame with (
+            "user_id", "post_content", "timestamp", and "topic_id" columns.
+        topic_model: Trained BERTopic model.
+    Returns:
+        dict: A dictionary where keys are user_ids and values are DataFrames of topic evolution for that user.
+    """
+    user_topic_evolution = {}
+    for user_id in df["user_id"].unique():
+        user_df = df[df["user_id"] == user_id].copy()
+        if not user_df.empty and len(user_df) > 1:
+            try:
+                # Ensure timestamps are sorted for topics_over_time
+                user_df = user_df.sort_values(by="timestamp")
+                docs = user_df["post_content"].tolist()
+                timestamps = user_df["timestamp"].tolist()
+                selected_topics = user_df["topic_id"].tolist() # Get topic_ids for the user's posts
+                topics_over_time = topic_model.topics_over_time(docs, timestamps, topics=selected_topics, global_tuning=True)
+                user_topic_evolution[user_id] = topics_over_time
+            except Exception:
+                user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+        else:
+             user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+    return user_topic_evolution
+if __name__ == "__main__":
+    # Example Usage:
+    data = {
+        "user_id": ["user1", "user2", "user1", "user3", "user2", "user1", "user4", "user3", "user2", "user1", "user5", "user4", "user3", "user2", "user1"],
+        "post_content": [
+            "This is a great movie, I loved the acting and the plot. It was truly captivating.",
+            "The new phone has an amazing camera and long battery life. Highly recommend it.",
+            "I enjoyed the film, especially the special effects and the soundtrack. A must-watch.",
+            "Learning about AI and machine learning is fascinating. The future is here.",
+            "My old phone is so slow, I need an upgrade soon. Thinking about the latest model.",
+            "The best part of the movie was the soundtrack and the stunning visuals. Very immersive.",
+            "Exploring the vastness of space is a lifelong dream. Astronomy is amazing.",
+            "Data science is revolutionizing industries. Predictive analytics is key.",
+            "I need a new laptop for work. Something powerful and portable.",
+            "Just finished reading a fantastic book on quantum physics. Mind-blowing concepts.",
+            "Cooking new recipes is my passion. Today, I tried a spicy Thai curry.",
+            "The universe is full of mysteries. Black holes and dark matter are intriguing.",
+            "Deep learning models are becoming incredibly sophisticated. Image recognition is impressive.",
+            "My current laptop is crashing frequently. Time for an upgrade.",
+            "Science fiction movies always make me think about the future of humanity."
+        ],
+        "timestamp": [
+            "2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 10:30:00",
+            "2023-01-02 14:00:00", "2023-01-03 09:00:00", "2023-01-03 16:00:00",
+            "2023-01-04 08:00:00", "2023-01-04 12:00:00", "2023-01-05 10:00:00",
+            "2023-01-05 15:00:00", "2023-01-06 09:30:00", "2023-01-06 13:00:00",
+            "2023-01-07 11:00:00", "2023-01-07 14:30:00", "2023-01-08 10:00:00"
+        ]
+    }
+    df = pd.DataFrame(data)
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    print("Performing topic modeling (English)...")
+    model_en, topics_en, probs_en = perform_topic_modeling(df, language="english")
+    df["topic_id"] = topics_en
+    print("\nAnalyzing general topic evolution...")
+    general_evolution_df = analyze_general_topic_evolution(model_en, df["post_content"].tolist(), df["timestamp"].tolist())
+    print(general_evolution_df.head())
+    print("\nAnalyzing per user topic evolution...")
+    user_evolution_dict = analyze_user_topic_evolution(df, model_en)
+    for user_id, evolution_df in user_evolution_dict.items():
+        print(f"\nTopic evolution for {user_id}:")
+        print(evolution_df.head())

topic_modeling.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# topic_modeling.py
+import pandas as pd
+from bertopic import BERTopic
+from gensim.corpora import Dictionary
+from gensim.models import CoherenceModel
+from nltk.tokenize import word_tokenize
+from typing import List
+from sklearn.feature_extraction.text import CountVectorizer # <-- Make sure this is imported
+def perform_topic_modeling(
+    docs: List[str],
+    language: str = "english",
+    nr_topics=None,
+    remove_stopwords_bertopic: bool = False, # New parameter to control behavior
+    custom_stopwords: List[str] = None
+):
+    """
+    Performs topic modeling on a list of documents.
+    Args:
+        docs (List[str]): A list of documents. Stopwords should be INCLUDED for best results.
+        language (str): Language for the BERTopic model ('english', 'multilingual').
+        nr_topics: The number of topics to find ("auto" or an int).
+        remove_stopwords_bertopic (bool): If True, stopwords will be removed internally by BERTopic.
+        custom_stopwords (List[str]): A list of custom stopwords to use.
+    Returns:
+        tuple: BERTopic model, topics, probabilities, and coherence score.
+    """
+    vectorizer_model = None  # Default to no custom vectorizer
+    if remove_stopwords_bertopic:
+        stop_words_list = []
+        if language == "english":
+            # Start with the built-in English stopword list from scikit-learn
+            from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
+            stop_words_list = list(ENGLISH_STOP_WORDS)
+        # Add any custom stopwords provided by the user
+        if custom_stopwords:
+            stop_words_list.extend(custom_stopwords)
+        # Only create a vectorizer if there's a list of stopwords to use
+        if stop_words_list:
+            vectorizer_model = CountVectorizer(stop_words=stop_words_list)
+    # Instantiate BERTopic, passing the vectorizer_model if it was created
+    if language == "multilingual":
+        topic_model = BERTopic(language="multilingual", nr_topics=nr_topics, vectorizer_model=vectorizer_model)
+    else:
+        topic_model = BERTopic(language=language, nr_topics=nr_topics, vectorizer_model=vectorizer_model)
+    # The 'docs' passed here should contain stopwords for the embedding model to work best
+    topics, probs = topic_model.fit_transform(docs)
+    # --- Calculate Coherence Score ---
+    # This part remains the same.
+    tokenized_docs = [word_tokenize(doc) for doc in docs]
+    dictionary = Dictionary(tokenized_docs)
+    corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
+    topic_words = topic_model.get_topics()
+    topics_for_coherence = []
+    for topic_id in sorted(topic_words.keys()):
+        if topic_id != -1:
+            words = [word for word, _ in topic_model.get_topic(topic_id)]
+            topics_for_coherence.append(words)
+    coherence_score = None
+    if topics_for_coherence and corpus:
+        try:
+            coherence_model = CoherenceModel(
+                topics=topics_for_coherence,
+                texts=tokenized_docs,
+                dictionary=dictionary,
+                coherence='c_v'
+            )
+            coherence_score = coherence_model.get_coherence()
+        except Exception as e:
+            print(f"Could not calculate coherence score: {e}")
+    return topic_model, topics, probs, coherence_score