Spaces:
Running
Social Media Topic Modeling System
A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation for diversity analysis, topic evolution tracking, and semantic narrative overlap detection.
Features
- π Topic Modeling: Uses BERTopic for state-of-the-art, transformer-based topic modeling.
- βοΈ Flexible Configuration:
- Custom Column Mapping: Use any CSV file by mapping your columns to
user_id,post_content, andtimestamp. - Topic Number Control: Let the model find topics automatically or specify the exact number you need.
- Custom Column Mapping: Use any CSV file by mapping your columns to
- π Multilingual Support: Handles English and 50+ other languages using appropriate language models.
- π Gini Index Analysis: Calculates topic and user diversity.
- β° Topic Evolution: Tracks how topic popularity and user interests change over time with interactive charts.
- π€ Narrative Overlap Analysis: Identifies users with semantically similar posting patterns (shared narratives), even when their wording differs.
- βοΈ Interactive Topic Refinement: Fine-tune topic quality by adding words to a custom stopword list directly from the dashboard.
- π― Interactive Visualizations: A rich dashboard with built-in charts and data tables using Plotly.
- π± Responsive Interface: Clean, modern Streamlit interface with a control panel for all settings.
Requirements
CSV File Format
Your CSV file must contain columns that can be mapped to the following roles:
- User ID: A column with unique identifiers for each user (string).
- Post Content: A column with the text content of the social media post (string).
- Timestamp: A column with the date and time of the post.
The application will prompt you to select the correct column for each role after you upload your file.
A Note on Timestamp Formatting
The application is highly flexible and can automatically parse many common date and time formats thanks to the powerful Pandas library. However, to ensure 100% accuracy and avoid errors, please follow these guidelines for your timestamp column:
Best Practice (Recommended): Use a standard, unambiguous format like ISO 8601.
YYYY-MM-DD HH:MM:SS(e.g.,2023-10-27 15:30:00)YYYY-MM-DDTHH:MM:SS(e.g.,2023-10-27T15:30:00)
Supported Formats: Most common formats will work, including:
MM/DD/YYYY HH:MM(e.g.,10/27/2023 15:30)DD/MM/YYYY HH:MM(e.g.,27/10/2023 15:30)Month D, YYYY(e.g.,October 27, 2023)
Potential Issues to Avoid:
- Ambiguous formats: A date like
01/02/2023can be interpreted as either Jan 2nd or Feb 1st. Using aYYYY-MM-DDformat avoids this. - Mixed formats in one column: Ensure all timestamps in your column follow the same format for best performance and reliability.
- Timezone information: Formats with timezone offsets (e.g.,
2023-10-27 15:30:00+05:30) are fully supported.
- Ambiguous formats: A date like
Dependencies
See requirements.txt for a full list of dependencies.
Installation
Option 1: Local Installation
- Clone or download the project files.
- Install dependencies:
pip install -r requirements.txt - Download spaCy models:
python -m spacy download en_core_web_sm python -m spacy download xx_ent_wiki_sm
Option 2: Docker Installation (Recommended)
- Using Docker Compose (easiest):
docker-compose up --build - Access the application:
Open your browser and go to
http://localhost:8501.
Usage
Start the Streamlit application:
streamlit run app.pyOpen your browser and navigate to the local URL provided by Streamlit (usually
http://localhost:8501).Follow the steps in the application:
- 1. Upload CSV File: Click "Browse files" to upload your dataset.
- 2. Map Data Columns: Once uploaded, select which of your columns correspond to
User ID,Post Content, andTimestamp. - 3. Configure Analysis:
- Language Model: Choose
englishfor English-only data ormultilingualfor other languages. - Number of Topics: Enter a specific number of meaningful topics to find, or use
-1to let the model decide automatically. - Text Preprocessing: Expand the advanced options to select cleaning steps like lowercasing, punctuation removal, and more.
- Custom Stopwords: (Optional) Enter comma-separated words to exclude from analysis.
- Language Model: Choose
- 4. Run Analysis: Click the "π Run Full Analysis" button.
Explore the results in the interactive sections of the main panel.
Exploring the Interface
The application provides a series of detailed sections:
π Overview & Preprocessing
- Key metrics (total posts, unique users), dataset time range, and a topic coherence score.
- A sample of your data showing the original and processed text.
π― Topic Visualization & Refinement
- Word Clouds: Visual representation of the most important words for top topics.
- Interactive Word Lists: Interactively select words from topic lists to add them to your custom stopwords for re-analysis.
π Topic Evolution
- An interactive line chart showing how topic frequencies change over the entire dataset's timespan.
π§βπ€βπ§ User Engagement Profile
- A scatter plot visualizing the relationship between the number of posts a user makes and the diversity of their topics.
- An expandable section showing the distribution of users by their post count.
π€ User Deep Dive
- Select a specific user to analyze.
- View their key metrics, overall topic distribution pie chart, and their personal topic evolution over time.
- See detailed tables of their topic breakdown and their most recent posts.
π€ Narrative Overlap Analysis
- Select a user to find other users who discuss a similar mix of topics.
- Use the slider to adjust the similarity threshold.
- The results table shows the overlap score and post count of similar users, providing context on both narrative alignment and engagement level.
Understanding the Results
Gini Impurity Index
This application uses the Gini Impurity Index, a measure of diversity.
- Range: 0 to 1
- User Gini (Topic Diversity): Measures how diverse a user's topics are. 0 = perfectly specialized (posts on only one topic), 1 = perfectly diverse (posts spread evenly across all topics).
- Topic Gini (User Diversity): Measures how concentrated a topic is among users. 0 = dominated by a single user, 1 = widely and evenly discussed by many users.
Narrative Overlap Score
- Range: 0 to 1
- This score measures the cosine similarity between the topic distributions of two users.
- A score of 1.0 means the two users have an identical proportional interest in topics (e.g., both are 100% focused on Topic 3).
- A score of 0.0 means their topic interests are completely different.
- This helps identify users with similar narrative focus, regardless of their total post count.