Spaces:

seanpedrickcase
/

llm_topic_modelling

Running on Zero

App Files Files Community

seanpedrickcase commited on Oct 8

Commit

9a0231a

1 Parent(s): 5ed844b

Enhanced user interface and documentation for topic extraction tool. Updated file input labels for clarity, improved README instructions, and refined progress tracking in API calls. Added time tracking for LLM calls in output files.

Browse files

Files changed (6) hide show

.github/workflows/sync_to_hf.yml +6 -4
README.md +1 -1
app.py +4 -4
tools/combine_sheets_into_xlsx.py +13 -4
tools/llm_api_call.py +5 -4
tools/prompts.py +1 -1

.github/workflows/sync_to_hf.yml CHANGED Viewed

@@ -3,18 +3,20 @@ on:
   push:
     branches: [main]
-  # to run this workflow manually from the Actions tab
-  workflow_dispatch:
 jobs:
   sync-to-hub:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v3
         with:
           fetch-depth: 0
           lfs: true
       - name: Push to hub
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
-        run: git push https://seanpedrickcase:$HF_TOKEN@huggingface.co/spaces/seanpedrickcase/llm_topic_modelling main

   push:
     branches: [main]
+permissions:
+  contents: read
 jobs:
   sync-to-hub:
     runs-on: ubuntu-latest
     steps:
+      - uses: actions/checkout@v4
         with:
           fetch-depth: 0
           lfs: true
       - name: Push to hub
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          HF_USERNAME: ${{ secrets.HF_USERNAME }}
+          HF_REPO_ID: ${{ secrets.HF_REPO_ID }}
+        run: git push https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$HF_REPO_ID main

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ license: agpl-3.0
 # Large language model topic modelling
-Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets under 'Test with an example dataset' below, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
 NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.

 # Large language model topic modelling
+Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets on the main app page, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
 NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.

app.py CHANGED Viewed

@@ -59,7 +59,7 @@ else: default_model_choice = "gemini-2.5-flash"
 in_data_files = gr.File(height=FILE_INPUT_HEIGHT, label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
 in_colnames = gr.Dropdown(choices=[""], multiselect = False, label="Select the open text column of interest. In an Excel file, this shows columns across all sheets.", allow_custom_value=True, interactive=True)
 context_textbox = gr.Textbox(label="Write up to one sentence giving context to the large language model for your task (e.g. 'Consultation for the construction of flats on Main Street')")
-topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file", scale=1, interactive=False, file_count="multiple")
 display_topic_table_markdown = gr.Markdown(value="", show_copy_button=True)
 output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False, lines=4)
 candidate_topics = gr.File(height=FILE_INPUT_HEIGHT, label="Input topics from file (csv). File should have at least one column with a header, and all topic names below this. Using the headers 'General topic' and/or 'Subtopic' will allow for these columns to be suggested to the model. If a third column is present, it will be assumed to be a topic description.", file_count="single")
@@ -162,7 +162,7 @@ with app:
     gr.Markdown("""# Large language model topic modelling
-    Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets under 'Test with an example dataset' below, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
     NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
@@ -190,7 +190,7 @@ with app:
         example_labels=["Main Street construction consultation", "Case notes for young people", "Main Street construction consultation with suggested topics", "Case notes grouped by person with suggested topics", "Case notes structured summary with suggested topics"],
-        label="Try topic extraction and summarisation with an example dataset",
         fn=show_info_box_on_click,
         run_on_click=True,
@@ -206,7 +206,7 @@ with app:
             in_excel_sheets = gr.Dropdown(multiselect = False, label="Select the Excel sheet of interest.", visible=False, allow_custom_value=True)
             in_colnames.render()
-        with gr.Accordion("Group analysis by values in another column", open=False):
             in_group_col.render()
         with gr.Accordion("Provide list of suggested topics", open = False):

 in_data_files = gr.File(height=FILE_INPUT_HEIGHT, label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
 in_colnames = gr.Dropdown(choices=[""], multiselect = False, label="Select the open text column of interest. In an Excel file, this shows columns across all sheets.", allow_custom_value=True, interactive=True)
 context_textbox = gr.Textbox(label="Write up to one sentence giving context to the large language model for your task (e.g. 'Consultation for the construction of flats on Main Street')")
+topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file. CSV outputs are available on the 'Advanced' tab.", scale=1, interactive=False, file_count="multiple")
 display_topic_table_markdown = gr.Markdown(value="", show_copy_button=True)
 output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False, lines=4)
 candidate_topics = gr.File(height=FILE_INPUT_HEIGHT, label="Input topics from file (csv). File should have at least one column with a header, and all topic names below this. Using the headers 'General topic' and/or 'Subtopic' will allow for these columns to be suggested to the model. If a third column is present, it will be assumed to be a topic description.", file_count="single")
     gr.Markdown("""# Large language model topic modelling
+    Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets below. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
     NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
         example_labels=["Main Street construction consultation", "Case notes for young people", "Main Street construction consultation with suggested topics", "Case notes grouped by person with suggested topics", "Case notes structured summary with suggested topics"],
+        label="Try topic extraction and summarisation with an example dataset. Example outputs are displayed. Click the 'Extract topics...' button below to rerun the analysis.",
         fn=show_info_box_on_click,
         run_on_click=True,
             in_excel_sheets = gr.Dropdown(multiselect = False, label="Select the Excel sheet of interest.", visible=False, allow_custom_value=True)
             in_colnames.render()
+        with gr.Accordion("Group analysis by values in another column", open=False):
             in_group_col.render()
         with gr.Accordion("Provide list of suggested topics", open = False):

tools/combine_sheets_into_xlsx.py CHANGED Viewed

@@ -21,6 +21,7 @@ def add_cover_sheet(
     llm_call_number:int,
     input_tokens:int,
     output_tokens:int,
     file_name:str,
     column_name:str,
     number_of_responses_with_topic_assignment:int,
@@ -57,6 +58,7 @@ def add_cover_sheet(
         "Number of LLM calls": llm_call_number,
         "Total number of input tokens from LLM calls": input_tokens,
         "Total number of output tokens from LLM calls": output_tokens,
     }
     for i, (label, value) in enumerate(metadata.items()):
@@ -84,6 +86,7 @@ def csvs_to_excel(
     llm_call_number:int=0,
     input_tokens:int=0,
     output_tokens:int=0,
     number_of_responses:int=0,
     number_of_responses_with_text:int=0,
     number_of_responses_with_text_five_plus_words:int=0,
@@ -152,6 +155,7 @@ def csvs_to_excel(
         llm_call_number = llm_call_number,
         input_tokens = input_tokens,
         output_tokens = output_tokens,
         file_name=file_name,
         column_name=column_name,
         number_of_responses_with_topic_assignment=number_of_responses_with_topic_assignment
@@ -166,7 +170,7 @@ def csvs_to_excel(
 ###
 # Run the functions
 ###
-def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:list[str], reference_data_file_name_textbox:str, in_group_col:str, model_choice:str, master_reference_df_state:pd.DataFrame, master_unique_topics_df_state:pd.DataFrame, summarised_output_df:pd.DataFrame, missing_df_state:pd.DataFrame, excel_sheets:str, usage_logs_location:str="", model_name_map:dict={}, output_folder:str=OUTPUT_FOLDER, structured_summaries:str="No"):
     '''
     Collect together output CSVs from various output boxes and combine them into a single output Excel file.
@@ -226,7 +230,6 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
         # Create structured summary from master_unique_topics_df_state
         structured_summary_data = list()
-        print("master_unique_topics_df_state:", master_unique_topics_df_state)
         # Group by 'Group' column
         for group_name, group_df in master_unique_topics_df_state.groupby('Group'):
             group_summary = f"## {group_name}\n\n"
@@ -267,7 +270,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
     else:
         # Use original summarised_output_df
         structured_summary_df = summarised_output_df
-        structured_summary_df.to_csv(overall_summary_csv_path, index = None)
     if not structured_summary_df.empty:
         csv_files.append(overall_summary_csv_path)
@@ -410,22 +413,27 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
     if usage_logs_location:
         try:
             usage_logs = pd.read_csv(usage_logs_location)
-            relevant_logs = usage_logs.loc[(usage_logs["Reference data file name"] == reference_data_file_name_textbox) & (usage_logs["LLM model"]==model_choice) & (usage_logs["Select the open text column of interest. In an Excel file, this shows columns across all sheets."]==chosen_cols),:]
             llm_call_number = sum(relevant_logs["Total LLM calls"].astype(int))
             input_tokens = sum(relevant_logs["Total input tokens"].astype(int))
             output_tokens = sum(relevant_logs["Total output tokens"].astype(int))
         except Exception as e:
             print("Could not obtain usage logs due to:", e)
             usage_logs = pd.DataFrame()
             llm_call_number = 0
             input_tokens = 0
             output_tokens = 0
     else:
         print("LLM call logs location not provided")
         usage_logs = pd.DataFrame()
         llm_call_number = 0
         input_tokens = 0
         output_tokens = 0
     # Create short filename:
     model_choice_clean_short = clean_column_name(model_name_map[model_choice]["short_name"], max_length=20, front_characters=False)
@@ -449,6 +457,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
         llm_call_number = llm_call_number,
         input_tokens = input_tokens,
         output_tokens = output_tokens,
         number_of_responses = number_of_responses,
         number_of_responses_with_text = number_of_responses_with_text,
         number_of_responses_with_text_five_plus_words = number_of_responses_with_text_five_plus_words,

     llm_call_number:int,
     input_tokens:int,
     output_tokens:int,
+    time_taken:float,
     file_name:str,
     column_name:str,
     number_of_responses_with_topic_assignment:int,
         "Number of LLM calls": llm_call_number,
         "Total number of input tokens from LLM calls": input_tokens,
         "Total number of output tokens from LLM calls": output_tokens,
+        "Total time taken for all LLM calls (seconds)": time_taken,
     }
     for i, (label, value) in enumerate(metadata.items()):
     llm_call_number:int=0,
     input_tokens:int=0,
     output_tokens:int=0,
+    time_taken:float=0,
     number_of_responses:int=0,
     number_of_responses_with_text:int=0,
     number_of_responses_with_text_five_plus_words:int=0,
         llm_call_number = llm_call_number,
         input_tokens = input_tokens,
         output_tokens = output_tokens,
+        time_taken = time_taken,
         file_name=file_name,
         column_name=column_name,
         number_of_responses_with_topic_assignment=number_of_responses_with_topic_assignment
 ###
 # Run the functions
 ###
+def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:list[str], reference_data_file_name_textbox:str, in_group_col:str, model_choice:str, master_reference_df_state:pd.DataFrame, master_unique_topics_df_state:pd.DataFrame, summarised_output_df:pd.DataFrame, missing_df_state:pd.DataFrame, excel_sheets:str="", usage_logs_location:str="", model_name_map:dict=dict(), output_folder:str=OUTPUT_FOLDER, structured_summaries:str="No"):
     '''
     Collect together output CSVs from various output boxes and combine them into a single output Excel file.
         # Create structured summary from master_unique_topics_df_state
         structured_summary_data = list()
         # Group by 'Group' column
         for group_name, group_df in master_unique_topics_df_state.groupby('Group'):
             group_summary = f"## {group_name}\n\n"
     else:
         # Use original summarised_output_df
         structured_summary_df = summarised_output_df
+        structured_summary_df.to_csv(overall_summary_csv_path, index = None)
     if not structured_summary_df.empty:
         csv_files.append(overall_summary_csv_path)
     if usage_logs_location:
         try:
             usage_logs = pd.read_csv(usage_logs_location)
+            relevant_logs = usage_logs.loc[(usage_logs["Reference data file name"] == reference_data_file_name_textbox) & (usage_logs["Large language model for topic extraction and summarisation"]==model_choice) & (usage_logs["Select the open text column of interest. In an Excel file, this shows columns across all sheets."]==chosen_cols),:]
             llm_call_number = sum(relevant_logs["Total LLM calls"].astype(int))
             input_tokens = sum(relevant_logs["Total input tokens"].astype(int))
             output_tokens = sum(relevant_logs["Total output tokens"].astype(int))
+            time_taken = sum(relevant_logs["Estimated time taken (seconds)"].astype(float))
         except Exception as e:
             print("Could not obtain usage logs due to:", e)
             usage_logs = pd.DataFrame()
             llm_call_number = 0
             input_tokens = 0
             output_tokens = 0
+            time_taken = 0
     else:
         print("LLM call logs location not provided")
         usage_logs = pd.DataFrame()
         llm_call_number = 0
         input_tokens = 0
         output_tokens = 0
+        time_taken = 0
     # Create short filename:
     model_choice_clean_short = clean_column_name(model_name_map[model_choice]["short_name"], max_length=20, front_characters=False)
         llm_call_number = llm_call_number,
         input_tokens = input_tokens,
         output_tokens = output_tokens,
+        time_taken = time_taken,
         number_of_responses = number_of_responses,
         number_of_responses_with_text = number_of_responses_with_text,
         number_of_responses_with_text_five_plus_words = number_of_responses_with_text_five_plus_words,

tools/llm_api_call.py CHANGED Viewed

@@ -1488,12 +1488,13 @@ def wrapper_extract_topics_per_column_value(
     wrapper_first_loop = initial_first_loop_state
     if len(unique_values) == 1:
-        loop_object = enumerate(unique_values)
     else:
-        loop_object = tqdm(enumerate(unique_values), desc=f"Analysing group", total=len(unique_values), unit="groups")
-    for i, group_value in loop_object:
         print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
         filtered_file_data = file_data.copy()

     wrapper_first_loop = initial_first_loop_state
     if len(unique_values) == 1:
+        # If only one unique value, no need for progress bar, iterate directly
+        loop_object = unique_values
     else:
+        # If multiple unique values, use tqdm progress bar
+        loop_object = progress.tqdm(unique_values, desc=f"Analysing group", total=len(unique_values), unit="groups")
+    for i, group_value in enumerate(loop_object):
         print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
         filtered_file_data = file_data.copy()

tools/prompts.py CHANGED Viewed

@@ -4,7 +4,7 @@
 generic_system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset."""
-system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset called '{column_name}'. {consultation_context}."""
 markdown_additional_prompt = """ You will be given a request for a markdown table. You must respond with ONLY the markdown table. Do not include any introduction, explanation, or concluding text."""

 generic_system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset."""
+system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset called '{column_name}'. {consultation_context}"""
 markdown_additional_prompt = """ You will be given a request for a markdown table. You must respond with ONLY the markdown table. Do not include any introduction, explanation, or concluding text."""