seanpedrickcase commited on
Commit
9a0231a
·
1 Parent(s): 5ed844b

Enhanced user interface and documentation for topic extraction tool. Updated file input labels for clarity, improved README instructions, and refined progress tracking in API calls. Added time tracking for LLM calls in output files.

Browse files
.github/workflows/sync_to_hf.yml CHANGED
@@ -3,18 +3,20 @@ on:
3
  push:
4
  branches: [main]
5
 
6
- # to run this workflow manually from the Actions tab
7
- workflow_dispatch:
8
 
9
  jobs:
10
  sync-to-hub:
11
  runs-on: ubuntu-latest
12
  steps:
13
- - uses: actions/checkout@v3
14
  with:
15
  fetch-depth: 0
16
  lfs: true
17
  - name: Push to hub
18
  env:
19
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
20
- run: git push https://seanpedrickcase:$HF_TOKEN@huggingface.co/spaces/seanpedrickcase/llm_topic_modelling main
 
 
 
3
  push:
4
  branches: [main]
5
 
6
+ permissions:
7
+ contents: read
8
 
9
  jobs:
10
  sync-to-hub:
11
  runs-on: ubuntu-latest
12
  steps:
13
+ - uses: actions/checkout@v4
14
  with:
15
  fetch-depth: 0
16
  lfs: true
17
  - name: Push to hub
18
  env:
19
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
20
+ HF_USERNAME: ${{ secrets.HF_USERNAME }}
21
+ HF_REPO_ID: ${{ secrets.HF_REPO_ID }}
22
+ run: git push https://$HF_USERNAME:[email protected]/spaces/$HF_USERNAME/$HF_REPO_ID main
README.md CHANGED
@@ -11,7 +11,7 @@ license: agpl-3.0
11
 
12
  # Large language model topic modelling
13
 
14
- Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets under 'Test with an example dataset' below, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
15
 
16
  NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
17
 
 
11
 
12
  # Large language model topic modelling
13
 
14
+ Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets on the main app page, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
15
 
16
  NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
17
 
app.py CHANGED
@@ -59,7 +59,7 @@ else: default_model_choice = "gemini-2.5-flash"
59
  in_data_files = gr.File(height=FILE_INPUT_HEIGHT, label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
60
  in_colnames = gr.Dropdown(choices=[""], multiselect = False, label="Select the open text column of interest. In an Excel file, this shows columns across all sheets.", allow_custom_value=True, interactive=True)
61
  context_textbox = gr.Textbox(label="Write up to one sentence giving context to the large language model for your task (e.g. 'Consultation for the construction of flats on Main Street')")
62
- topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file", scale=1, interactive=False, file_count="multiple")
63
  display_topic_table_markdown = gr.Markdown(value="", show_copy_button=True)
64
  output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False, lines=4)
65
  candidate_topics = gr.File(height=FILE_INPUT_HEIGHT, label="Input topics from file (csv). File should have at least one column with a header, and all topic names below this. Using the headers 'General topic' and/or 'Subtopic' will allow for these columns to be suggested to the model. If a third column is present, it will be assumed to be a topic description.", file_count="single")
@@ -162,7 +162,7 @@ with app:
162
 
163
  gr.Markdown("""# Large language model topic modelling
164
 
165
- Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets under 'Test with an example dataset' below, which will show you example outputs from a local model run. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
166
 
167
  NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
168
 
@@ -190,7 +190,7 @@ with app:
190
 
191
  example_labels=["Main Street construction consultation", "Case notes for young people", "Main Street construction consultation with suggested topics", "Case notes grouped by person with suggested topics", "Case notes structured summary with suggested topics"],
192
 
193
- label="Try topic extraction and summarisation with an example dataset",
194
 
195
  fn=show_info_box_on_click,
196
  run_on_click=True,
@@ -206,7 +206,7 @@ with app:
206
  in_excel_sheets = gr.Dropdown(multiselect = False, label="Select the Excel sheet of interest.", visible=False, allow_custom_value=True)
207
  in_colnames.render()
208
 
209
- with gr.Accordion("Group analysis by values in another column", open=False):
210
  in_group_col.render()
211
 
212
  with gr.Accordion("Provide list of suggested topics", open = False):
 
59
  in_data_files = gr.File(height=FILE_INPUT_HEIGHT, label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
60
  in_colnames = gr.Dropdown(choices=[""], multiselect = False, label="Select the open text column of interest. In an Excel file, this shows columns across all sheets.", allow_custom_value=True, interactive=True)
61
  context_textbox = gr.Textbox(label="Write up to one sentence giving context to the large language model for your task (e.g. 'Consultation for the construction of flats on Main Street')")
62
+ topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file. CSV outputs are available on the 'Advanced' tab.", scale=1, interactive=False, file_count="multiple")
63
  display_topic_table_markdown = gr.Markdown(value="", show_copy_button=True)
64
  output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False, lines=4)
65
  candidate_topics = gr.File(height=FILE_INPUT_HEIGHT, label="Input topics from file (csv). File should have at least one column with a header, and all topic names below this. Using the headers 'General topic' and/or 'Subtopic' will allow for these columns to be suggested to the model. If a third column is present, it will be assumed to be a topic description.", file_count="single")
 
162
 
163
  gr.Markdown("""# Large language model topic modelling
164
 
165
+ Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini, Azure, or AWS Bedrock models (e.g. Claude, Nova models). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and a topic summary. Instructions on use can be found in the README.md file. You can try out examples by clicking on one of the example datasets below. API keys for AWS, Azure, and Gemini services can be entered on the settings page (note that Gemini has a free public API).
166
 
167
  NOTE: Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
168
 
 
190
 
191
  example_labels=["Main Street construction consultation", "Case notes for young people", "Main Street construction consultation with suggested topics", "Case notes grouped by person with suggested topics", "Case notes structured summary with suggested topics"],
192
 
193
+ label="Try topic extraction and summarisation with an example dataset. Example outputs are displayed. Click the 'Extract topics...' button below to rerun the analysis.",
194
 
195
  fn=show_info_box_on_click,
196
  run_on_click=True,
 
206
  in_excel_sheets = gr.Dropdown(multiselect = False, label="Select the Excel sheet of interest.", visible=False, allow_custom_value=True)
207
  in_colnames.render()
208
 
209
+ with gr.Accordion("Group analysis by values in another column", open=False):
210
  in_group_col.render()
211
 
212
  with gr.Accordion("Provide list of suggested topics", open = False):
tools/combine_sheets_into_xlsx.py CHANGED
@@ -21,6 +21,7 @@ def add_cover_sheet(
21
  llm_call_number:int,
22
  input_tokens:int,
23
  output_tokens:int,
 
24
  file_name:str,
25
  column_name:str,
26
  number_of_responses_with_topic_assignment:int,
@@ -57,6 +58,7 @@ def add_cover_sheet(
57
  "Number of LLM calls": llm_call_number,
58
  "Total number of input tokens from LLM calls": input_tokens,
59
  "Total number of output tokens from LLM calls": output_tokens,
 
60
  }
61
 
62
  for i, (label, value) in enumerate(metadata.items()):
@@ -84,6 +86,7 @@ def csvs_to_excel(
84
  llm_call_number:int=0,
85
  input_tokens:int=0,
86
  output_tokens:int=0,
 
87
  number_of_responses:int=0,
88
  number_of_responses_with_text:int=0,
89
  number_of_responses_with_text_five_plus_words:int=0,
@@ -152,6 +155,7 @@ def csvs_to_excel(
152
  llm_call_number = llm_call_number,
153
  input_tokens = input_tokens,
154
  output_tokens = output_tokens,
 
155
  file_name=file_name,
156
  column_name=column_name,
157
  number_of_responses_with_topic_assignment=number_of_responses_with_topic_assignment
@@ -166,7 +170,7 @@ def csvs_to_excel(
166
  ###
167
  # Run the functions
168
  ###
169
- def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:list[str], reference_data_file_name_textbox:str, in_group_col:str, model_choice:str, master_reference_df_state:pd.DataFrame, master_unique_topics_df_state:pd.DataFrame, summarised_output_df:pd.DataFrame, missing_df_state:pd.DataFrame, excel_sheets:str, usage_logs_location:str="", model_name_map:dict={}, output_folder:str=OUTPUT_FOLDER, structured_summaries:str="No"):
170
  '''
171
  Collect together output CSVs from various output boxes and combine them into a single output Excel file.
172
 
@@ -226,7 +230,6 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
226
  # Create structured summary from master_unique_topics_df_state
227
  structured_summary_data = list()
228
 
229
- print("master_unique_topics_df_state:", master_unique_topics_df_state)
230
  # Group by 'Group' column
231
  for group_name, group_df in master_unique_topics_df_state.groupby('Group'):
232
  group_summary = f"## {group_name}\n\n"
@@ -267,7 +270,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
267
  else:
268
  # Use original summarised_output_df
269
  structured_summary_df = summarised_output_df
270
- structured_summary_df.to_csv(overall_summary_csv_path, index = None)
271
 
272
  if not structured_summary_df.empty:
273
  csv_files.append(overall_summary_csv_path)
@@ -410,22 +413,27 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
410
  if usage_logs_location:
411
  try:
412
  usage_logs = pd.read_csv(usage_logs_location)
413
- relevant_logs = usage_logs.loc[(usage_logs["Reference data file name"] == reference_data_file_name_textbox) & (usage_logs["LLM model"]==model_choice) & (usage_logs["Select the open text column of interest. In an Excel file, this shows columns across all sheets."]==chosen_cols),:]
 
 
414
  llm_call_number = sum(relevant_logs["Total LLM calls"].astype(int))
415
  input_tokens = sum(relevant_logs["Total input tokens"].astype(int))
416
  output_tokens = sum(relevant_logs["Total output tokens"].astype(int))
 
417
  except Exception as e:
418
  print("Could not obtain usage logs due to:", e)
419
  usage_logs = pd.DataFrame()
420
  llm_call_number = 0
421
  input_tokens = 0
422
  output_tokens = 0
 
423
  else:
424
  print("LLM call logs location not provided")
425
  usage_logs = pd.DataFrame()
426
  llm_call_number = 0
427
  input_tokens = 0
428
  output_tokens = 0
 
429
 
430
  # Create short filename:
431
  model_choice_clean_short = clean_column_name(model_name_map[model_choice]["short_name"], max_length=20, front_characters=False)
@@ -449,6 +457,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
449
  llm_call_number = llm_call_number,
450
  input_tokens = input_tokens,
451
  output_tokens = output_tokens,
 
452
  number_of_responses = number_of_responses,
453
  number_of_responses_with_text = number_of_responses_with_text,
454
  number_of_responses_with_text_five_plus_words = number_of_responses_with_text_five_plus_words,
 
21
  llm_call_number:int,
22
  input_tokens:int,
23
  output_tokens:int,
24
+ time_taken:float,
25
  file_name:str,
26
  column_name:str,
27
  number_of_responses_with_topic_assignment:int,
 
58
  "Number of LLM calls": llm_call_number,
59
  "Total number of input tokens from LLM calls": input_tokens,
60
  "Total number of output tokens from LLM calls": output_tokens,
61
+ "Total time taken for all LLM calls (seconds)": time_taken,
62
  }
63
 
64
  for i, (label, value) in enumerate(metadata.items()):
 
86
  llm_call_number:int=0,
87
  input_tokens:int=0,
88
  output_tokens:int=0,
89
+ time_taken:float=0,
90
  number_of_responses:int=0,
91
  number_of_responses_with_text:int=0,
92
  number_of_responses_with_text_five_plus_words:int=0,
 
155
  llm_call_number = llm_call_number,
156
  input_tokens = input_tokens,
157
  output_tokens = output_tokens,
158
+ time_taken = time_taken,
159
  file_name=file_name,
160
  column_name=column_name,
161
  number_of_responses_with_topic_assignment=number_of_responses_with_topic_assignment
 
170
  ###
171
  # Run the functions
172
  ###
173
+ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:list[str], reference_data_file_name_textbox:str, in_group_col:str, model_choice:str, master_reference_df_state:pd.DataFrame, master_unique_topics_df_state:pd.DataFrame, summarised_output_df:pd.DataFrame, missing_df_state:pd.DataFrame, excel_sheets:str="", usage_logs_location:str="", model_name_map:dict=dict(), output_folder:str=OUTPUT_FOLDER, structured_summaries:str="No"):
174
  '''
175
  Collect together output CSVs from various output boxes and combine them into a single output Excel file.
176
 
 
230
  # Create structured summary from master_unique_topics_df_state
231
  structured_summary_data = list()
232
 
 
233
  # Group by 'Group' column
234
  for group_name, group_df in master_unique_topics_df_state.groupby('Group'):
235
  group_summary = f"## {group_name}\n\n"
 
270
  else:
271
  # Use original summarised_output_df
272
  structured_summary_df = summarised_output_df
273
+ structured_summary_df.to_csv(overall_summary_csv_path, index = None)
274
 
275
  if not structured_summary_df.empty:
276
  csv_files.append(overall_summary_csv_path)
 
413
  if usage_logs_location:
414
  try:
415
  usage_logs = pd.read_csv(usage_logs_location)
416
+
417
+ relevant_logs = usage_logs.loc[(usage_logs["Reference data file name"] == reference_data_file_name_textbox) & (usage_logs["Large language model for topic extraction and summarisation"]==model_choice) & (usage_logs["Select the open text column of interest. In an Excel file, this shows columns across all sheets."]==chosen_cols),:]
418
+
419
  llm_call_number = sum(relevant_logs["Total LLM calls"].astype(int))
420
  input_tokens = sum(relevant_logs["Total input tokens"].astype(int))
421
  output_tokens = sum(relevant_logs["Total output tokens"].astype(int))
422
+ time_taken = sum(relevant_logs["Estimated time taken (seconds)"].astype(float))
423
  except Exception as e:
424
  print("Could not obtain usage logs due to:", e)
425
  usage_logs = pd.DataFrame()
426
  llm_call_number = 0
427
  input_tokens = 0
428
  output_tokens = 0
429
+ time_taken = 0
430
  else:
431
  print("LLM call logs location not provided")
432
  usage_logs = pd.DataFrame()
433
  llm_call_number = 0
434
  input_tokens = 0
435
  output_tokens = 0
436
+ time_taken = 0
437
 
438
  # Create short filename:
439
  model_choice_clean_short = clean_column_name(model_name_map[model_choice]["short_name"], max_length=20, front_characters=False)
 
457
  llm_call_number = llm_call_number,
458
  input_tokens = input_tokens,
459
  output_tokens = output_tokens,
460
+ time_taken = time_taken,
461
  number_of_responses = number_of_responses,
462
  number_of_responses_with_text = number_of_responses_with_text,
463
  number_of_responses_with_text_five_plus_words = number_of_responses_with_text_five_plus_words,
tools/llm_api_call.py CHANGED
@@ -1488,12 +1488,13 @@ def wrapper_extract_topics_per_column_value(
1488
  wrapper_first_loop = initial_first_loop_state
1489
 
1490
  if len(unique_values) == 1:
1491
- loop_object = enumerate(unique_values)
 
1492
  else:
1493
- loop_object = tqdm(enumerate(unique_values), desc=f"Analysing group", total=len(unique_values), unit="groups")
 
1494
 
1495
-
1496
- for i, group_value in loop_object:
1497
  print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
1498
 
1499
  filtered_file_data = file_data.copy()
 
1488
  wrapper_first_loop = initial_first_loop_state
1489
 
1490
  if len(unique_values) == 1:
1491
+ # If only one unique value, no need for progress bar, iterate directly
1492
+ loop_object = unique_values
1493
  else:
1494
+ # If multiple unique values, use tqdm progress bar
1495
+ loop_object = progress.tqdm(unique_values, desc=f"Analysing group", total=len(unique_values), unit="groups")
1496
 
1497
+ for i, group_value in enumerate(loop_object):
 
1498
  print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
1499
 
1500
  filtered_file_data = file_data.copy()
tools/prompts.py CHANGED
@@ -4,7 +4,7 @@
4
 
5
  generic_system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset."""
6
 
7
- system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset called '{column_name}'. {consultation_context}."""
8
 
9
  markdown_additional_prompt = """ You will be given a request for a markdown table. You must respond with ONLY the markdown table. Do not include any introduction, explanation, or concluding text."""
10
 
 
4
 
5
  generic_system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset."""
6
 
7
+ system_prompt = """You are a researcher analysing responses from an open text dataset. You are analysing a single column from this dataset called '{column_name}'. {consultation_context}"""
8
 
9
  markdown_additional_prompt = """ You will be given a request for a markdown table. You must respond with ONLY the markdown table. Do not include any introduction, explanation, or concluding text."""
10