seanpedrickcase commited on
Commit
6f3d42c
·
1 Parent(s): 9a0231a

Added deduplication with LLM functionality. Minor package updates. Updated installation documentation.

Browse files
README.md CHANGED
@@ -26,7 +26,7 @@ Basic use:
26
 
27
  # Installation guide
28
 
29
- Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from a relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
30
 
31
  -----
32
 
@@ -37,11 +37,9 @@ First, you need to copy the project files to your local machine. Navigate to the
37
  1. **Clone the repo:**
38
 
39
  ```bash
40
- git clone https://github.com/example-user/example-repo.git
41
  ```
42
 
43
- *Replace the URL with your repository's URL.*
44
-
45
  2. **Navigate into the new project folder:**
46
 
47
  ```bash
@@ -91,21 +89,27 @@ Now that your virtual environment is active, you can install all the required pa
91
 
92
  1. **Choose the relevant requirements file**
93
 
 
 
94
  Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
95
 
96
  The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
97
 
98
- - **requirements_no_local**: Can be used to install the app without local model inference for a more lightweight installation.
99
  - **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the requirements under 'Windows' for Windows compatibility (CUDA 12.4).
100
  - **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the requirements under 'Windows' for Windows compatibility. Make sure you have [Openblas](https://github.com/OpenMathLib/OpenBLAS) installed!
101
  - **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4).
102
 
 
 
103
  2. **Install packages from the requirements file:**
104
  ```bash
105
  pip install -r requirements_gpu.txt
106
  ```
107
  *This command reads every package name listed in the file and installs it into your `.venv` environment.*
108
 
 
 
109
  You're all set\! ✅ Your project is cloned, and all dependencies are installed in an isolated environment.
110
 
111
  When you are finished working, you can leave the virtual environment by simply typing:
 
26
 
27
  # Installation guide
28
 
29
+ Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from the relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
30
 
31
  -----
32
 
 
37
  1. **Clone the repo:**
38
 
39
  ```bash
40
+ git clone https://github.com/seanpedrick-case/llm_topic_modelling/example-repo.git
41
  ```
42
 
 
 
43
  2. **Navigate into the new project folder:**
44
 
45
  ```bash
 
89
 
90
  1. **Choose the relevant requirements file**
91
 
92
+ ****NOTE:** To start, I advise installing using the **requirements_no_local.txt** file, which installs the app without local model inference. This approach is much simpler as a first step, and avoids issues with potentially complicated llama-cpp-python installation and GPU management described below.
93
+
94
  Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
95
 
96
  The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
97
 
98
+ - **requirements_no_local.txt**: Can be used to install the app without local model inference for a more lightweight installation.
99
  - **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the requirements under 'Windows' for Windows compatibility (CUDA 12.4).
100
  - **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the requirements under 'Windows' for Windows compatibility. Make sure you have [Openblas](https://github.com/OpenMathLib/OpenBLAS) installed!
101
  - **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4).
102
 
103
+ The below instructions will guide you in how to install the GPU-enabled version of the app for local inference.
104
+
105
  2. **Install packages from the requirements file:**
106
  ```bash
107
  pip install -r requirements_gpu.txt
108
  ```
109
  *This command reads every package name listed in the file and installs it into your `.venv` environment.*
110
 
111
+ NOTE: If default llama-cpp-python installation does not work when installing from the above, go into the requirements_gpu.txt file and uncomment the lines to install a wheel for llama-cpp-python 0.3.16 relevant to your system.
112
+
113
  You're all set\! ✅ Your project is cloned, and all dependencies are installed in an isolated environment.
114
 
115
  When you are finished working, you can leave the virtual environment by simply typing:
app.py CHANGED
@@ -6,7 +6,7 @@ from datetime import datetime
6
  from tools.helper_functions import put_columns_in_df, get_connection_params, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise, load_in_previous_reference_file, join_cols_onto_reference_df, load_in_previous_data_files, load_in_data_file, load_in_default_cost_codes, reset_base_dataframe, update_cost_code_dataframe_from_dropdown_select, df_select_callback_cost, enforce_cost_codes, _get_env_list, move_overall_summary_output_files_to_front_page
7
  from tools.aws_functions import upload_file_to_s3, download_file_from_s3
8
  from tools.llm_api_call import modify_existing_output_tables, wrapper_extract_topics_per_column_value, all_in_one_pipeline
9
- from tools.dedup_summaries import sample_reference_table_summaries, summarise_output_topics, deduplicate_topics, overall_summary
10
  from tools.combine_sheets_into_xlsx import collect_output_csvs_and_create_excel_output
11
  from tools.custom_csvlogger import CSVLogger_custom
12
  from tools.auth import authenticate_user
@@ -171,7 +171,7 @@ with app:
171
  in_data_files, in_colnames, context_textbox, original_data_file_name_textbox, topic_extraction_output_files_xlsx, display_topic_table_markdown, output_messages_textbox, candidate_topics, produce_structured_summary_radio, in_group_col, batch_size_number,
172
  ):
173
  gr.Info(
174
- "Example data loaded. Now click on the 'All in one...' button below to run the full suite of topic extraction, deduplication, and summarisation."
175
  )
176
 
177
  examples = gr.Examples(examples=\
@@ -251,24 +251,23 @@ with app:
251
 
252
  with gr.Tab(label="Advanced - Step by step topic extraction and summarisation"):
253
 
254
- with gr.Accordion("1. Extract topics - go to first tab for file upload, model choice, and other settings before clicking this button", open = True):
255
  context_textbox.render()
256
  extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
257
- topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False)
258
 
259
  with gr.Accordion("2. Modify topics from topic extraction", open = False):
260
  gr.Markdown("""Load in previously completed Extract Topics output files ('reference_table', and 'unique_topics' files) to modify topics, deduplicate topics, or summarise the outputs. If you want pivot table outputs, please load in the original data file along with the selected open text column on the first tab before deduplicating or summarising.""")
261
 
262
-
263
- modification_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to modify topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
264
 
265
  modifiable_unique_topics_df_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=(4, "fixed"), row_count = (1, "fixed"), visible=True, type="pandas")
266
 
267
  save_modified_files_button = gr.Button(value="Save modified topic names")
268
 
269
- with gr.Accordion("3. Deduplicate topics - upload reference data file and unique data files", open = False):
270
  ### DEDUPLICATION
271
- deduplication_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to deduplicate topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
272
  deduplication_input_files_status = gr.Textbox(value = "", label="Previous file input", visible=False)
273
 
274
  with gr.Row():
@@ -276,11 +275,13 @@ with app:
276
  merge_sentiment_drop = gr.Dropdown(label="Merge sentiment values together for duplicate subtopics.", value="No", choices=["Yes", "No"])
277
  deduplicate_score_threshold = gr.Number(label="Similarity threshold with which to determine duplicates.", value = 90, minimum=5, maximum=100, precision=0)
278
 
279
- deduplicate_previous_data_btn = gr.Button("3. Deduplicate topics", variant="primary")
 
 
280
 
281
  with gr.Accordion("4. Summarise topics", open = False):
282
  ### SUMMARISATION
283
- summarisation_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to summarise", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
284
 
285
  summarise_format_radio = gr.Radio(label="Choose summary type", value=two_para_summary_format_prompt, choices=[two_para_summary_format_prompt, single_para_summary_format_prompt])
286
 
@@ -476,6 +477,15 @@ with app:
476
  deduplicate_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
477
  success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown], scroll_to_output=True, api_name="deduplicate_topics")
478
 
 
 
 
 
 
 
 
 
 
479
  # When button pressed, summarise previous data
480
  summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, overall_summarisation_input_files]).\
481
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
 
6
  from tools.helper_functions import put_columns_in_df, get_connection_params, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise, load_in_previous_reference_file, join_cols_onto_reference_df, load_in_previous_data_files, load_in_data_file, load_in_default_cost_codes, reset_base_dataframe, update_cost_code_dataframe_from_dropdown_select, df_select_callback_cost, enforce_cost_codes, _get_env_list, move_overall_summary_output_files_to_front_page
7
  from tools.aws_functions import upload_file_to_s3, download_file_from_s3
8
  from tools.llm_api_call import modify_existing_output_tables, wrapper_extract_topics_per_column_value, all_in_one_pipeline
9
+ from tools.dedup_summaries import sample_reference_table_summaries, summarise_output_topics, deduplicate_topics, deduplicate_topics_llm, overall_summary
10
  from tools.combine_sheets_into_xlsx import collect_output_csvs_and_create_excel_output
11
  from tools.custom_csvlogger import CSVLogger_custom
12
  from tools.auth import authenticate_user
 
171
  in_data_files, in_colnames, context_textbox, original_data_file_name_textbox, topic_extraction_output_files_xlsx, display_topic_table_markdown, output_messages_textbox, candidate_topics, produce_structured_summary_radio, in_group_col, batch_size_number,
172
  ):
173
  gr.Info(
174
+ "Example data loaded. Now click on the 'Extract topics...' button below to run the full suite of topic extraction, deduplication, and summarisation."
175
  )
176
 
177
  examples = gr.Examples(examples=\
 
251
 
252
  with gr.Tab(label="Advanced - Step by step topic extraction and summarisation"):
253
 
254
+ with gr.Accordion("1. Extract topics - go to first tab for file upload, model choice, and other settings before clicking this button", open = False):
255
  context_textbox.render()
256
  extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
257
+ topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False, height=FILE_INPUT_HEIGHT)
258
 
259
  with gr.Accordion("2. Modify topics from topic extraction", open = False):
260
  gr.Markdown("""Load in previously completed Extract Topics output files ('reference_table', and 'unique_topics' files) to modify topics, deduplicate topics, or summarise the outputs. If you want pivot table outputs, please load in the original data file along with the selected open text column on the first tab before deduplicating or summarising.""")
261
 
262
+ modification_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to modify topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
 
263
 
264
  modifiable_unique_topics_df_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=(4, "fixed"), row_count = (1, "fixed"), visible=True, type="pandas")
265
 
266
  save_modified_files_button = gr.Button(value="Save modified topic names")
267
 
268
+ with gr.Accordion("3. Deduplicate topics using fuzzy matching or LLMs", open = False):
269
  ### DEDUPLICATION
270
+ deduplication_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to deduplicate topics. Optionally upload suggested topics on the first tab to match to these where possible with LLM deduplication", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
271
  deduplication_input_files_status = gr.Textbox(value = "", label="Previous file input", visible=False)
272
 
273
  with gr.Row():
 
275
  merge_sentiment_drop = gr.Dropdown(label="Merge sentiment values together for duplicate subtopics.", value="No", choices=["Yes", "No"])
276
  deduplicate_score_threshold = gr.Number(label="Similarity threshold with which to determine duplicates.", value = 90, minimum=5, maximum=100, precision=0)
277
 
278
+ with gr.Row():
279
+ deduplicate_previous_data_btn = gr.Button("3. Deduplicate topics (Fuzzy matching)", variant="primary")
280
+ deduplicate_llm_previous_data_btn = gr.Button("3b. Deduplicate topics (LLM semantic)", variant="secondary")
281
 
282
  with gr.Accordion("4. Summarise topics", open = False):
283
  ### SUMMARISATION
284
+ summarisation_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to summarise", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
285
 
286
  summarise_format_radio = gr.Radio(label="Choose summary type", value=two_para_summary_format_prompt, choices=[two_para_summary_format_prompt, single_para_summary_format_prompt])
287
 
 
477
  deduplicate_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
478
  success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown], scroll_to_output=True, api_name="deduplicate_topics")
479
 
480
+ # When LLM deduplication button pressed, deduplicate data using LLM
481
+ def deduplicate_topics_llm_wrapper(reference_df, topic_summary_df, reference_table_file_name, unique_topics_table_file_name, model_choice, in_api_key, temperature, in_excel_sheets, merge_sentiment, merge_general_topics, in_data_files, chosen_cols, output_folder, candidate_topics=None):
482
+ model_source = model_name_map[model_choice]["source"]
483
+ return deduplicate_topics_llm(reference_df, topic_summary_df, reference_table_file_name, unique_topics_table_file_name, model_choice, in_api_key, temperature, model_source, None, None, None, None, in_excel_sheets, merge_sentiment, merge_general_topics, in_data_files, chosen_cols, output_folder, candidate_topics)
484
+
485
+ deduplicate_llm_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
486
+ success(deduplicate_topics_llm_wrapper, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, model_choice, google_api_key_textbox, temperature_slide, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, in_data_files, in_colnames, output_folder_state, candidate_topics], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], scroll_to_output=True, api_name="deduplicate_topics_llm").\
487
+ success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs_llm_dedup")
488
+
489
  # When button pressed, summarise previous data
490
  summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, overall_summarisation_input_files]).\
491
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
requirements.txt CHANGED
@@ -1,9 +1,9 @@
1
- # Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
2
- pandas==2.3.2
3
- gradio==5.48.0
4
  transformers==4.56.0
5
- spaces==0.40.1
6
- boto3==1.40.22
7
  pyarrow==21.0.0
8
  openpyxl==3.1.5
9
  markdown==3.7
@@ -23,11 +23,7 @@ torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
23
  unsloth[cu124-torch260]==2025.9.4
24
  unsloth_zoo==2025.9.5
25
  timm==1.0.19
26
- https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
27
-
28
- # CPU only (for e.g. Hugging Face CPU instances)
29
- #torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
30
- # For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts
31
- #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
32
 
33
 
 
1
+ # Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_no_local.txt for installation without local model inference (simplest approach to get going). Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
2
+ pandas==2.3.3
3
+ gradio==5.49.1
4
  transformers==4.56.0
5
+ spaces==0.42.1
6
+ boto3==1.40.48
7
  pyarrow==21.0.0
8
  openpyxl==3.1.5
9
  markdown==3.7
 
23
  unsloth[cu124-torch260]==2025.9.4
24
  unsloth_zoo==2025.9.5
25
  timm==1.0.19
26
+ # llama-cpp-python direct wheel link for GPU compatible version 3.16 for use with Python 3.10 and Hugging Face
27
+ llama-cpp-python==0.3.16 @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
 
 
 
 
28
 
29
 
requirements_cpu.txt CHANGED
@@ -1,8 +1,8 @@
1
- pandas==2.3.2
2
- gradio==5.48.0
3
  transformers==4.56.0
4
- spaces==0.40.1
5
- boto3==1.40.22
6
  pyarrow==21.0.0
7
  openpyxl==3.1.5
8
  markdown==3.7
@@ -16,8 +16,10 @@ beautifulsoup4==4.12.3
16
  rapidfuzz==3.13.0
17
  python-dotenv==1.1.0
18
  torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
19
- # Linux, Python 3.11 compatible wheel available:
20
- #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
21
- # Windows, Python 3.11 compatible wheel available:
22
- https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
 
 
23
  # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
 
1
+ pandas==2.3.3
2
+ gradio==5.49.1
3
  transformers==4.56.0
4
+ spaces==0.42.1
5
+ boto3==1.40.48
6
  pyarrow==21.0.0
7
  openpyxl==3.1.5
8
  markdown==3.7
 
16
  rapidfuzz==3.13.0
17
  python-dotenv==1.1.0
18
  torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
19
+ llama-cpp-python==0.3.16 -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
20
+ # Direct wheel links if above doesn't work
21
+ # I have created CPU Linux, Python 3.11 compatible wheels:
22
+ # llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
23
+ # Windows, Python 3.11 compatible CPU wheels available:
24
+ # llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
25
  # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
requirements_gpu.txt CHANGED
@@ -1,9 +1,8 @@
1
- pandas==2.3.2
2
- gradio==5.48.0
3
- huggingface_hub[hf_xet]==0.34.4
4
  transformers==4.56.0
5
- spaces==0.40.1
6
- boto3==1.40.22
7
  pyarrow==21.0.0
8
  openpyxl==3.1.5
9
  markdown==3.7
@@ -25,7 +24,11 @@ unsloth_zoo==2025.9.5
25
  #triton-windows<3.3
26
  timm==1.0.19
27
  # Llama CPP Python
 
 
28
  # For Linux:
29
- https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
30
- # For Windows:
31
- #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl
 
 
 
1
+ pandas==2.3.3
2
+ gradio==5.49.1
 
3
  transformers==4.56.0
4
+ spaces==0.42.1
5
+ boto3==1.40.48
6
  pyarrow==21.0.0
7
  openpyxl==3.1.5
8
  markdown==3.7
 
24
  #triton-windows<3.3
25
  timm==1.0.19
26
  # Llama CPP Python
27
+ llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on"
28
+ # If below doesn't work, try specific wheels for your system:
29
  # For Linux:
30
+ # See files in https://github.com/abetlen/llama-cpp-python/releases/tag/v0.3.16-cu124 for different python versions
31
+ # Python 3.11 compatible wheel:
32
+ # llama-cpp-python==0.3.16 @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
33
+ # For Windows, not available at above link. I have made a GPU Windows wheel for Python 3.11:
34
+ # llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl
requirements_no_local.txt CHANGED
@@ -1,9 +1,9 @@
1
  # This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, without local torch and llama-cpp-python. For AWS ECS, torch and llama-cpp-python are optionally installed in the main Dockerfile
2
- pandas==2.3.2
3
- gradio==5.48.0
4
  transformers==4.56.0
5
- spaces==0.40.1
6
- boto3==1.40.22
7
  pyarrow==21.0.0
8
  openpyxl==3.1.5
9
  markdown==3.7
 
1
  # This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, without local torch and llama-cpp-python. For AWS ECS, torch and llama-cpp-python are optionally installed in the main Dockerfile
2
+ pandas==2.3.3
3
+ gradio==5.49.1
4
  transformers==4.56.0
5
+ spaces==0.42.1
6
+ boto3==1.40.48
7
  pyarrow==21.0.0
8
  openpyxl==3.1.5
9
  markdown==3.7
tools/config.py CHANGED
@@ -380,7 +380,7 @@ LLM_THREADS = int(get_or_create_env_var('LLM_THREADS', '-1'))
380
  LLM_BATCH_SIZE = int(get_or_create_env_var('LLM_BATCH_SIZE', '512'))
381
  LLM_CONTEXT_LENGTH = int(get_or_create_env_var('LLM_CONTEXT_LENGTH', '32768'))
382
  LLM_SAMPLE = get_or_create_env_var('LLM_SAMPLE', 'True')
383
- LLM_STOP_STRINGS = get_or_create_env_var('LLM_STOP_STRINGS', r"[' ','\n\n\n\n','---------------------------------------------']")
384
  MULTIMODAL_PROMPT_FORMAT = get_or_create_env_var('MULTIMODAL_PROMPT_FORMAT', 'False')
385
  SPECULATIVE_DECODING = get_or_create_env_var('SPECULATIVE_DECODING', 'False')
386
  NUM_PRED_TOKENS = int(get_or_create_env_var('NUM_PRED_TOKENS', '2'))
 
380
  LLM_BATCH_SIZE = int(get_or_create_env_var('LLM_BATCH_SIZE', '512'))
381
  LLM_CONTEXT_LENGTH = int(get_or_create_env_var('LLM_CONTEXT_LENGTH', '32768'))
382
  LLM_SAMPLE = get_or_create_env_var('LLM_SAMPLE', 'True')
383
+ LLM_STOP_STRINGS = get_or_create_env_var('LLM_STOP_STRINGS', r"['\n\n\n\n\n\n']")
384
  MULTIMODAL_PROMPT_FORMAT = get_or_create_env_var('MULTIMODAL_PROMPT_FORMAT', 'False')
385
  SPECULATIVE_DECODING = get_or_create_env_var('SPECULATIVE_DECODING', 'False')
386
  NUM_PRED_TOKENS = int(get_or_create_env_var('NUM_PRED_TOKENS', '2'))
tools/dedup_summaries.py CHANGED
@@ -9,12 +9,12 @@ import markdown
9
  import boto3
10
  from tqdm import tqdm
11
  import os
12
-
13
- from tools.prompts import summarise_topic_descriptions_prompt, summarise_topic_descriptions_system_prompt, system_prompt, summarise_everything_prompt, comprehensive_summary_format_prompt, summarise_everything_system_prompt, comprehensive_summary_format_prompt_by_group, summary_assistant_prefill
14
- from tools.llm_funcs import construct_gemini_generative_model, process_requests, ResponseObject, load_model, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
15
- from tools.helper_functions import create_topic_summary_df_from_reference_table, load_in_data_file, get_basic_response_data, convert_reference_table_to_pivot_table, wrap_text, clean_column_name, get_file_name_no_ext, create_batch_file_path_details
16
  from tools.aws_functions import connect_to_bedrock_runtime
17
- from tools.config import OUTPUT_FOLDER, RUN_LOCAL_MODEL, MAX_COMMENT_CHARS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
18
 
19
  max_tokens = LLM_MAX_NEW_TOKENS
20
  timeout_wait = TIMEOUT_WAIT
@@ -156,17 +156,16 @@ def deduplicate_topics(reference_df:pd.DataFrame,
156
  file_data = pd.DataFrame()
157
  deduplicated_unique_table_markdown = ""
158
 
 
159
  if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
160
  print("Data file outputs are too short for deduplicating. Returning original data.")
161
 
162
- reference_file_out_path = output_folder + reference_table_file_name
163
  unique_topics_file_out_path = output_folder + unique_topics_table_file_name
164
 
165
  output_files.append(reference_file_out_path)
166
  output_files.append(unique_topics_file_out_path)
167
- return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
168
-
169
-
170
 
171
  # For checking that data is not lost during the process
172
  initial_unique_references = len(reference_df["Response References"].unique())
@@ -376,6 +375,341 @@ def deduplicate_topics(reference_df:pd.DataFrame,
376
 
377
  return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
378
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
379
  def sample_reference_table_summaries(reference_df:pd.DataFrame,
380
  random_seed:int,
381
  no_of_sampled_summaries:int=150):
 
9
  import boto3
10
  from tqdm import tqdm
11
  import os
12
+ from tools.llm_api_call import generate_zero_shot_topics_df
13
+ from tools.prompts import summarise_topic_descriptions_prompt, summarise_topic_descriptions_system_prompt, system_prompt, summarise_everything_prompt, comprehensive_summary_format_prompt, summarise_everything_system_prompt, comprehensive_summary_format_prompt_by_group, summary_assistant_prefill, llm_deduplication_system_prompt, llm_deduplication_prompt, llm_deduplication_prompt_with_candidates
14
+ from tools.llm_funcs import construct_gemini_generative_model, process_requests, ResponseObject, load_model, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model, send_request, construct_gemini_generative_model, construct_azure_client, call_llm_with_markdown_table_checks
15
+ from tools.helper_functions import create_topic_summary_df_from_reference_table, load_in_data_file, get_basic_response_data, convert_reference_table_to_pivot_table, wrap_text, clean_column_name, get_file_name_no_ext, create_batch_file_path_details, read_file
16
  from tools.aws_functions import connect_to_bedrock_runtime
17
+ from tools.config import OUTPUT_FOLDER, RUN_LOCAL_MODEL, MAX_COMMENT_CHARS, LLM_MAX_NEW_TOKENS, LLM_SEED, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
18
 
19
  max_tokens = LLM_MAX_NEW_TOKENS
20
  timeout_wait = TIMEOUT_WAIT
 
156
  file_data = pd.DataFrame()
157
  deduplicated_unique_table_markdown = ""
158
 
159
+
160
  if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
161
  print("Data file outputs are too short for deduplicating. Returning original data.")
162
 
163
+ reference_file_out_path = output_folder + reference_table_file_name
164
  unique_topics_file_out_path = output_folder + unique_topics_table_file_name
165
 
166
  output_files.append(reference_file_out_path)
167
  output_files.append(unique_topics_file_out_path)
168
+ return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
 
 
169
 
170
  # For checking that data is not lost during the process
171
  initial_unique_references = len(reference_df["Response References"].unique())
 
375
 
376
  return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
377
 
378
+ def deduplicate_topics_llm(reference_df:pd.DataFrame,
379
+ topic_summary_df:pd.DataFrame,
380
+ reference_table_file_name:str,
381
+ unique_topics_table_file_name:str,
382
+ model_choice:str,
383
+ in_api_key:str,
384
+ temperature:float,
385
+ model_source:str,
386
+ bedrock_runtime=None,
387
+ local_model=None,
388
+ tokenizer=None,
389
+ assistant_model=None,
390
+ in_excel_sheets:str="",
391
+ merge_sentiment:str="No",
392
+ merge_general_topics:str="No",
393
+ in_data_files:List[str]=list(),
394
+ chosen_cols:List[str]="",
395
+ output_folder:str=OUTPUT_FOLDER,
396
+ candidate_topics=None
397
+ ):
398
+ '''
399
+ Deduplicate topics using LLM semantic understanding to identify and merge similar topics.
400
+
401
+ Args:
402
+ reference_df (pd.DataFrame): DataFrame containing reference data with topics.
403
+ topic_summary_df (pd.DataFrame): DataFrame summarizing unique topics.
404
+ reference_table_file_name (str): Base file name for the output reference table.
405
+ unique_topics_table_file_name (str): Base file name for the output unique topics table.
406
+ model_choice (str): The LLM model to use for deduplication.
407
+ in_api_key (str): API key for the LLM service.
408
+ temperature (float): Temperature setting for the LLM.
409
+ model_source (str): Source of the model (AWS, Gemini, Local, etc.).
410
+ bedrock_runtime: AWS Bedrock runtime client (if using AWS).
411
+ local_model: Local model instance (if using local model).
412
+ tokenizer: Tokenizer for local model.
413
+ assistant_model: Assistant model for speculative decoding.
414
+ in_excel_sheets (str, optional): Comma-separated list of Excel sheet names to load. Defaults to "".
415
+ merge_sentiment (str, optional): Whether to merge topics regardless of sentiment ("Yes" or "No"). Defaults to "No".
416
+ merge_general_topics (str, optional): Whether to merge topics across different general topics ("Yes" or "No"). Defaults to "No".
417
+ in_data_files (List[str], optional): List of input data file paths. Defaults to [].
418
+ chosen_cols (List[str], optional): List of chosen columns from the input data files. Defaults to "".
419
+ output_folder (str, optional): Folder path to save output files. Defaults to OUTPUT_FOLDER.
420
+ candidate_topics (optional): Candidate topics file for zero-shot guidance. Defaults to None.
421
+ '''
422
+
423
+
424
+ output_files = list()
425
+ log_output_files = list()
426
+ file_data = pd.DataFrame()
427
+ deduplicated_unique_table_markdown = ""
428
+
429
+ # Check if data is too short for deduplication
430
+ if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
431
+ print("Data file outputs are too short for deduplicating. Returning original data.")
432
+
433
+ #print("reference_df:", reference_df)
434
+ #print("topic_summary_df:", topic_summary_df)
435
+
436
+ reference_file_out_path = output_folder + reference_table_file_name
437
+ unique_topics_file_out_path = output_folder + unique_topics_table_file_name
438
+
439
+ output_files.append(reference_file_out_path)
440
+ output_files.append(unique_topics_file_out_path)
441
+ return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
442
+
443
+ # For checking that data is not lost during the process
444
+ initial_unique_references = len(reference_df["Response References"].unique())
445
+
446
+ # Create topic summary if it doesn't exist
447
+ if topic_summary_df.empty:
448
+ topic_summary_df = create_topic_summary_df_from_reference_table(reference_df)
449
+
450
+ # Merge topic numbers back to the original dataframe
451
+ reference_df = reference_df.merge(
452
+ topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Topic_number']],
453
+ on=['General topic', 'Subtopic', 'Sentiment'],
454
+ how='left'
455
+ )
456
+
457
+ # Load data files if provided
458
+ if in_data_files and chosen_cols:
459
+ file_data, data_file_names_textbox, total_number_of_batches = load_in_data_file(in_data_files, chosen_cols, 1, in_excel_sheets)
460
+ else:
461
+ out_message = "No file data found, pivot table output will not be created."
462
+ print(out_message)
463
+
464
+ # Process candidate topics if provided
465
+ candidate_topics_table = ""
466
+ if candidate_topics is not None:
467
+ try:
468
+
469
+
470
+ # Read and process candidate topics
471
+ candidate_topics_df = read_file(candidate_topics.name)
472
+ candidate_topics_df = candidate_topics_df.fillna("")
473
+ candidate_topics_df = candidate_topics_df.astype(str)
474
+
475
+ # Generate zero-shot topics DataFrame
476
+ zero_shot_topics_df = generate_zero_shot_topics_df(candidate_topics_df, "No", False)
477
+
478
+ if not zero_shot_topics_df.empty:
479
+ candidate_topics_table = zero_shot_topics_df[['General topic', 'Subtopic']].to_markdown(index=False)
480
+ print(f"Found {len(zero_shot_topics_df)} candidate topics to consider during deduplication")
481
+ except Exception as e:
482
+ print(f"Error processing candidate topics: {e}")
483
+ candidate_topics_table = ""
484
+
485
+ # Prepare topics table for LLM analysis
486
+ topics_table = topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Number of responses']].to_markdown(index=False)
487
+
488
+ # Format the prompt with candidate topics if available
489
+ if candidate_topics_table:
490
+ formatted_prompt = llm_deduplication_prompt_with_candidates.format(
491
+ topics_table=topics_table,
492
+ candidate_topics_table=candidate_topics_table
493
+ )
494
+ else:
495
+ formatted_prompt = llm_deduplication_prompt.format(topics_table=topics_table)
496
+
497
+ # Initialise conversation history
498
+ conversation_history = []
499
+ whole_conversation = []
500
+ whole_conversation_metadata = []
501
+
502
+ # Set up model clients based on model source
503
+ if "Gemini" in model_source:
504
+ google_client, config = construct_gemini_generative_model(
505
+ in_api_key, temperature, model_choice, llm_deduplication_system_prompt,
506
+ max_tokens, LLM_SEED
507
+ )
508
+ bedrock_runtime = None
509
+ elif "AWS" in model_source:
510
+ if not bedrock_runtime:
511
+ bedrock_runtime = boto3.client('bedrock-runtime')
512
+ google_client = None
513
+ config = None
514
+ elif "Azure" in model_source:
515
+ google_client, config = construct_azure_client(in_api_key, "")
516
+ bedrock_runtime = None
517
+ elif "Local" in model_source:
518
+ google_client = None
519
+ config = None
520
+ bedrock_runtime = None
521
+ else:
522
+ raise ValueError(f"Unsupported model source: {model_source}")
523
+
524
+ # Call LLM to get deduplication suggestions
525
+ print("Calling LLM for topic deduplication analysis...")
526
+
527
+ # Use the existing call_llm_with_markdown_table_checks function
528
+ responses, conversation_history, whole_conversation, whole_conversation_metadata, response_text = call_llm_with_markdown_table_checks(
529
+ batch_prompts=[formatted_prompt],
530
+ system_prompt=llm_deduplication_system_prompt,
531
+ conversation_history=conversation_history,
532
+ whole_conversation=whole_conversation,
533
+ whole_conversation_metadata=whole_conversation_metadata,
534
+ google_client=google_client,
535
+ google_config=config,
536
+ model_choice=model_choice,
537
+ temperature=temperature,
538
+ reported_batch_no=1,
539
+ local_model=local_model,
540
+ tokenizer=tokenizer,
541
+ bedrock_runtime=bedrock_runtime,
542
+ model_source=model_source,
543
+ MAX_OUTPUT_VALIDATION_ATTEMPTS=3,
544
+ assistant_prefill="",
545
+ master=False,
546
+ CHOSEN_LOCAL_MODEL_TYPE=CHOSEN_LOCAL_MODEL_TYPE,
547
+ random_seed=LLM_SEED
548
+ )
549
+
550
+ # Generate debug files if enabled
551
+ if OUTPUT_DEBUG_FILES == "True":
552
+ try:
553
+ # Create batch file path details for debug files
554
+ batch_file_path_details = get_file_name_no_ext(reference_table_file_name) + "_llm_dedup"
555
+ model_choice_clean_short = model_choice.replace("/", "_").replace(":", "_").replace(".", "_")
556
+
557
+ # Create full prompt for debug output
558
+ full_prompt = llm_deduplication_system_prompt + "\n" + formatted_prompt
559
+
560
+ # Write debug files
561
+ current_prompt_content_logged, current_summary_content_logged, current_conversation_content_logged, current_metadata_content_logged = process_debug_output_iteration(
562
+ OUTPUT_DEBUG_FILES,
563
+ output_folder,
564
+ batch_file_path_details,
565
+ model_choice_clean_short,
566
+ full_prompt,
567
+ response_text,
568
+ whole_conversation,
569
+ whole_conversation_metadata,
570
+ log_output_files,
571
+ task_type="llm_deduplication"
572
+ )
573
+
574
+ print(f"Debug files written for LLM deduplication analysis")
575
+
576
+ except Exception as e:
577
+ print(f"Error writing debug files for LLM deduplication: {e}")
578
+
579
+ # Parse the LLM response to extract merge suggestions
580
+ merge_suggestions_df = pd.DataFrame() # Initialize empty DataFrame for analysis results
581
+ num_merges_applied = 0
582
+
583
+ try:
584
+ # Extract the markdown table from the response
585
+ table_match = re.search(r'\|.*\|.*\n\|.*\|.*\n(\|.*\|.*\n)*', response_text, re.MULTILINE)
586
+ if table_match:
587
+ table_text = table_match.group(0)
588
+
589
+ # Convert markdown table to DataFrame
590
+ from io import StringIO
591
+ merge_suggestions_df = pd.read_csv(StringIO(table_text), sep='|', skipinitialspace=True)
592
+
593
+ # Clean up the DataFrame
594
+ merge_suggestions_df = merge_suggestions_df.dropna(axis=1, how='all') # Remove empty columns
595
+ merge_suggestions_df.columns = merge_suggestions_df.columns.str.strip()
596
+
597
+ # Remove rows where all values are NaN
598
+ merge_suggestions_df = merge_suggestions_df.dropna(how='all')
599
+
600
+ if not merge_suggestions_df.empty:
601
+ print(f"LLM identified {len(merge_suggestions_df)} potential topic merges")
602
+
603
+ # Apply the merges to the reference_df
604
+ for _, row in merge_suggestions_df.iterrows():
605
+ original_general = row.get('Original General topic', '').strip()
606
+ original_subtopic = row.get('Original Subtopic', '').strip()
607
+ original_sentiment = row.get('Original Sentiment', '').strip()
608
+ merged_general = row.get('Merged General topic', '').strip()
609
+ merged_subtopic = row.get('Merged Subtopic', '').strip()
610
+ merged_sentiment = row.get('Merged Sentiment', '').strip()
611
+
612
+ if all([original_general, original_subtopic, original_sentiment,
613
+ merged_general, merged_subtopic, merged_sentiment]):
614
+
615
+ # Find matching rows in reference_df
616
+ mask = (
617
+ (reference_df['General topic'] == original_general) &
618
+ (reference_df['Subtopic'] == original_subtopic) &
619
+ (reference_df['Sentiment'] == original_sentiment)
620
+ )
621
+
622
+ if mask.any():
623
+ # Update the matching rows
624
+ reference_df.loc[mask, 'General topic'] = merged_general
625
+ reference_df.loc[mask, 'Subtopic'] = merged_subtopic
626
+ reference_df.loc[mask, 'Sentiment'] = merged_sentiment
627
+ num_merges_applied += 1
628
+ print(f"Merged: {original_general} | {original_subtopic} | {original_sentiment} -> {merged_general} | {merged_subtopic} | {merged_sentiment}")
629
+ else:
630
+ print("No merge suggestions found in LLM response")
631
+ else:
632
+ print("No markdown table found in LLM response")
633
+
634
+ except Exception as e:
635
+ print(f"Error parsing LLM response: {e}")
636
+ print("Continuing with original data...")
637
+
638
+ # Update reference summary column with all summaries
639
+ reference_df["Summary"] = reference_df.groupby(
640
+ ["Response References", "General topic", "Subtopic", "Sentiment"]
641
+ )["Summary"].transform(' <br> '.join)
642
+
643
+ # Check that we have not inadvertently removed some data during the process
644
+ end_unique_references = len(reference_df["Response References"].unique())
645
+
646
+ if initial_unique_references != end_unique_references:
647
+ raise Exception(f"Number of unique references changed during processing: Initial={initial_unique_references}, Final={end_unique_references}")
648
+
649
+ # Drop duplicates in the reference table
650
+ reference_df.drop_duplicates(['Response References', 'General topic', 'Subtopic', 'Sentiment'], inplace=True)
651
+
652
+ # Remake topic_summary_df based on new reference_df
653
+ topic_summary_df = create_topic_summary_df_from_reference_table(reference_df)
654
+
655
+ # Merge the topic numbers back to the original dataframe
656
+ reference_df = reference_df.merge(
657
+ topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Group', 'Topic_number']],
658
+ on=['General topic', 'Subtopic', 'Sentiment', 'Group'],
659
+ how='left'
660
+ )
661
+
662
+ # Create pivot table if file data is available
663
+ if not file_data.empty:
664
+ basic_response_data = get_basic_response_data(file_data, chosen_cols)
665
+ reference_df_pivot = convert_reference_table_to_pivot_table(reference_df, basic_response_data)
666
+
667
+ reference_pivot_file_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_pivot_dedup.csv"
668
+ reference_df_pivot.to_csv(reference_pivot_file_path, index=None, encoding='utf-8-sig')
669
+ log_output_files.append(reference_pivot_file_path)
670
+
671
+ # Save analysis results CSV if merge suggestions were found
672
+ if not merge_suggestions_df.empty:
673
+ analysis_results_file_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_llm_analysis_results.csv"
674
+ merge_suggestions_df.to_csv(analysis_results_file_path, index=None, encoding='utf-8-sig')
675
+ log_output_files.append(analysis_results_file_path)
676
+ print(f"Analysis results saved to: {analysis_results_file_path}")
677
+
678
+ # Save output files
679
+ reference_file_out_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_dedup.csv"
680
+ unique_topics_file_out_path = output_folder + get_file_name_no_ext(unique_topics_table_file_name) + "_dedup.csv"
681
+ reference_df.to_csv(reference_file_out_path, index=None, encoding='utf-8-sig')
682
+ topic_summary_df.to_csv(unique_topics_file_out_path, index=None, encoding='utf-8-sig')
683
+
684
+ output_files.append(reference_file_out_path)
685
+ output_files.append(unique_topics_file_out_path)
686
+
687
+ # Outputs for markdown table output
688
+ topic_summary_df_revised_display = topic_summary_df.apply(lambda col: col.map(lambda x: wrap_text(x, max_text_length=500)))
689
+ deduplicated_unique_table_markdown = topic_summary_df_revised_display.to_markdown(index=False)
690
+
691
+ # Calculate token usage and timing information for logging
692
+ total_input_tokens = 0
693
+ total_output_tokens = 0
694
+ number_of_calls = 1 # Single LLM call for deduplication
695
+
696
+ # Extract token usage from conversation metadata
697
+ if whole_conversation_metadata:
698
+ for metadata in whole_conversation_metadata:
699
+ if "input_tokens:" in metadata and "output_tokens:" in metadata:
700
+ try:
701
+ input_tokens = int(metadata.split("input_tokens: ")[1].split(" ")[0])
702
+ output_tokens = int(metadata.split("output_tokens: ")[1].split(" ")[0])
703
+ total_input_tokens += input_tokens
704
+ total_output_tokens += output_tokens
705
+ except (ValueError, IndexError):
706
+ pass
707
+
708
+ # Calculate estimated time taken (rough estimate based on token usage)
709
+ estimated_time_taken = (total_input_tokens + total_output_tokens) / 1000 # Rough estimate in seconds
710
+
711
+ return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown, total_input_tokens, total_output_tokens, number_of_calls, estimated_time_taken#, num_merges_applied
712
+
713
  def sample_reference_table_summaries(reference_df:pd.DataFrame,
714
  random_seed:int,
715
  no_of_sampled_summaries:int=150):
tools/helper_functions.py CHANGED
@@ -7,7 +7,7 @@ import numpy as np
7
  from typing import List
8
  import math
9
  from botocore.exceptions import ClientError
10
- from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, CUSTOM_HEADER, CUSTOM_HEADER_VALUE, AWS_USER_POOL_ID
11
 
12
  def empty_output_vars_extract_topics():
13
  # Empty output objects before processing a new file
@@ -786,4 +786,116 @@ def create_batch_file_path_details(reference_data_file_name: str) -> str:
786
 
787
 
788
  def move_overall_summary_output_files_to_front_page(overall_summary_output_files_xlsx:List[str]):
789
- return overall_summary_output_files_xlsx
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  from typing import List
8
  import math
9
  from botocore.exceptions import ClientError
10
+ from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, CUSTOM_HEADER, CUSTOM_HEADER_VALUE, AWS_USER_POOL_ID, MAXIMUM_ZERO_SHOT_TOPICS
11
 
12
  def empty_output_vars_extract_topics():
13
  # Empty output objects before processing a new file
 
786
 
787
 
788
  def move_overall_summary_output_files_to_front_page(overall_summary_output_files_xlsx:List[str]):
789
+ return overall_summary_output_files_xlsx
790
+
791
+ def generate_zero_shot_topics_df(zero_shot_topics:pd.DataFrame,
792
+ force_zero_shot_radio:str="No",
793
+ create_revised_general_topics:bool=False,
794
+ max_topic_no:int=MAXIMUM_ZERO_SHOT_TOPICS):
795
+ """
796
+ Preprocesses a DataFrame of zero-shot topics, cleaning and formatting them
797
+ for use with a large language model. It handles different column configurations
798
+ (e.g., only subtopics, general topics and subtopics, or subtopics with descriptions)
799
+ and enforces a maximum number of topics.
800
+
801
+ Args:
802
+ zero_shot_topics (pd.DataFrame): A DataFrame containing the initial zero-shot topics.
803
+ Expected columns can vary, but typically include
804
+ "General topic", "Subtopic", and/or "Description".
805
+ force_zero_shot_radio (str, optional): A string indicating whether to force
806
+ the use of zero-shot topics. Defaults to "No".
807
+ (Currently not used in the function logic, but kept for signature consistency).
808
+ create_revised_general_topics (bool, optional): A boolean indicating whether to
809
+ create revised general topics. Defaults to False.
810
+ (Currently not used in the function logic, but kept for signature consistency).
811
+ max_topic_no (int, optional): The maximum number of topics allowed to fit within
812
+ LLM context limits. If `zero_shot_topics` exceeds this,
813
+ it will be truncated. Defaults to 120.
814
+
815
+ Returns:
816
+ tuple: A tuple containing:
817
+ - zero_shot_topics_gen_topics_list (list): A list of cleaned general topics.
818
+ - zero_shot_topics_subtopics_list (list): A list of cleaned subtopics.
819
+ - zero_shot_topics_description_list (list): A list of cleaned topic descriptions.
820
+ """
821
+
822
+ zero_shot_topics_gen_topics_list = list()
823
+ zero_shot_topics_subtopics_list = list()
824
+ zero_shot_topics_description_list = list()
825
+
826
+ # Max 120 topics allowed
827
+ if zero_shot_topics.shape[0] > max_topic_no:
828
+ out_message = "Maximum " + str(max_topic_no) + " zero-shot topics allowed according to application configuration."
829
+ print(out_message)
830
+ raise Exception(out_message)
831
+
832
+ # Forward slashes in the topic names seems to confuse the model
833
+ if zero_shot_topics.shape[1] >= 1: # Check if there is at least one column
834
+ for x in zero_shot_topics.columns:
835
+ if not zero_shot_topics[x].isnull().all():
836
+ zero_shot_topics[x] = zero_shot_topics[x].apply(initial_clean)
837
+
838
+ zero_shot_topics.loc[:, x] = (
839
+ zero_shot_topics.loc[:, x]
840
+ .str.strip()
841
+ .str.replace('\n', ' ')
842
+ .str.replace('\r', ' ')
843
+ .str.replace('/', ' or ')
844
+ .str.lower()
845
+ .str.capitalize())
846
+
847
+ # If number of columns is 1, keep only subtopics
848
+ if zero_shot_topics.shape[1] == 1 and "General topic" not in zero_shot_topics.columns:
849
+ print("Found only Subtopic in zero shot topics")
850
+ zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
851
+ zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
852
+ # Allow for possibility that the user only wants to set general topics and not subtopics
853
+ elif zero_shot_topics.shape[1] == 1 and "General topic" in zero_shot_topics.columns:
854
+ print("Found only General topic in zero shot topics")
855
+ zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
856
+ zero_shot_topics_subtopics_list = [""] * zero_shot_topics.shape[0]
857
+ # If general topic and subtopic are specified
858
+ elif set(["General topic", "Subtopic"]).issubset(zero_shot_topics.columns):
859
+ print("Found General topic and Subtopic in zero shot topics")
860
+ zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
861
+ zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
862
+ # If subtopic and description are specified
863
+ elif set(["Subtopic", "Description"]).issubset(zero_shot_topics.columns):
864
+ print("Found Subtopic and Description in zero shot topics")
865
+ zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
866
+ zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
867
+ zero_shot_topics_description_list = list(zero_shot_topics["Description"])
868
+
869
+ # If number of columns is at least 2, keep general topics and subtopics
870
+ elif zero_shot_topics.shape[1] >= 2 and "Description" not in zero_shot_topics.columns:
871
+ zero_shot_topics_gen_topics_list = list(zero_shot_topics.iloc[:, 0])
872
+ zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 1])
873
+ else:
874
+ # If there are more columns, just assume that the first column was meant to be a subtopic
875
+ zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
876
+ zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
877
+
878
+ # Add a description if column is present
879
+ if not zero_shot_topics_description_list:
880
+ if "Description" in zero_shot_topics.columns:
881
+ zero_shot_topics_description_list = list(zero_shot_topics["Description"])
882
+ #print("Description found in topic title. List is:", zero_shot_topics_description_list)
883
+ elif zero_shot_topics.shape[1] >= 3:
884
+ zero_shot_topics_description_list = list(zero_shot_topics.iloc[:, 2]) # Assume the third column is description
885
+ else:
886
+ zero_shot_topics_description_list = [""] * zero_shot_topics.shape[0]
887
+
888
+ # If the responses are being forced into zero shot topics, allow an option for nothing relevant
889
+ if force_zero_shot_radio == "Yes":
890
+ zero_shot_topics_gen_topics_list.append("")
891
+ zero_shot_topics_subtopics_list.append("No relevant topic")
892
+ zero_shot_topics_description_list.append("")
893
+
894
+ # Add description or not
895
+ zero_shot_topics_df = pd.DataFrame(data={
896
+ "General topic":zero_shot_topics_gen_topics_list,
897
+ "Subtopic":zero_shot_topics_subtopics_list,
898
+ "Description": zero_shot_topics_description_list
899
+ })
900
+
901
+ return zero_shot_topics_df
tools/llm_api_call.py CHANGED
@@ -16,7 +16,7 @@ from io import StringIO
16
  GradioFileData = gr.FileData
17
 
18
  from tools.prompts import initial_table_prompt, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt, default_response_reference_format, negative_neutral_positive_sentiment_prompt, negative_or_positive_sentiment_prompt, default_sentiment_prompt
19
- from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details, move_overall_summary_output_files_to_front_page
20
  from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
21
  from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_ROWS, MAXIMUM_ZERO_SHOT_TOPICS, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
22
  from tools.aws_functions import connect_to_bedrock_runtime
@@ -578,117 +578,7 @@ def write_llm_output_and_logs(response_text: str,
578
 
579
  return topic_table_out_path, reference_table_out_path, topic_summary_df_out_path, topic_with_response_df, out_reference_df, out_topic_summary_df, batch_file_path_details, is_error
580
 
581
- def generate_zero_shot_topics_df(zero_shot_topics:pd.DataFrame,
582
- force_zero_shot_radio:str="No",
583
- create_revised_general_topics:bool=False,
584
- max_topic_no:int=maximum_zero_shot_topics):
585
- """
586
- Preprocesses a DataFrame of zero-shot topics, cleaning and formatting them
587
- for use with a large language model. It handles different column configurations
588
- (e.g., only subtopics, general topics and subtopics, or subtopics with descriptions)
589
- and enforces a maximum number of topics.
590
-
591
- Args:
592
- zero_shot_topics (pd.DataFrame): A DataFrame containing the initial zero-shot topics.
593
- Expected columns can vary, but typically include
594
- "General topic", "Subtopic", and/or "Description".
595
- force_zero_shot_radio (str, optional): A string indicating whether to force
596
- the use of zero-shot topics. Defaults to "No".
597
- (Currently not used in the function logic, but kept for signature consistency).
598
- create_revised_general_topics (bool, optional): A boolean indicating whether to
599
- create revised general topics. Defaults to False.
600
- (Currently not used in the function logic, but kept for signature consistency).
601
- max_topic_no (int, optional): The maximum number of topics allowed to fit within
602
- LLM context limits. If `zero_shot_topics` exceeds this,
603
- it will be truncated. Defaults to 120.
604
-
605
- Returns:
606
- tuple: A tuple containing:
607
- - zero_shot_topics_gen_topics_list (list): A list of cleaned general topics.
608
- - zero_shot_topics_subtopics_list (list): A list of cleaned subtopics.
609
- - zero_shot_topics_description_list (list): A list of cleaned topic descriptions.
610
- """
611
-
612
- zero_shot_topics_gen_topics_list = list()
613
- zero_shot_topics_subtopics_list = list()
614
- zero_shot_topics_description_list = list()
615
 
616
- # Max 120 topics allowed
617
- if zero_shot_topics.shape[0] > max_topic_no:
618
- out_message = "Maximum " + str(max_topic_no) + " zero-shot topics allowed according to application configuration."
619
- print(out_message)
620
- raise Exception(out_message)
621
-
622
- # Forward slashes in the topic names seems to confuse the model
623
- if zero_shot_topics.shape[1] >= 1: # Check if there is at least one column
624
- for x in zero_shot_topics.columns:
625
- if not zero_shot_topics[x].isnull().all():
626
- zero_shot_topics[x] = zero_shot_topics[x].apply(initial_clean)
627
-
628
- zero_shot_topics.loc[:, x] = (
629
- zero_shot_topics.loc[:, x]
630
- .str.strip()
631
- .str.replace('\n', ' ')
632
- .str.replace('\r', ' ')
633
- .str.replace('/', ' or ')
634
- .str.lower()
635
- .str.capitalize())
636
-
637
- # If number of columns is 1, keep only subtopics
638
- if zero_shot_topics.shape[1] == 1 and "General topic" not in zero_shot_topics.columns:
639
- print("Found only Subtopic in zero shot topics")
640
- zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
641
- zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
642
- # Allow for possibility that the user only wants to set general topics and not subtopics
643
- elif zero_shot_topics.shape[1] == 1 and "General topic" in zero_shot_topics.columns:
644
- print("Found only General topic in zero shot topics")
645
- zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
646
- zero_shot_topics_subtopics_list = [""] * zero_shot_topics.shape[0]
647
- # If general topic and subtopic are specified
648
- elif set(["General topic", "Subtopic"]).issubset(zero_shot_topics.columns):
649
- print("Found General topic and Subtopic in zero shot topics")
650
- zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
651
- zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
652
- # If subtopic and description are specified
653
- elif set(["Subtopic", "Description"]).issubset(zero_shot_topics.columns):
654
- print("Found Subtopic and Description in zero shot topics")
655
- zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
656
- zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
657
- zero_shot_topics_description_list = list(zero_shot_topics["Description"])
658
-
659
- # If number of columns is at least 2, keep general topics and subtopics
660
- elif zero_shot_topics.shape[1] >= 2 and "Description" not in zero_shot_topics.columns:
661
- zero_shot_topics_gen_topics_list = list(zero_shot_topics.iloc[:, 0])
662
- zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 1])
663
- else:
664
- # If there are more columns, just assume that the first column was meant to be a subtopic
665
- zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
666
- zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
667
-
668
- # Add a description if column is present
669
- if not zero_shot_topics_description_list:
670
- if "Description" in zero_shot_topics.columns:
671
- zero_shot_topics_description_list = list(zero_shot_topics["Description"])
672
- #print("Description found in topic title. List is:", zero_shot_topics_description_list)
673
- elif zero_shot_topics.shape[1] >= 3:
674
- zero_shot_topics_description_list = list(zero_shot_topics.iloc[:, 2]) # Assume the third column is description
675
- else:
676
- zero_shot_topics_description_list = [""] * zero_shot_topics.shape[0]
677
-
678
- # If the responses are being forced into zero shot topics, allow an option for nothing relevant
679
- if force_zero_shot_radio == "Yes":
680
- zero_shot_topics_gen_topics_list.append("")
681
- zero_shot_topics_subtopics_list.append("No relevant topic")
682
- zero_shot_topics_description_list.append("")
683
-
684
- # Add description or not
685
- zero_shot_topics_df = pd.DataFrame(data={
686
- "General topic":zero_shot_topics_gen_topics_list,
687
- "Subtopic":zero_shot_topics_subtopics_list,
688
- "Description": zero_shot_topics_description_list
689
- })
690
-
691
- return zero_shot_topics_df
692
 
693
 
694
  def extract_topics(in_data_file: GradioFileData,
 
16
  GradioFileData = gr.FileData
17
 
18
  from tools.prompts import initial_table_prompt, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt, default_response_reference_format, negative_neutral_positive_sentiment_prompt, negative_or_positive_sentiment_prompt, default_sentiment_prompt
19
+ from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details, move_overall_summary_output_files_to_front_page, generate_zero_shot_topics_df
20
  from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
21
  from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_ROWS, MAXIMUM_ZERO_SHOT_TOPICS, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
22
  from tools.aws_functions import connect_to_bedrock_runtime
 
578
 
579
  return topic_table_out_path, reference_table_out_path, topic_summary_df_out_path, topic_with_response_df, out_reference_df, out_topic_summary_df, batch_file_path_details, is_error
580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
581
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
582
 
583
 
584
  def extract_topics(in_data_file: GradioFileData,
tools/prompts.py CHANGED
@@ -182,4 +182,63 @@ New Topics table:"""
182
  # Categorise the following text into only one of the following categories that seems most relevant: 'cat1', 'cat2', 'cat3', 'cat4'. Answer only with the choice of category. Do not add any other text. Do not explain your choice.
183
  # Text: {text}<end_of_turn>
184
  # <start_of_turn>model
185
- # Category:"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  # Categorise the following text into only one of the following categories that seems most relevant: 'cat1', 'cat2', 'cat3', 'cat4'. Answer only with the choice of category. Do not add any other text. Do not explain your choice.
183
  # Text: {text}<end_of_turn>
184
  # <start_of_turn>model
185
+ # Category:"""
186
+
187
+ ###
188
+ # LLM-BASED TOPIC DEDUPLICATION PROMPTS
189
+ ###
190
+
191
+ llm_deduplication_system_prompt = """You are an expert at analysing and consolidating topic categories. Your task is to identify semantically similar topics that should be merged together, even if they use different wording or synonyms."""
192
+
193
+ llm_deduplication_prompt = """You are given a table of topics with their General topics, Subtopics, and Sentiment classifications. Your task is to identify topics that are semantically similar and should be merged together. Only merge topics that are almost identical in terms of meaning - if in doubt, do not merge.
194
+
195
+ Analyse the following topics table and identify groups of topics that describe essentially the same concept but may use different words or phrases. For example:
196
+ - "Transportation issues" and "Public transport problems"
197
+ - "Housing costs" and "Rent prices"
198
+ - "Environmental concerns" and "Green issues"
199
+
200
+ Create a markdown table with the following columns:
201
+ 1. 'Original General topic' - The current general topic name
202
+ 2. 'Original Subtopic' - The current subtopic name
203
+ 3. 'Original Sentiment' - The current sentiment
204
+ 4. 'Merged General topic' - The consolidated general topic name (use the most descriptive)
205
+ 5. 'Merged Subtopic' - The consolidated subtopic name (use the most descriptive)
206
+ 6. 'Merged Sentiment' - The consolidated sentiment (use 'Mixed' if sentiments differ)
207
+ 7. 'Merge Reason' - Brief explanation of why these topics should be merged
208
+
209
+ Only include rows where topics should actually be merged. If a topic has no semantic duplicates, do not include it in the table.
210
+
211
+ Topics to analyse:
212
+ {topics_table}
213
+
214
+ Merged topics table:"""
215
+
216
+ llm_deduplication_prompt_with_candidates = """You are given a table of topics with their General topics, Subtopics, and Sentiment classifications. Your task is to identify topics that are semantically similar and should be merged together, even if they use different wording.
217
+
218
+ Additionally, you have been provided with a list of candidate topics that represent preferred topic categories. When merging topics, prioritise fitting similar topics into these existing candidate categories rather than creating new ones. Only merge topics that are almost identical in terms of meaning - if in doubt, do not merge.
219
+
220
+ Analyse the following topics table and identify groups of topics that describe essentially the same concept but may use different words or phrases. For example:
221
+ - "Transportation issues" and "Public transport problems"
222
+ - "Housing costs" and "Rent prices"
223
+ - "Environmental concerns" and "Green issues"
224
+
225
+ When merging topics, consider the candidate topics provided below and try to map similar topics to these preferred categories when possible.
226
+
227
+ Create a markdown table with the following columns:
228
+ 1. 'Original General topic' - The current general topic name
229
+ 2. 'Original Subtopic' - The current subtopic name
230
+ 3. 'Original Sentiment' - The current sentiment
231
+ 4. 'Merged General topic' - The consolidated general topic name (prefer candidate topics when similar)
232
+ 5. 'Merged Subtopic' - The consolidated subtopic name (prefer candidate topics when similar)
233
+ 6. 'Merged Sentiment' - The consolidated sentiment (use 'Mixed' if sentiments differ)
234
+ 7. 'Merge Reason' - Brief explanation of why these topics should be merged
235
+
236
+ Only include rows where topics should actually be merged. If a topic has no semantic duplicates, do not include it in the table.
237
+
238
+ Topics to analyse:
239
+ {topics_table}
240
+
241
+ Candidate topics to consider for mapping:
242
+ {candidate_topics_table}
243
+
244
+ Merged topics table:"""