Spaces:

seanpedrickcase
/

llm_topic_modelling

Running on Zero

App Files Files Community

seanpedrickcase commited on Oct 9

Commit

6f3d42c

1 Parent(s): 9a0231a

Added deduplication with LLM functionality. Minor package updates. Updated installation documentation.

Browse files

Files changed (11) hide show

README.md +9 -5
app.py +20 -10
requirements.txt +7 -11
requirements_cpu.txt +10 -8
requirements_gpu.txt +11 -8
requirements_no_local.txt +4 -4
tools/config.py +1 -1
tools/dedup_summaries.py +343 -9
tools/helper_functions.py +114 -2
tools/llm_api_call.py +1 -111
tools/prompts.py +60 -1

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ Basic use:
 # Installation guide
-Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from a relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
 -----
@@ -37,11 +37,9 @@ First, you need to copy the project files to your local machine. Navigate to the
 1.  **Clone the repo:**
     ```bash
-    git clone https://github.com/example-user/example-repo.git
     ```
-    *Replace the URL with your repository's URL.*
 2.  **Navigate into the new project folder:**
     ```bash
@@ -91,21 +89,27 @@ Now that your virtual environment is active, you can install all the required pa
 1. **Choose the relevant requirements file**
 Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
 The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
-- **requirements_no_local**: Can be used to install the app without local model inference for a more lightweight installation.
 - **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the requirements under 'Windows' for Windows compatibility (CUDA 12.4).
 - **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the requirements under 'Windows' for Windows compatibility. Make sure you have [Openblas](https://github.com/OpenMathLib/OpenBLAS) installed!
 - **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4).
 2.  **Install packages from the requirements file:**
     ```bash
     pip install -r requirements_gpu.txt
     ```
     *This command reads every package name listed in the file and installs it into your `.venv` environment.*
 You're all set\! ✅ Your project is cloned, and all dependencies are installed in an isolated environment.
 When you are finished working, you can leave the virtual environment by simply typing:

 # Installation guide
+Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from the relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
 -----
 1.  **Clone the repo:**
     ```bash
+    git clone https://github.com/seanpedrick-case/llm_topic_modelling/example-repo.git
     ```
 2.  **Navigate into the new project folder:**
     ```bash
 1. **Choose the relevant requirements file**
+****NOTE:** To start, I advise installing using the **requirements_no_local.txt** file, which installs the app without local model inference. This approach is much simpler as a first step, and avoids issues with potentially complicated llama-cpp-python installation and GPU management described below.
 Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
 The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
+- **requirements_no_local.txt**: Can be used to install the app without local model inference for a more lightweight installation.
 - **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the requirements under 'Windows' for Windows compatibility (CUDA 12.4).
 - **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the requirements under 'Windows' for Windows compatibility. Make sure you have [Openblas](https://github.com/OpenMathLib/OpenBLAS) installed!
 - **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4).
+The below instructions will guide you in how to install the GPU-enabled version of the app for local inference.
 2.  **Install packages from the requirements file:**
     ```bash
     pip install -r requirements_gpu.txt
     ```
     *This command reads every package name listed in the file and installs it into your `.venv` environment.*
+NOTE: If default llama-cpp-python installation does not work when installing from the above, go into the requirements_gpu.txt file and uncomment the lines to install a wheel for llama-cpp-python 0.3.16 relevant to your system.
 You're all set\! ✅ Your project is cloned, and all dependencies are installed in an isolated environment.
 When you are finished working, you can leave the virtual environment by simply typing:

app.py CHANGED Viewed

@@ -6,7 +6,7 @@ from datetime import datetime
 from tools.helper_functions import put_columns_in_df, get_connection_params, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise, load_in_previous_reference_file, join_cols_onto_reference_df, load_in_previous_data_files, load_in_data_file, load_in_default_cost_codes, reset_base_dataframe, update_cost_code_dataframe_from_dropdown_select, df_select_callback_cost, enforce_cost_codes, _get_env_list, move_overall_summary_output_files_to_front_page
 from tools.aws_functions import upload_file_to_s3, download_file_from_s3
 from tools.llm_api_call import modify_existing_output_tables, wrapper_extract_topics_per_column_value, all_in_one_pipeline
-from tools.dedup_summaries import sample_reference_table_summaries, summarise_output_topics, deduplicate_topics, overall_summary
 from tools.combine_sheets_into_xlsx import collect_output_csvs_and_create_excel_output
 from tools.custom_csvlogger import CSVLogger_custom
 from tools.auth import authenticate_user
@@ -171,7 +171,7 @@ with app:
                     in_data_files, in_colnames, context_textbox, original_data_file_name_textbox, topic_extraction_output_files_xlsx, display_topic_table_markdown, output_messages_textbox, candidate_topics, produce_structured_summary_radio, in_group_col, batch_size_number,
                 ):
                     gr.Info(
-                        "Example data loaded. Now click on the 'All in one...' button below to run the full suite of topic extraction, deduplication, and summarisation."
                     )
         examples = gr.Examples(examples=\
@@ -251,24 +251,23 @@ with app:
     with gr.Tab(label="Advanced - Step by step topic extraction and summarisation"):
-        with gr.Accordion("1. Extract topics - go to first tab for file upload, model choice, and other settings before clicking this button", open = True):
             context_textbox.render()
             extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
-            topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False)
         with gr.Accordion("2. Modify topics from topic extraction", open = False):
             gr.Markdown("""Load in previously completed Extract Topics output files ('reference_table', and 'unique_topics' files) to modify topics, deduplicate topics, or summarise the outputs. If you want pivot table outputs, please load in the original data file along with the selected open text column on the first tab before deduplicating or summarising.""")
-            modification_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to modify topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             modifiable_unique_topics_df_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=(4, "fixed"), row_count = (1, "fixed"), visible=True, type="pandas")
             save_modified_files_button = gr.Button(value="Save modified topic names")
-        with gr.Accordion("3. Deduplicate topics - upload reference data file and unique data files", open = False):
             ### DEDUPLICATION
-            deduplication_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to deduplicate topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             deduplication_input_files_status = gr.Textbox(value = "", label="Previous file input", visible=False)
             with gr.Row():
@@ -276,11 +275,13 @@ with app:
                 merge_sentiment_drop = gr.Dropdown(label="Merge sentiment values together for duplicate subtopics.", value="No", choices=["Yes", "No"])
                 deduplicate_score_threshold = gr.Number(label="Similarity threshold with which to determine duplicates.", value = 90, minimum=5, maximum=100, precision=0)
-            deduplicate_previous_data_btn = gr.Button("3. Deduplicate topics", variant="primary")
         with gr.Accordion("4. Summarise topics", open = False):
             ### SUMMARISATION
-            summarisation_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload files to summarise", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             summarise_format_radio = gr.Radio(label="Choose summary type", value=two_para_summary_format_prompt, choices=[two_para_summary_format_prompt, single_para_summary_format_prompt])
@@ -476,6 +477,15 @@ with app:
     deduplicate_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
         success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown], scroll_to_output=True, api_name="deduplicate_topics")
     # When button pressed, summarise previous data
     summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, overall_summarisation_input_files]).\
     success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\

 from tools.helper_functions import put_columns_in_df, get_connection_params, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise, load_in_previous_reference_file, join_cols_onto_reference_df, load_in_previous_data_files, load_in_data_file, load_in_default_cost_codes, reset_base_dataframe, update_cost_code_dataframe_from_dropdown_select, df_select_callback_cost, enforce_cost_codes, _get_env_list, move_overall_summary_output_files_to_front_page
 from tools.aws_functions import upload_file_to_s3, download_file_from_s3
 from tools.llm_api_call import modify_existing_output_tables, wrapper_extract_topics_per_column_value, all_in_one_pipeline
+from tools.dedup_summaries import sample_reference_table_summaries, summarise_output_topics, deduplicate_topics, deduplicate_topics_llm, overall_summary
 from tools.combine_sheets_into_xlsx import collect_output_csvs_and_create_excel_output
 from tools.custom_csvlogger import CSVLogger_custom
 from tools.auth import authenticate_user
                     in_data_files, in_colnames, context_textbox, original_data_file_name_textbox, topic_extraction_output_files_xlsx, display_topic_table_markdown, output_messages_textbox, candidate_topics, produce_structured_summary_radio, in_group_col, batch_size_number,
                 ):
                     gr.Info(
+                        "Example data loaded. Now click on the 'Extract topics...' button below to run the full suite of topic extraction, deduplication, and summarisation."
                     )
         examples = gr.Examples(examples=\
     with gr.Tab(label="Advanced - Step by step topic extraction and summarisation"):
+        with gr.Accordion("1. Extract topics - go to first tab for file upload, model choice, and other settings before clicking this button", open = False):
             context_textbox.render()
             extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
+            topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False, height=FILE_INPUT_HEIGHT)
         with gr.Accordion("2. Modify topics from topic extraction", open = False):
             gr.Markdown("""Load in previously completed Extract Topics output files ('reference_table', and 'unique_topics' files) to modify topics, deduplicate topics, or summarise the outputs. If you want pivot table outputs, please load in the original data file along with the selected open text column on the first tab before deduplicating or summarising.""")
+            modification_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to modify topics", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             modifiable_unique_topics_df_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=(4, "fixed"), row_count = (1, "fixed"), visible=True, type="pandas")
             save_modified_files_button = gr.Button(value="Save modified topic names")
+        with gr.Accordion("3. Deduplicate topics using fuzzy matching or LLMs", open = False):
             ### DEDUPLICATION
+            deduplication_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to deduplicate topics. Optionally upload suggested topics on the first tab to match to these where possible with LLM deduplication", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             deduplication_input_files_status = gr.Textbox(value = "", label="Previous file input", visible=False)
             with gr.Row():
                 merge_sentiment_drop = gr.Dropdown(label="Merge sentiment values together for duplicate subtopics.", value="No", choices=["Yes", "No"])
                 deduplicate_score_threshold = gr.Number(label="Similarity threshold with which to determine duplicates.", value = 90, minimum=5, maximum=100, precision=0)
+            with gr.Row():
+                deduplicate_previous_data_btn = gr.Button("3. Deduplicate topics (Fuzzy matching)", variant="primary")
+                deduplicate_llm_previous_data_btn = gr.Button("3b. Deduplicate topics (LLM semantic)", variant="secondary")
         with gr.Accordion("4. Summarise topics", open = False):
             ### SUMMARISATION
+            summarisation_input_files = gr.File(height=FILE_INPUT_HEIGHT, label="Upload reference and unique topic files to summarise", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet'])
             summarise_format_radio = gr.Radio(label="Choose summary type", value=two_para_summary_format_prompt, choices=[two_para_summary_format_prompt, single_para_summary_format_prompt])
     deduplicate_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
         success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown], scroll_to_output=True, api_name="deduplicate_topics")
+    # When LLM deduplication button pressed, deduplicate data using LLM
+    def deduplicate_topics_llm_wrapper(reference_df, topic_summary_df, reference_table_file_name, unique_topics_table_file_name, model_choice, in_api_key, temperature, in_excel_sheets, merge_sentiment, merge_general_topics, in_data_files, chosen_cols, output_folder, candidate_topics=None):
+        model_source = model_name_map[model_choice]["source"]
+        return deduplicate_topics_llm(reference_df, topic_summary_df, reference_table_file_name, unique_topics_table_file_name, model_choice, in_api_key, temperature, model_source, None, None, None, None, in_excel_sheets, merge_sentiment, merge_general_topics, in_data_files, chosen_cols, output_folder, candidate_topics)
+    deduplicate_llm_previous_data_btn.click(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
+        success(deduplicate_topics_llm_wrapper, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, model_choice, google_api_key_textbox, temperature_slide, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, in_data_files, in_colnames, output_folder_state, candidate_topics], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], scroll_to_output=True, api_name="deduplicate_topics_llm").\
+        success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs_llm_dedup")
     # When button pressed, summarise previous data
     summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, overall_summarisation_input_files]).\
     success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\

requirements.txt CHANGED Viewed

@@ -1,9 +1,9 @@
-# Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
-pandas==2.3.2
-gradio==5.48.0
 transformers==4.56.0
-spaces==0.40.1
-boto3==1.40.22
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
@@ -23,11 +23,7 @@ torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
 unsloth[cu124-torch260]==2025.9.4
 unsloth_zoo==2025.9.5
 timm==1.0.19
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
-# CPU only (for e.g. Hugging Face CPU instances)
-#torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
-# For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts
-#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl

+# Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_no_local.txt for installation without local model inference (simplest approach to get going). Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
+pandas==2.3.3
+gradio==5.49.1
 transformers==4.56.0
+spaces==0.42.1
+boto3==1.40.48
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
 unsloth[cu124-torch260]==2025.9.4
 unsloth_zoo==2025.9.5
 timm==1.0.19
+# llama-cpp-python direct wheel link for GPU compatible version 3.16 for use with Python 3.10 and Hugging Face
+llama-cpp-python==0.3.16 @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl

requirements_cpu.txt CHANGED Viewed

@@ -1,8 +1,8 @@
-pandas==2.3.2
-gradio==5.48.0
 transformers==4.56.0
-spaces==0.40.1
-boto3==1.40.22
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
@@ -16,8 +16,10 @@ beautifulsoup4==4.12.3
 rapidfuzz==3.13.0
 python-dotenv==1.1.0
 torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
-# Linux, Python 3.11 compatible wheel available:
-#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
-# Windows, Python 3.11 compatible wheel available:
-https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
 # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source

+pandas==2.3.3
+gradio==5.49.1
 transformers==4.56.0
+spaces==0.42.1
+boto3==1.40.48
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
 rapidfuzz==3.13.0
 python-dotenv==1.1.0
 torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cpu
+llama-cpp-python==0.3.16 -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
+# Direct wheel links if above doesn't work
+# I have created CPU Linux, Python 3.11 compatible wheels:
+# llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
+# Windows, Python 3.11 compatible CPU wheels available:
+# llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
 # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source

requirements_gpu.txt CHANGED Viewed

@@ -1,9 +1,8 @@
-pandas==2.3.2
-gradio==5.48.0
-huggingface_hub[hf_xet]==0.34.4
 transformers==4.56.0
-spaces==0.40.1
-boto3==1.40.22
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
@@ -25,7 +24,11 @@ unsloth_zoo==2025.9.5
 #triton-windows<3.3
 timm==1.0.19
 # Llama CPP Python
 # For Linux:
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
-# For Windows:
-#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl

+pandas==2.3.3
+gradio==5.49.1
 transformers==4.56.0
+spaces==0.42.1
+boto3==1.40.48
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7
 #triton-windows<3.3
 timm==1.0.19
 # Llama CPP Python
+llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on"
+# If below doesn't work, try specific wheels for your system:
 # For Linux:
+# See files in https://github.com/abetlen/llama-cpp-python/releases/tag/v0.3.16-cu124 for different python versions
+# Python 3.11 compatible wheel:
+# llama-cpp-python==0.3.16 @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
+# For Windows, not available at above link. I have made a GPU Windows wheel for Python 3.11:
+# llama-cpp-python==0.3.16 @ https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl

requirements_no_local.txt CHANGED Viewed

@@ -1,9 +1,9 @@
 # This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, without local torch and llama-cpp-python. For AWS ECS, torch and llama-cpp-python are optionally installed in the main Dockerfile
-pandas==2.3.2
-gradio==5.48.0
 transformers==4.56.0
-spaces==0.40.1
-boto3==1.40.22
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7

 # This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, without local torch and llama-cpp-python. For AWS ECS, torch and llama-cpp-python are optionally installed in the main Dockerfile
+pandas==2.3.3
+gradio==5.49.1
 transformers==4.56.0
+spaces==0.42.1
+boto3==1.40.48
 pyarrow==21.0.0
 openpyxl==3.1.5
 markdown==3.7

tools/config.py CHANGED Viewed

@@ -380,7 +380,7 @@ LLM_THREADS = int(get_or_create_env_var('LLM_THREADS', '-1'))
 LLM_BATCH_SIZE = int(get_or_create_env_var('LLM_BATCH_SIZE', '512'))
 LLM_CONTEXT_LENGTH = int(get_or_create_env_var('LLM_CONTEXT_LENGTH', '32768'))
 LLM_SAMPLE = get_or_create_env_var('LLM_SAMPLE', 'True')
-LLM_STOP_STRINGS = get_or_create_env_var('LLM_STOP_STRINGS', r"['                                          ','\n\n\n\n','---------------------------------------------']")
 MULTIMODAL_PROMPT_FORMAT = get_or_create_env_var('MULTIMODAL_PROMPT_FORMAT', 'False')
 SPECULATIVE_DECODING = get_or_create_env_var('SPECULATIVE_DECODING', 'False')
 NUM_PRED_TOKENS = int(get_or_create_env_var('NUM_PRED_TOKENS', '2'))

 LLM_BATCH_SIZE = int(get_or_create_env_var('LLM_BATCH_SIZE', '512'))
 LLM_CONTEXT_LENGTH = int(get_or_create_env_var('LLM_CONTEXT_LENGTH', '32768'))
 LLM_SAMPLE = get_or_create_env_var('LLM_SAMPLE', 'True')
+LLM_STOP_STRINGS = get_or_create_env_var('LLM_STOP_STRINGS', r"['\n\n\n\n\n\n']")
 MULTIMODAL_PROMPT_FORMAT = get_or_create_env_var('MULTIMODAL_PROMPT_FORMAT', 'False')
 SPECULATIVE_DECODING = get_or_create_env_var('SPECULATIVE_DECODING', 'False')
 NUM_PRED_TOKENS = int(get_or_create_env_var('NUM_PRED_TOKENS', '2'))

tools/dedup_summaries.py CHANGED Viewed

@@ -9,12 +9,12 @@ import markdown
 import boto3
 from tqdm import tqdm
 import os
-from tools.prompts import summarise_topic_descriptions_prompt, summarise_topic_descriptions_system_prompt, system_prompt, summarise_everything_prompt, comprehensive_summary_format_prompt, summarise_everything_system_prompt, comprehensive_summary_format_prompt_by_group, summary_assistant_prefill
-from tools.llm_funcs import construct_gemini_generative_model, process_requests, ResponseObject, load_model, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
-from tools.helper_functions import create_topic_summary_df_from_reference_table, load_in_data_file, get_basic_response_data, convert_reference_table_to_pivot_table, wrap_text, clean_column_name, get_file_name_no_ext, create_batch_file_path_details
 from tools.aws_functions import connect_to_bedrock_runtime
-from tools.config import OUTPUT_FOLDER, RUN_LOCAL_MODEL, MAX_COMMENT_CHARS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
 max_tokens = LLM_MAX_NEW_TOKENS
 timeout_wait = TIMEOUT_WAIT
@@ -156,17 +156,16 @@ def deduplicate_topics(reference_df:pd.DataFrame,
     file_data = pd.DataFrame()
     deduplicated_unique_table_markdown = ""
     if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
         print("Data file outputs are too short for deduplicating. Returning original data.")
-        reference_file_out_path = output_folder + reference_table_file_name
         unique_topics_file_out_path = output_folder + unique_topics_table_file_name
         output_files.append(reference_file_out_path)
         output_files.append(unique_topics_file_out_path)
-        return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
     # For checking that data is not lost during the process
     initial_unique_references = len(reference_df["Response References"].unique())
@@ -376,6 +375,341 @@ def deduplicate_topics(reference_df:pd.DataFrame,
     return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
 def sample_reference_table_summaries(reference_df:pd.DataFrame,
                                      random_seed:int,
                                      no_of_sampled_summaries:int=150):

 import boto3
 from tqdm import tqdm
 import os
+from tools.llm_api_call import generate_zero_shot_topics_df
+from tools.prompts import summarise_topic_descriptions_prompt, summarise_topic_descriptions_system_prompt, system_prompt, summarise_everything_prompt, comprehensive_summary_format_prompt, summarise_everything_system_prompt, comprehensive_summary_format_prompt_by_group, summary_assistant_prefill, llm_deduplication_system_prompt, llm_deduplication_prompt, llm_deduplication_prompt_with_candidates
+from tools.llm_funcs import construct_gemini_generative_model, process_requests, ResponseObject, load_model, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model, send_request, construct_gemini_generative_model, construct_azure_client, call_llm_with_markdown_table_checks
+from tools.helper_functions import create_topic_summary_df_from_reference_table, load_in_data_file, get_basic_response_data, convert_reference_table_to_pivot_table, wrap_text, clean_column_name, get_file_name_no_ext, create_batch_file_path_details, read_file
 from tools.aws_functions import connect_to_bedrock_runtime
+from tools.config import OUTPUT_FOLDER, RUN_LOCAL_MODEL, MAX_COMMENT_CHARS, LLM_MAX_NEW_TOKENS, LLM_SEED, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
 max_tokens = LLM_MAX_NEW_TOKENS
 timeout_wait = TIMEOUT_WAIT
     file_data = pd.DataFrame()
     deduplicated_unique_table_markdown = ""
     if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
         print("Data file outputs are too short for deduplicating. Returning original data.")
+        reference_file_out_path = output_folder + reference_table_file_name
         unique_topics_file_out_path = output_folder + unique_topics_table_file_name
         output_files.append(reference_file_out_path)
         output_files.append(unique_topics_file_out_path)
+        return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
     # For checking that data is not lost during the process
     initial_unique_references = len(reference_df["Response References"].unique())
     return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
+def deduplicate_topics_llm(reference_df:pd.DataFrame,
+                          topic_summary_df:pd.DataFrame,
+                          reference_table_file_name:str,
+                          unique_topics_table_file_name:str,
+                          model_choice:str,
+                          in_api_key:str,
+                          temperature:float,
+                          model_source:str,
+                          bedrock_runtime=None,
+                          local_model=None,
+                          tokenizer=None,
+                          assistant_model=None,
+                          in_excel_sheets:str="",
+                          merge_sentiment:str="No",
+                          merge_general_topics:str="No",
+                          in_data_files:List[str]=list(),
+                          chosen_cols:List[str]="",
+                          output_folder:str=OUTPUT_FOLDER,
+                          candidate_topics=None
+                          ):
+    '''
+    Deduplicate topics using LLM semantic understanding to identify and merge similar topics.
+    Args:
+        reference_df (pd.DataFrame): DataFrame containing reference data with topics.
+        topic_summary_df (pd.DataFrame): DataFrame summarizing unique topics.
+        reference_table_file_name (str): Base file name for the output reference table.
+        unique_topics_table_file_name (str): Base file name for the output unique topics table.
+        model_choice (str): The LLM model to use for deduplication.
+        in_api_key (str): API key for the LLM service.
+        temperature (float): Temperature setting for the LLM.
+        model_source (str): Source of the model (AWS, Gemini, Local, etc.).
+        bedrock_runtime: AWS Bedrock runtime client (if using AWS).
+        local_model: Local model instance (if using local model).
+        tokenizer: Tokenizer for local model.
+        assistant_model: Assistant model for speculative decoding.
+        in_excel_sheets (str, optional): Comma-separated list of Excel sheet names to load. Defaults to "".
+        merge_sentiment (str, optional): Whether to merge topics regardless of sentiment ("Yes" or "No"). Defaults to "No".
+        merge_general_topics (str, optional): Whether to merge topics across different general topics ("Yes" or "No"). Defaults to "No".
+        in_data_files (List[str], optional): List of input data file paths. Defaults to [].
+        chosen_cols (List[str], optional): List of chosen columns from the input data files. Defaults to "".
+        output_folder (str, optional): Folder path to save output files. Defaults to OUTPUT_FOLDER.
+        candidate_topics (optional): Candidate topics file for zero-shot guidance. Defaults to None.
+    '''
+    output_files = list()
+    log_output_files = list()
+    file_data = pd.DataFrame()
+    deduplicated_unique_table_markdown = ""
+    # Check if data is too short for deduplication
+    if (len(reference_df["Response References"].unique()) == 1) | (len(topic_summary_df["Topic_number"].unique()) == 1):
+        print("Data file outputs are too short for deduplicating. Returning original data.")
+        #print("reference_df:", reference_df)
+        #print("topic_summary_df:", topic_summary_df)
+        reference_file_out_path = output_folder + reference_table_file_name
+        unique_topics_file_out_path = output_folder + unique_topics_table_file_name
+        output_files.append(reference_file_out_path)
+        output_files.append(unique_topics_file_out_path)
+        return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown
+    # For checking that data is not lost during the process
+    initial_unique_references = len(reference_df["Response References"].unique())
+    # Create topic summary if it doesn't exist
+    if topic_summary_df.empty:
+        topic_summary_df = create_topic_summary_df_from_reference_table(reference_df)
+        # Merge topic numbers back to the original dataframe
+        reference_df = reference_df.merge(
+            topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Topic_number']],
+            on=['General topic', 'Subtopic', 'Sentiment'],
+            how='left'
+        )
+    # Load data files if provided
+    if in_data_files and chosen_cols:
+        file_data, data_file_names_textbox, total_number_of_batches = load_in_data_file(in_data_files, chosen_cols, 1, in_excel_sheets)
+    else:
+        out_message = "No file data found, pivot table output will not be created."
+        print(out_message)
+    # Process candidate topics if provided
+    candidate_topics_table = ""
+    if candidate_topics is not None:
+        try:
+            # Read and process candidate topics
+            candidate_topics_df = read_file(candidate_topics.name)
+            candidate_topics_df = candidate_topics_df.fillna("")
+            candidate_topics_df = candidate_topics_df.astype(str)
+            # Generate zero-shot topics DataFrame
+            zero_shot_topics_df = generate_zero_shot_topics_df(candidate_topics_df, "No", False)
+            if not zero_shot_topics_df.empty:
+                candidate_topics_table = zero_shot_topics_df[['General topic', 'Subtopic']].to_markdown(index=False)
+                print(f"Found {len(zero_shot_topics_df)} candidate topics to consider during deduplication")
+        except Exception as e:
+            print(f"Error processing candidate topics: {e}")
+            candidate_topics_table = ""
+    # Prepare topics table for LLM analysis
+    topics_table = topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Number of responses']].to_markdown(index=False)
+    # Format the prompt with candidate topics if available
+    if candidate_topics_table:
+        formatted_prompt = llm_deduplication_prompt_with_candidates.format(
+            topics_table=topics_table,
+            candidate_topics_table=candidate_topics_table
+        )
+    else:
+        formatted_prompt = llm_deduplication_prompt.format(topics_table=topics_table)
+    # Initialise conversation history
+    conversation_history = []
+    whole_conversation = []
+    whole_conversation_metadata = []
+    # Set up model clients based on model source
+    if "Gemini" in model_source:
+        google_client, config = construct_gemini_generative_model(
+            in_api_key, temperature, model_choice, llm_deduplication_system_prompt,
+            max_tokens, LLM_SEED
+        )
+        bedrock_runtime = None
+    elif "AWS" in model_source:
+        if not bedrock_runtime:
+            bedrock_runtime = boto3.client('bedrock-runtime')
+        google_client = None
+        config = None
+    elif "Azure" in model_source:
+        google_client, config = construct_azure_client(in_api_key, "")
+        bedrock_runtime = None
+    elif "Local" in model_source:
+        google_client = None
+        config = None
+        bedrock_runtime = None
+    else:
+        raise ValueError(f"Unsupported model source: {model_source}")
+    # Call LLM to get deduplication suggestions
+    print("Calling LLM for topic deduplication analysis...")
+    # Use the existing call_llm_with_markdown_table_checks function
+    responses, conversation_history, whole_conversation, whole_conversation_metadata, response_text = call_llm_with_markdown_table_checks(
+        batch_prompts=[formatted_prompt],
+        system_prompt=llm_deduplication_system_prompt,
+        conversation_history=conversation_history,
+        whole_conversation=whole_conversation,
+        whole_conversation_metadata=whole_conversation_metadata,
+        google_client=google_client,
+        google_config=config,
+        model_choice=model_choice,
+        temperature=temperature,
+        reported_batch_no=1,
+        local_model=local_model,
+        tokenizer=tokenizer,
+        bedrock_runtime=bedrock_runtime,
+        model_source=model_source,
+        MAX_OUTPUT_VALIDATION_ATTEMPTS=3,
+        assistant_prefill="",
+        master=False,
+        CHOSEN_LOCAL_MODEL_TYPE=CHOSEN_LOCAL_MODEL_TYPE,
+        random_seed=LLM_SEED
+    )
+    # Generate debug files if enabled
+    if OUTPUT_DEBUG_FILES == "True":
+        try:
+            # Create batch file path details for debug files
+            batch_file_path_details = get_file_name_no_ext(reference_table_file_name) + "_llm_dedup"
+            model_choice_clean_short = model_choice.replace("/", "_").replace(":", "_").replace(".", "_")
+            # Create full prompt for debug output
+            full_prompt = llm_deduplication_system_prompt + "\n" + formatted_prompt
+            # Write debug files
+            current_prompt_content_logged, current_summary_content_logged, current_conversation_content_logged, current_metadata_content_logged = process_debug_output_iteration(
+                OUTPUT_DEBUG_FILES,
+                output_folder,
+                batch_file_path_details,
+                model_choice_clean_short,
+                full_prompt,
+                response_text,
+                whole_conversation,
+                whole_conversation_metadata,
+                log_output_files,
+                task_type="llm_deduplication"
+            )
+            print(f"Debug files written for LLM deduplication analysis")
+        except Exception as e:
+            print(f"Error writing debug files for LLM deduplication: {e}")
+    # Parse the LLM response to extract merge suggestions
+    merge_suggestions_df = pd.DataFrame()  # Initialize empty DataFrame for analysis results
+    num_merges_applied = 0
+    try:
+        # Extract the markdown table from the response
+        table_match = re.search(r'\|.*\|.*\n\|.*\|.*\n(\|.*\|.*\n)*', response_text, re.MULTILINE)
+        if table_match:
+            table_text = table_match.group(0)
+            # Convert markdown table to DataFrame
+            from io import StringIO
+            merge_suggestions_df = pd.read_csv(StringIO(table_text), sep='|', skipinitialspace=True)
+            # Clean up the DataFrame
+            merge_suggestions_df = merge_suggestions_df.dropna(axis=1, how='all')  # Remove empty columns
+            merge_suggestions_df.columns = merge_suggestions_df.columns.str.strip()
+            # Remove rows where all values are NaN
+            merge_suggestions_df = merge_suggestions_df.dropna(how='all')
+            if not merge_suggestions_df.empty:
+                print(f"LLM identified {len(merge_suggestions_df)} potential topic merges")
+                # Apply the merges to the reference_df
+                for _, row in merge_suggestions_df.iterrows():
+                    original_general = row.get('Original General topic', '').strip()
+                    original_subtopic = row.get('Original Subtopic', '').strip()
+                    original_sentiment = row.get('Original Sentiment', '').strip()
+                    merged_general = row.get('Merged General topic', '').strip()
+                    merged_subtopic = row.get('Merged Subtopic', '').strip()
+                    merged_sentiment = row.get('Merged Sentiment', '').strip()
+                    if all([original_general, original_subtopic, original_sentiment,
+                           merged_general, merged_subtopic, merged_sentiment]):
+                        # Find matching rows in reference_df
+                        mask = (
+                            (reference_df['General topic'] == original_general) &
+                            (reference_df['Subtopic'] == original_subtopic) &
+                            (reference_df['Sentiment'] == original_sentiment)
+                        )
+                        if mask.any():
+                            # Update the matching rows
+                            reference_df.loc[mask, 'General topic'] = merged_general
+                            reference_df.loc[mask, 'Subtopic'] = merged_subtopic
+                            reference_df.loc[mask, 'Sentiment'] = merged_sentiment
+                            num_merges_applied += 1
+                            print(f"Merged: {original_general} | {original_subtopic} | {original_sentiment} -> {merged_general} | {merged_subtopic} | {merged_sentiment}")
+            else:
+                print("No merge suggestions found in LLM response")
+        else:
+            print("No markdown table found in LLM response")
+    except Exception as e:
+        print(f"Error parsing LLM response: {e}")
+        print("Continuing with original data...")
+    # Update reference summary column with all summaries
+    reference_df["Summary"] = reference_df.groupby(
+        ["Response References", "General topic", "Subtopic", "Sentiment"]
+    )["Summary"].transform(' <br> '.join)
+    # Check that we have not inadvertently removed some data during the process
+    end_unique_references = len(reference_df["Response References"].unique())
+    if initial_unique_references != end_unique_references:
+        raise Exception(f"Number of unique references changed during processing: Initial={initial_unique_references}, Final={end_unique_references}")
+    # Drop duplicates in the reference table
+    reference_df.drop_duplicates(['Response References', 'General topic', 'Subtopic', 'Sentiment'], inplace=True)
+    # Remake topic_summary_df based on new reference_df
+    topic_summary_df = create_topic_summary_df_from_reference_table(reference_df)
+    # Merge the topic numbers back to the original dataframe
+    reference_df = reference_df.merge(
+        topic_summary_df[['General topic', 'Subtopic', 'Sentiment', 'Group', 'Topic_number']],
+        on=['General topic', 'Subtopic', 'Sentiment', 'Group'],
+        how='left'
+    )
+    # Create pivot table if file data is available
+    if not file_data.empty:
+        basic_response_data = get_basic_response_data(file_data, chosen_cols)
+        reference_df_pivot = convert_reference_table_to_pivot_table(reference_df, basic_response_data)
+        reference_pivot_file_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_pivot_dedup.csv"
+        reference_df_pivot.to_csv(reference_pivot_file_path, index=None, encoding='utf-8-sig')
+        log_output_files.append(reference_pivot_file_path)
+    # Save analysis results CSV if merge suggestions were found
+    if not merge_suggestions_df.empty:
+        analysis_results_file_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_llm_analysis_results.csv"
+        merge_suggestions_df.to_csv(analysis_results_file_path, index=None, encoding='utf-8-sig')
+        log_output_files.append(analysis_results_file_path)
+        print(f"Analysis results saved to: {analysis_results_file_path}")
+    # Save output files
+    reference_file_out_path = output_folder + get_file_name_no_ext(reference_table_file_name) + "_dedup.csv"
+    unique_topics_file_out_path = output_folder + get_file_name_no_ext(unique_topics_table_file_name) + "_dedup.csv"
+    reference_df.to_csv(reference_file_out_path, index=None, encoding='utf-8-sig')
+    topic_summary_df.to_csv(unique_topics_file_out_path, index=None, encoding='utf-8-sig')
+    output_files.append(reference_file_out_path)
+    output_files.append(unique_topics_file_out_path)
+    # Outputs for markdown table output
+    topic_summary_df_revised_display = topic_summary_df.apply(lambda col: col.map(lambda x: wrap_text(x, max_text_length=500)))
+    deduplicated_unique_table_markdown = topic_summary_df_revised_display.to_markdown(index=False)
+    # Calculate token usage and timing information for logging
+    total_input_tokens = 0
+    total_output_tokens = 0
+    number_of_calls = 1  # Single LLM call for deduplication
+    # Extract token usage from conversation metadata
+    if whole_conversation_metadata:
+        for metadata in whole_conversation_metadata:
+            if "input_tokens:" in metadata and "output_tokens:" in metadata:
+                try:
+                    input_tokens = int(metadata.split("input_tokens: ")[1].split(" ")[0])
+                    output_tokens = int(metadata.split("output_tokens: ")[1].split(" ")[0])
+                    total_input_tokens += input_tokens
+                    total_output_tokens += output_tokens
+                except (ValueError, IndexError):
+                    pass
+    # Calculate estimated time taken (rough estimate based on token usage)
+    estimated_time_taken = (total_input_tokens + total_output_tokens) / 1000  # Rough estimate in seconds
+    return reference_df, topic_summary_df, output_files, log_output_files, deduplicated_unique_table_markdown, total_input_tokens, total_output_tokens, number_of_calls, estimated_time_taken#, num_merges_applied
 def sample_reference_table_summaries(reference_df:pd.DataFrame,
                                      random_seed:int,
                                      no_of_sampled_summaries:int=150):

tools/helper_functions.py CHANGED Viewed

@@ -7,7 +7,7 @@ import numpy as np
 from typing import List
 import math
 from botocore.exceptions import ClientError
-from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, CUSTOM_HEADER, CUSTOM_HEADER_VALUE, AWS_USER_POOL_ID
 def empty_output_vars_extract_topics():
     # Empty output objects before processing a new file
@@ -786,4 +786,116 @@ def create_batch_file_path_details(reference_data_file_name: str) -> str:
 def move_overall_summary_output_files_to_front_page(overall_summary_output_files_xlsx:List[str]):
-    return overall_summary_output_files_xlsx

 from typing import List
 import math
 from botocore.exceptions import ClientError
+from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, CUSTOM_HEADER, CUSTOM_HEADER_VALUE, AWS_USER_POOL_ID, MAXIMUM_ZERO_SHOT_TOPICS
 def empty_output_vars_extract_topics():
     # Empty output objects before processing a new file
 def move_overall_summary_output_files_to_front_page(overall_summary_output_files_xlsx:List[str]):
+    return overall_summary_output_files_xlsx
+def generate_zero_shot_topics_df(zero_shot_topics:pd.DataFrame,
+                                 force_zero_shot_radio:str="No",
+                                 create_revised_general_topics:bool=False,
+                                 max_topic_no:int=MAXIMUM_ZERO_SHOT_TOPICS):
+    """
+    Preprocesses a DataFrame of zero-shot topics, cleaning and formatting them
+    for use with a large language model. It handles different column configurations
+    (e.g., only subtopics, general topics and subtopics, or subtopics with descriptions)
+    and enforces a maximum number of topics.
+    Args:
+        zero_shot_topics (pd.DataFrame): A DataFrame containing the initial zero-shot topics.
+                                         Expected columns can vary, but typically include
+                                         "General topic", "Subtopic", and/or "Description".
+        force_zero_shot_radio (str, optional): A string indicating whether to force
+                                               the use of zero-shot topics. Defaults to "No".
+                                               (Currently not used in the function logic, but kept for signature consistency).
+        create_revised_general_topics (bool, optional): A boolean indicating whether to
+                                                        create revised general topics. Defaults to False.
+                                                        (Currently not used in the function logic, but kept for signature consistency).
+        max_topic_no (int, optional): The maximum number of topics allowed to fit within
+                                      LLM context limits. If `zero_shot_topics` exceeds this,
+                                      it will be truncated. Defaults to 120.
+    Returns:
+        tuple: A tuple containing:
+            - zero_shot_topics_gen_topics_list (list): A list of cleaned general topics.
+            - zero_shot_topics_subtopics_list (list): A list of cleaned subtopics.
+            - zero_shot_topics_description_list (list): A list of cleaned topic descriptions.
+    """
+    zero_shot_topics_gen_topics_list = list()
+    zero_shot_topics_subtopics_list = list()
+    zero_shot_topics_description_list = list()
+    # Max 120 topics allowed
+    if zero_shot_topics.shape[0] > max_topic_no:
+        out_message = "Maximum " + str(max_topic_no) + " zero-shot topics allowed according to application configuration."
+        print(out_message)
+        raise Exception(out_message)
+    # Forward slashes in the topic names seems to confuse the model
+    if zero_shot_topics.shape[1] >= 1:  # Check if there is at least one column
+        for x in zero_shot_topics.columns:
+            if not zero_shot_topics[x].isnull().all():
+                zero_shot_topics[x] = zero_shot_topics[x].apply(initial_clean)
+                zero_shot_topics.loc[:, x] = (
+                zero_shot_topics.loc[:, x]
+                .str.strip()
+                .str.replace('\n', ' ')
+                .str.replace('\r', ' ')
+                .str.replace('/', ' or ')
+                .str.lower()
+                .str.capitalize())
+        # If number of columns is 1, keep only subtopics
+        if zero_shot_topics.shape[1] == 1 and "General topic" not in zero_shot_topics.columns:
+            print("Found only Subtopic in zero shot topics")
+            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
+            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
+        # Allow for possibility that the user only wants to set general topics and not subtopics
+        elif zero_shot_topics.shape[1] == 1 and "General topic" in zero_shot_topics.columns:
+            print("Found only General topic in zero shot topics")
+            zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
+            zero_shot_topics_subtopics_list = [""] * zero_shot_topics.shape[0]
+        # If general topic and subtopic are specified
+        elif set(["General topic", "Subtopic"]).issubset(zero_shot_topics.columns):
+            print("Found General topic and Subtopic in zero shot topics")
+            zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
+            zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
+        # If subtopic and description are specified
+        elif set(["Subtopic", "Description"]).issubset(zero_shot_topics.columns):
+            print("Found Subtopic and Description in zero shot topics")
+            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
+            zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
+            zero_shot_topics_description_list = list(zero_shot_topics["Description"])
+        # If number of columns is at least 2, keep general topics and subtopics
+        elif zero_shot_topics.shape[1] >= 2 and "Description" not in zero_shot_topics.columns:
+            zero_shot_topics_gen_topics_list = list(zero_shot_topics.iloc[:, 0])
+            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 1])
+        else:
+            # If there are more columns, just assume that the first column was meant to be a subtopic
+            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
+            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
+        # Add a description if column is present
+        if not zero_shot_topics_description_list:
+            if "Description" in zero_shot_topics.columns:
+                zero_shot_topics_description_list = list(zero_shot_topics["Description"])
+                #print("Description found in topic title. List is:", zero_shot_topics_description_list)
+            elif zero_shot_topics.shape[1] >= 3:
+                zero_shot_topics_description_list = list(zero_shot_topics.iloc[:, 2]) # Assume the third column is description
+            else:
+                zero_shot_topics_description_list = [""] * zero_shot_topics.shape[0]
+        # If the responses are being forced into zero shot topics, allow an option for nothing relevant
+        if force_zero_shot_radio == "Yes":
+            zero_shot_topics_gen_topics_list.append("")
+            zero_shot_topics_subtopics_list.append("No relevant topic")
+            zero_shot_topics_description_list.append("")
+        # Add description or not
+        zero_shot_topics_df = pd.DataFrame(data={
+                "General topic":zero_shot_topics_gen_topics_list,
+                "Subtopic":zero_shot_topics_subtopics_list,
+                "Description": zero_shot_topics_description_list
+                })
+        return zero_shot_topics_df

tools/llm_api_call.py CHANGED Viewed

@@ -16,7 +16,7 @@ from io import StringIO
 GradioFileData = gr.FileData
 from tools.prompts import initial_table_prompt, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt,  force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt, default_response_reference_format, negative_neutral_positive_sentiment_prompt, negative_or_positive_sentiment_prompt,  default_sentiment_prompt
-from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details, move_overall_summary_output_files_to_front_page
 from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
 from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_ROWS, MAXIMUM_ZERO_SHOT_TOPICS, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
 from tools.aws_functions import connect_to_bedrock_runtime
@@ -578,117 +578,7 @@ def write_llm_output_and_logs(response_text: str,
     return topic_table_out_path, reference_table_out_path, topic_summary_df_out_path, topic_with_response_df, out_reference_df, out_topic_summary_df, batch_file_path_details, is_error
-def generate_zero_shot_topics_df(zero_shot_topics:pd.DataFrame,
-                                 force_zero_shot_radio:str="No",
-                                 create_revised_general_topics:bool=False,
-                                 max_topic_no:int=maximum_zero_shot_topics):
-    """
-    Preprocesses a DataFrame of zero-shot topics, cleaning and formatting them
-    for use with a large language model. It handles different column configurations
-    (e.g., only subtopics, general topics and subtopics, or subtopics with descriptions)
-    and enforces a maximum number of topics.
-    Args:
-        zero_shot_topics (pd.DataFrame): A DataFrame containing the initial zero-shot topics.
-                                         Expected columns can vary, but typically include
-                                         "General topic", "Subtopic", and/or "Description".
-        force_zero_shot_radio (str, optional): A string indicating whether to force
-                                               the use of zero-shot topics. Defaults to "No".
-                                               (Currently not used in the function logic, but kept for signature consistency).
-        create_revised_general_topics (bool, optional): A boolean indicating whether to
-                                                        create revised general topics. Defaults to False.
-                                                        (Currently not used in the function logic, but kept for signature consistency).
-        max_topic_no (int, optional): The maximum number of topics allowed to fit within
-                                      LLM context limits. If `zero_shot_topics` exceeds this,
-                                      it will be truncated. Defaults to 120.
-    Returns:
-        tuple: A tuple containing:
-            - zero_shot_topics_gen_topics_list (list): A list of cleaned general topics.
-            - zero_shot_topics_subtopics_list (list): A list of cleaned subtopics.
-            - zero_shot_topics_description_list (list): A list of cleaned topic descriptions.
-    """
-    zero_shot_topics_gen_topics_list = list()
-    zero_shot_topics_subtopics_list = list()
-    zero_shot_topics_description_list = list()
-    # Max 120 topics allowed
-    if zero_shot_topics.shape[0] > max_topic_no:
-        out_message = "Maximum " + str(max_topic_no) + " zero-shot topics allowed according to application configuration."
-        print(out_message)
-        raise Exception(out_message)
-    # Forward slashes in the topic names seems to confuse the model
-    if zero_shot_topics.shape[1] >= 1:  # Check if there is at least one column
-        for x in zero_shot_topics.columns:
-            if not zero_shot_topics[x].isnull().all():
-                zero_shot_topics[x] = zero_shot_topics[x].apply(initial_clean)
-                zero_shot_topics.loc[:, x] = (
-                zero_shot_topics.loc[:, x]
-                .str.strip()
-                .str.replace('\n', ' ')
-                .str.replace('\r', ' ')
-                .str.replace('/', ' or ')
-                .str.lower()
-                .str.capitalize())
-        # If number of columns is 1, keep only subtopics
-        if zero_shot_topics.shape[1] == 1 and "General topic" not in zero_shot_topics.columns:
-            print("Found only Subtopic in zero shot topics")
-            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
-            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
-        # Allow for possibility that the user only wants to set general topics and not subtopics
-        elif zero_shot_topics.shape[1] == 1 and "General topic" in zero_shot_topics.columns:
-            print("Found only General topic in zero shot topics")
-            zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
-            zero_shot_topics_subtopics_list = [""] * zero_shot_topics.shape[0]
-        # If general topic and subtopic are specified
-        elif set(["General topic", "Subtopic"]).issubset(zero_shot_topics.columns):
-            print("Found General topic and Subtopic in zero shot topics")
-            zero_shot_topics_gen_topics_list = list(zero_shot_topics["General topic"])
-            zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
-        # If subtopic and description are specified
-        elif set(["Subtopic", "Description"]).issubset(zero_shot_topics.columns):
-            print("Found Subtopic and Description in zero shot topics")
-            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
-            zero_shot_topics_subtopics_list = list(zero_shot_topics["Subtopic"])
-            zero_shot_topics_description_list = list(zero_shot_topics["Description"])
-        # If number of columns is at least 2, keep general topics and subtopics
-        elif zero_shot_topics.shape[1] >= 2 and "Description" not in zero_shot_topics.columns:
-            zero_shot_topics_gen_topics_list = list(zero_shot_topics.iloc[:, 0])
-            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 1])
-        else:
-            # If there are more columns, just assume that the first column was meant to be a subtopic
-            zero_shot_topics_gen_topics_list = [""] * zero_shot_topics.shape[0]
-            zero_shot_topics_subtopics_list = list(zero_shot_topics.iloc[:, 0])
-        # Add a description if column is present
-        if not zero_shot_topics_description_list:
-            if "Description" in zero_shot_topics.columns:
-                zero_shot_topics_description_list = list(zero_shot_topics["Description"])
-                #print("Description found in topic title. List is:", zero_shot_topics_description_list)
-            elif zero_shot_topics.shape[1] >= 3:
-                zero_shot_topics_description_list = list(zero_shot_topics.iloc[:, 2]) # Assume the third column is description
-            else:
-                zero_shot_topics_description_list = [""] * zero_shot_topics.shape[0]
-        # If the responses are being forced into zero shot topics, allow an option for nothing relevant
-        if force_zero_shot_radio == "Yes":
-            zero_shot_topics_gen_topics_list.append("")
-            zero_shot_topics_subtopics_list.append("No relevant topic")
-            zero_shot_topics_description_list.append("")
-        # Add description or not
-        zero_shot_topics_df = pd.DataFrame(data={
-                "General topic":zero_shot_topics_gen_topics_list,
-                "Subtopic":zero_shot_topics_subtopics_list,
-                "Description": zero_shot_topics_description_list
-                })
-        return zero_shot_topics_df
 def extract_topics(in_data_file: GradioFileData,

 GradioFileData = gr.FileData
 from tools.prompts import initial_table_prompt, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt,  force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt, default_response_reference_format, negative_neutral_positive_sentiment_prompt, negative_or_positive_sentiment_prompt,  default_sentiment_prompt
+from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details, move_overall_summary_output_files_to_front_page, generate_zero_shot_topics_df
 from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata, construct_azure_client, get_model, get_tokenizer, get_assistant_model
 from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, LLM_MAX_NEW_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX, AZURE_INFERENCE_ENDPOINT, MAX_ROWS, MAXIMUM_ZERO_SHOT_TOPICS, MAX_SPACES_GPU_RUN_TIME, OUTPUT_DEBUG_FILES
 from tools.aws_functions import connect_to_bedrock_runtime
     return topic_table_out_path, reference_table_out_path, topic_summary_df_out_path, topic_with_response_df, out_reference_df, out_topic_summary_df, batch_file_path_details, is_error
 def extract_topics(in_data_file: GradioFileData,

tools/prompts.py CHANGED Viewed

@@ -182,4 +182,63 @@ New Topics table:"""
 # Categorise the following text into only one of the following categories that seems most relevant: 'cat1', 'cat2', 'cat3', 'cat4'. Answer only with the choice of category. Do not add any other text. Do not explain your choice.
 # Text: {text}<end_of_turn>
 # <start_of_turn>model
-# Category:"""

 # Categorise the following text into only one of the following categories that seems most relevant: 'cat1', 'cat2', 'cat3', 'cat4'. Answer only with the choice of category. Do not add any other text. Do not explain your choice.
 # Text: {text}<end_of_turn>
 # <start_of_turn>model
+# Category:"""
+###
+# LLM-BASED TOPIC DEDUPLICATION PROMPTS
+###
+llm_deduplication_system_prompt = """You are an expert at analysing and consolidating topic categories. Your task is to identify semantically similar topics that should be merged together, even if they use different wording or synonyms."""
+llm_deduplication_prompt = """You are given a table of topics with their General topics, Subtopics, and Sentiment classifications. Your task is to identify topics that are semantically similar and should be merged together. Only merge topics that are almost identical in terms of meaning - if in doubt, do not merge.
+Analyse the following topics table and identify groups of topics that describe essentially the same concept but may use different words or phrases. For example:
+- "Transportation issues" and "Public transport problems"
+- "Housing costs" and "Rent prices"
+- "Environmental concerns" and "Green issues"
+Create a markdown table with the following columns:
+1. 'Original General topic' - The current general topic name
+2. 'Original Subtopic' - The current subtopic name
+3. 'Original Sentiment' - The current sentiment
+4. 'Merged General topic' - The consolidated general topic name (use the most descriptive)
+5. 'Merged Subtopic' - The consolidated subtopic name (use the most descriptive)
+6. 'Merged Sentiment' - The consolidated sentiment (use 'Mixed' if sentiments differ)
+7. 'Merge Reason' - Brief explanation of why these topics should be merged
+Only include rows where topics should actually be merged. If a topic has no semantic duplicates, do not include it in the table.
+Topics to analyse:
+{topics_table}
+Merged topics table:"""
+llm_deduplication_prompt_with_candidates = """You are given a table of topics with their General topics, Subtopics, and Sentiment classifications. Your task is to identify topics that are semantically similar and should be merged together, even if they use different wording.
+Additionally, you have been provided with a list of candidate topics that represent preferred topic categories. When merging topics, prioritise fitting similar topics into these existing candidate categories rather than creating new ones. Only merge topics that are almost identical in terms of meaning - if in doubt, do not merge.
+Analyse the following topics table and identify groups of topics that describe essentially the same concept but may use different words or phrases. For example:
+- "Transportation issues" and "Public transport problems"
+- "Housing costs" and "Rent prices"
+- "Environmental concerns" and "Green issues"
+When merging topics, consider the candidate topics provided below and try to map similar topics to these preferred categories when possible.
+Create a markdown table with the following columns:
+1. 'Original General topic' - The current general topic name
+2. 'Original Subtopic' - The current subtopic name
+3. 'Original Sentiment' - The current sentiment
+4. 'Merged General topic' - The consolidated general topic name (prefer candidate topics when similar)
+5. 'Merged Subtopic' - The consolidated subtopic name (prefer candidate topics when similar)
+6. 'Merged Sentiment' - The consolidated sentiment (use 'Mixed' if sentiments differ)
+7. 'Merge Reason' - Brief explanation of why these topics should be merged
+Only include rows where topics should actually be merged. If a topic has no semantic duplicates, do not include it in the table.
+Topics to analyse:
+{topics_table}
+Candidate topics to consider for mapping:
+{candidate_topics_table}
+Merged topics table:"""