Spaces:

seanpedrickcase
/

llm_topic_modelling

Running on Zero

App Files Files Community

seanpedrickcase commited on Sep 3

Commit

ba1a951

1 Parent(s): aa08197

Enhanced app functionality by adding new UI elements for summary file management, put in Bedrock model toggle, and refined logging messages. Updated Dockerfile and requirements for better compatibility and added install guide to readme. Removed deprecated code and unnecessary comments.

Browse files

Files changed (14) hide show

Dockerfile +5 -4
README.md +118 -4
app.py +30 -25
requirements.txt +5 -6
requirements_aws.txt +2 -3
requirements_cpu.txt +23 -0
requirements_gpu.txt +4 -2
tools/aws_functions.py +0 -62
tools/combine_sheets_into_xlsx.py +3 -7
tools/config.py +2 -1
tools/dedup_summaries.py +16 -8
tools/llm_api_call.py +16 -6
tools/llm_funcs.py +1 -3
windows_install_llama-cpp-python.txt +34 -44

Dockerfile CHANGED Viewed

@@ -1,3 +1,4 @@
 # Stage 1: Build dependencies and download models
 FROM public.ecr.aws/docker/library/python:3.11.13-slim-bookworm AS builder
@@ -7,7 +8,7 @@ RUN apt-get update && apt-get install -y \
     gcc \
     g++ \
     cmake \
-    libopenblas-dev \
     pkg-config \
     python3-dev \
     libffi-dev \
@@ -18,9 +19,9 @@ WORKDIR /src
 COPY requirements_aws.txt .
-# Set environment variables for OpenBLAS
-ENV OPENBLAS_VERBOSE=1
-ENV CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
 RUN pip install --no-cache-dir --target=/install torch==2.7.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu \
 && pip install --no-cache-dir --target=/install https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl \

+# This Dockerfile is optimised for AWS ECS using Python 3.11, and assumes CPU inference with OpenBLAS for local models.
 # Stage 1: Build dependencies and download models
 FROM public.ecr.aws/docker/library/python:3.11.13-slim-bookworm AS builder
     gcc \
     g++ \
     cmake \
+    #libopenblas-dev \
     pkg-config \
     python3-dev \
     libffi-dev \
 COPY requirements_aws.txt .
+# Set environment variables for OpenBLAS - not necessary if not building from source
+# ENV OPENBLAS_VERBOSE=1
+# ENV CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
 RUN pip install --no-cache-dir --target=/install torch==2.7.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu \
 && pip install --no-cache-dir --target=/install https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl \

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ license: agpl-3.0
 # Large language model topic modelling
-Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see config.py), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
 Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
@@ -25,6 +25,120 @@ Basic use:
 2. Select the relevant open text column from the dropdown.
 3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
 4. Write a one sentence description of the consultation/context of the open text.
-5. Extract topics.
-6. If topic extraction fails part way through, you can upload the latest 'reference_table' and 'unique_topics_table' csv outputs on the 'Continue previous topic extraction' tab to continue from where you left off.
-7. Summaries will be produced for each topic for each 'batch' of responses. If you want consolidated summaries, go to the tab 'Summarise topic outputs', upload your output reference_table and unique_topics csv files, and press summarise.

 # Large language model topic modelling
+Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
 Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
 2. Select the relevant open text column from the dropdown.
 3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
 4. Write a one sentence description of the consultation/context of the open text.
+5. Click 'All in one - Extract topics, deduplicate, and summarise'. This will run through the whole analysis process from topic extraction, to topic deduplication, to topic-level and overall summaries.
+6. A summary xlsx file workbook will be created on the front page in the box 'Overall summary xlsx file'. This will combine all the results from the different processes into one workbook.
+# Installation guide
+Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from a relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
+-----
+### Step 1: Clone the Git Repository
+First, you need to copy the project files to your local machine. Navigate to the directory where you want to store the project using the `cd` (change directory) command. Then, use `git clone` with the repository's URL.
+1.  **Clone the repo:**
+    ```bash
+    git clone https://github.com/example-user/example-repo.git
+    ```
+    *Replace the URL with your repository's URL.*
+2.  **Navigate into the new project folder:**
+    ```bash
+    cd example-repo
+    ```
+-----
+### Step 2: Create and Activate a Virtual Environment
+A virtual environment is a self-contained directory that holds a specific Python interpreter and its own set of installed packages. This is crucial for isolating your project's dependencies.
+1.  **Create the virtual environment:** We'll use Python's built-in `venv` module. It's common practice to name the environment folder `.venv`.
+    ```bash
+    python -m venv .venv
+    ```
+    *This command tells Python to create a new virtual environment in a folder named `.venv`.*
+2.  **Activate the environment:** You must "activate" the environment to start using it. The command differs based on your operating system and shell.
+      * **On macOS / Linux (bash/zsh):**
+        ```bash
+        source .venv/bin/activate
+        ```
+      * **On Windows (Command Prompt):**
+        ```bash
+        .\.venv\Scripts\activate
+        ```
+      * **On Windows (PowerShell):**
+        ```powershell
+        .\.venv\Scripts\Activate.ps1
+        ```
+    You'll know it's active because your command prompt will be prefixed with `(.venv)`.
+-----
+### Step 3: Install Dependencies
+Now that your virtual environment is active, you can install all the required packages listed in the relevant project `requirements.txt` file using `pip`.
+1. **Choose the relevant requirements file**
+Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
+The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
+- **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the last requirement under 'Windows' for Windows compatibility (CUDA 12.4)
+- **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the last requirement under 'Windows' for Windows compatibility
+- **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4)
+- **requirements_aws**: Used in conjunction with the Dockerfile for Python 3.11, CPU-only environments.
+2.  **Install packages from the requirements file:**
+    ```bash
+    pip install -r requirements_gpu.txt
+    ```
+    *This command reads every package name listed in the file and installs it into your `.venv` environment.*
+You're all set\! ✅ Your project is cloned, and all dependencies are installed in an isolated environment.
+When you are finished working, you can leave the virtual environment by simply typing:
+```bash
+deactivate
+```
+### Step 4: Verify CUDA compatibility (if using a GPU environment)
+Install the relevant toolkit for CUDA 12.4 from here: https://developer.nvidia.com/cuda-12-4-0-download-archive
+Restart your computer
+Ensure you have the latest drivers for your NVIDIA GPU. Check your current version and memory availability by running nvidia-smi
+In command line, CUDA compatibility can be checked by running nvcc --version
+### Step 5: Ensure you have compatible NVIDIA drivers
+Make sure you have the latest NVIDIA drivers installed on your system for your GPU (be careful in particular if using WSL that you have drivers compatible with this). Official drivers can be found here: https://www.nvidia.com/en-us/drivers
+Current drivers can be found by running nvidia-smi in command line
+### Step 6: Run the app
+Go to the app project directory. Run python app.py
+### Step 7: (optional) change default configuration
+A number of configuration options can be seen the tools/config.py file. You can either pass in these variables as environment variables, or you can create a file in config/app_config.env to read this into the app on initialisation.

app.py CHANGED Viewed

@@ -12,7 +12,7 @@ from tools.custom_csvlogger import CSVLogger_custom
 from tools.auth import authenticate_user
 from tools.prompts import initial_table_prompt, prompt2, prompt3, system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, verify_titles_prompt, verify_titles_system_prompt, two_para_summary_format_prompt, single_para_summary_format_prompt
 from tools.verify_titles import verify_titles
-from tools.config import RUN_AWS_FUNCTIONS, HOST_NAME, ACCESS_LOGS_FOLDER, FEEDBACK_LOGS_FOLDER, USAGE_LOGS_FOLDER, RUN_LOCAL_MODEL,  FILE_INPUT_HEIGHT, GEMINI_API_KEY, model_full_names, BATCH_SIZE_DEFAULT, CHOSEN_LOCAL_MODEL_TYPE, LLM_SEED, COGNITO_AUTH, MAX_QUEUE_SIZE, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, INPUT_FOLDER, OUTPUT_FOLDER, S3_LOG_BUCKET, CONFIG_FOLDER, GRADIO_TEMP_DIR, MPLCONFIGDIR, model_name_map, GET_COST_CODES, ENFORCE_COST_CODES, DEFAULT_COST_CODE, COST_CODES_PATH, S3_COST_CODES_PATH, OUTPUT_COST_CODES_PATH, SHOW_COSTS, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, USAGE_LOG_DYNAMODB_TABLE_NAME, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, LOG_FILE_NAME, FEEDBACK_LOG_FILE_NAME, USAGE_LOG_FILE_NAME, CSV_ACCESS_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, CSV_USAGE_LOG_HEADERS, DYNAMODB_ACCESS_LOG_HEADERS, DYNAMODB_FEEDBACK_LOG_HEADERS, DYNAMODB_USAGE_LOG_HEADERS, S3_ACCESS_LOGS_FOLDER, S3_FEEDBACK_LOGS_FOLDER, S3_USAGE_LOGS_FOLDER
 def ensure_folder_exists(output_folder:str):
     """Checks if the specified folder exists, creates it if not."""
@@ -115,6 +115,8 @@ with app:
     summarised_outputs_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of summarised outputs", allow_custom_value=True)
     latest_summary_completed_num = gr.Number(0, visible=False)
     original_data_file_name_textbox = gr.Textbox(label = "Reference data file name", value="", visible=False)
     working_data_file_name_textbox = gr.Textbox(label = "Working data file name", value="", visible=False)
     unique_topics_table_file_name_textbox = gr.Textbox(label="Unique topics data file name textbox", visible=False)
@@ -132,13 +134,17 @@ with app:
     cost_code_dataframe = gr.Dataframe(value=pd.DataFrame(), type="pandas", visible=False, wrap=True)
     cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
     ###
     # UI LAYOUT
     ###
     gr.Markdown("""# Large language model topic modelling
-    Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see config.py), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
     Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
@@ -183,14 +189,12 @@ with app:
         all_in_one_btn = gr.Button("All in one - Extract topics, deduplicate, and summarise", variant="primary")
         extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
-        with gr.Row():
-            topic_extraction_output_files = gr.File(height=FILE_INPUT_HEIGHT, label="Output files", scale=3, interactive=False)
-            topic_extraction_output_files_xlsx = gr.File(height=FILE_INPUT_HEIGHT, label="Output xlsx summary file", scale=1, interactive=False)
         display_topic_table_markdown = gr.Markdown(value="### Language model response will appear here", show_copy_button=True)
-        latest_batch_completed = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
-        # Duplicate version of the above variable for when you don't want to initiate the summarisation loop
-        latest_batch_completed_no_loop = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
         data_feedback_title = gr.Markdown(value="## Please give feedback", visible=False)
         data_feedback_radio = gr.Radio(label="Please give some feedback about the results of the topic extraction.",
@@ -287,14 +291,15 @@ with app:
     with gr.Tab(label="LLM and topic extraction settings"):
         gr.Markdown("""Define settings that affect large language model output.""")
         with gr.Accordion("Settings for LLM generation", open = True):
-            temperature_slide = gr.Slider(minimum=0.1, maximum=1.0, value=0.1, label="Choose LLM temperature setting")
             batch_size_number = gr.Number(label = "Number of responses to submit in a single LLM query", value = BATCH_SIZE_DEFAULT, precision=0, minimum=1, maximum=100)
             random_seed = gr.Number(value=LLM_SEED, label="Random seed for LLM generation", visible=False)
         with gr.Accordion("AWS API keys", open = False):
             with gr.Row():
-                aws_access_key_textbox = gr.Textbox(label="AWS access key", interactive=False, lines=1, type="password")
-                aws_secret_key_textbox = gr.Textbox(label="AWS secret key", interactive=False, lines=1, type="password")
         with gr.Accordion("Gemini API keys", open = False):
             google_api_key_textbox = gr.Textbox(value = GEMINI_API_KEY, label="Enter Gemini API key (only if using Google API models)", lines=1, type="password")
@@ -413,10 +418,11 @@ with app:
                 missing_df_state,
                 input_tokens_num,
                 output_tokens_num,
-                number_of_calls_num],
-                api_name="extract_topics").\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs").\
-                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[topic_extraction_output_files_xlsx])
     ###
     # DEDUPLICATION AND SUMMARISATION FUNCTIONS
@@ -436,16 +442,16 @@ with app:
     success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
         success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
             success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown], api_name="sample_summaries").\
-                success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], api_name="summarise_topics").\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
-                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_revised_summaries_state, master_unique_topics_df_revised_summaries_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[summary_output_files_xlsx])
     # SUMMARISE WHOLE TABLE PAGE
     overall_summarise_previous_data_btn.click(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
             success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
-            success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], scroll_to_output=True, api_name="overall_summary").\
             success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
-            success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx])
     # All in one button
@@ -509,21 +515,20 @@ with app:
                 missing_df_state,
                 input_tokens_num,
                 output_tokens_num,
-                number_of_calls_num]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
                 success(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
                 success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown]).\
                 success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
                 success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown]).\
-                success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
                 success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
-                success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
-                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx]).\
-                success(move_overall_summary_output_files_to_front_page, inputs=[overall_summary_output_files_xlsx], outputs=[topic_extraction_output_files_xlsx])
     ###
     # CONTINUE PREVIOUS TOPIC EXTRACTION PAGE

 from tools.auth import authenticate_user
 from tools.prompts import initial_table_prompt, prompt2, prompt3, system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, verify_titles_prompt, verify_titles_system_prompt, two_para_summary_format_prompt, single_para_summary_format_prompt
 from tools.verify_titles import verify_titles
+from tools.config import RUN_AWS_FUNCTIONS, HOST_NAME, ACCESS_LOGS_FOLDER, FEEDBACK_LOGS_FOLDER, USAGE_LOGS_FOLDER, RUN_LOCAL_MODEL,  FILE_INPUT_HEIGHT, GEMINI_API_KEY, model_full_names, BATCH_SIZE_DEFAULT, CHOSEN_LOCAL_MODEL_TYPE, LLM_SEED, COGNITO_AUTH, MAX_QUEUE_SIZE, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, INPUT_FOLDER, OUTPUT_FOLDER, S3_LOG_BUCKET, CONFIG_FOLDER, GRADIO_TEMP_DIR, MPLCONFIGDIR, model_name_map, GET_COST_CODES, ENFORCE_COST_CODES, DEFAULT_COST_CODE, COST_CODES_PATH, S3_COST_CODES_PATH, OUTPUT_COST_CODES_PATH, SHOW_COSTS, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, USAGE_LOG_DYNAMODB_TABLE_NAME, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, LOG_FILE_NAME, FEEDBACK_LOG_FILE_NAME, USAGE_LOG_FILE_NAME, CSV_ACCESS_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, CSV_USAGE_LOG_HEADERS, DYNAMODB_ACCESS_LOG_HEADERS, DYNAMODB_FEEDBACK_LOG_HEADERS, DYNAMODB_USAGE_LOG_HEADERS, S3_ACCESS_LOGS_FOLDER, S3_FEEDBACK_LOGS_FOLDER, S3_USAGE_LOGS_FOLDER, AWS_ACCESS_KEY, AWS_SECRET_KEY
 def ensure_folder_exists(output_folder:str):
     """Checks if the specified folder exists, creates it if not."""
     summarised_outputs_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of summarised outputs", allow_custom_value=True)
     latest_summary_completed_num = gr.Number(0, visible=False)
+    summary_xlsx_output_files_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of xlsx summary output files", allow_custom_value=True)
     original_data_file_name_textbox = gr.Textbox(label = "Reference data file name", value="", visible=False)
     working_data_file_name_textbox = gr.Textbox(label = "Working data file name", value="", visible=False)
     unique_topics_table_file_name_textbox = gr.Textbox(label="Unique topics data file name textbox", visible=False)
     cost_code_dataframe = gr.Dataframe(value=pd.DataFrame(), type="pandas", visible=False, wrap=True)
     cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
+    latest_batch_completed = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
+    # Duplicate version of the above variable for when you don't want to initiate the summarisation loop
+    latest_batch_completed_no_loop = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
     ###
     # UI LAYOUT
     ###
     gr.Markdown("""# Large language model topic modelling
+    Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
     Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
         all_in_one_btn = gr.Button("All in one - Extract topics, deduplicate, and summarise", variant="primary")
         extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
+        with gr.Row(equal_height=True):
+            output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False)
+            topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False)
+            topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file", scale=1, interactive=False)
         display_topic_table_markdown = gr.Markdown(value="### Language model response will appear here", show_copy_button=True)
         data_feedback_title = gr.Markdown(value="## Please give feedback", visible=False)
         data_feedback_radio = gr.Radio(label="Please give some feedback about the results of the topic extraction.",
     with gr.Tab(label="LLM and topic extraction settings"):
         gr.Markdown("""Define settings that affect large language model output.""")
         with gr.Accordion("Settings for LLM generation", open = True):
+            temperature_slide = gr.Slider(minimum=0.1, maximum=1.0, value=0.1, label="Choose LLM temperature setting", precision=1)
             batch_size_number = gr.Number(label = "Number of responses to submit in a single LLM query", value = BATCH_SIZE_DEFAULT, precision=0, minimum=1, maximum=100)
             random_seed = gr.Number(value=LLM_SEED, label="Random seed for LLM generation", visible=False)
         with gr.Accordion("AWS API keys", open = False):
+            gr.Markdown("""Querying Bedrock models with API keys requires a role with IAM permissions for the bedrock:InvokeModel action.""")
             with gr.Row():
+                aws_access_key_textbox = gr.Textbox(value=AWS_ACCESS_KEY, label="AWS access key", lines=1, type="password")
+                aws_secret_key_textbox = gr.Textbox(value=AWS_SECRET_KEY, label="AWS secret key", lines=1, type="password")
         with gr.Accordion("Gemini API keys", open = False):
             google_api_key_textbox = gr.Textbox(value = GEMINI_API_KEY, label="Enter Gemini API key (only if using Google API models)", lines=1, type="password")
                 missing_df_state,
                 input_tokens_num,
                 output_tokens_num,
+                number_of_calls_num,
+                output_messages_textbox],
+                api_name="extract_topics", show_progress_on=output_messages_textbox).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs").\
+                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[topic_extraction_output_files_xlsx, summary_xlsx_output_files_list])
     ###
     # DEDUPLICATION AND SUMMARISATION FUNCTIONS
     success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
         success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
             success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown], api_name="sample_summaries").\
+                success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], api_name="summarise_topics", show_progress_on=[output_messages_textbox, summary_output_files]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
+                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_revised_summaries_state, master_unique_topics_df_revised_summaries_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[summary_output_files_xlsx, summary_xlsx_output_files_list])
     # SUMMARISE WHOLE TABLE PAGE
     overall_summarise_previous_data_btn.click(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
             success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
+            success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], scroll_to_output=True, api_name="overall_summary", show_progress_on=[output_messages_textbox, overall_summary_output_files]).\
             success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
+            success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx, summary_xlsx_output_files_list])
     # All in one button
                 missing_df_state,
                 input_tokens_num,
                 output_tokens_num,
+                number_of_calls_num,
+                output_messages_textbox], show_progress_on=output_messages_textbox).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
                 success(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
                 success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown]).\
                 success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
                 success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown]).\
+                success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, display_topic_table_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], show_progress_on=[output_messages_textbox, summary_output_files]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
                 success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
+                success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], show_progress_on=[output_messages_textbox, overall_summary_output_files]).\
                 success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB,  dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
+                success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx, summary_xlsx_output_files_list]).\
+                success(move_overall_summary_output_files_to_front_page, inputs=[summary_xlsx_output_files_list], outputs=[topic_extraction_output_files_xlsx])
     ###
     # CONTINUE PREVIOUS TOPIC EXTRACTION PAGE

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
 pandas==2.3.2
 gradio==5.44.1
 transformers==4.56.0
@@ -13,15 +14,13 @@ html5lib==1.1
 beautifulsoup4==4.12.3
 rapidfuzz==3.13.0
 python-dotenv==1.1.0
-# Torch and Llama CPP Python
 # GPU
 torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest compatible with CUDA 12.4
-https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl # Specify exact llama_cpp for cuda compatibility on Hugging Face
-#
-# CPU only (for e.g. Hugging Face CPU instances):
 #torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
-# For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts. A Python 3.11 wheel is also available from the same repo
 #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
-#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl

+# Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
 pandas==2.3.2
 gradio==5.44.1
 transformers==4.56.0
 beautifulsoup4==4.12.3
 rapidfuzz==3.13.0
 python-dotenv==1.1.0
+# Torch and llama-cpp-python
 # GPU
 torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest compatible with CUDA 12.4
+https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
+# CPU only (for e.g. Hugging Face CPU instances)
 #torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
+# For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts
 #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl

requirements_aws.txt CHANGED Viewed

@@ -1,3 +1,4 @@
 pandas==2.3.2
 gradio==5.44.1
 transformers==4.56.0
@@ -12,6 +13,4 @@ google-genai==1.32.0
 html5lib==1.1
 beautifulsoup4==4.12.3
 rapidfuzz==3.13.0
-python-dotenv==1.1.0
-# torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu # Commented out as Dockerfile should install torch
-# llama-cpp-python==0.3.16 # Commented out as Dockerfile should install llama-cpp-python

+# This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, and assumes a python 3.11 compatible llama-cpp-python wheel is available (see Dockerfile). torch and llama-cpp-python are not present here, as they are installed in the main Dockerfile
 pandas==2.3.2
 gradio==5.44.1
 transformers==4.56.0
 html5lib==1.1
 beautifulsoup4==4.12.3
 rapidfuzz==3.13.0
+python-dotenv==1.1.0

requirements_cpu.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+pandas==2.3.2
+gradio==5.44.1
+transformers==4.56.0
+spaces==0.40.1
+boto3==1.40.22
+pyarrow==21.0.0
+openpyxl==3.1.5
+markdown==3.7
+tabulate==0.9.0
+lxml==5.3.0
+google-genai==1.32.0
+html5lib==1.1
+beautifulsoup4==4.12.3
+rapidfuzz==3.13.0
+python-dotenv==1.1.0
+torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
+# Linux, Python 3.11 compatible wheel available:
+#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
+# Windows, Python 3.11 compatible wheel available:
+#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
+# If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
+# Alternatively, try your luck at installing the package from source below
+# llama-cpp-python==0.3.16

requirements_gpu.txt CHANGED Viewed

@@ -19,6 +19,8 @@ torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest c
 # For Linux:
 https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
 # For Windows:
-#llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on" --verbose
-# If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt'

 # For Linux:
 https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
 # For Windows:
+#https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl
+# If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
+# If none of the above work for you, try the following:
+# llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"

tools/aws_functions.py CHANGED Viewed

@@ -62,68 +62,6 @@ def connect_to_s3_client(aws_access_key_textbox:str="", aws_secret_key_textbox:s
     return s3_client
-# def connect_to_sts_client(aws_access_key_textbox:str="", aws_secret_key_textbox:str="", sts_endpoint:str=""):
-#     # If running an anthropic model, assume that running an AWS sts model, load in sts
-#     sts_client = []
-#     if aws_access_key_textbox and aws_secret_key_textbox:
-#         print("Connecting to sts using AWS access key and secret keys from user input.")
-#         sts_client = boto3.client('sts',
-#             aws_access_key_id=aws_access_key_textbox,
-#             aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
-#     elif RUN_AWS_FUNCTIONS == "1" and PRIORITISE_SSO_OVER_AWS_ENV_ACCESS_KEYS == "1":
-#         print("Connecting to sts via existing SSO connection")
-#         sts_client = boto3.client('sts', region_name=AWS_REGION)
-#     elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
-#         print("Getting sts credentials from environment variables")
-#         sts_client = boto3.client('sts',
-#             aws_access_key_id=AWS_ACCESS_KEY,
-#             aws_secret_access_key=AWS_SECRET_KEY,
-#             region_name=AWS_REGION)
-#     else:
-#         sts_client = ""
-#         out_message = "Cannot connect to sts service. Please provide access keys under LLM settings, or choose another model type."
-#         print(out_message)
-#         raise Exception(out_message)
-#     return sts_client
-# def get_assumed_role_info(aws_access_key_textbox, aws_secret_key_textbox, sts_endpoint):
-#     sts_endpoint = 'https://sts.' + AWS_REGION + '.amazonaws.com'
-#     sts = connect_to_sts_client(aws_access_key_textbox, aws_secret_key_textbox, endpoint_url=sts_endpoint)
-#     #boto3.client('sts', region_name=AWS_REGION, endpoint_url=sts_endpoint)
-#     response = sts.get_caller_identity()
-#     # Extract ARN of the assumed role
-#     assumed_role_arn = response['Arn']
-#     # Extract the name of the assumed role from the ARN
-#     assumed_role_name = assumed_role_arn.split('/')[-1]
-#     return assumed_role_arn, assumed_role_name
-# if RUN_AWS_FUNCTIONS == "1":
-#     try:
-#         bucket_name = S3_LOG_BUCKET
-#         #session = boto3.Session() # profile_name="default"
-#     except Exception as e:
-#         print(e)
-#     try:
-#         assumed_role_arn, assumed_role_name = get_assumed_role_info(aws_access_key_textbox, aws_secret_key_textbox, sts_endpoint)
-#         #print("Assumed Role ARN:", assumed_role_arn)
-#         #print("Assumed Role Name:", assumed_role_name)
-#         print("Successfully assumed role with AWS STS")
-#     except Exception as e:
-#         print("Could not connect to AWS STS due to:", e)
 # Download direct from S3 - requires login credentials
 def download_file_from_s3(bucket_name:str, key:str, local_file_path:str, aws_access_key_textbox:str="", aws_secret_key_textbox:str="", RUN_AWS_FUNCTIONS=RUN_AWS_FUNCTIONS):

     return s3_client
 # Download direct from S3 - requires login credentials
 def download_file_from_s3(bucket_name:str, key:str, local_file_path:str, aws_access_key_textbox:str="", aws_secret_key_textbox:str="", RUN_AWS_FUNCTIONS=RUN_AWS_FUNCTIONS):

tools/combine_sheets_into_xlsx.py CHANGED Viewed

@@ -102,7 +102,6 @@ def csvs_to_excel(
     for idx, csv_path in enumerate(csv_files):
         # Use provided sheet name or derive from file name
         sheet_name = sheet_names[idx] if sheet_names and idx < len(sheet_names) else os.path.splitext(os.path.basename(csv_path))[0]
-        print("csv_path:", csv_path)
         df = pd.read_csv(csv_path)
         if sheet_name == "Original data":
@@ -160,7 +159,7 @@ def csvs_to_excel(
     wb.save(output_filename)
-    print(f"Workbook saved as '{output_filename}'")
     return output_filename
@@ -243,10 +242,9 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
     else:
         raise Exception("Could not find unique topic files to put into Excel format")
     if reference_table_csv_path:
-        #reference_table_csv_path = reference_table_csv_path[0]
         csv_files.append(reference_table_csv_path)
         sheet_names.append("Response level data")
-        column_widths["Response level data"] = {"A": 15, "B": 30, "C": 40, "G":100, "H":100}
         wrap_text_columns["Response level data"] = ["C", "G"]
     else:
         raise Exception("Could not find any reference files to put into Excel format")
@@ -308,8 +306,6 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
     sheet_names.append("Original data")
     column_widths["Original data"] = {"A": 20, "B": 20, "C": 20}
     wrap_text_columns["Original data"] = ["C"]
-    print("Creating intro page and text")
     # Intro page text
     intro_text = [
@@ -381,7 +377,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
     xlsx_output_filenames = [xlsx_output_filename]
-    return xlsx_output_filenames

     for idx, csv_path in enumerate(csv_files):
         # Use provided sheet name or derive from file name
         sheet_name = sheet_names[idx] if sheet_names and idx < len(sheet_names) else os.path.splitext(os.path.basename(csv_path))[0]
         df = pd.read_csv(csv_path)
         if sheet_name == "Original data":
     wb.save(output_filename)
+    print(f"Output xlsx summary saved as '{output_filename}'")
     return output_filename
     else:
         raise Exception("Could not find unique topic files to put into Excel format")
     if reference_table_csv_path:
         csv_files.append(reference_table_csv_path)
         sheet_names.append("Response level data")
+        column_widths["Response level data"] = {"A": 15, "B": 30, "C": 40, "H":100}
         wrap_text_columns["Response level data"] = ["C", "G"]
     else:
         raise Exception("Could not find any reference files to put into Excel format")
     sheet_names.append("Original data")
     column_widths["Original data"] = {"A": 20, "B": 20, "C": 20}
     wrap_text_columns["Original data"] = ["C"]
     # Intro page text
     intro_text = [
     xlsx_output_filenames = [xlsx_output_filename]
+    return xlsx_output_filenames, xlsx_output_filenames

tools/config.py CHANGED Viewed

@@ -203,6 +203,7 @@ MAX_COMMENT_CHARS = int(get_or_create_env_var('MAX_COMMENT_CHARS', '14000'))
 RUN_LOCAL_MODEL = get_or_create_env_var("RUN_LOCAL_MODEL", "1")
 RUN_GEMINI_MODELS = get_or_create_env_var("RUN_GEMINI_MODELS", "1")
 GEMINI_API_KEY = get_or_create_env_var('GEMINI_API_KEY', '')
 # Build up options for models
@@ -218,7 +219,7 @@ if RUN_LOCAL_MODEL == "1" and CHOSEN_LOCAL_MODEL_TYPE:
     model_short_names.append(CHOSEN_LOCAL_MODEL_TYPE)
     model_source.append("Local")
-if RUN_AWS_FUNCTIONS == "1":
     model_full_names.extend(["anthropic.claude-3-haiku-20240307-v1:0", "anthropic.claude-3-7-sonnet-20250219-v1:0"])
     model_short_names.extend(["haiku", "sonnet"])
     model_source.extend(["AWS", "AWS"])

 RUN_LOCAL_MODEL = get_or_create_env_var("RUN_LOCAL_MODEL", "1")
 RUN_GEMINI_MODELS = get_or_create_env_var("RUN_GEMINI_MODELS", "1")
+RUN_AWS_BEDROCK_MODELS = get_or_create_env_var("RUN_AWS_BEDROCK_MODELS", "1")
 GEMINI_API_KEY = get_or_create_env_var('GEMINI_API_KEY', '')
 # Build up options for models
     model_short_names.append(CHOSEN_LOCAL_MODEL_TYPE)
     model_source.append("Local")
+if RUN_AWS_BEDROCK_MODELS == "1":
     model_full_names.extend(["anthropic.claude-3-haiku-20240307-v1:0", "anthropic.claude-3-7-sonnet-20250219-v1:0"])
     model_short_names.extend(["haiku", "sonnet"])
     model_source.extend(["AWS", "AWS"])

tools/dedup_summaries.py CHANGED Viewed

@@ -529,6 +529,7 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
     acc_number_of_calls = 0
     time_taken = 0
     out_metadata_str = "" # Output metadata is currently replaced on starting a summarisation task
     tic = time.perf_counter()
@@ -573,8 +574,8 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
         progress(0.1, f"Loading in local model: {CHOSEN_LOCAL_MODEL_TYPE}")
         local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
-    summary_loop_description = "Creating summaries. " + str(latest_summary_completed) + " summaries completed so far."
-    summary_loop = tqdm(range(latest_summary_completed, length_all_summaries), desc="Creating summaries", unit="summaries")
     if do_summaries == "Yes":
@@ -675,9 +676,13 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
         acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(out_metadata_str, model_choice, model_name_map)
         toc = time.perf_counter()
-        time_taken = toc - tic
-        return sampled_reference_table_df, topic_summary_df_revised, reference_table_df_revised, output_files, summarised_outputs, latest_summary_completed, out_metadata_str, summarised_output_markdown, log_output_files, output_files, acc_input_tokens, acc_output_tokens, acc_number_of_calls, time_taken
 @spaces.GPU(duration=120)
 def overall_summary(topic_summary_df:pd.DataFrame,
@@ -747,6 +752,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
     output_tokens_num = 0
     number_of_calls_num = 0
     time_taken = 0
     tic = time.perf_counter()
@@ -792,7 +798,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
                 local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
                 #print("Local model loaded:", local_model)
-    summary_loop = tqdm(unique_groups, desc="Creating summaries for groups", unit="groups")
     if do_summaries == "Yes":
         model_source = model_name_map[model_choice]["source"]
@@ -800,7 +806,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
         for summary_group in summary_loop:
-            print("Creating summary for group:", summary_group)
             summary_text = topic_summary_df.loc[topic_summary_df["Group"]==summary_group].to_markdown(index=False)
@@ -879,6 +885,8 @@ def overall_summary(topic_summary_df:pd.DataFrame,
         toc = time.perf_counter()
         time_taken = toc - tic
-        print("All group summaries created. Time taken:", time_taken)
-    return output_files, html_output_table, summarised_outputs_df, out_metadata_str, input_tokens_num, output_tokens_num, number_of_calls_num, time_taken

     acc_number_of_calls = 0
     time_taken = 0
     out_metadata_str = "" # Output metadata is currently replaced on starting a summarisation task
+    out_message = list()
     tic = time.perf_counter()
         progress(0.1, f"Loading in local model: {CHOSEN_LOCAL_MODEL_TYPE}")
         local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
+    summary_loop_description = "Revising topic-level summaries. " + str(latest_summary_completed) + " summaries completed so far."
+    summary_loop = tqdm(range(latest_summary_completed, length_all_summaries), desc="Revising topic-level summaries", unit="summaries")
     if do_summaries == "Yes":
         acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(out_metadata_str, model_choice, model_name_map)
         toc = time.perf_counter()
+        time_taken = toc - tic
+        out_message = '\n'.join(out_message)
+        out_message = out_message + " " + f"Topic summarisation finished processing. Total time: {time_taken:.2f}s"
+        print(out_message)
+        return sampled_reference_table_df, topic_summary_df_revised, reference_table_df_revised, output_files, summarised_outputs, latest_summary_completed, out_metadata_str, summarised_output_markdown, log_output_files, output_files, acc_input_tokens, acc_output_tokens, acc_number_of_calls, time_taken, out_message
 @spaces.GPU(duration=120)
 def overall_summary(topic_summary_df:pd.DataFrame,
     output_tokens_num = 0
     number_of_calls_num = 0
     time_taken = 0
+    out_message = list()
     tic = time.perf_counter()
                 local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
                 #print("Local model loaded:", local_model)
+    summary_loop = tqdm(unique_groups, desc="Creating overall summary for groups", unit="groups")
     if do_summaries == "Yes":
         model_source = model_name_map[model_choice]["source"]
         for summary_group in summary_loop:
+            print("Creating overallsummary for group:", summary_group)
             summary_text = topic_summary_df.loc[topic_summary_df["Group"]==summary_group].to_markdown(index=False)
         toc = time.perf_counter()
         time_taken = toc - tic
+        out_message = '\n'.join(out_message)
+        out_message = out_message + " " + f"Overall summary finished processing. Total time: {time_taken:.2f}s"
+        print(out_message)
+    return output_files, html_output_table, summarised_outputs_df, out_metadata_str, input_tokens_num, output_tokens_num, number_of_calls_num, time_taken, out_message

tools/llm_api_call.py CHANGED Viewed

@@ -17,7 +17,7 @@ GradioFileData = gr.FileData
 from tools.prompts import initial_table_prompt, prompt2, prompt3, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt,  force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt
 from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details
 from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata
-from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, RUN_AWS_FUNCTIONS, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX
 from tools.aws_functions import connect_to_bedrock_runtime
 if RUN_LOCAL_MODEL == "1":
@@ -1283,6 +1283,7 @@ def wrapper_extract_topics_per_column_value(
     acc_input_tokens = 0
     acc_output_tokens = 0
     acc_number_of_calls = 0
     if grouping_col is None:
         print("No grouping column found")
@@ -1321,8 +1322,14 @@ def wrapper_extract_topics_per_column_value(
     wrapper_first_loop = initial_first_loop_state
-    for i, group_value in tqdm(enumerate(unique_values), desc=f"Analysing by group", total=len(unique_values), unit="groups"):
-        print(f"\nProcessing segment: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
         filtered_file_data = file_data.copy()
@@ -1440,7 +1447,7 @@ def wrapper_extract_topics_per_column_value(
             acc_total_time_taken += float(seg_time_taken)
             acc_gradio_df = seg_gradio_df # Keep the latest Gradio DF
-            print(f"Segment {grouping_col} = {group_value} processed. Time: {seg_time_taken:.2f}s")
         except Exception as e:
             print(f"Error processing segment {grouping_col} = {group_value}: {e}")
@@ -1481,7 +1488,9 @@ def wrapper_extract_topics_per_column_value(
     acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(acc_whole_conversation_metadata, model_choice, model_name_map)
-    print(f"\nWrapper finished processing all segments. Total time: {acc_total_time_taken:.2f}s")
     # The return signature should match extract_topics.
     # The aggregated lists will be returned in the multiple slots.
@@ -1505,7 +1514,8 @@ def wrapper_extract_topics_per_column_value(
         acc_missing_df,
         acc_input_tokens,
         acc_output_tokens,
-        acc_number_of_calls
     )

 from tools.prompts import initial_table_prompt, prompt2, prompt3, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt,  force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt
 from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details
 from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata
+from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX
 from tools.aws_functions import connect_to_bedrock_runtime
 if RUN_LOCAL_MODEL == "1":
     acc_input_tokens = 0
     acc_output_tokens = 0
     acc_number_of_calls = 0
+    out_message = list()
     if grouping_col is None:
         print("No grouping column found")
     wrapper_first_loop = initial_first_loop_state
+    if len(unique_values) == 1:
+        loop_object = enumerate(unique_values)
+    else:
+        loop_object = tqdm(enumerate(unique_values), desc=f"Analysing group", total=len(unique_values), unit="groups")
+    for i, group_value in loop_object:
+        print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
         filtered_file_data = file_data.copy()
             acc_total_time_taken += float(seg_time_taken)
             acc_gradio_df = seg_gradio_df # Keep the latest Gradio DF
+            print(f"Group {grouping_col} = {group_value} processed. Time: {seg_time_taken:.2f}s")
         except Exception as e:
             print(f"Error processing segment {grouping_col} = {group_value}: {e}")
     acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(acc_whole_conversation_metadata, model_choice, model_name_map)
+    out_message = '\n'.join(out_message)
+    out_message = out_message + " " + f"Topic extraction finished processing all groups. Total time: {acc_total_time_taken:.2f}s"
+    print(out_message)
     # The return signature should match extract_topics.
     # The aggregated lists will be returned in the multiple slots.
         acc_missing_df,
         acc_input_tokens,
         acc_output_tokens,
+        acc_number_of_calls,
+        out_message
     )

tools/llm_funcs.py CHANGED Viewed

@@ -18,7 +18,7 @@ full_text = "" # Define dummy source text (full text) just to enable highlight f
 model = list() # Define empty list for model functions to run
 tokenizer = list() #[] # Define empty list for model functions to run
-from tools.config import RUN_AWS_FUNCTIONS, AWS_REGION, LLM_TEMPERATURE, LLM_TOP_K, LLM_MIN_P, LLM_TOP_P, LLM_REPETITION_PENALTY, LLM_LAST_N_TOKENS, LLM_MAX_NEW_TOKENS, LLM_SEED, LLM_RESET, LLM_STREAM, LLM_THREADS, LLM_BATCH_SIZE, LLM_CONTEXT_LENGTH, LLM_SAMPLE, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, MAX_COMMENT_CHARS, RUN_LOCAL_MODEL, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, HF_TOKEN, LLM_SEED, LLM_MAX_GPU_LAYERS, SPECULATIVE_DECODING, NUM_PRED_TOKENS
 from tools.prompts import initial_table_assistant_prefill
 if SPECULATIVE_DECODING == "True": SPECULATIVE_DECODING = True
@@ -220,8 +220,6 @@ def load_model(local_model_type:str=CHOSEN_LOCAL_MODEL_TYPE,
     # Verify the device and cuda settings
     # Check if CUDA is enabled
     import torch
-    #if RUN_LOCAL_MODEL == "1":
-    #print("Running local model - importing llama-cpp-python")
     from llama_cpp import Llama
     from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

 model = list() # Define empty list for model functions to run
 tokenizer = list() #[] # Define empty list for model functions to run
+from tools.config import AWS_REGION, LLM_TEMPERATURE, LLM_TOP_K, LLM_MIN_P, LLM_TOP_P, LLM_REPETITION_PENALTY, LLM_LAST_N_TOKENS, LLM_MAX_NEW_TOKENS, LLM_SEED, LLM_RESET, LLM_STREAM, LLM_THREADS, LLM_BATCH_SIZE, LLM_CONTEXT_LENGTH, LLM_SAMPLE, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, MAX_COMMENT_CHARS, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, HF_TOKEN, LLM_SEED, LLM_MAX_GPU_LAYERS, SPECULATIVE_DECODING, NUM_PRED_TOKENS
 from tools.prompts import initial_table_assistant_prefill
 if SPECULATIVE_DECODING == "True": SPECULATIVE_DECODING = True
     # Verify the device and cuda settings
     # Check if CUDA is enabled
     import torch
     from llama_cpp import Llama
     from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

windows_install_llama-cpp-python.txt CHANGED Viewed

@@ -1,26 +1,11 @@
 ---
-set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
-pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
----
-# With CUDA
-pip install llama-cpp-python==0.3.16 --force-reinstall --no-cache-dir --verbose -C cmake.args="-DGGML_CUDA=on"
----
-How to Make it Work: Step-by-Step Guide
-To successfully run your command, you need to set up a proper C++ development environment.
-Step 1: Install the C++ Compiler
-Go to the Visual Studio downloads page.
-Scroll down to "Tools for Visual Studio" and download the "Build Tools for Visual Studio". This is a standalone installer that gives you the C++ compiler and libraries without installing the full Visual Studio IDE.
 Run the installer. In the "Workloads" tab, check the box for "Desktop development with C++".
@@ -34,18 +19,17 @@ Windows 10 SDK (10.0.20348.0)
 Proceed with the installation.
-Need to use 'x64 Native Tools Command Prompt for VS 2022' to install. Run as administrator
-Step 2: Install CMake
-Go to the CMake download page.
 Download the latest Windows installer (e.g., cmake-x.xx.x-windows-x86_64.msi).
 Run the installer. Crucially, when prompted, select the option to "Add CMake to the system PATH for all users" or "for the current user." This allows you to run cmake from any command prompt.
-Step 3: Download and Place OpenBLAS
 This is often the trickiest part.
 Go to the OpenBLAS releases on GitHub.
@@ -56,14 +40,12 @@ Create a folder somewhere easily accessible, for example, C:\libs\.
 Extract the contents of the OpenBLAS zip file into that folder. Your final directory structure should look something like this:
-Generated code
 C:\libs\OpenBLAS\
 ├── bin\
 ├── include\
 └── lib\
-Use code with caution.
-3.b. Install Chocolatey
 https://chocolatey.org/install
 Step 1: Install Chocolatey (if you don't already have it)
@@ -71,25 +53,20 @@ Open PowerShell as an Administrator. (Right-click the Start Menu -> "Windows Pow
 Run the following command to install Chocolatey. It's a single, long line:
-Generated powershell
 Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
-Use code with caution.
-Powershell
-Wait for it to finish. Once it's done, close the Administrator PowerShell window.
 Step 2: Install pkg-config-lite using Chocolatey
-IMPORTANT: Open a NEW command prompt or PowerShell window (as a regular user is fine). This is necessary so it recognizes the new choco command.
-Run the following command to install a lightweight version of pkg-config:
-Generated cmd
 choco install pkgconfiglite
-Use code with caution.
-Cmd
-Approve the installation by typing Y or A if prompted.
-Step 4: Run the Installation Command
 Now you have all the pieces. The final step is to run the command in a terminal that is aware of your new build environment.
 Open the "Developer Command Prompt for VS" from your Start Menu. This is important! This special command prompt automatically configures all the necessary paths for the C++ compiler.
@@ -98,22 +75,35 @@ Open the "Developer Command Prompt for VS" from your Start Menu. This is importa
 set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
 pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
-## With Cuda
-### Make sure you are using the x64 version of Developer command tools, e.g. 'x64 Native Tools Command Prompt for VS 2022' ###
 Use NVIDIA GPU (cuBLAS): If you have an NVIDIA GPU, using cuBLAS is often easier because the CUDA Toolkit installer handles most of the setup.
 Install the NVIDIA CUDA Toolkit.
-Run the install command specifying cuBLAS:
-set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
-pip install llama-cpp-python==0.3.16 --force-reinstall --no-cache-dir --verbose -C cmake.args="-DGGML_CUDA=on"

 ---
+#How to build llama-cpp-python on Windows: Step-by-Step Guide
+First, you need to set up a proper C++ development environment.
+# Step 1: Install the C++ Compiler
+Scroll down the page past the main programs to "Tools for Visual Studio" and download the "Build Tools for Visual Studio". This is a standalone installer that gives you the C++ compiler and libraries without installing the full Visual Studio IDE.
 Run the installer. In the "Workloads" tab, check the box for "Desktop development with C++".
 Proceed with the installation.
+Need to use 'x64 Native Tools Command Prompt for VS 2022' to install the below. Run as administrator
+# Step 2: Install CMake
+Go to the CMake download page: https://cmake.org/download
 Download the latest Windows installer (e.g., cmake-x.xx.x-windows-x86_64.msi).
 Run the installer. Crucially, when prompted, select the option to "Add CMake to the system PATH for all users" or "for the current user." This allows you to run cmake from any command prompt.
+# Step 3: (FOR CPU INFERENCE ONLY) Download and Place OpenBLAS
 This is often the trickiest part.
 Go to the OpenBLAS releases on GitHub.
 Extract the contents of the OpenBLAS zip file into that folder. Your final directory structure should look something like this:
 C:\libs\OpenBLAS\
 ├── bin\
 ├── include\
 └── lib\
+## 3.b. Install Chocolatey
 https://chocolatey.org/install
 Step 1: Install Chocolatey (if you don't already have it)
 Run the following command to install Chocolatey. It's a single, long line:
 Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
+Once it's done, close the Administrator PowerShell window.
 Step 2: Install pkg-config-lite using Chocolatey
+IMPORTANT: Open a NEW command prompt or PowerShell window (as a regular user is fine). This is necessary so it recognises the new choco command.
+Run the following command in console to install a lightweight version of pkg-config:
 choco install pkgconfiglite
+Approve the installation by typing Y or A if prompted.
+# Step 4: Run the Installation Command
 Now you have all the pieces. The final step is to run the command in a terminal that is aware of your new build environment.
 Open the "Developer Command Prompt for VS" from your Start Menu. This is important! This special command prompt automatically configures all the necessary paths for the C++ compiler.
 set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
 pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
+or to make a wheel:
+pip install llama-cpp-python==0.3.16 --wheel-dir dist --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
+pip wheel llama-cpp-python==0.3.16 --wheel-dir dist --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/Users/spedrickcase/libs/OpenBLAS/include;-DBLAS_LIBRARIES=C:/Users/spedrickcase/libs/OpenBLAS/lib/libopenblas.lib"
+C:/Users/spedrickcase/libs
+## With Cuda (NVIDIA GPUs only)
+Make sure that the have the CUDA 12.4 toolkit for windows installed: https://developer.nvidia.com/cuda-12-4-0-download-archive
+### Make sure you are using the x64 version of Developer command tools for the below, e.g. 'x64 Native Tools Command Prompt for VS 2022' ###
 Use NVIDIA GPU (cuBLAS): If you have an NVIDIA GPU, using cuBLAS is often easier because the CUDA Toolkit installer handles most of the setup.
 Install the NVIDIA CUDA Toolkit.
+Run the install command specifying cuBLAS (for faster inference):
+pip install llama-cpp-python==0.3.16 --force-reinstall --verbose -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"
+If you want to create a new wheel to help with future installs, you can run:
+cd first to a folder that you have edit access for
+pip wheel llama-cpp-python==0.3.16 --wheel-dir dist --verbose -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"