seanpedrickcase commited on
Commit
ba1a951
Β·
1 Parent(s): aa08197

Enhanced app functionality by adding new UI elements for summary file management, put in Bedrock model toggle, and refined logging messages. Updated Dockerfile and requirements for better compatibility and added install guide to readme. Removed deprecated code and unnecessary comments.

Browse files
Dockerfile CHANGED
@@ -1,3 +1,4 @@
 
1
  # Stage 1: Build dependencies and download models
2
  FROM public.ecr.aws/docker/library/python:3.11.13-slim-bookworm AS builder
3
 
@@ -7,7 +8,7 @@ RUN apt-get update && apt-get install -y \
7
  gcc \
8
  g++ \
9
  cmake \
10
- libopenblas-dev \
11
  pkg-config \
12
  python3-dev \
13
  libffi-dev \
@@ -18,9 +19,9 @@ WORKDIR /src
18
 
19
  COPY requirements_aws.txt .
20
 
21
- # Set environment variables for OpenBLAS
22
- ENV OPENBLAS_VERBOSE=1
23
- ENV CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
24
 
25
  RUN pip install --no-cache-dir --target=/install torch==2.7.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu \
26
  && pip install --no-cache-dir --target=/install https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl \
 
1
+ # This Dockerfile is optimised for AWS ECS using Python 3.11, and assumes CPU inference with OpenBLAS for local models.
2
  # Stage 1: Build dependencies and download models
3
  FROM public.ecr.aws/docker/library/python:3.11.13-slim-bookworm AS builder
4
 
 
8
  gcc \
9
  g++ \
10
  cmake \
11
+ #libopenblas-dev \
12
  pkg-config \
13
  python3-dev \
14
  libffi-dev \
 
19
 
20
  COPY requirements_aws.txt .
21
 
22
+ # Set environment variables for OpenBLAS - not necessary if not building from source
23
+ # ENV OPENBLAS_VERBOSE=1
24
+ # ENV CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
25
 
26
  RUN pip install --no-cache-dir --target=/install torch==2.7.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu \
27
  && pip install --no-cache-dir --target=/install https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl \
README.md CHANGED
@@ -11,7 +11,7 @@ license: agpl-3.0
11
 
12
  # Large language model topic modelling
13
 
14
- Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see config.py), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
15
 
16
  Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
17
 
@@ -25,6 +25,120 @@ Basic use:
25
  2. Select the relevant open text column from the dropdown.
26
  3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
27
  4. Write a one sentence description of the consultation/context of the open text.
28
- 5. Extract topics.
29
- 6. If topic extraction fails part way through, you can upload the latest 'reference_table' and 'unique_topics_table' csv outputs on the 'Continue previous topic extraction' tab to continue from where you left off.
30
- 7. Summaries will be produced for each topic for each 'batch' of responses. If you want consolidated summaries, go to the tab 'Summarise topic outputs', upload your output reference_table and unique_topics csv files, and press summarise.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # Large language model topic modelling
13
 
14
+ Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
15
 
16
  Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
17
 
 
25
  2. Select the relevant open text column from the dropdown.
26
  3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
27
  4. Write a one sentence description of the consultation/context of the open text.
28
+ 5. Click 'All in one - Extract topics, deduplicate, and summarise'. This will run through the whole analysis process from topic extraction, to topic deduplication, to topic-level and overall summaries.
29
+ 6. A summary xlsx file workbook will be created on the front page in the box 'Overall summary xlsx file'. This will combine all the results from the different processes into one workbook.
30
+
31
+ # Installation guide
32
+
33
+ Here is a step-by-step guide to clone the repository, create a virtual environment, and install dependencies from a relevant `requirements` file. This guide assumes you have **Git** and **Python 3.11** installed.
34
+
35
+ -----
36
+
37
+ ### Step 1: Clone the Git Repository
38
+
39
+ First, you need to copy the project files to your local machine. Navigate to the directory where you want to store the project using the `cd` (change directory) command. Then, use `git clone` with the repository's URL.
40
+
41
+ 1. **Clone the repo:**
42
+
43
+ ```bash
44
+ git clone https://github.com/example-user/example-repo.git
45
+ ```
46
+
47
+ *Replace the URL with your repository's URL.*
48
+
49
+ 2. **Navigate into the new project folder:**
50
+
51
+ ```bash
52
+ cd example-repo
53
+ ```
54
+ -----
55
+
56
+ ### Step 2: Create and Activate a Virtual Environment
57
+
58
+ A virtual environment is a self-contained directory that holds a specific Python interpreter and its own set of installed packages. This is crucial for isolating your project's dependencies.
59
+
60
+ 1. **Create the virtual environment:** We'll use Python's built-in `venv` module. It's common practice to name the environment folder `.venv`.
61
+
62
+ ```bash
63
+ python -m venv .venv
64
+ ```
65
+
66
+ *This command tells Python to create a new virtual environment in a folder named `.venv`.*
67
+
68
+ 2. **Activate the environment:** You must "activate" the environment to start using it. The command differs based on your operating system and shell.
69
+
70
+ * **On macOS / Linux (bash/zsh):**
71
+
72
+ ```bash
73
+ source .venv/bin/activate
74
+ ```
75
+
76
+ * **On Windows (Command Prompt):**
77
+
78
+ ```bash
79
+ .\.venv\Scripts\activate
80
+ ```
81
+
82
+ * **On Windows (PowerShell):**
83
+
84
+ ```powershell
85
+ .\.venv\Scripts\Activate.ps1
86
+ ```
87
+
88
+ You'll know it's active because your command prompt will be prefixed with `(.venv)`.
89
+
90
+ -----
91
+
92
+ ### Step 3: Install Dependencies
93
+
94
+ Now that your virtual environment is active, you can install all the required packages listed in the relevant project `requirements.txt` file using `pip`.
95
+
96
+ 1. **Choose the relevant requirements file**
97
+
98
+ Llama-cpp-python version 3.16 is compatible with Gemma 3 and GPT-OSS models, but does not at the time of writing have relevant wheels for CPU inference or for Windows. A sister repository contains [llama-cpp-python 3.16 wheels for Python version 3.11/10](https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/tag/v0.1.0) so that users can avoid having to build the package from source. If you prefer to build from source, then please refer to the llama-cpp-python documentation [here](https://github.com/abetlen/llama-cpp-python). I also have a guide to building the package on a Windows system [here](https://github.com/seanpedrick-case/llm_topic_modelling/blob/main/windows_install_llama-cpp-python.txt).
99
+
100
+ The repo provides several requirements files that are relevant for different situations. I would advise using requirements_gpu.txt for GPU environments, and requirements_cpu.txt for CPU environments:
101
+
102
+ - **requirements_gpu.txt**: Used for Python 3.11 GPU-enabled environments. Uncomment the last requirement under 'Windows' for Windows compatibility (CUDA 12.4)
103
+ - **requirements_cpu.txt**: Used for Python 3.11 CPU-only environments. Uncomment the last requirement under 'Windows' for Windows compatibility
104
+ - **requirements.txt**: Used for the Python 3.10 GPU-enabled environment on Hugging Face spaces (CUDA 12.4)
105
+ - **requirements_aws**: Used in conjunction with the Dockerfile for Python 3.11, CPU-only environments.
106
+
107
+ 2. **Install packages from the requirements file:**
108
+ ```bash
109
+ pip install -r requirements_gpu.txt
110
+ ```
111
+ *This command reads every package name listed in the file and installs it into your `.venv` environment.*
112
+
113
+ You're all set\! βœ… Your project is cloned, and all dependencies are installed in an isolated environment.
114
+
115
+ When you are finished working, you can leave the virtual environment by simply typing:
116
+
117
+ ```bash
118
+ deactivate
119
+ ```
120
+
121
+ ### Step 4: Verify CUDA compatibility (if using a GPU environment)
122
+
123
+ Install the relevant toolkit for CUDA 12.4 from here: https://developer.nvidia.com/cuda-12-4-0-download-archive
124
+
125
+ Restart your computer
126
+
127
+ Ensure you have the latest drivers for your NVIDIA GPU. Check your current version and memory availability by running nvidia-smi
128
+
129
+ In command line, CUDA compatibility can be checked by running nvcc --version
130
+
131
+
132
+ ### Step 5: Ensure you have compatible NVIDIA drivers
133
+
134
+ Make sure you have the latest NVIDIA drivers installed on your system for your GPU (be careful in particular if using WSL that you have drivers compatible with this). Official drivers can be found here: https://www.nvidia.com/en-us/drivers
135
+
136
+ Current drivers can be found by running nvidia-smi in command line
137
+
138
+ ### Step 6: Run the app
139
+
140
+ Go to the app project directory. Run python app.py
141
+
142
+ ### Step 7: (optional) change default configuration
143
+
144
+ A number of configuration options can be seen the tools/config.py file. You can either pass in these variables as environment variables, or you can create a file in config/app_config.env to read this into the app on initialisation.
app.py CHANGED
@@ -12,7 +12,7 @@ from tools.custom_csvlogger import CSVLogger_custom
12
  from tools.auth import authenticate_user
13
  from tools.prompts import initial_table_prompt, prompt2, prompt3, system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, verify_titles_prompt, verify_titles_system_prompt, two_para_summary_format_prompt, single_para_summary_format_prompt
14
  from tools.verify_titles import verify_titles
15
- from tools.config import RUN_AWS_FUNCTIONS, HOST_NAME, ACCESS_LOGS_FOLDER, FEEDBACK_LOGS_FOLDER, USAGE_LOGS_FOLDER, RUN_LOCAL_MODEL, FILE_INPUT_HEIGHT, GEMINI_API_KEY, model_full_names, BATCH_SIZE_DEFAULT, CHOSEN_LOCAL_MODEL_TYPE, LLM_SEED, COGNITO_AUTH, MAX_QUEUE_SIZE, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, INPUT_FOLDER, OUTPUT_FOLDER, S3_LOG_BUCKET, CONFIG_FOLDER, GRADIO_TEMP_DIR, MPLCONFIGDIR, model_name_map, GET_COST_CODES, ENFORCE_COST_CODES, DEFAULT_COST_CODE, COST_CODES_PATH, S3_COST_CODES_PATH, OUTPUT_COST_CODES_PATH, SHOW_COSTS, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, USAGE_LOG_DYNAMODB_TABLE_NAME, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, LOG_FILE_NAME, FEEDBACK_LOG_FILE_NAME, USAGE_LOG_FILE_NAME, CSV_ACCESS_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, CSV_USAGE_LOG_HEADERS, DYNAMODB_ACCESS_LOG_HEADERS, DYNAMODB_FEEDBACK_LOG_HEADERS, DYNAMODB_USAGE_LOG_HEADERS, S3_ACCESS_LOGS_FOLDER, S3_FEEDBACK_LOGS_FOLDER, S3_USAGE_LOGS_FOLDER
16
 
17
  def ensure_folder_exists(output_folder:str):
18
  """Checks if the specified folder exists, creates it if not."""
@@ -115,6 +115,8 @@ with app:
115
  summarised_outputs_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of summarised outputs", allow_custom_value=True)
116
  latest_summary_completed_num = gr.Number(0, visible=False)
117
 
 
 
118
  original_data_file_name_textbox = gr.Textbox(label = "Reference data file name", value="", visible=False)
119
  working_data_file_name_textbox = gr.Textbox(label = "Working data file name", value="", visible=False)
120
  unique_topics_table_file_name_textbox = gr.Textbox(label="Unique topics data file name textbox", visible=False)
@@ -132,13 +134,17 @@ with app:
132
  cost_code_dataframe = gr.Dataframe(value=pd.DataFrame(), type="pandas", visible=False, wrap=True)
133
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
134
 
 
 
 
 
135
  ###
136
  # UI LAYOUT
137
  ###
138
 
139
  gr.Markdown("""# Large language model topic modelling
140
 
141
- Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see config.py), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
142
 
143
  Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
144
 
@@ -183,14 +189,12 @@ with app:
183
  all_in_one_btn = gr.Button("All in one - Extract topics, deduplicate, and summarise", variant="primary")
184
  extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
185
 
186
- with gr.Row():
187
- topic_extraction_output_files = gr.File(height=FILE_INPUT_HEIGHT, label="Output files", scale=3, interactive=False)
188
- topic_extraction_output_files_xlsx = gr.File(height=FILE_INPUT_HEIGHT, label="Output xlsx summary file", scale=1, interactive=False)
 
189
 
190
  display_topic_table_markdown = gr.Markdown(value="### Language model response will appear here", show_copy_button=True)
191
- latest_batch_completed = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
192
- # Duplicate version of the above variable for when you don't want to initiate the summarisation loop
193
- latest_batch_completed_no_loop = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
194
 
195
  data_feedback_title = gr.Markdown(value="## Please give feedback", visible=False)
196
  data_feedback_radio = gr.Radio(label="Please give some feedback about the results of the topic extraction.",
@@ -287,14 +291,15 @@ with app:
287
  with gr.Tab(label="LLM and topic extraction settings"):
288
  gr.Markdown("""Define settings that affect large language model output.""")
289
  with gr.Accordion("Settings for LLM generation", open = True):
290
- temperature_slide = gr.Slider(minimum=0.1, maximum=1.0, value=0.1, label="Choose LLM temperature setting")
291
  batch_size_number = gr.Number(label = "Number of responses to submit in a single LLM query", value = BATCH_SIZE_DEFAULT, precision=0, minimum=1, maximum=100)
292
  random_seed = gr.Number(value=LLM_SEED, label="Random seed for LLM generation", visible=False)
293
 
294
  with gr.Accordion("AWS API keys", open = False):
 
295
  with gr.Row():
296
- aws_access_key_textbox = gr.Textbox(label="AWS access key", interactive=False, lines=1, type="password")
297
- aws_secret_key_textbox = gr.Textbox(label="AWS secret key", interactive=False, lines=1, type="password")
298
 
299
  with gr.Accordion("Gemini API keys", open = False):
300
  google_api_key_textbox = gr.Textbox(value = GEMINI_API_KEY, label="Enter Gemini API key (only if using Google API models)", lines=1, type="password")
@@ -413,10 +418,11 @@ with app:
413
  missing_df_state,
414
  input_tokens_num,
415
  output_tokens_num,
416
- number_of_calls_num],
417
- api_name="extract_topics").\
 
418
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs").\
419
- success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[topic_extraction_output_files_xlsx])
420
 
421
  ###
422
  # DEDUPLICATION AND SUMMARISATION FUNCTIONS
@@ -436,16 +442,16 @@ with app:
436
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
437
  success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
438
  success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown], api_name="sample_summaries").\
439
- success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], api_name="summarise_topics").\
440
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
441
- success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_revised_summaries_state, master_unique_topics_df_revised_summaries_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[summary_output_files_xlsx])
442
 
443
  # SUMMARISE WHOLE TABLE PAGE
444
  overall_summarise_previous_data_btn.click(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
445
  success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
446
- success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number], scroll_to_output=True, api_name="overall_summary").\
447
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
448
- success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx])
449
 
450
 
451
  # All in one button
@@ -509,21 +515,20 @@ with app:
509
  missing_df_state,
510
  input_tokens_num,
511
  output_tokens_num,
512
- number_of_calls_num]).\
 
513
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
514
  success(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
515
  success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown]).\
516
  success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
517
  success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown]).\
518
- success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number]).\
519
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
520
  success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
521
- success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number]).\
522
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
523
- success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx]).\
524
- success(move_overall_summary_output_files_to_front_page, inputs=[overall_summary_output_files_xlsx], outputs=[topic_extraction_output_files_xlsx])
525
-
526
-
527
 
528
  ###
529
  # CONTINUE PREVIOUS TOPIC EXTRACTION PAGE
 
12
  from tools.auth import authenticate_user
13
  from tools.prompts import initial_table_prompt, prompt2, prompt3, system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, verify_titles_prompt, verify_titles_system_prompt, two_para_summary_format_prompt, single_para_summary_format_prompt
14
  from tools.verify_titles import verify_titles
15
+ from tools.config import RUN_AWS_FUNCTIONS, HOST_NAME, ACCESS_LOGS_FOLDER, FEEDBACK_LOGS_FOLDER, USAGE_LOGS_FOLDER, RUN_LOCAL_MODEL, FILE_INPUT_HEIGHT, GEMINI_API_KEY, model_full_names, BATCH_SIZE_DEFAULT, CHOSEN_LOCAL_MODEL_TYPE, LLM_SEED, COGNITO_AUTH, MAX_QUEUE_SIZE, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, INPUT_FOLDER, OUTPUT_FOLDER, S3_LOG_BUCKET, CONFIG_FOLDER, GRADIO_TEMP_DIR, MPLCONFIGDIR, model_name_map, GET_COST_CODES, ENFORCE_COST_CODES, DEFAULT_COST_CODE, COST_CODES_PATH, S3_COST_CODES_PATH, OUTPUT_COST_CODES_PATH, SHOW_COSTS, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, USAGE_LOG_DYNAMODB_TABLE_NAME, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, LOG_FILE_NAME, FEEDBACK_LOG_FILE_NAME, USAGE_LOG_FILE_NAME, CSV_ACCESS_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, CSV_USAGE_LOG_HEADERS, DYNAMODB_ACCESS_LOG_HEADERS, DYNAMODB_FEEDBACK_LOG_HEADERS, DYNAMODB_USAGE_LOG_HEADERS, S3_ACCESS_LOGS_FOLDER, S3_FEEDBACK_LOGS_FOLDER, S3_USAGE_LOGS_FOLDER, AWS_ACCESS_KEY, AWS_SECRET_KEY
16
 
17
  def ensure_folder_exists(output_folder:str):
18
  """Checks if the specified folder exists, creates it if not."""
 
115
  summarised_outputs_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of summarised outputs", allow_custom_value=True)
116
  latest_summary_completed_num = gr.Number(0, visible=False)
117
 
118
+ summary_xlsx_output_files_list = gr.Dropdown(value= list(), choices= list(), visible=False, label="List of xlsx summary output files", allow_custom_value=True)
119
+
120
  original_data_file_name_textbox = gr.Textbox(label = "Reference data file name", value="", visible=False)
121
  working_data_file_name_textbox = gr.Textbox(label = "Working data file name", value="", visible=False)
122
  unique_topics_table_file_name_textbox = gr.Textbox(label="Unique topics data file name textbox", visible=False)
 
134
  cost_code_dataframe = gr.Dataframe(value=pd.DataFrame(), type="pandas", visible=False, wrap=True)
135
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
136
 
137
+ latest_batch_completed = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
138
+ # Duplicate version of the above variable for when you don't want to initiate the summarisation loop
139
+ latest_batch_completed_no_loop = gr.Number(value=0, label="Number of files prepared", interactive=False, visible=False)
140
+
141
  ###
142
  # UI LAYOUT
143
  ###
144
 
145
  gr.Markdown("""# Large language model topic modelling
146
 
147
+ Extract topics and summarise outputs using Large Language Models (LLMs, Gemma 3 4b/GPT-OSS 20b if local (see tools/config.py to modify), Gemini 2.5, or Bedrock models e.g. (Claude 3 Haiku/Claude Sonnet 3.7). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
148
 
149
  Instructions on use can be found in the README.md file. Try it out with this [dummy development consultation dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main), which you can also try with [zero-shot topics](https://huggingface.co/datasets/seanpedrickcase/dummy_development_consultation/tree/main). Try also this [dummy case notes dataset](https://huggingface.co/datasets/seanpedrickcase/dummy_case_notes/tree/main).
150
 
 
189
  all_in_one_btn = gr.Button("All in one - Extract topics, deduplicate, and summarise", variant="primary")
190
  extract_topics_btn = gr.Button("1. Extract topics", variant="secondary")
191
 
192
+ with gr.Row(equal_height=True):
193
+ output_messages_textbox = gr.Textbox(value="", label="Output messages", scale=1, interactive=False)
194
+ topic_extraction_output_files = gr.File(label="Extract topics output files", scale=1, interactive=False)
195
+ topic_extraction_output_files_xlsx = gr.File(label="Overall summary xlsx file", scale=1, interactive=False)
196
 
197
  display_topic_table_markdown = gr.Markdown(value="### Language model response will appear here", show_copy_button=True)
 
 
 
198
 
199
  data_feedback_title = gr.Markdown(value="## Please give feedback", visible=False)
200
  data_feedback_radio = gr.Radio(label="Please give some feedback about the results of the topic extraction.",
 
291
  with gr.Tab(label="LLM and topic extraction settings"):
292
  gr.Markdown("""Define settings that affect large language model output.""")
293
  with gr.Accordion("Settings for LLM generation", open = True):
294
+ temperature_slide = gr.Slider(minimum=0.1, maximum=1.0, value=0.1, label="Choose LLM temperature setting", precision=1)
295
  batch_size_number = gr.Number(label = "Number of responses to submit in a single LLM query", value = BATCH_SIZE_DEFAULT, precision=0, minimum=1, maximum=100)
296
  random_seed = gr.Number(value=LLM_SEED, label="Random seed for LLM generation", visible=False)
297
 
298
  with gr.Accordion("AWS API keys", open = False):
299
+ gr.Markdown("""Querying Bedrock models with API keys requires a role with IAM permissions for the bedrock:InvokeModel action.""")
300
  with gr.Row():
301
+ aws_access_key_textbox = gr.Textbox(value=AWS_ACCESS_KEY, label="AWS access key", lines=1, type="password")
302
+ aws_secret_key_textbox = gr.Textbox(value=AWS_SECRET_KEY, label="AWS secret key", lines=1, type="password")
303
 
304
  with gr.Accordion("Gemini API keys", open = False):
305
  google_api_key_textbox = gr.Textbox(value = GEMINI_API_KEY, label="Enter Gemini API key (only if using Google API models)", lines=1, type="password")
 
418
  missing_df_state,
419
  input_tokens_num,
420
  output_tokens_num,
421
+ number_of_calls_num,
422
+ output_messages_textbox],
423
+ api_name="extract_topics", show_progress_on=output_messages_textbox).\
424
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False, api_name="usage_logs").\
425
+ success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[topic_extraction_output_files_xlsx, summary_xlsx_output_files_list])
426
 
427
  ###
428
  # DEDUPLICATION AND SUMMARISATION FUNCTIONS
 
442
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
443
  success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
444
  success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown], api_name="sample_summaries").\
445
+ success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, summarised_output_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], api_name="summarise_topics", show_progress_on=[output_messages_textbox, summary_output_files]).\
446
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
447
+ success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_revised_summaries_state, master_unique_topics_df_revised_summaries_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[summary_output_files_xlsx, summary_xlsx_output_files_list])
448
 
449
  # SUMMARISE WHOLE TABLE PAGE
450
  overall_summarise_previous_data_btn.click(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
451
  success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
452
+ success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], scroll_to_output=True, api_name="overall_summary", show_progress_on=[output_messages_textbox, overall_summary_output_files]).\
453
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
454
+ success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx, summary_xlsx_output_files_list])
455
 
456
 
457
  # All in one button
 
515
  missing_df_state,
516
  input_tokens_num,
517
  output_tokens_num,
518
+ number_of_calls_num,
519
+ output_messages_textbox], show_progress_on=output_messages_textbox).\
520
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
521
  success(load_in_previous_data_files, inputs=[deduplication_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
522
  success(deduplicate_topics, inputs=[master_reference_df_state, master_unique_topics_df_state, working_data_file_name_textbox, unique_topics_table_file_name_textbox, in_excel_sheets, merge_sentiment_drop, merge_general_topics_drop, deduplicate_score_threshold, in_data_files, in_colnames, output_folder_state], outputs=[master_reference_df_state, master_unique_topics_df_state, summarisation_input_files, log_files_output, summarised_output_markdown]).\
523
  success(load_in_previous_data_files, inputs=[summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
524
  success(sample_reference_table_summaries, inputs=[master_reference_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown]).\
525
+ success(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, in_data_files, in_excel_sheets, in_colnames, log_files_output_list_state, summarise_format_radio, output_folder_state, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox, display_topic_table_markdown, log_files_output, overall_summarisation_input_files, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], show_progress_on=[output_messages_textbox, summary_output_files]).\
526
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
527
  success(load_in_previous_data_files, inputs=[overall_summarisation_input_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, deduplication_input_files_status, working_data_file_name_textbox, unique_topics_table_file_name_textbox]).\
528
+ success(overall_summary, inputs=[master_unique_topics_df_state, model_choice, google_api_key_textbox, temperature_slide, working_data_file_name_textbox, output_folder_state, in_colnames, context_textbox, aws_access_key_textbox, aws_secret_key_textbox, model_name_map_state], outputs=[overall_summary_output_files, overall_summarised_output_markdown, summarised_output_df, conversation_metadata_textbox, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, output_messages_textbox], show_progress_on=[output_messages_textbox, overall_summary_output_files]).\
529
  success(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, original_data_file_name_textbox, in_colnames, model_choice, conversation_metadata_textbox_placeholder, input_tokens_num, output_tokens_num, number_of_calls_num, estimated_time_taken_number, cost_code_choice_drop], None, preprocess=False).\
530
+ success(collect_output_csvs_and_create_excel_output, inputs=[in_data_files, in_colnames, original_data_file_name_textbox, in_group_col, model_choice, master_reference_df_state, master_unique_topics_df_state, summarised_output_df, missing_df_state, in_excel_sheets, usage_logs_state, model_name_map_state, output_folder_state], outputs=[overall_summary_output_files_xlsx, summary_xlsx_output_files_list]).\
531
+ success(move_overall_summary_output_files_to_front_page, inputs=[summary_xlsx_output_files_list], outputs=[topic_extraction_output_files_xlsx])
 
 
532
 
533
  ###
534
  # CONTINUE PREVIOUS TOPIC EXTRACTION PAGE
requirements.txt CHANGED
@@ -1,3 +1,4 @@
 
1
  pandas==2.3.2
2
  gradio==5.44.1
3
  transformers==4.56.0
@@ -13,15 +14,13 @@ html5lib==1.1
13
  beautifulsoup4==4.12.3
14
  rapidfuzz==3.13.0
15
  python-dotenv==1.1.0
16
- # Torch and Llama CPP Python
17
  # GPU
18
  torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest compatible with CUDA 12.4
19
- https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl # Specify exact llama_cpp for cuda compatibility on Hugging Face
20
- #
21
- # CPU only (for e.g. Hugging Face CPU instances):
22
  #torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
23
- # For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts. A Python 3.11 wheel is also available from the same repo
24
  #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
25
- #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
26
 
27
 
 
1
+ # Note that this requirements file is optimised for Hugging Face spaces / Python 3.10. Please use requirements_cpu.txt for CPU instances and requirements_gpu.txt for GPU instances using Python 3.11
2
  pandas==2.3.2
3
  gradio==5.44.1
4
  transformers==4.56.0
 
14
  beautifulsoup4==4.12.3
15
  rapidfuzz==3.13.0
16
  python-dotenv==1.1.0
17
+ # Torch and llama-cpp-python
18
  # GPU
19
  torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest compatible with CUDA 12.4
20
+ https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
21
+ # CPU only (for e.g. Hugging Face CPU instances)
 
22
  #torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
23
+ # For Hugging Face, need a python 3.10 compatible wheel for llama-cpp-python to avoid build timeouts
24
  #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
 
25
 
26
 
requirements_aws.txt CHANGED
@@ -1,3 +1,4 @@
 
1
  pandas==2.3.2
2
  gradio==5.44.1
3
  transformers==4.56.0
@@ -12,6 +13,4 @@ google-genai==1.32.0
12
  html5lib==1.1
13
  beautifulsoup4==4.12.3
14
  rapidfuzz==3.13.0
15
- python-dotenv==1.1.0
16
- # torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu # Commented out as Dockerfile should install torch
17
- # llama-cpp-python==0.3.16 # Commented out as Dockerfile should install llama-cpp-python
 
1
+ # This requirements file is optimised for AWS ECS using Python 3.11 alongside the Dockerfile, and assumes a python 3.11 compatible llama-cpp-python wheel is available (see Dockerfile). torch and llama-cpp-python are not present here, as they are installed in the main Dockerfile
2
  pandas==2.3.2
3
  gradio==5.44.1
4
  transformers==4.56.0
 
13
  html5lib==1.1
14
  beautifulsoup4==4.12.3
15
  rapidfuzz==3.13.0
16
+ python-dotenv==1.1.0
 
 
requirements_cpu.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas==2.3.2
2
+ gradio==5.44.1
3
+ transformers==4.56.0
4
+ spaces==0.40.1
5
+ boto3==1.40.22
6
+ pyarrow==21.0.0
7
+ openpyxl==3.1.5
8
+ markdown==3.7
9
+ tabulate==0.9.0
10
+ lxml==5.3.0
11
+ google-genai==1.32.0
12
+ html5lib==1.1
13
+ beautifulsoup4==4.12.3
14
+ rapidfuzz==3.13.0
15
+ python-dotenv==1.1.0
16
+ torch==2.7.1 --extra-index-url https://download.pytorch.org/whl/cpu
17
+ # Linux, Python 3.11 compatible wheel available:
18
+ #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
19
+ # Windows, Python 3.11 compatible wheel available:
20
+ #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64_cpu_openblas.whl
21
+ # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
22
+ # Alternatively, try your luck at installing the package from source below
23
+ # llama-cpp-python==0.3.16
requirements_gpu.txt CHANGED
@@ -19,6 +19,8 @@ torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124 # Latest c
19
  # For Linux:
20
  https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
21
  # For Windows:
22
- #llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on" --verbose
23
- # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt'
 
 
24
 
 
19
  # For Linux:
20
  https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp311-cp311-linux_x86_64.whl
21
  # For Windows:
22
+ #https://github.com/seanpedrick-case/llama-cpp-python-whl-builder/releases/download/v0.1.0/llama_cpp_python-0.3.16-cp311-cp311-win_amd64.whl
23
+ # If above doesn't work for Windows, try looking at'windows_install_llama-cpp-python.txt' for instructions on how to build from source
24
+ # If none of the above work for you, try the following:
25
+ # llama-cpp-python==0.3.16 -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"
26
 
tools/aws_functions.py CHANGED
@@ -62,68 +62,6 @@ def connect_to_s3_client(aws_access_key_textbox:str="", aws_secret_key_textbox:s
62
 
63
  return s3_client
64
 
65
- # def connect_to_sts_client(aws_access_key_textbox:str="", aws_secret_key_textbox:str="", sts_endpoint:str=""):
66
- # # If running an anthropic model, assume that running an AWS sts model, load in sts
67
- # sts_client = []
68
-
69
- # if aws_access_key_textbox and aws_secret_key_textbox:
70
- # print("Connecting to sts using AWS access key and secret keys from user input.")
71
- # sts_client = boto3.client('sts',
72
- # aws_access_key_id=aws_access_key_textbox,
73
- # aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
74
- # elif RUN_AWS_FUNCTIONS == "1" and PRIORITISE_SSO_OVER_AWS_ENV_ACCESS_KEYS == "1":
75
- # print("Connecting to sts via existing SSO connection")
76
- # sts_client = boto3.client('sts', region_name=AWS_REGION)
77
- # elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
78
- # print("Getting sts credentials from environment variables")
79
- # sts_client = boto3.client('sts',
80
- # aws_access_key_id=AWS_ACCESS_KEY,
81
- # aws_secret_access_key=AWS_SECRET_KEY,
82
- # region_name=AWS_REGION)
83
- # else:
84
- # sts_client = ""
85
- # out_message = "Cannot connect to sts service. Please provide access keys under LLM settings, or choose another model type."
86
- # print(out_message)
87
- # raise Exception(out_message)
88
-
89
- # return sts_client
90
-
91
-
92
-
93
- # def get_assumed_role_info(aws_access_key_textbox, aws_secret_key_textbox, sts_endpoint):
94
- # sts_endpoint = 'https://sts.' + AWS_REGION + '.amazonaws.com'
95
- # sts = connect_to_sts_client(aws_access_key_textbox, aws_secret_key_textbox, endpoint_url=sts_endpoint)
96
-
97
- # #boto3.client('sts', region_name=AWS_REGION, endpoint_url=sts_endpoint)
98
- # response = sts.get_caller_identity()
99
-
100
- # # Extract ARN of the assumed role
101
- # assumed_role_arn = response['Arn']
102
-
103
- # # Extract the name of the assumed role from the ARN
104
- # assumed_role_name = assumed_role_arn.split('/')[-1]
105
-
106
- # return assumed_role_arn, assumed_role_name
107
-
108
- # if RUN_AWS_FUNCTIONS == "1":
109
- # try:
110
- # bucket_name = S3_LOG_BUCKET
111
- # #session = boto3.Session() # profile_name="default"
112
- # except Exception as e:
113
- # print(e)
114
-
115
- # try:
116
- # assumed_role_arn, assumed_role_name = get_assumed_role_info(aws_access_key_textbox, aws_secret_key_textbox, sts_endpoint)
117
-
118
- # #print("Assumed Role ARN:", assumed_role_arn)
119
- # #print("Assumed Role Name:", assumed_role_name)
120
-
121
- # print("Successfully assumed role with AWS STS")
122
-
123
- # except Exception as e:
124
- # print("Could not connect to AWS STS due to:", e)
125
-
126
-
127
  # Download direct from S3 - requires login credentials
128
  def download_file_from_s3(bucket_name:str, key:str, local_file_path:str, aws_access_key_textbox:str="", aws_secret_key_textbox:str="", RUN_AWS_FUNCTIONS=RUN_AWS_FUNCTIONS):
129
 
 
62
 
63
  return s3_client
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  # Download direct from S3 - requires login credentials
66
  def download_file_from_s3(bucket_name:str, key:str, local_file_path:str, aws_access_key_textbox:str="", aws_secret_key_textbox:str="", RUN_AWS_FUNCTIONS=RUN_AWS_FUNCTIONS):
67
 
tools/combine_sheets_into_xlsx.py CHANGED
@@ -102,7 +102,6 @@ def csvs_to_excel(
102
  for idx, csv_path in enumerate(csv_files):
103
  # Use provided sheet name or derive from file name
104
  sheet_name = sheet_names[idx] if sheet_names and idx < len(sheet_names) else os.path.splitext(os.path.basename(csv_path))[0]
105
- print("csv_path:", csv_path)
106
  df = pd.read_csv(csv_path)
107
 
108
  if sheet_name == "Original data":
@@ -160,7 +159,7 @@ def csvs_to_excel(
160
 
161
  wb.save(output_filename)
162
 
163
- print(f"Workbook saved as '{output_filename}'")
164
 
165
  return output_filename
166
 
@@ -243,10 +242,9 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
243
  else:
244
  raise Exception("Could not find unique topic files to put into Excel format")
245
  if reference_table_csv_path:
246
- #reference_table_csv_path = reference_table_csv_path[0]
247
  csv_files.append(reference_table_csv_path)
248
  sheet_names.append("Response level data")
249
- column_widths["Response level data"] = {"A": 15, "B": 30, "C": 40, "G":100, "H":100}
250
  wrap_text_columns["Response level data"] = ["C", "G"]
251
  else:
252
  raise Exception("Could not find any reference files to put into Excel format")
@@ -308,8 +306,6 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
308
  sheet_names.append("Original data")
309
  column_widths["Original data"] = {"A": 20, "B": 20, "C": 20}
310
  wrap_text_columns["Original data"] = ["C"]
311
-
312
- print("Creating intro page and text")
313
 
314
  # Intro page text
315
  intro_text = [
@@ -381,7 +377,7 @@ def collect_output_csvs_and_create_excel_output(in_data_files:List, chosen_cols:
381
 
382
  xlsx_output_filenames = [xlsx_output_filename]
383
 
384
- return xlsx_output_filenames
385
 
386
 
387
 
 
102
  for idx, csv_path in enumerate(csv_files):
103
  # Use provided sheet name or derive from file name
104
  sheet_name = sheet_names[idx] if sheet_names and idx < len(sheet_names) else os.path.splitext(os.path.basename(csv_path))[0]
 
105
  df = pd.read_csv(csv_path)
106
 
107
  if sheet_name == "Original data":
 
159
 
160
  wb.save(output_filename)
161
 
162
+ print(f"Output xlsx summary saved as '{output_filename}'")
163
 
164
  return output_filename
165
 
 
242
  else:
243
  raise Exception("Could not find unique topic files to put into Excel format")
244
  if reference_table_csv_path:
 
245
  csv_files.append(reference_table_csv_path)
246
  sheet_names.append("Response level data")
247
+ column_widths["Response level data"] = {"A": 15, "B": 30, "C": 40, "H":100}
248
  wrap_text_columns["Response level data"] = ["C", "G"]
249
  else:
250
  raise Exception("Could not find any reference files to put into Excel format")
 
306
  sheet_names.append("Original data")
307
  column_widths["Original data"] = {"A": 20, "B": 20, "C": 20}
308
  wrap_text_columns["Original data"] = ["C"]
 
 
309
 
310
  # Intro page text
311
  intro_text = [
 
377
 
378
  xlsx_output_filenames = [xlsx_output_filename]
379
 
380
+ return xlsx_output_filenames, xlsx_output_filenames
381
 
382
 
383
 
tools/config.py CHANGED
@@ -203,6 +203,7 @@ MAX_COMMENT_CHARS = int(get_or_create_env_var('MAX_COMMENT_CHARS', '14000'))
203
 
204
  RUN_LOCAL_MODEL = get_or_create_env_var("RUN_LOCAL_MODEL", "1")
205
  RUN_GEMINI_MODELS = get_or_create_env_var("RUN_GEMINI_MODELS", "1")
 
206
  GEMINI_API_KEY = get_or_create_env_var('GEMINI_API_KEY', '')
207
 
208
  # Build up options for models
@@ -218,7 +219,7 @@ if RUN_LOCAL_MODEL == "1" and CHOSEN_LOCAL_MODEL_TYPE:
218
  model_short_names.append(CHOSEN_LOCAL_MODEL_TYPE)
219
  model_source.append("Local")
220
 
221
- if RUN_AWS_FUNCTIONS == "1":
222
  model_full_names.extend(["anthropic.claude-3-haiku-20240307-v1:0", "anthropic.claude-3-7-sonnet-20250219-v1:0"])
223
  model_short_names.extend(["haiku", "sonnet"])
224
  model_source.extend(["AWS", "AWS"])
 
203
 
204
  RUN_LOCAL_MODEL = get_or_create_env_var("RUN_LOCAL_MODEL", "1")
205
  RUN_GEMINI_MODELS = get_or_create_env_var("RUN_GEMINI_MODELS", "1")
206
+ RUN_AWS_BEDROCK_MODELS = get_or_create_env_var("RUN_AWS_BEDROCK_MODELS", "1")
207
  GEMINI_API_KEY = get_or_create_env_var('GEMINI_API_KEY', '')
208
 
209
  # Build up options for models
 
219
  model_short_names.append(CHOSEN_LOCAL_MODEL_TYPE)
220
  model_source.append("Local")
221
 
222
+ if RUN_AWS_BEDROCK_MODELS == "1":
223
  model_full_names.extend(["anthropic.claude-3-haiku-20240307-v1:0", "anthropic.claude-3-7-sonnet-20250219-v1:0"])
224
  model_short_names.extend(["haiku", "sonnet"])
225
  model_source.extend(["AWS", "AWS"])
tools/dedup_summaries.py CHANGED
@@ -529,6 +529,7 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
529
  acc_number_of_calls = 0
530
  time_taken = 0
531
  out_metadata_str = "" # Output metadata is currently replaced on starting a summarisation task
 
532
 
533
  tic = time.perf_counter()
534
 
@@ -573,8 +574,8 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
573
  progress(0.1, f"Loading in local model: {CHOSEN_LOCAL_MODEL_TYPE}")
574
  local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
575
 
576
- summary_loop_description = "Creating summaries. " + str(latest_summary_completed) + " summaries completed so far."
577
- summary_loop = tqdm(range(latest_summary_completed, length_all_summaries), desc="Creating summaries", unit="summaries")
578
 
579
  if do_summaries == "Yes":
580
 
@@ -675,9 +676,13 @@ def summarise_output_topics(sampled_reference_table_df:pd.DataFrame,
675
  acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(out_metadata_str, model_choice, model_name_map)
676
 
677
  toc = time.perf_counter()
678
- time_taken = toc - tic
 
 
 
 
679
 
680
- return sampled_reference_table_df, topic_summary_df_revised, reference_table_df_revised, output_files, summarised_outputs, latest_summary_completed, out_metadata_str, summarised_output_markdown, log_output_files, output_files, acc_input_tokens, acc_output_tokens, acc_number_of_calls, time_taken
681
 
682
  @spaces.GPU(duration=120)
683
  def overall_summary(topic_summary_df:pd.DataFrame,
@@ -747,6 +752,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
747
  output_tokens_num = 0
748
  number_of_calls_num = 0
749
  time_taken = 0
 
750
 
751
  tic = time.perf_counter()
752
 
@@ -792,7 +798,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
792
  local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
793
  #print("Local model loaded:", local_model)
794
 
795
- summary_loop = tqdm(unique_groups, desc="Creating summaries for groups", unit="groups")
796
 
797
  if do_summaries == "Yes":
798
  model_source = model_name_map[model_choice]["source"]
@@ -800,7 +806,7 @@ def overall_summary(topic_summary_df:pd.DataFrame,
800
 
801
  for summary_group in summary_loop:
802
 
803
- print("Creating summary for group:", summary_group)
804
 
805
  summary_text = topic_summary_df.loc[topic_summary_df["Group"]==summary_group].to_markdown(index=False)
806
 
@@ -879,6 +885,8 @@ def overall_summary(topic_summary_df:pd.DataFrame,
879
  toc = time.perf_counter()
880
  time_taken = toc - tic
881
 
882
- print("All group summaries created. Time taken:", time_taken)
 
 
883
 
884
- return output_files, html_output_table, summarised_outputs_df, out_metadata_str, input_tokens_num, output_tokens_num, number_of_calls_num, time_taken
 
529
  acc_number_of_calls = 0
530
  time_taken = 0
531
  out_metadata_str = "" # Output metadata is currently replaced on starting a summarisation task
532
+ out_message = list()
533
 
534
  tic = time.perf_counter()
535
 
 
574
  progress(0.1, f"Loading in local model: {CHOSEN_LOCAL_MODEL_TYPE}")
575
  local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
576
 
577
+ summary_loop_description = "Revising topic-level summaries. " + str(latest_summary_completed) + " summaries completed so far."
578
+ summary_loop = tqdm(range(latest_summary_completed, length_all_summaries), desc="Revising topic-level summaries", unit="summaries")
579
 
580
  if do_summaries == "Yes":
581
 
 
676
  acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(out_metadata_str, model_choice, model_name_map)
677
 
678
  toc = time.perf_counter()
679
+ time_taken = toc - tic
680
+
681
+ out_message = '\n'.join(out_message)
682
+ out_message = out_message + " " + f"Topic summarisation finished processing. Total time: {time_taken:.2f}s"
683
+ print(out_message)
684
 
685
+ return sampled_reference_table_df, topic_summary_df_revised, reference_table_df_revised, output_files, summarised_outputs, latest_summary_completed, out_metadata_str, summarised_output_markdown, log_output_files, output_files, acc_input_tokens, acc_output_tokens, acc_number_of_calls, time_taken, out_message
686
 
687
  @spaces.GPU(duration=120)
688
  def overall_summary(topic_summary_df:pd.DataFrame,
 
752
  output_tokens_num = 0
753
  number_of_calls_num = 0
754
  time_taken = 0
755
+ out_message = list()
756
 
757
  tic = time.perf_counter()
758
 
 
798
  local_model, tokenizer = load_model(local_model_type=CHOSEN_LOCAL_MODEL_TYPE, repo_id=LOCAL_REPO_ID, model_filename=LOCAL_MODEL_FILE, model_dir=LOCAL_MODEL_FOLDER)
799
  #print("Local model loaded:", local_model)
800
 
801
+ summary_loop = tqdm(unique_groups, desc="Creating overall summary for groups", unit="groups")
802
 
803
  if do_summaries == "Yes":
804
  model_source = model_name_map[model_choice]["source"]
 
806
 
807
  for summary_group in summary_loop:
808
 
809
+ print("Creating overallsummary for group:", summary_group)
810
 
811
  summary_text = topic_summary_df.loc[topic_summary_df["Group"]==summary_group].to_markdown(index=False)
812
 
 
885
  toc = time.perf_counter()
886
  time_taken = toc - tic
887
 
888
+ out_message = '\n'.join(out_message)
889
+ out_message = out_message + " " + f"Overall summary finished processing. Total time: {time_taken:.2f}s"
890
+ print(out_message)
891
 
892
+ return output_files, html_output_table, summarised_outputs_df, out_metadata_str, input_tokens_num, output_tokens_num, number_of_calls_num, time_taken, out_message
tools/llm_api_call.py CHANGED
@@ -17,7 +17,7 @@ GradioFileData = gr.FileData
17
  from tools.prompts import initial_table_prompt, prompt2, prompt3, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt
18
  from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details
19
  from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata
20
- from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, RUN_AWS_FUNCTIONS, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX
21
  from tools.aws_functions import connect_to_bedrock_runtime
22
 
23
  if RUN_LOCAL_MODEL == "1":
@@ -1283,6 +1283,7 @@ def wrapper_extract_topics_per_column_value(
1283
  acc_input_tokens = 0
1284
  acc_output_tokens = 0
1285
  acc_number_of_calls = 0
 
1286
 
1287
  if grouping_col is None:
1288
  print("No grouping column found")
@@ -1321,8 +1322,14 @@ def wrapper_extract_topics_per_column_value(
1321
 
1322
  wrapper_first_loop = initial_first_loop_state
1323
 
1324
- for i, group_value in tqdm(enumerate(unique_values), desc=f"Analysing by group", total=len(unique_values), unit="groups"):
1325
- print(f"\nProcessing segment: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
 
 
 
 
 
 
1326
 
1327
  filtered_file_data = file_data.copy()
1328
 
@@ -1440,7 +1447,7 @@ def wrapper_extract_topics_per_column_value(
1440
  acc_total_time_taken += float(seg_time_taken)
1441
  acc_gradio_df = seg_gradio_df # Keep the latest Gradio DF
1442
 
1443
- print(f"Segment {grouping_col} = {group_value} processed. Time: {seg_time_taken:.2f}s")
1444
 
1445
  except Exception as e:
1446
  print(f"Error processing segment {grouping_col} = {group_value}: {e}")
@@ -1481,7 +1488,9 @@ def wrapper_extract_topics_per_column_value(
1481
 
1482
  acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(acc_whole_conversation_metadata, model_choice, model_name_map)
1483
 
1484
- print(f"\nWrapper finished processing all segments. Total time: {acc_total_time_taken:.2f}s")
 
 
1485
 
1486
  # The return signature should match extract_topics.
1487
  # The aggregated lists will be returned in the multiple slots.
@@ -1505,7 +1514,8 @@ def wrapper_extract_topics_per_column_value(
1505
  acc_missing_df,
1506
  acc_input_tokens,
1507
  acc_output_tokens,
1508
- acc_number_of_calls
 
1509
  )
1510
 
1511
 
 
17
  from tools.prompts import initial_table_prompt, prompt2, prompt3, initial_table_system_prompt, add_existing_topics_system_prompt, add_existing_topics_prompt, force_existing_topics_prompt, allow_new_topics_prompt, force_single_topic_prompt, add_existing_topics_assistant_prefill, initial_table_assistant_prefill, structured_summary_prompt
18
  from tools.helper_functions import read_file, put_columns_in_df, wrap_text, initial_clean, load_in_data_file, load_in_file, create_topic_summary_df_from_reference_table, convert_reference_table_to_pivot_table, get_basic_response_data, clean_column_name, load_in_previous_data_files, create_batch_file_path_details
19
  from tools.llm_funcs import ResponseObject, construct_gemini_generative_model, call_llm_with_markdown_table_checks, create_missing_references_df, calculate_tokens_from_metadata
20
+ from tools.config import RUN_LOCAL_MODEL, AWS_REGION, MAX_COMMENT_CHARS, MAX_OUTPUT_VALIDATION_ATTEMPTS, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, model_name_map, OUTPUT_FOLDER, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, LLM_SEED, MAX_GROUPS, REASONING_SUFFIX
21
  from tools.aws_functions import connect_to_bedrock_runtime
22
 
23
  if RUN_LOCAL_MODEL == "1":
 
1283
  acc_input_tokens = 0
1284
  acc_output_tokens = 0
1285
  acc_number_of_calls = 0
1286
+ out_message = list()
1287
 
1288
  if grouping_col is None:
1289
  print("No grouping column found")
 
1322
 
1323
  wrapper_first_loop = initial_first_loop_state
1324
 
1325
+ if len(unique_values) == 1:
1326
+ loop_object = enumerate(unique_values)
1327
+ else:
1328
+ loop_object = tqdm(enumerate(unique_values), desc=f"Analysing group", total=len(unique_values), unit="groups")
1329
+
1330
+
1331
+ for i, group_value in loop_object:
1332
+ print(f"\nProcessing group: {grouping_col} = {group_value} ({i+1}/{len(unique_values)})")
1333
 
1334
  filtered_file_data = file_data.copy()
1335
 
 
1447
  acc_total_time_taken += float(seg_time_taken)
1448
  acc_gradio_df = seg_gradio_df # Keep the latest Gradio DF
1449
 
1450
+ print(f"Group {grouping_col} = {group_value} processed. Time: {seg_time_taken:.2f}s")
1451
 
1452
  except Exception as e:
1453
  print(f"Error processing segment {grouping_col} = {group_value}: {e}")
 
1488
 
1489
  acc_input_tokens, acc_output_tokens, acc_number_of_calls = calculate_tokens_from_metadata(acc_whole_conversation_metadata, model_choice, model_name_map)
1490
 
1491
+ out_message = '\n'.join(out_message)
1492
+ out_message = out_message + " " + f"Topic extraction finished processing all groups. Total time: {acc_total_time_taken:.2f}s"
1493
+ print(out_message)
1494
 
1495
  # The return signature should match extract_topics.
1496
  # The aggregated lists will be returned in the multiple slots.
 
1514
  acc_missing_df,
1515
  acc_input_tokens,
1516
  acc_output_tokens,
1517
+ acc_number_of_calls,
1518
+ out_message
1519
  )
1520
 
1521
 
tools/llm_funcs.py CHANGED
@@ -18,7 +18,7 @@ full_text = "" # Define dummy source text (full text) just to enable highlight f
18
  model = list() # Define empty list for model functions to run
19
  tokenizer = list() #[] # Define empty list for model functions to run
20
 
21
- from tools.config import RUN_AWS_FUNCTIONS, AWS_REGION, LLM_TEMPERATURE, LLM_TOP_K, LLM_MIN_P, LLM_TOP_P, LLM_REPETITION_PENALTY, LLM_LAST_N_TOKENS, LLM_MAX_NEW_TOKENS, LLM_SEED, LLM_RESET, LLM_STREAM, LLM_THREADS, LLM_BATCH_SIZE, LLM_CONTEXT_LENGTH, LLM_SAMPLE, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, MAX_COMMENT_CHARS, RUN_LOCAL_MODEL, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, HF_TOKEN, LLM_SEED, LLM_MAX_GPU_LAYERS, SPECULATIVE_DECODING, NUM_PRED_TOKENS
22
  from tools.prompts import initial_table_assistant_prefill
23
 
24
  if SPECULATIVE_DECODING == "True": SPECULATIVE_DECODING = True
@@ -220,8 +220,6 @@ def load_model(local_model_type:str=CHOSEN_LOCAL_MODEL_TYPE,
220
  # Verify the device and cuda settings
221
  # Check if CUDA is enabled
222
  import torch
223
- #if RUN_LOCAL_MODEL == "1":
224
- #print("Running local model - importing llama-cpp-python")
225
  from llama_cpp import Llama
226
  from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
227
 
 
18
  model = list() # Define empty list for model functions to run
19
  tokenizer = list() #[] # Define empty list for model functions to run
20
 
21
+ from tools.config import AWS_REGION, LLM_TEMPERATURE, LLM_TOP_K, LLM_MIN_P, LLM_TOP_P, LLM_REPETITION_PENALTY, LLM_LAST_N_TOKENS, LLM_MAX_NEW_TOKENS, LLM_SEED, LLM_RESET, LLM_STREAM, LLM_THREADS, LLM_BATCH_SIZE, LLM_CONTEXT_LENGTH, LLM_SAMPLE, MAX_TOKENS, TIMEOUT_WAIT, NUMBER_OF_RETRY_ATTEMPTS, MAX_TIME_FOR_LOOP, BATCH_SIZE_DEFAULT, DEDUPLICATION_THRESHOLD, MAX_COMMENT_CHARS, CHOSEN_LOCAL_MODEL_TYPE, LOCAL_REPO_ID, LOCAL_MODEL_FILE, LOCAL_MODEL_FOLDER, HF_TOKEN, LLM_SEED, LLM_MAX_GPU_LAYERS, SPECULATIVE_DECODING, NUM_PRED_TOKENS
22
  from tools.prompts import initial_table_assistant_prefill
23
 
24
  if SPECULATIVE_DECODING == "True": SPECULATIVE_DECODING = True
 
220
  # Verify the device and cuda settings
221
  # Check if CUDA is enabled
222
  import torch
 
 
223
  from llama_cpp import Llama
224
  from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
225
 
windows_install_llama-cpp-python.txt CHANGED
@@ -1,26 +1,11 @@
1
  ---
2
 
3
- set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
4
-
5
 
6
- pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
7
- ---
8
 
9
- # With CUDA
10
-
11
- pip install llama-cpp-python==0.3.16 --force-reinstall --no-cache-dir --verbose -C cmake.args="-DGGML_CUDA=on"
12
-
13
-
14
- ---
15
-
16
-
17
- How to Make it Work: Step-by-Step Guide
18
- To successfully run your command, you need to set up a proper C++ development environment.
19
-
20
- Step 1: Install the C++ Compiler
21
- Go to the Visual Studio downloads page.
22
-
23
- Scroll down to "Tools for Visual Studio" and download the "Build Tools for Visual Studio". This is a standalone installer that gives you the C++ compiler and libraries without installing the full Visual Studio IDE.
24
 
25
  Run the installer. In the "Workloads" tab, check the box for "Desktop development with C++".
26
 
@@ -34,18 +19,17 @@ Windows 10 SDK (10.0.20348.0)
34
 
35
  Proceed with the installation.
36
 
 
37
 
38
- Need to use 'x64 Native Tools Command Prompt for VS 2022' to install. Run as administrator
39
-
40
- Step 2: Install CMake
41
- Go to the CMake download page.
42
 
43
  Download the latest Windows installer (e.g., cmake-x.xx.x-windows-x86_64.msi).
44
 
45
  Run the installer. Crucially, when prompted, select the option to "Add CMake to the system PATH for all users" or "for the current user." This allows you to run cmake from any command prompt.
46
 
47
 
48
- Step 3: Download and Place OpenBLAS
49
  This is often the trickiest part.
50
 
51
  Go to the OpenBLAS releases on GitHub.
@@ -56,14 +40,12 @@ Create a folder somewhere easily accessible, for example, C:\libs\.
56
 
57
  Extract the contents of the OpenBLAS zip file into that folder. Your final directory structure should look something like this:
58
 
59
- Generated code
60
  C:\libs\OpenBLAS\
61
  β”œβ”€β”€ bin\
62
  β”œβ”€β”€ include\
63
  └── lib\
64
- Use code with caution.
65
 
66
- 3.b. Install Chocolatey
67
  https://chocolatey.org/install
68
 
69
  Step 1: Install Chocolatey (if you don't already have it)
@@ -71,25 +53,20 @@ Open PowerShell as an Administrator. (Right-click the Start Menu -> "Windows Pow
71
 
72
  Run the following command to install Chocolatey. It's a single, long line:
73
 
74
- Generated powershell
75
  Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
76
- Use code with caution.
77
- Powershell
78
- Wait for it to finish. Once it's done, close the Administrator PowerShell window.
79
 
80
  Step 2: Install pkg-config-lite using Chocolatey
81
- IMPORTANT: Open a NEW command prompt or PowerShell window (as a regular user is fine). This is necessary so it recognizes the new choco command.
82
 
83
- Run the following command to install a lightweight version of pkg-config:
84
 
85
- Generated cmd
86
  choco install pkgconfiglite
87
- Use code with caution.
88
- Cmd
89
- Approve the installation by typing Y or A if prompted.
90
 
 
91
 
92
- Step 4: Run the Installation Command
93
  Now you have all the pieces. The final step is to run the command in a terminal that is aware of your new build environment.
94
 
95
  Open the "Developer Command Prompt for VS" from your Start Menu. This is important! This special command prompt automatically configures all the necessary paths for the C++ compiler.
@@ -98,22 +75,35 @@ Open the "Developer Command Prompt for VS" from your Start Menu. This is importa
98
 
99
  set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
100
 
101
-
102
  pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
103
 
104
- ## With Cuda
 
 
 
 
105
 
106
- ### Make sure you are using the x64 version of Developer command tools, e.g. 'x64 Native Tools Command Prompt for VS 2022' ###
 
 
 
 
 
 
107
 
108
  Use NVIDIA GPU (cuBLAS): If you have an NVIDIA GPU, using cuBLAS is often easier because the CUDA Toolkit installer handles most of the setup.
109
 
110
  Install the NVIDIA CUDA Toolkit.
111
 
112
- Run the install command specifying cuBLAS:
113
 
 
114
 
115
- set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
 
 
 
 
116
 
117
- pip install llama-cpp-python==0.3.16 --force-reinstall --no-cache-dir --verbose -C cmake.args="-DGGML_CUDA=on"
118
 
119
 
 
1
  ---
2
 
3
+ #How to build llama-cpp-python on Windows: Step-by-Step Guide
 
4
 
5
+ First, you need to set up a proper C++ development environment.
 
6
 
7
+ # Step 1: Install the C++ Compiler
8
+ Scroll down the page past the main programs to "Tools for Visual Studio" and download the "Build Tools for Visual Studio". This is a standalone installer that gives you the C++ compiler and libraries without installing the full Visual Studio IDE.
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  Run the installer. In the "Workloads" tab, check the box for "Desktop development with C++".
11
 
 
19
 
20
  Proceed with the installation.
21
 
22
+ Need to use 'x64 Native Tools Command Prompt for VS 2022' to install the below. Run as administrator
23
 
24
+ # Step 2: Install CMake
25
+ Go to the CMake download page: https://cmake.org/download
 
 
26
 
27
  Download the latest Windows installer (e.g., cmake-x.xx.x-windows-x86_64.msi).
28
 
29
  Run the installer. Crucially, when prompted, select the option to "Add CMake to the system PATH for all users" or "for the current user." This allows you to run cmake from any command prompt.
30
 
31
 
32
+ # Step 3: (FOR CPU INFERENCE ONLY) Download and Place OpenBLAS
33
  This is often the trickiest part.
34
 
35
  Go to the OpenBLAS releases on GitHub.
 
40
 
41
  Extract the contents of the OpenBLAS zip file into that folder. Your final directory structure should look something like this:
42
 
 
43
  C:\libs\OpenBLAS\
44
  β”œβ”€β”€ bin\
45
  β”œβ”€β”€ include\
46
  └── lib\
 
47
 
48
+ ## 3.b. Install Chocolatey
49
  https://chocolatey.org/install
50
 
51
  Step 1: Install Chocolatey (if you don't already have it)
 
53
 
54
  Run the following command to install Chocolatey. It's a single, long line:
55
 
 
56
  Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
57
+
58
+ Once it's done, close the Administrator PowerShell window.
 
59
 
60
  Step 2: Install pkg-config-lite using Chocolatey
61
+ IMPORTANT: Open a NEW command prompt or PowerShell window (as a regular user is fine). This is necessary so it recognises the new choco command.
62
 
63
+ Run the following command in console to install a lightweight version of pkg-config:
64
 
 
65
  choco install pkgconfiglite
 
 
 
66
 
67
+ Approve the installation by typing Y or A if prompted.
68
 
69
+ # Step 4: Run the Installation Command
70
  Now you have all the pieces. The final step is to run the command in a terminal that is aware of your new build environment.
71
 
72
  Open the "Developer Command Prompt for VS" from your Start Menu. This is important! This special command prompt automatically configures all the necessary paths for the C++ compiler.
 
75
 
76
  set PKG_CONFIG_PATH=C:\<path-to-openblas>\OpenBLAS\lib\pkgconfig # Set this in environment variables
77
 
 
78
  pip install llama-cpp-python==0.3.16 --force-reinstall --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
79
 
80
+ or to make a wheel:
81
+
82
+ pip install llama-cpp-python==0.3.16 --wheel-dir dist --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/<path-to-openblas>/OpenBLAS/include;-DBLAS_LIBRARIES=C:/<path-to-openblas>/OpenBLAS/lib/libopenblas.lib"
83
+
84
+ pip wheel llama-cpp-python==0.3.16 --wheel-dir dist --verbose --no-cache-dir -Ccmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS;-DBLAS_INCLUDE_DIRS=C:/Users/spedrickcase/libs/OpenBLAS/include;-DBLAS_LIBRARIES=C:/Users/spedrickcase/libs/OpenBLAS/lib/libopenblas.lib"
85
 
86
+ C:/Users/spedrickcase/libs
87
+
88
+ ## With Cuda (NVIDIA GPUs only)
89
+
90
+ Make sure that the have the CUDA 12.4 toolkit for windows installed: https://developer.nvidia.com/cuda-12-4-0-download-archive
91
+
92
+ ### Make sure you are using the x64 version of Developer command tools for the below, e.g. 'x64 Native Tools Command Prompt for VS 2022' ###
93
 
94
  Use NVIDIA GPU (cuBLAS): If you have an NVIDIA GPU, using cuBLAS is often easier because the CUDA Toolkit installer handles most of the setup.
95
 
96
  Install the NVIDIA CUDA Toolkit.
97
 
98
+ Run the install command specifying cuBLAS (for faster inference):
99
 
100
+ pip install llama-cpp-python==0.3.16 --force-reinstall --verbose -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"
101
 
102
+ If you want to create a new wheel to help with future installs, you can run:
103
+
104
+ cd first to a folder that you have edit access for
105
+
106
+ pip wheel llama-cpp-python==0.3.16 --wheel-dir dist --verbose -C cmake.args="-DGGML_CUDA=on -DGGML_CUBLAS=on"
107
 
 
108
 
109