File size: 14,145 Bytes
de421b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
357bd7f
ce889d8
c260e8a
de421b8
 
 
 
 
 
571ef19
de421b8
31cebe0
423f476
de421b8
 
140fa2a
de421b8
 
 
 
 
 
 
 
9058dfa
de421b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26a8e83
 
 
 
 
de421b8
 
962c488
de421b8
962c488
de421b8
 
 
 
 
962c488
de421b8
 
 
 
 
 
ccbf47f
 
de421b8
 
 
1adf2ab
de421b8
0db5f78
1adf2ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de421b8
 
 
 
 
ccbf47f
 
 
 
 
de421b8
 
 
ccbf47f
de421b8
 
 
 
 
 
 
 
 
 
ccbf47f
de421b8
 
 
 
 
 
 
ccbf47f
de421b8
 
 
 
 
 
0db5f78
de421b8
 
 
 
0db5f78
ccbf47f
de421b8
 
 
 
 
 
 
 
fd1277b
 
de421b8
 
ccbf47f
de421b8
 
ccbf47f
 
 
 
 
 
 
 
 
de421b8
ccbf47f
 
 
 
 
 
 
 
de421b8
 
ccbf47f
 
 
 
 
0db5f78
ccbf47f
de421b8
 
 
 
 
 
 
 
 
 
0db5f78
ccbf47f
0db5f78
ccbf47f
 
0db5f78
ccbf47f
 
 
 
de421b8
ccbf47f
de421b8
 
ccbf47f
 
 
 
 
 
 
de421b8
 
ccbf47f
0db5f78
 
ccbf47f
 
de421b8
ccbf47f
 
 
 
 
 
 
de421b8
 
 
 
 
 
0db5f78
ccbf47f
 
 
 
 
 
 
 
 
 
 
 
0db5f78
ccbf47f
 
 
 
 
 
 
 
de421b8
ccbf47f
de421b8
 
ccbf47f
de421b8
 
ccbf47f
de421b8
 
 
 
ccbf47f
de421b8
 
 
 
ccbf47f
de421b8
ccbf47f
de421b8
ccbf47f
 
de421b8
 
 
 
 
962c488
 
 
 
de421b8
c260e8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e324e6a
c260e8a
 
 
e324e6a
c260e8a
 
1adf2ab
c260e8a
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
license: llama3
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- Text Generation
- Transformers
- llama
- llama-3
- 8B
- nvidia
- facebook
- meta
- LLM
- fine-tuned
- insurance
- research
- pytorch
- instruct
- chatqa-1.5
- chatqa
- finetune
- gpt4
- conversational
- text-generation-inference
- Inference Endpoints
datasets:
- InsuranceQA

base_model: "nvidia/Llama3-ChatQA-1.5-8B"
finetuned: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B"
quantized: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
---

# Open-Insurance-LLM-Llama3-8B-GGUF

This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA
Fine-tuned for insurance-related queries and conversations. 


## Model Details

- **Model Type:** Quantized Language Model (GGUF format)
- **Base Model:** nvidia/Llama3-ChatQA-1.5-8B
- **Finetuned Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B
- **Quantized Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF
- **Model Architecture:** Llama
- **Quantization:** 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit
- **Finetuned Dataset**: InsuranceQA (https://github.com/shuzi/insuranceQA)
- **Developer:** Raj Maharajwala
- **License:** llama3
- **Language:** English 

## Setup Instructions

### Environment Setup

#### For Windows
```bash
python3 -m venv .venv_open_insurance_llm
.\.venv_open_insurance_llm\Scripts\activate
```

#### For Mac/Linux
```bash
python3 -m venv .venv_open_insurance_llm
source .venv_open_insurance_llm/bin/activate
```

### Installation

#### For Mac Users (Metal Support)
```bash
export FORCE_CMAKE=1
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir
```

#### For Windows Users (CPU Support)
```bash
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
```

### Dependencies

Then install dependencies (inference_requirements.txt) attached under `Files and Versions`:
```bash
pip install -r inference_requirements.txt
```

## Inference Loop

```python
# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py)
import os
import time
from pathlib import Path
from llama_cpp import Llama
from rich.console import Console
from huggingface_hub import hf_hub_download
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple

@dataclass
class ModelConfig:
    # Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2
    model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
    model_file: str = "open-insurance-llm-q4_k_m.gguf"
    # model_file: str = "open-insurance-llm-q8_0.gguf"  # 8-bit quantization; higher precision, better quality, increased resource usage
    # model_file: str = "open-insurance-llm-q5_k_m.gguf"  # 5-bit quantization; balance between performance and resource efficiency
    max_tokens: int = 1000  # Maximum number of tokens to generate in a single output
    temperature: float = 0.1  # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution)
    top_k: int = 15  # After temperature scaling, Consider the top 15 most probable tokens during sampling
    top_p: float = 0.2  # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20%
    repeat_penalty: float = 1.2  # Penalize repeated tokens to reduce redundancy
    num_beams: int = 4  # Number of beams for beam search; higher values improve quality at the cost of speed
    n_gpu_layers: int = -2  # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration
    n_ctx: int = 2048  # Context window size; Llama 3 models support up to 8192 tokens context length
    n_batch: int = 256  # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512)
    verbose: bool = False  # True for enabling verbose logging for debugging purposes
    use_mmap: bool = False  # Memory-map model to reduce RAM usage; set to True if running on limited memory systems
    use_mlock: bool = True  # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM
    offload_kqv: bool = True  # Offload key, query, value matrices to GPU to accelerate inference



class InsuranceLLM:
    def __init__(self, config: ModelConfig):
        self.config = config
        self.llm_ctx = None
        self.console = Console()
        self.conversation_history: List[Dict[str, str]] = []
        
        self.system_message = (
            "This is a chat between a user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. "
            "The assistant should also indicate when the answer cannot be found in the context. "
            "You are an expert from the Insurance domain with extensive insurance knowledge and "
            "professional writer skills, especially about insurance policies. "
            "Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. "
            "You are willing to help answer the user's query with a detailed explanation. "
            "In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, "
            "complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while "
            "still aiming to make the explanation clear and accessible to a general audience."
        )

    def download_model(self) -> str:
        try:
            with self.console.status("[bold green]Downloading model..."):
                model_path = hf_hub_download(
                    self.config.model_name,
                    filename=self.config.model_file,
                    local_dir=os.path.join(os.getcwd(), 'gguf_dir')
                )
            return model_path
        except Exception as e:
            self.console.print(f"[red]Error downloading model: {str(e)}[/red]")
            raise

    def load_model(self) -> None:
        try:
            quantized_path = os.path.join(os.getcwd(), "gguf_dir")
            directory = Path(quantized_path)

            try:
                model_path = str(list(directory.glob(self.config.model_file))[0])
            except IndexError:
                model_path = self.download_model()

            with self.console.status("[bold green]Loading model..."):
                self.llm_ctx = Llama(
                    model_path=model_path,
                    n_gpu_layers=self.config.n_gpu_layers,
                    n_ctx=self.config.n_ctx,
                    n_batch=self.config.n_batch,
                    num_beams=self.config.num_beams,
                    verbose=self.config.verbose,
                    use_mlock=self.config.use_mlock,
                    use_mmap=self.config.use_mmap,
                    offload_kqv=self.config.offload_kqv
                )
        except Exception as e:
            self.console.print(f"[red]Error loading model: {str(e)}[/red]")
            raise

    def build_conversation_prompt(self, new_question: str, context: str = "") -> str:
        prompt = f"System: {self.system_message}\n\n"
        
        # Add conversation history
        for exchange in self.conversation_history:
            prompt += f"User: {exchange['user']}\n\n"
            prompt += f"Assistant: {exchange['assistant']}\n\n"
        
        # Add the new question
        if context:
            prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n"
        else:
            prompt += f"User: {new_question}\n\n"
            
        prompt += "Assistant:"
        return prompt

    def generate_response(self, prompt: str) -> Tuple[str, int, float]:
        if not self.llm_ctx:
            raise RuntimeError("Model not loaded. Call load_model() first.")
        
        self.console.print("[bold cyan]Assistant: [/bold cyan]", end="")
        complete_response = ""
        token_count = 0
        start_time = time.time()

        try:
            for chunk in self.llm_ctx.create_completion(
                prompt,
                max_tokens=self.config.max_tokens,
                top_k=self.config.top_k,
                top_p=self.config.top_p,
                temperature=self.config.temperature,
                repeat_penalty=self.config.repeat_penalty,
                stream=True
            ):
                text_chunk = chunk["choices"][0]["text"]
                complete_response += text_chunk
                token_count += 1
                print(text_chunk, end="", flush=True)
            
            elapsed_time = time.time() - start_time
            print()
            return complete_response, token_count, elapsed_time
        except Exception as e:
            self.console.print(f"\n[red]Error generating response: {str(e)}[/red]")
            return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0

    def run_chat(self):
        try:
            self.load_model()
            self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]")
            self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n")
            self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n")
            self.console.print("Your conversation history will be maintained for context-aware responses.\n")
            
            total_tokens = 0
            
            while True:
                try:
                    user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip()

                    if user_input.lower() in ["exit", "/bye", "quit"]:
                        self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]")
                        self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]")
                        break

                    # Reset conversation with command
                    if user_input.lower() == "/reset":
                        self.conversation_history = []
                        self.console.print("[yellow]Conversation history has been reset.[/yellow]")
                        continue

                    context = ""
                    question = user_input
                    if "context:" in user_input.lower() and "question:" in user_input.lower():
                        parts = user_input.split("question:", 1)
                        context = parts[0].replace("context:", "").strip()
                        question = parts[1].strip()

                    prompt = self.build_conversation_prompt(question, context)
                    response, tokens, elapsed_time = self.generate_response(prompt)
                    
                    # Add to conversation history
                    self.conversation_history.append({
                        "user": question,
                        "assistant": response
                    })
                    
                    # Update total tokens
                    total_tokens += tokens
                    
                    # Print metrics
                    tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0
                    self.console.print(
                        f"[dim]Tokens: {tokens} || " +
                        f"Time: {elapsed_time:.2f}s || " +
                        f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]"
                    )
                    print()  # Add a blank line after each response
                    
                except KeyboardInterrupt:
                    self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]")
                    continue
                except Exception as e:
                    self.console.print(f"\n[red]Error processing input: {str(e)}[/red]")
                    continue
        except Exception as e:
            self.console.print(f"\n[red]Fatal error: {str(e)}[/red]")
        finally:
            if self.llm_ctx:
                del self.llm_ctx


def main():
    try:
        config = ModelConfig()
        llm = InsuranceLLM(config)
        llm.run_chat()
    except KeyboardInterrupt:
        print("\nProgram interrupted by user")
    except Exception as e:
        print(f"\nApplication error: {str(e)}")


if __name__ == "__main__":
    main()
```

```bash
python3 inference_open-insurance-llm-gguf.py
```

### Nvidia Llama 3 - ChatQA Paper:
Arxiv : [https://arxiv.org/pdf/2401.10225](https://arxiv.org/pdf/2401.10225)

## Use Cases

This model is specifically designed for:
- Insurance policy understanding and explanation
- Claims processing assistance
- Coverage analysis
- Insurance terminology clarification
- Policy comparison and recommendations
- Risk assessment queries
- Insurance compliance questions

## Limitations

- The model's knowledge is limited to its training data cutoff
- Should not be used as a replacement for professional insurance advice
- May occasionally generate plausible-sounding but incorrect information

## Bias and Ethics

This model should be used with awareness that:
- It may reflect biases present in insurance industry training data
- Output should be verified by insurance professionals for critical decisions
- It should not be used as the sole basis for insurance decisions
- The model's responses should be treated as informational, not as legal or professional advice

## Citation and Attribution

If you use base model or quantized model in your research or applications, please cite:
```
@misc{maharajwala2024openinsurance,
  author = {Raj Maharajwala},
  title = {Open-Insurance-LLM-Llama3-8B-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  linkedin = {https://www.linkedin.com/in/raj6800/},
  url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF}
}
```