File size: 14,145 Bytes
de421b8 357bd7f ce889d8 c260e8a de421b8 571ef19 de421b8 31cebe0 423f476 de421b8 140fa2a de421b8 9058dfa de421b8 26a8e83 de421b8 962c488 de421b8 962c488 de421b8 962c488 de421b8 ccbf47f de421b8 1adf2ab de421b8 0db5f78 1adf2ab de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 0db5f78 de421b8 0db5f78 ccbf47f de421b8 fd1277b de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f 0db5f78 ccbf47f de421b8 0db5f78 ccbf47f 0db5f78 ccbf47f 0db5f78 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f 0db5f78 ccbf47f de421b8 ccbf47f de421b8 0db5f78 ccbf47f 0db5f78 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 ccbf47f de421b8 962c488 de421b8 c260e8a e324e6a c260e8a e324e6a c260e8a 1adf2ab c260e8a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 |
---
license: llama3
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- Text Generation
- Transformers
- llama
- llama-3
- 8B
- nvidia
- facebook
- meta
- LLM
- fine-tuned
- insurance
- research
- pytorch
- instruct
- chatqa-1.5
- chatqa
- finetune
- gpt4
- conversational
- text-generation-inference
- Inference Endpoints
datasets:
- InsuranceQA
base_model: "nvidia/Llama3-ChatQA-1.5-8B"
finetuned: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B"
quantized: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
---
# Open-Insurance-LLM-Llama3-8B-GGUF
This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA
Fine-tuned for insurance-related queries and conversations.
## Model Details
- **Model Type:** Quantized Language Model (GGUF format)
- **Base Model:** nvidia/Llama3-ChatQA-1.5-8B
- **Finetuned Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B
- **Quantized Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF
- **Model Architecture:** Llama
- **Quantization:** 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit
- **Finetuned Dataset**: InsuranceQA (https://github.com/shuzi/insuranceQA)
- **Developer:** Raj Maharajwala
- **License:** llama3
- **Language:** English
## Setup Instructions
### Environment Setup
#### For Windows
```bash
python3 -m venv .venv_open_insurance_llm
.\.venv_open_insurance_llm\Scripts\activate
```
#### For Mac/Linux
```bash
python3 -m venv .venv_open_insurance_llm
source .venv_open_insurance_llm/bin/activate
```
### Installation
#### For Mac Users (Metal Support)
```bash
export FORCE_CMAKE=1
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir
```
#### For Windows Users (CPU Support)
```bash
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
```
### Dependencies
Then install dependencies (inference_requirements.txt) attached under `Files and Versions`:
```bash
pip install -r inference_requirements.txt
```
## Inference Loop
```python
# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py)
import os
import time
from pathlib import Path
from llama_cpp import Llama
from rich.console import Console
from huggingface_hub import hf_hub_download
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple
@dataclass
class ModelConfig:
# Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2
model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
model_file: str = "open-insurance-llm-q4_k_m.gguf"
# model_file: str = "open-insurance-llm-q8_0.gguf" # 8-bit quantization; higher precision, better quality, increased resource usage
# model_file: str = "open-insurance-llm-q5_k_m.gguf" # 5-bit quantization; balance between performance and resource efficiency
max_tokens: int = 1000 # Maximum number of tokens to generate in a single output
temperature: float = 0.1 # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution)
top_k: int = 15 # After temperature scaling, Consider the top 15 most probable tokens during sampling
top_p: float = 0.2 # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20%
repeat_penalty: float = 1.2 # Penalize repeated tokens to reduce redundancy
num_beams: int = 4 # Number of beams for beam search; higher values improve quality at the cost of speed
n_gpu_layers: int = -2 # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration
n_ctx: int = 2048 # Context window size; Llama 3 models support up to 8192 tokens context length
n_batch: int = 256 # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512)
verbose: bool = False # True for enabling verbose logging for debugging purposes
use_mmap: bool = False # Memory-map model to reduce RAM usage; set to True if running on limited memory systems
use_mlock: bool = True # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM
offload_kqv: bool = True # Offload key, query, value matrices to GPU to accelerate inference
class InsuranceLLM:
def __init__(self, config: ModelConfig):
self.config = config
self.llm_ctx = None
self.console = Console()
self.conversation_history: List[Dict[str, str]] = []
self.system_message = (
"This is a chat between a user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. "
"The assistant should also indicate when the answer cannot be found in the context. "
"You are an expert from the Insurance domain with extensive insurance knowledge and "
"professional writer skills, especially about insurance policies. "
"Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. "
"You are willing to help answer the user's query with a detailed explanation. "
"In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, "
"complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while "
"still aiming to make the explanation clear and accessible to a general audience."
)
def download_model(self) -> str:
try:
with self.console.status("[bold green]Downloading model..."):
model_path = hf_hub_download(
self.config.model_name,
filename=self.config.model_file,
local_dir=os.path.join(os.getcwd(), 'gguf_dir')
)
return model_path
except Exception as e:
self.console.print(f"[red]Error downloading model: {str(e)}[/red]")
raise
def load_model(self) -> None:
try:
quantized_path = os.path.join(os.getcwd(), "gguf_dir")
directory = Path(quantized_path)
try:
model_path = str(list(directory.glob(self.config.model_file))[0])
except IndexError:
model_path = self.download_model()
with self.console.status("[bold green]Loading model..."):
self.llm_ctx = Llama(
model_path=model_path,
n_gpu_layers=self.config.n_gpu_layers,
n_ctx=self.config.n_ctx,
n_batch=self.config.n_batch,
num_beams=self.config.num_beams,
verbose=self.config.verbose,
use_mlock=self.config.use_mlock,
use_mmap=self.config.use_mmap,
offload_kqv=self.config.offload_kqv
)
except Exception as e:
self.console.print(f"[red]Error loading model: {str(e)}[/red]")
raise
def build_conversation_prompt(self, new_question: str, context: str = "") -> str:
prompt = f"System: {self.system_message}\n\n"
# Add conversation history
for exchange in self.conversation_history:
prompt += f"User: {exchange['user']}\n\n"
prompt += f"Assistant: {exchange['assistant']}\n\n"
# Add the new question
if context:
prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n"
else:
prompt += f"User: {new_question}\n\n"
prompt += "Assistant:"
return prompt
def generate_response(self, prompt: str) -> Tuple[str, int, float]:
if not self.llm_ctx:
raise RuntimeError("Model not loaded. Call load_model() first.")
self.console.print("[bold cyan]Assistant: [/bold cyan]", end="")
complete_response = ""
token_count = 0
start_time = time.time()
try:
for chunk in self.llm_ctx.create_completion(
prompt,
max_tokens=self.config.max_tokens,
top_k=self.config.top_k,
top_p=self.config.top_p,
temperature=self.config.temperature,
repeat_penalty=self.config.repeat_penalty,
stream=True
):
text_chunk = chunk["choices"][0]["text"]
complete_response += text_chunk
token_count += 1
print(text_chunk, end="", flush=True)
elapsed_time = time.time() - start_time
print()
return complete_response, token_count, elapsed_time
except Exception as e:
self.console.print(f"\n[red]Error generating response: {str(e)}[/red]")
return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0
def run_chat(self):
try:
self.load_model()
self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]")
self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n")
self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n")
self.console.print("Your conversation history will be maintained for context-aware responses.\n")
total_tokens = 0
while True:
try:
user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip()
if user_input.lower() in ["exit", "/bye", "quit"]:
self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]")
self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]")
break
# Reset conversation with command
if user_input.lower() == "/reset":
self.conversation_history = []
self.console.print("[yellow]Conversation history has been reset.[/yellow]")
continue
context = ""
question = user_input
if "context:" in user_input.lower() and "question:" in user_input.lower():
parts = user_input.split("question:", 1)
context = parts[0].replace("context:", "").strip()
question = parts[1].strip()
prompt = self.build_conversation_prompt(question, context)
response, tokens, elapsed_time = self.generate_response(prompt)
# Add to conversation history
self.conversation_history.append({
"user": question,
"assistant": response
})
# Update total tokens
total_tokens += tokens
# Print metrics
tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0
self.console.print(
f"[dim]Tokens: {tokens} || " +
f"Time: {elapsed_time:.2f}s || " +
f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]"
)
print() # Add a blank line after each response
except KeyboardInterrupt:
self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]")
continue
except Exception as e:
self.console.print(f"\n[red]Error processing input: {str(e)}[/red]")
continue
except Exception as e:
self.console.print(f"\n[red]Fatal error: {str(e)}[/red]")
finally:
if self.llm_ctx:
del self.llm_ctx
def main():
try:
config = ModelConfig()
llm = InsuranceLLM(config)
llm.run_chat()
except KeyboardInterrupt:
print("\nProgram interrupted by user")
except Exception as e:
print(f"\nApplication error: {str(e)}")
if __name__ == "__main__":
main()
```
```bash
python3 inference_open-insurance-llm-gguf.py
```
### Nvidia Llama 3 - ChatQA Paper:
Arxiv : [https://arxiv.org/pdf/2401.10225](https://arxiv.org/pdf/2401.10225)
## Use Cases
This model is specifically designed for:
- Insurance policy understanding and explanation
- Claims processing assistance
- Coverage analysis
- Insurance terminology clarification
- Policy comparison and recommendations
- Risk assessment queries
- Insurance compliance questions
## Limitations
- The model's knowledge is limited to its training data cutoff
- Should not be used as a replacement for professional insurance advice
- May occasionally generate plausible-sounding but incorrect information
## Bias and Ethics
This model should be used with awareness that:
- It may reflect biases present in insurance industry training data
- Output should be verified by insurance professionals for critical decisions
- It should not be used as the sole basis for insurance decisions
- The model's responses should be treated as informational, not as legal or professional advice
## Citation and Attribution
If you use base model or quantized model in your research or applications, please cite:
```
@misc{maharajwala2024openinsurance,
author = {Raj Maharajwala},
title = {Open-Insurance-LLM-Llama3-8B-GGUF},
year = {2024},
publisher = {HuggingFace},
linkedin = {https://www.linkedin.com/in/raj6800/},
url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF}
}
``` |