|
--- |
|
license: llama3 |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- Text Generation |
|
- Transformers |
|
- llama |
|
- llama-3 |
|
- 8B |
|
- nvidia |
|
- facebook |
|
- meta |
|
- LLM |
|
- fine-tuned |
|
- insurance |
|
- research |
|
- pytorch |
|
- instruct |
|
- chatqa-1.5 |
|
- chatqa |
|
- finetune |
|
- gpt4 |
|
- conversational |
|
- text-generation-inference |
|
- Inference Endpoints |
|
datasets: |
|
- InsuranceQA |
|
|
|
base_model: "nvidia/Llama3-ChatQA-1.5-8B" |
|
finetuned: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B" |
|
quantized: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF" |
|
--- |
|
|
|
# Open-Insurance-LLM-Llama3-8B-GGUF |
|
|
|
This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA |
|
Fine-tuned for insurance-related queries and conversations. |
|
|
|
|
|
## Model Details |
|
|
|
- **Model Type:** Quantized Language Model (GGUF format) |
|
- **Base Model:** nvidia/Llama3-ChatQA-1.5-8B |
|
- **Finetuned Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B |
|
- **Quantized Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF |
|
- **Model Architecture:** Llama |
|
- **Quantization:** 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit |
|
- **Finetuned Dataset**: InsuranceQA (https://github.com/shuzi/insuranceQA) |
|
- **Developer:** Raj Maharajwala |
|
- **License:** llama3 |
|
- **Language:** English |
|
|
|
## Setup Instructions |
|
|
|
### Environment Setup |
|
|
|
#### For Windows |
|
```bash |
|
python3 -m venv .venv_open_insurance_llm |
|
.\.venv_open_insurance_llm\Scripts\activate |
|
``` |
|
|
|
#### For Mac/Linux |
|
```bash |
|
python3 -m venv .venv_open_insurance_llm |
|
source .venv_open_insurance_llm/bin/activate |
|
``` |
|
|
|
### Installation |
|
|
|
#### For Mac Users (Metal Support) |
|
```bash |
|
export FORCE_CMAKE=1 |
|
CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir |
|
``` |
|
|
|
#### For Windows Users (CPU Support) |
|
```bash |
|
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu |
|
``` |
|
|
|
### Dependencies |
|
|
|
Then install dependencies (inference_requirements.txt) attached under `Files and Versions`: |
|
```bash |
|
pip install -r inference_requirements.txt |
|
``` |
|
|
|
## Inference Loop |
|
|
|
```python |
|
# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py) |
|
import os |
|
import time |
|
from pathlib import Path |
|
from llama_cpp import Llama |
|
from rich.console import Console |
|
from huggingface_hub import hf_hub_download |
|
from dataclasses import dataclass |
|
from typing import List, Dict, Any, Tuple |
|
|
|
@dataclass |
|
class ModelConfig: |
|
# Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2 |
|
model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF" |
|
model_file: str = "open-insurance-llm-q4_k_m.gguf" |
|
# model_file: str = "open-insurance-llm-q8_0.gguf" # 8-bit quantization; higher precision, better quality, increased resource usage |
|
# model_file: str = "open-insurance-llm-q5_k_m.gguf" # 5-bit quantization; balance between performance and resource efficiency |
|
max_tokens: int = 1000 # Maximum number of tokens to generate in a single output |
|
temperature: float = 0.1 # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution) |
|
top_k: int = 15 # After temperature scaling, Consider the top 15 most probable tokens during sampling |
|
top_p: float = 0.2 # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20% |
|
repeat_penalty: float = 1.2 # Penalize repeated tokens to reduce redundancy |
|
num_beams: int = 4 # Number of beams for beam search; higher values improve quality at the cost of speed |
|
n_gpu_layers: int = -2 # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration |
|
n_ctx: int = 2048 # Context window size; Llama 3 models support up to 8192 tokens context length |
|
n_batch: int = 256 # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512) |
|
verbose: bool = False # True for enabling verbose logging for debugging purposes |
|
use_mmap: bool = False # Memory-map model to reduce RAM usage; set to True if running on limited memory systems |
|
use_mlock: bool = True # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM |
|
offload_kqv: bool = True # Offload key, query, value matrices to GPU to accelerate inference |
|
|
|
|
|
|
|
class InsuranceLLM: |
|
def __init__(self, config: ModelConfig): |
|
self.config = config |
|
self.llm_ctx = None |
|
self.console = Console() |
|
self.conversation_history: List[Dict[str, str]] = [] |
|
|
|
self.system_message = ( |
|
"This is a chat between a user and an artificial intelligence assistant. " |
|
"The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. " |
|
"The assistant should also indicate when the answer cannot be found in the context. " |
|
"You are an expert from the Insurance domain with extensive insurance knowledge and " |
|
"professional writer skills, especially about insurance policies. " |
|
"Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. " |
|
"You are willing to help answer the user's query with a detailed explanation. " |
|
"In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, " |
|
"complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while " |
|
"still aiming to make the explanation clear and accessible to a general audience." |
|
) |
|
|
|
def download_model(self) -> str: |
|
try: |
|
with self.console.status("[bold green]Downloading model..."): |
|
model_path = hf_hub_download( |
|
self.config.model_name, |
|
filename=self.config.model_file, |
|
local_dir=os.path.join(os.getcwd(), 'gguf_dir') |
|
) |
|
return model_path |
|
except Exception as e: |
|
self.console.print(f"[red]Error downloading model: {str(e)}[/red]") |
|
raise |
|
|
|
def load_model(self) -> None: |
|
try: |
|
quantized_path = os.path.join(os.getcwd(), "gguf_dir") |
|
directory = Path(quantized_path) |
|
|
|
try: |
|
model_path = str(list(directory.glob(self.config.model_file))[0]) |
|
except IndexError: |
|
model_path = self.download_model() |
|
|
|
with self.console.status("[bold green]Loading model..."): |
|
self.llm_ctx = Llama( |
|
model_path=model_path, |
|
n_gpu_layers=self.config.n_gpu_layers, |
|
n_ctx=self.config.n_ctx, |
|
n_batch=self.config.n_batch, |
|
num_beams=self.config.num_beams, |
|
verbose=self.config.verbose, |
|
use_mlock=self.config.use_mlock, |
|
use_mmap=self.config.use_mmap, |
|
offload_kqv=self.config.offload_kqv |
|
) |
|
except Exception as e: |
|
self.console.print(f"[red]Error loading model: {str(e)}[/red]") |
|
raise |
|
|
|
def build_conversation_prompt(self, new_question: str, context: str = "") -> str: |
|
prompt = f"System: {self.system_message}\n\n" |
|
|
|
# Add conversation history |
|
for exchange in self.conversation_history: |
|
prompt += f"User: {exchange['user']}\n\n" |
|
prompt += f"Assistant: {exchange['assistant']}\n\n" |
|
|
|
# Add the new question |
|
if context: |
|
prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n" |
|
else: |
|
prompt += f"User: {new_question}\n\n" |
|
|
|
prompt += "Assistant:" |
|
return prompt |
|
|
|
def generate_response(self, prompt: str) -> Tuple[str, int, float]: |
|
if not self.llm_ctx: |
|
raise RuntimeError("Model not loaded. Call load_model() first.") |
|
|
|
self.console.print("[bold cyan]Assistant: [/bold cyan]", end="") |
|
complete_response = "" |
|
token_count = 0 |
|
start_time = time.time() |
|
|
|
try: |
|
for chunk in self.llm_ctx.create_completion( |
|
prompt, |
|
max_tokens=self.config.max_tokens, |
|
top_k=self.config.top_k, |
|
top_p=self.config.top_p, |
|
temperature=self.config.temperature, |
|
repeat_penalty=self.config.repeat_penalty, |
|
stream=True |
|
): |
|
text_chunk = chunk["choices"][0]["text"] |
|
complete_response += text_chunk |
|
token_count += 1 |
|
print(text_chunk, end="", flush=True) |
|
|
|
elapsed_time = time.time() - start_time |
|
print() |
|
return complete_response, token_count, elapsed_time |
|
except Exception as e: |
|
self.console.print(f"\n[red]Error generating response: {str(e)}[/red]") |
|
return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0 |
|
|
|
def run_chat(self): |
|
try: |
|
self.load_model() |
|
self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]") |
|
self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n") |
|
self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n") |
|
self.console.print("Your conversation history will be maintained for context-aware responses.\n") |
|
|
|
total_tokens = 0 |
|
|
|
while True: |
|
try: |
|
user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip() |
|
|
|
if user_input.lower() in ["exit", "/bye", "quit"]: |
|
self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]") |
|
self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]") |
|
break |
|
|
|
# Reset conversation with command |
|
if user_input.lower() == "/reset": |
|
self.conversation_history = [] |
|
self.console.print("[yellow]Conversation history has been reset.[/yellow]") |
|
continue |
|
|
|
context = "" |
|
question = user_input |
|
if "context:" in user_input.lower() and "question:" in user_input.lower(): |
|
parts = user_input.split("question:", 1) |
|
context = parts[0].replace("context:", "").strip() |
|
question = parts[1].strip() |
|
|
|
prompt = self.build_conversation_prompt(question, context) |
|
response, tokens, elapsed_time = self.generate_response(prompt) |
|
|
|
# Add to conversation history |
|
self.conversation_history.append({ |
|
"user": question, |
|
"assistant": response |
|
}) |
|
|
|
# Update total tokens |
|
total_tokens += tokens |
|
|
|
# Print metrics |
|
tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0 |
|
self.console.print( |
|
f"[dim]Tokens: {tokens} || " + |
|
f"Time: {elapsed_time:.2f}s || " + |
|
f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]" |
|
) |
|
print() # Add a blank line after each response |
|
|
|
except KeyboardInterrupt: |
|
self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]") |
|
continue |
|
except Exception as e: |
|
self.console.print(f"\n[red]Error processing input: {str(e)}[/red]") |
|
continue |
|
except Exception as e: |
|
self.console.print(f"\n[red]Fatal error: {str(e)}[/red]") |
|
finally: |
|
if self.llm_ctx: |
|
del self.llm_ctx |
|
|
|
|
|
def main(): |
|
try: |
|
config = ModelConfig() |
|
llm = InsuranceLLM(config) |
|
llm.run_chat() |
|
except KeyboardInterrupt: |
|
print("\nProgram interrupted by user") |
|
except Exception as e: |
|
print(f"\nApplication error: {str(e)}") |
|
|
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
```bash |
|
python3 inference_open-insurance-llm-gguf.py |
|
``` |
|
|
|
### Nvidia Llama 3 - ChatQA Paper: |
|
Arxiv : [https://arxiv.org/pdf/2401.10225](https://arxiv.org/pdf/2401.10225) |
|
|
|
## Use Cases |
|
|
|
This model is specifically designed for: |
|
- Insurance policy understanding and explanation |
|
- Claims processing assistance |
|
- Coverage analysis |
|
- Insurance terminology clarification |
|
- Policy comparison and recommendations |
|
- Risk assessment queries |
|
- Insurance compliance questions |
|
|
|
## Limitations |
|
|
|
- The model's knowledge is limited to its training data cutoff |
|
- Should not be used as a replacement for professional insurance advice |
|
- May occasionally generate plausible-sounding but incorrect information |
|
|
|
## Bias and Ethics |
|
|
|
This model should be used with awareness that: |
|
- It may reflect biases present in insurance industry training data |
|
- Output should be verified by insurance professionals for critical decisions |
|
- It should not be used as the sole basis for insurance decisions |
|
- The model's responses should be treated as informational, not as legal or professional advice |
|
|
|
## Citation and Attribution |
|
|
|
If you use base model or quantized model in your research or applications, please cite: |
|
``` |
|
@misc{maharajwala2024openinsurance, |
|
author = {Raj Maharajwala}, |
|
title = {Open-Insurance-LLM-Llama3-8B-GGUF}, |
|
year = {2024}, |
|
publisher = {HuggingFace}, |
|
linkedin = {https://www.linkedin.com/in/raj6800/}, |
|
url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF} |
|
} |
|
``` |