Raj-Maharajwala
/

Open-Insurance-LLM-Llama3-8B-GGUF

Model card Files Files and versions Community

Open-Insurance-LLM-Llama3-8B-GGUF / README.md

Raj-Maharajwala

Update README.md

423f476 verified 4 months ago

preview code

raw

history blame contribute delete

14.1 kB

	---
	license: llama3
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- Text Generation
	- Transformers
	- llama
	- llama-3
	- 8B
	- nvidia
	- facebook
	- meta
	- LLM
	- fine-tuned
	- insurance
	- research
	- pytorch
	- instruct
	- chatqa-1.5
	- chatqa
	- finetune
	- gpt4
	- conversational
	- text-generation-inference
	- Inference Endpoints
	datasets:
	- InsuranceQA

	base_model: "nvidia/Llama3-ChatQA-1.5-8B"
	finetuned: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B"
	quantized: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
	---

	# Open-Insurance-LLM-Llama3-8B-GGUF

	This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA
	Fine-tuned for insurance-related queries and conversations.


	## Model Details

	- Model Type: Quantized Language Model (GGUF format)
	- Base Model: nvidia/Llama3-ChatQA-1.5-8B
	- Finetuned Model: Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B
	- Quantized Model: Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF
	- Model Architecture: Llama
	- Quantization: 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit
	- Finetuned Dataset: InsuranceQA (https://github.com/shuzi/insuranceQA)
	- Developer: Raj Maharajwala
	- License: llama3
	- Language: English

	## Setup Instructions

	### Environment Setup

	#### For Windows
	```bash
	python3 -m venv .venv_open_insurance_llm
	.\.venv_open_insurance_llm\Scripts\activate
	```

	#### For Mac/Linux
	```bash
	python3 -m venv .venv_open_insurance_llm
	source .venv_open_insurance_llm/bin/activate
	```

	### Installation

	#### For Mac Users (Metal Support)
	```bash
	export FORCE_CMAKE=1
	CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir
	```

	#### For Windows Users (CPU Support)
	```bash
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
	```

	### Dependencies

	Then install dependencies (inference_requirements.txt) attached under `Files and Versions`:
	```bash
	pip install -r inference_requirements.txt
	```

	## Inference Loop

	```python
	# Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py)
	import os
	import time
	from pathlib import Path
	from llama_cpp import Llama
	from rich.console import Console
	from huggingface_hub import hf_hub_download
	from dataclasses import dataclass
	from typing import List, Dict, Any, Tuple

	@dataclass
	class ModelConfig:
	# Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2
	model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF"
	model_file: str = "open-insurance-llm-q4_k_m.gguf"
	# model_file: str = "open-insurance-llm-q8_0.gguf" # 8-bit quantization; higher precision, better quality, increased resource usage
	# model_file: str = "open-insurance-llm-q5_k_m.gguf" # 5-bit quantization; balance between performance and resource efficiency
	max_tokens: int = 1000 # Maximum number of tokens to generate in a single output
	temperature: float = 0.1 # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution)
	top_k: int = 15 # After temperature scaling, Consider the top 15 most probable tokens during sampling
	top_p: float = 0.2 # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20%
	repeat_penalty: float = 1.2 # Penalize repeated tokens to reduce redundancy
	num_beams: int = 4 # Number of beams for beam search; higher values improve quality at the cost of speed
	n_gpu_layers: int = -2 # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration
	n_ctx: int = 2048 # Context window size; Llama 3 models support up to 8192 tokens context length
	n_batch: int = 256 # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512)
	verbose: bool = False # True for enabling verbose logging for debugging purposes
	use_mmap: bool = False # Memory-map model to reduce RAM usage; set to True if running on limited memory systems
	use_mlock: bool = True # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM
	offload_kqv: bool = True # Offload key, query, value matrices to GPU to accelerate inference



	class InsuranceLLM:
	def __init__(self, config: ModelConfig):
	self.config = config
	self.llm_ctx = None
	self.console = Console()
	self.conversation_history: List[Dict[str, str]] = []

	self.system_message = (
	"This is a chat between a user and an artificial intelligence assistant. "
	"The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. "
	"The assistant should also indicate when the answer cannot be found in the context. "
	"You are an expert from the Insurance domain with extensive insurance knowledge and "
	"professional writer skills, especially about insurance policies. "
	"Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. "
	"You are willing to help answer the user's query with a detailed explanation. "
	"In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, "
	"complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while "
	"still aiming to make the explanation clear and accessible to a general audience."
	)

	def download_model(self) -> str:
	try:
	with self.console.status("[bold green]Downloading model..."):
	model_path = hf_hub_download(
	self.config.model_name,
	filename=self.config.model_file,
	local_dir=os.path.join(os.getcwd(), 'gguf_dir')
	)
	return model_path
	except Exception as e:
	self.console.print(f"[red]Error downloading model: {str(e)}[/red]")
	raise

	def load_model(self) -> None:
	try:
	quantized_path = os.path.join(os.getcwd(), "gguf_dir")
	directory = Path(quantized_path)

	try:
	model_path = str(list(directory.glob(self.config.model_file))[0])
	except IndexError:
	model_path = self.download_model()

	with self.console.status("[bold green]Loading model..."):
	self.llm_ctx = Llama(
	model_path=model_path,
	n_gpu_layers=self.config.n_gpu_layers,
	n_ctx=self.config.n_ctx,
	n_batch=self.config.n_batch,
	num_beams=self.config.num_beams,
	verbose=self.config.verbose,
	use_mlock=self.config.use_mlock,
	use_mmap=self.config.use_mmap,
	offload_kqv=self.config.offload_kqv
	)
	except Exception as e:
	self.console.print(f"[red]Error loading model: {str(e)}[/red]")
	raise

	def build_conversation_prompt(self, new_question: str, context: str = "") -> str:
	prompt = f"System: {self.system_message}\n\n"

	# Add conversation history
	for exchange in self.conversation_history:
	prompt += f"User: {exchange['user']}\n\n"
	prompt += f"Assistant: {exchange['assistant']}\n\n"

	# Add the new question
	if context:
	prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n"
	else:
	prompt += f"User: {new_question}\n\n"

	prompt += "Assistant:"
	return prompt

	def generate_response(self, prompt: str) -> Tuple[str, int, float]:
	if not self.llm_ctx:
	raise RuntimeError("Model not loaded. Call load_model() first.")

	self.console.print("[bold cyan]Assistant: [/bold cyan]", end="")
	complete_response = ""
	token_count = 0
	start_time = time.time()

	try:
	for chunk in self.llm_ctx.create_completion(
	prompt,
	max_tokens=self.config.max_tokens,
	top_k=self.config.top_k,
	top_p=self.config.top_p,
	temperature=self.config.temperature,
	repeat_penalty=self.config.repeat_penalty,
	stream=True
	):
	text_chunk = chunk["choices"][0]["text"]
	complete_response += text_chunk
	token_count += 1
	print(text_chunk, end="", flush=True)

	elapsed_time = time.time() - start_time
	print()
	return complete_response, token_count, elapsed_time
	except Exception as e:
	self.console.print(f"\n[red]Error generating response: {str(e)}[/red]")
	return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0

	def run_chat(self):
	try:
	self.load_model()
	self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]")
	self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n")
	self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n")
	self.console.print("Your conversation history will be maintained for context-aware responses.\n")

	total_tokens = 0

	while True:
	try:
	user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip()

	if user_input.lower() in ["exit", "/bye", "quit"]:
	self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]")
	self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]")
	break

	# Reset conversation with command
	if user_input.lower() == "/reset":
	self.conversation_history = []
	self.console.print("[yellow]Conversation history has been reset.[/yellow]")
	continue

	context = ""
	question = user_input
	if "context:" in user_input.lower() and "question:" in user_input.lower():
	parts = user_input.split("question:", 1)
	context = parts[0].replace("context:", "").strip()
	question = parts[1].strip()

	prompt = self.build_conversation_prompt(question, context)
	response, tokens, elapsed_time = self.generate_response(prompt)

	# Add to conversation history
	self.conversation_history.append({
	"user": question,
	"assistant": response
	})

	# Update total tokens
	total_tokens += tokens

	# Print metrics
	tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0
	self.console.print(
	f"[dim]Tokens: {tokens} \|\| " +
	f"Time: {elapsed_time:.2f}s \|\| " +
	f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]"
	)
	print() # Add a blank line after each response

	except KeyboardInterrupt:
	self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]")
	continue
	except Exception as e:
	self.console.print(f"\n[red]Error processing input: {str(e)}[/red]")
	continue
	except Exception as e:
	self.console.print(f"\n[red]Fatal error: {str(e)}[/red]")
	finally:
	if self.llm_ctx:
	del self.llm_ctx


	def main():
	try:
	config = ModelConfig()
	llm = InsuranceLLM(config)
	llm.run_chat()
	except KeyboardInterrupt:
	print("\nProgram interrupted by user")
	except Exception as e:
	print(f"\nApplication error: {str(e)}")


	if __name__ == "__main__":
	main()
	```

	```bash
	python3 inference_open-insurance-llm-gguf.py
	```

	### Nvidia Llama 3 - ChatQA Paper:
	Arxiv : [https://arxiv.org/pdf/2401.10225](https://arxiv.org/pdf/2401.10225)

	## Use Cases

	This model is specifically designed for:
	- Insurance policy understanding and explanation
	- Claims processing assistance
	- Coverage analysis
	- Insurance terminology clarification
	- Policy comparison and recommendations
	- Risk assessment queries
	- Insurance compliance questions

	## Limitations

	- The model's knowledge is limited to its training data cutoff
	- Should not be used as a replacement for professional insurance advice
	- May occasionally generate plausible-sounding but incorrect information

	## Bias and Ethics

	This model should be used with awareness that:
	- It may reflect biases present in insurance industry training data
	- Output should be verified by insurance professionals for critical decisions
	- It should not be used as the sole basis for insurance decisions
	- The model's responses should be treated as informational, not as legal or professional advice

	## Citation and Attribution

	If you use base model or quantized model in your research or applications, please cite:
	```
	@misc{maharajwala2024openinsurance,
	author = {Raj Maharajwala},
	title = {Open-Insurance-LLM-Llama3-8B-GGUF},
	year = {2024},
	publisher = {HuggingFace},
	linkedin = {https://www.linkedin.com/in/raj6800/},
	url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF}
	}
	```