Uploaded model
- Developed by: CRLannister
- License: apache-2.0
- Finetuned from model : unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.
This model builds upon the base Meta-Llama-3.1-8B-Instruct-bnb-4bit and is fine-tuned for text-generation tasks using parameter-efficient techniques such as LoRA (Low-Rank Adaptation) through Hugging Face's TRL library.
Fine-tuning was accelerated with the Unsloth library, enabling faster training and optimization.
Key Features
Efficient Fine-Tuning: LoRA adapters were used, significantly reducing computational costs and memory usage compared to full-model fine-tuning. High Performance: Optimized for text generation and conversational AI tasks. Fast Training: Training achieved a 2x speed-up with Unsloth's optimizations and advanced features like gradient checkpointing.
How to Use
Load the Model
To load the fine-tuned model for inference, follow these steps:
# Load the base model
max_seq_length = 1024
base_model = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" # Your base model
lora_path = "CRLannister/finetuned_Llama_3_1_8B_Amharic_lora" # Path to your saved LoRA weights
# Load model with LoRA weights
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=base_model,
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
# Load LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
use_rslora=True,
)
# Load the trained weights
model.load_adapter(lora_path, "default")
# Prepare model for inference
FastLanguageModel.for_inference(model)
def generate_output(instruction, input_, max_length=1024):
# Format the prompt
formatted_prompt = alpaca_prompt.format(instruction, input_, '')
# Tokenize
inputs = tokenizer(
[formatted_prompt],
return_tensors="pt",
truncation=True,
max_length=max_length,
padding=True
).to("cuda")
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=64,
use_cache=True,
temperature=0, # Lower temperature for more deterministic outputs
do_sample=False, # Deterministic generation
num_beams=1, # Simple greedy decoding
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and process output
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract the classification from the generated text
# Remove the input prompt to get only the generated part
generated_text = result[len(formatted_prompt):].strip()
return generated_text
generate_output(query['instruction'], query['input'])
Model Details
Training
Fine-Tuning Method: LoRA (Low-Rank Adaptation) Optimizer: AdamW 8-bit Batch Size: 32 Gradient Accumulation Steps: 4 Learning Rate: 2e-4 Sequence Length: 2048 tokens
Frameworks Used:
Unsloth for training optimizations Transformers TRL
Hardware Requirements
This model was trained on GPUs with 4-bit quantization (bnb-4bit) to optimize memory usage. It is suitable for inference on GPUs with at least 16 GB of VRAM.
Results
The model was fine-tuned on conversational and text generation tasks, demonstrating high fluency and coherence. This makes it ideal for applications like:
Chatbots Summarization Question Answering Text Completion
Contributing
Contributions to this model are welcome! Feel free to open issues or submit pull requests on the Hugging Face repository.
Acknowledgments
Special thanks to the Unsloth team for making fine-tuning faster and more accessible. The base model was developed by Meta and enhanced by the Unsloth community.
Model tree for CRLannister/finetuned_Llama_3_1_8B_Amharic_lora
Base model
meta-llama/Llama-3.1-8B