Uploaded model

Developed by: CRLannister
License: apache-2.0
Finetuned from model : unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

This model builds upon the base Meta-Llama-3.1-8B-Instruct-bnb-4bit and is fine-tuned for text-generation tasks using parameter-efficient techniques such as LoRA (Low-Rank Adaptation) through Hugging Face's TRL library.

Fine-tuning was accelerated with the Unsloth library, enabling faster training and optimization.

Key Features

Efficient Fine-Tuning: LoRA adapters were used, significantly reducing computational costs and memory usage compared to full-model fine-tuning. High Performance: Optimized for text generation and conversational AI tasks. Fast Training: Training achieved a 2x speed-up with Unsloth's optimizations and advanced features like gradient checkpointing.

How to Use

Load the Model

To load the fine-tuned model for inference, follow these steps:

# Load the base model                                                                                                                                                                                             
max_seq_length = 1024                                                                                                                                                                                             
base_model = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Your base model                                                                                                                                              
lora_path = "CRLannister/finetuned_Llama_3_1_8B_Amharic_lora"  # Path to your saved LoRA weights                                                                                                                                                       

# Load model with LoRA weights                                                                                                                                                                                    
model, tokenizer = FastLanguageModel.from_pretrained(                                                                                                                                                             
 model_name=base_model,                                                                                                                                                                                        
 max_seq_length=max_seq_length,                                                                                                                                                                                
 load_in_4bit=True,                                                                                                                                                                                            
 dtype=None,                                                                                                                                                                                                   
)                                                                                                                                                                                                                 

# Load LoRA adapters                                                                                                                                                                                              
model = FastLanguageModel.get_peft_model(                                                                                                                                                                         
 model,                                                                                                                                                                                                        
 r=16,                                                                                                                                                                                                          
 lora_alpha=16,                                                                                                                                                                                                
 lora_dropout=0,                                                                                                                                                                                               
 target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],                                                                                                                 
 use_rslora=True,                                                                                                                                                                                              
)                                                                                                                                                                                                                 

# Load the trained weights                                                                                                                                                                                        
model.load_adapter(lora_path, "default")                                                                                                                                                                          

# Prepare model for inference                                                                                                                                                                                     
FastLanguageModel.for_inference(model)


def generate_output(instruction, input_, max_length=1024):
    # Format the prompt
    formatted_prompt = alpaca_prompt.format(instruction, input_, '')

    # Tokenize
    inputs = tokenizer(
     [formatted_prompt],
     return_tensors="pt",
     truncation=True,
     max_length=max_length,
     padding=True
    ).to("cuda")

    # Generate
    outputs = model.generate(
     **inputs,
     max_new_tokens=64,  
     use_cache=True,
     temperature=0,    # Lower temperature for more deterministic outputs
     do_sample=False,    # Deterministic generation
     num_beams=1,        # Simple greedy decoding
     pad_token_id=tokenizer.pad_token_id,
     eos_token_id=tokenizer.eos_token_id,
    )

    # Decode and process output
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the classification from the generated text
    # Remove the input prompt to get only the generated part
    generated_text = result[len(formatted_prompt):].strip()

    return generated_text


generate_output(query['instruction'], query['input'])

Model Details

Training

Fine-Tuning Method: LoRA (Low-Rank Adaptation) Optimizer: AdamW 8-bit Batch Size: 32 Gradient Accumulation Steps: 4 Learning Rate: 2e-4 Sequence Length: 2048 tokens

Frameworks Used:

Unsloth for training optimizations Transformers TRL

Hardware Requirements

This model was trained on GPUs with 4-bit quantization (bnb-4bit) to optimize memory usage. It is suitable for inference on GPUs with at least 16 GB of VRAM.

Results

The model was fine-tuned on conversational and text generation tasks, demonstrating high fluency and coherence. This makes it ideal for applications like:

Chatbots Summarization Question Answering Text Completion

Contributing

Contributions to this model are welcome! Feel free to open issues or submit pull requests on the Hugging Face repository.

Acknowledgments

Special thanks to the Unsloth team for making fine-tuning faster and more accessible. The base model was developed by Meta and enhanced by the Unsloth community.

CRLannister
/

finetuned_Llama_3_1_8B_Amharic_lora