--- base_model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit tags: - text-generation-inference - transformers - unsloth - llama - trl license: apache-2.0 language: - en --- # Uploaded model - **Developed by:** CRLannister - **License:** apache-2.0 - **Finetuned from model :** unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth) This model builds upon the base Meta-Llama-3.1-8B-Instruct-bnb-4bit and is fine-tuned for text-generation tasks using parameter-efficient techniques such as LoRA (Low-Rank Adaptation) through Hugging Face's TRL library. Fine-tuning was accelerated with the Unsloth library, enabling faster training and optimization. # Key Features **Efficient Fine-Tuning:** LoRA adapters were used, significantly reducing computational costs and memory usage compared to full-model fine-tuning. **High Performance:** Optimized for text generation and conversational AI tasks. **Fast Training:** Training achieved a 2x speed-up with Unsloth's optimizations and advanced features like gradient checkpointing. # How to Use ## Load the Model To load the fine-tuned model for inference, follow these steps: ``` # Load the base model max_seq_length = 1024 base_model = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" # Your base model lora_path = "CRLannister/finetuned_Llama_3_1_8B_Amharic_lora" # Path to your saved LoRA weights # Load model with LoRA weights model, tokenizer = FastLanguageModel.from_pretrained( model_name=base_model, max_seq_length=max_seq_length, load_in_4bit=True, dtype=None, ) # Load LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], use_rslora=True, ) # Load the trained weights model.load_adapter(lora_path, "default") # Prepare model for inference FastLanguageModel.for_inference(model) def generate_output(instruction, input_, max_length=1024): # Format the prompt formatted_prompt = alpaca_prompt.format(instruction, input_, '') # Tokenize inputs = tokenizer( [formatted_prompt], return_tensors="pt", truncation=True, max_length=max_length, padding=True ).to("cuda") # Generate outputs = model.generate( **inputs, max_new_tokens=64, use_cache=True, temperature=0, # Lower temperature for more deterministic outputs do_sample=False, # Deterministic generation num_beams=1, # Simple greedy decoding pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) # Decode and process output result = tokenizer.decode(outputs[0], skip_special_tokens=True) # Extract the classification from the generated text # Remove the input prompt to get only the generated part generated_text = result[len(formatted_prompt):].strip() return generated_text generate_output(query['instruction'], query['input']) ``` # Model Details ## Training Fine-Tuning Method: LoRA (Low-Rank Adaptation) Optimizer: AdamW 8-bit Batch Size: 32 Gradient Accumulation Steps: 4 Learning Rate: 2e-4 Sequence Length: 2048 tokens # Frameworks Used: Unsloth for training optimizations Transformers TRL # Hardware Requirements This model was trained on GPUs with 4-bit quantization (bnb-4bit) to optimize memory usage. It is suitable for inference on GPUs with at least 16 GB of VRAM. # Results The model was fine-tuned on conversational and text generation tasks, demonstrating high fluency and coherence. This makes it ideal for applications like: Chatbots Summarization Question Answering Text Completion # Contributing Contributions to this model are welcome! Feel free to open issues or submit pull requests on the Hugging Face repository. # Acknowledgments Special thanks to the Unsloth team for making fine-tuning faster and more accessible. The base model was developed by Meta and enhanced by the Unsloth community.