Model Card for Qwen3-0.6B-Arabic-2B-Tokens-LoRA
This model card is for YoussefHosni/Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned
, a version of Qwen/Qwen3-0.6B
fine-tuned with Low-Rank Adaptation (LoRA) on a large Arabic corpus for enhanced Arabic text generation.
Model Details
Model Description
This model is a fine-tuned version of the Qwen/Qwen3-0.6B
model. It has been adapted for the Arabic language using Low-Rank Adaptation (LoRA) on approximately 2 billion tokens from the MohamedRashad/arabic-billion-words
dataset. The fine-tuning process was performed in a sequential, chunk-wise manner, where the model was iteratively trained on 50 million token chunks of the dataset. This approach allows for efficient training on a very large corpus.
- Developed by: Youssef Hosni
- Model type: Causal Language Model (Decoder-only Transformer)
- Language(s) (NLP): Arabic (ar)
- License: The base model, Qwen3, is licensed under the Apache 2.0 license.
- Finetuned from model:
Qwen/Qwen3-0.6B
Model Sources
Uses
Direct Use
The model is intended for direct use in Arabic text generation. It can be used to continue a given prompt, write stories, answer questions, or generate creative text in Arabic. The notebook provides examples of generating text with prompts like "ูุงู ูุง ู ูุงู ูู ูุฏูู ุงูุฒู ุงูุ" (Once upon a time, long ago,).
Downstream Use
This LoRA model can serve as a strong foundation for further fine-tuning on more specific, downstream Arabic NLP tasks such as:
- Dialect-specific chatbots
- Arabic summarization
- Content creation for Arabic websites or social media
Out-of-Scope Use
This model is not intended for use in high-stakes decision-making or for applications where factual accuracy is critical without further rigorous evaluation. Like all language models, it can generate plausible but incorrect information (hallucinate) and may reflect biases from its training data.
Bias, Risks, and Limitations
The model was fine-tuned on the MohamedRashad/arabic-billion-words
dataset, which is sourced from newspaper articles across various Arabic countries. This data may contain biases reflecting the perspectives and reporting styles of those sources. The model may, therefore, generate text that reflects these biases. Users should be aware of this and critically evaluate the model's outputs.
How to Get Started with the Model
Use the code below to get started with the LoRA model. This code demonstrates how to load the base model and apply the fine-tuned LoRA adapters for inference.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
def generate_arabic_text_lora(base_model_id, lora_model_id, prompt_text, max_new_tokens=250, temperature=0.7, do_sample=True):
"""
Generate Arabic text using a LoRA fine-tuned model.
"""
# Check for bfloat16 support
bf16_supported = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if bf16_supported else torch.float16,
device_map="auto"
)
# Load LoRA model
model = PeftModel.from_pretrained(base_model, lora_model_id)
# Set padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print(f"\nPrompt: {prompt_text}")
print("\nGenerating text...")
# Encode prompt
inputs = tokenizer(prompt_text, return_tensors="pt", return_attention_mask=True)
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
# Generate text
model.eval()
with torch.no_grad():
output_sequences = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature if do_sample else 1.0,
top_k=50 if do_sample else None,
top_p=0.95 if do_sample else None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode outputs
full_generated_text = tokenizer.decode(output_sequences, skip_special_tokens=True)
print("\n--- Full Generated Text (including prompt) ---")
print(full_generated_text)
print("\n--- Generation Complete ---")
return full_generated_text
# --- Example Usage ---
base_model_id = "Qwen/Qwen3-0.6B"
lora_model_id = "YoussefHosni/Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned"
prompt = "ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงูุ"
generate_arabic_text_lora(
base_model_id=base_model_id,
lora_model_id=lora_model_id,
prompt_text=prompt
)
Training Details
Training Data
The model was fine-tuned on the MohamedRashad/arabic-billion-words dataset. This dataset is described as containing 1.5 billion words from newspaper articles from ten major news sources across eight Arabic countries, collected over a fourteen-year period.
Training Procedure
The model was trained using a sequential chunk-wise fine-tuning strategy. The training was performed on chunks of approximately 50 million tokens for 5 epochs each. This process was repeated about 40 times to cover roughly 2 billion tokens from the dataset.
Preprocessing
The text data was preprocessed by tokenizing it and then grouping the tokens into blocks of 512. The DataCollatorForLanguageModeling was used to prepare the data for causal language modeling.
Training Hyperparameters
- LoRA Rank (r): 16
- LoRA Alpha (lora_alpha): 32
- LoRA Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Epochs per Chunk: 5
- Batch Size: 8 per device
- Gradient Accumulation: 4 steps
- Optimizer: AdamW (adamw_torch)
- Learning Rate: 2e-4
- LR Scheduler: Cosine
- Training Regime: Mixed precision (fp16) with 4-bit quantization (BitsAndBytes nf4).
Evaluation
Testing Data, Factors & Metrics
- Testing Data: For each training chunk, a validation set was created by splitting off 5% of the data.
- Metrics: The primary metric for evaluation and model selection was the validation loss.
Results
The model's performance was monitored by tracking the validation loss during training. The final model represents the checkpoint with the best validation loss. Qualitative comparisons in the training notebook show that the fine-tuned model produces more coherent and contextually relevant Arabic text compared to the base model for the given prompts.
Environmental Impact
- Hardware Type: GPU (as specified in the Kaggle environment)
- Hours used: Approximately 24 hours (calculated from ~35 minutes per chunk for ~40 chunks)
- Cloud Provider: Kaggle
- Compute Region: Not specified in the notebook, but typically US regions.