Model Card for Qwen3-0.6B-Arabic-2B-Tokens-LoRA

This model card is for YoussefHosni/Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned, a version of Qwen/Qwen3-0.6B fine-tuned with Low-Rank Adaptation (LoRA) on a large Arabic corpus for enhanced Arabic text generation.

Model Details

Model Description

This model is a fine-tuned version of the Qwen/Qwen3-0.6B model. It has been adapted for the Arabic language using Low-Rank Adaptation (LoRA) on approximately 2 billion tokens from the MohamedRashad/arabic-billion-words dataset. The fine-tuning process was performed in a sequential, chunk-wise manner, where the model was iteratively trained on 50 million token chunks of the dataset. This approach allows for efficient training on a very large corpus.

Developed by: Youssef Hosni
Model type: Causal Language Model (Decoder-only Transformer)
Language(s) (NLP): Arabic (ar)
License: The base model, Qwen3, is licensed under the Apache 2.0 license.
Finetuned from model: Qwen/Qwen3-0.6B

Model Sources

Repository: YoussefHosni/Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned

Uses

Direct Use

The model is intended for direct use in Arabic text generation. It can be used to continue a given prompt, write stories, answer questions, or generate creative text in Arabic. The notebook provides examples of generating text with prompts like "كان يا مكان في قديم الزمان،" (Once upon a time, long ago,).

Downstream Use

This LoRA model can serve as a strong foundation for further fine-tuning on more specific, downstream Arabic NLP tasks such as:

Dialect-specific chatbots
Arabic summarization
Content creation for Arabic websites or social media

Out-of-Scope Use

This model is not intended for use in high-stakes decision-making or for applications where factual accuracy is critical without further rigorous evaluation. Like all language models, it can generate plausible but incorrect information (hallucinate) and may reflect biases from its training data.

Bias, Risks, and Limitations

The model was fine-tuned on the MohamedRashad/arabic-billion-words dataset, which is sourced from newspaper articles across various Arabic countries. This data may contain biases reflecting the perspectives and reporting styles of those sources. The model may, therefore, generate text that reflects these biases. Users should be aware of this and critically evaluate the model's outputs.

How to Get Started with the Model

Use the code below to get started with the LoRA model. This code demonstrates how to load the base model and apply the fine-tuned LoRA adapters for inference.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

def generate_arabic_text_lora(base_model_id, lora_model_id, prompt_text, max_new_tokens=250, temperature=0.7, do_sample=True):
    """
    Generate Arabic text using a LoRA fine-tuned model.
    """
    # Check for bfloat16 support
    bf16_supported = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

    # Load base model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16 if bf16_supported else torch.float16,
        device_map="auto"
    )

    # Load LoRA model
    model = PeftModel.from_pretrained(base_model, lora_model_id)

    # Set padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = tokenizer.eos_token_id

    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    print(f"\nPrompt: {prompt_text}")
    print("\nGenerating text...")

    # Encode prompt
    inputs = tokenizer(prompt_text, return_tensors="pt", return_attention_mask=True)
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)

    # Generate text
    model.eval()
    with torch.no_grad():
        output_sequences = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            temperature=temperature if do_sample else 1.0,
            top_k=50 if do_sample else None,
            top_p=0.95 if do_sample else None,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode outputs
    full_generated_text = tokenizer.decode(output_sequences, skip_special_tokens=True)

    print("\n--- Full Generated Text (including prompt) ---")
    print(full_generated_text)

    print("\n--- Generation Complete ---")

    return full_generated_text

# --- Example Usage ---
base_model_id = "Qwen/Qwen3-0.6B"
lora_model_id = "YoussefHosni/Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned"
prompt = "كان يا مكان في قديم الزمان،"

generate_arabic_text_lora(
    base_model_id=base_model_id,
    lora_model_id=lora_model_id,
    prompt_text=prompt
)

Training Details

Training Data

The model was fine-tuned on the MohamedRashad/arabic-billion-words dataset. This dataset is described as containing 1.5 billion words from newspaper articles from ten major news sources across eight Arabic countries, collected over a fourteen-year period.

Training Procedure

The model was trained using a sequential chunk-wise fine-tuning strategy. The training was performed on chunks of approximately 50 million tokens for 5 epochs each. This process was repeated about 40 times to cover roughly 2 billion tokens from the dataset.

Preprocessing

The text data was preprocessed by tokenizing it and then grouping the tokens into blocks of 512. The DataCollatorForLanguageModeling was used to prepare the data for causal language modeling.

Training Hyperparameters

LoRA Rank (r): 16
LoRA Alpha (lora_alpha): 32
LoRA Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs per Chunk: 5
Batch Size: 8 per device
Gradient Accumulation: 4 steps
Optimizer: AdamW (adamw_torch)
Learning Rate: 2e-4
LR Scheduler: Cosine
Training Regime: Mixed precision (fp16) with 4-bit quantization (BitsAndBytes nf4).

Evaluation

Testing Data, Factors & Metrics

Testing Data: For each training chunk, a validation set was created by splitting off 5% of the data.
Metrics: The primary metric for evaluation and model selection was the validation loss.

Results

The model's performance was monitored by tracking the validation loss during training. The final model represents the checkpoint with the best validation loss. Qualitative comparisons in the training notebook show that the fine-tuned model produces more coherent and contextually relevant Arabic text compared to the base model for the given prompts.

Environmental Impact

Hardware Type: GPU (as specified in the Kaggle environment)
Hours used: Approximately 24 hours (calculated from ~35 minutes per chunk for ~40 chunks)
Cloud Provider: Kaggle
Compute Region: Not specified in the notebook, but typically US regions.

YoussefHosni
/

Qwen3-0.6b-2B-Token-arabic-LoRA-finetuned