codelion's picture
Upload folder using huggingface_hub
54c5e94 verified
metadata
base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
tags:
  - ellora
  - lora
  - long-context
  - repository-understanding
  - code-analysis
  - progressive-training
  - 2m-context
  - unsloth
  - vllm
  - peft
library_name: peft
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
datasets:
  - codelion/Qwen2.5-Coder-0.5B-Instruct-progressive-2M-context

codelion/qwen2-5-coder-0-5b-instruct-progressive-2000k-lora

πŸš€ Progressive Context Extension to 2.0M Tokens

This is a progressive LoRA adapter that extends Qwen/Qwen2.5-Coder-0.5B-Instruct to handle 2.0 MILLION token contexts through curriculum learning.

Part of the Ellora project - Recipe #4: Progressive Long Context Extension.

🎯 Key Features

  • Final Context: 2,000,000 tokens (62x base model)
  • Training Method: Hybrid approach with vLLM + Unsloth optimizations
  • Data Generation: vLLM for 10x+ faster task generation
  • Training: Unsloth for memory-efficient progressive training
  • Single Adapter: One LoRA handles all context lengths up to 2000K
  • Use Cases:
    • Entire codebase analysis
    • Multi-repository understanding
    • Large-scale code generation
    • Cross-file dependency analysis

πŸ“Š Training Progression

The model was trained progressively through these stages:

  • Stage 1: 32K tokens (loss: 0.4882)
  • Stage 2: 128K tokens (loss: 0.0641)
  • Stage 3: 512K tokens (loss: 0.1327)
  • Stage 4: 2000K tokens (loss: 0.0484)

Performance Metrics

  • Final Training Loss: 0.0484
  • Total Training Time: 0.17 hours
  • Peak Memory Usage: 4.7 GB
  • LoRA Rank: 64
  • LoRA Alpha: 128

πŸ”§ Usage with Unsloth

from unsloth import FastLanguageModel
from transformers import TextStreamer

# Load model with Unsloth (automatically handles 2M context!)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="codelion/qwen2-5-coder-0-5b-instruct-progressive-2000k-lora",
    max_seq_length=2000000,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# Enable native fast generation
FastLanguageModel.for_inference(model)

# Example: Analyze a large codebase
prompt = """Repository Context:
[Your repository content up to 2000K tokens]

Question: Analyze the overall architecture and provide improvement suggestions.

Answer:"""

inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2000000)
streamer = TextStreamer(tokenizer)

outputs = model.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.7,
    do_sample=True
)

πŸ”§ Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-0.5B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")

# Load the progressive adapter
model = PeftModel.from_pretrained(model, "codelion/qwen2-5-coder-0-5b-instruct-progressive-2000k-lora")

# Now you can use contexts up to 2000K tokens!

πŸ“ˆ Progressive Training Details

This adapter was trained using a novel progressive curriculum approach with hybrid optimizations:

  1. Stage 1 (32K): Basic file-level understanding
  2. Stage 2 (128K): Multi-file repository comprehension
  3. Stage 3 (512K): Large repository analysis
  4. Stage 4 (2M): Massive codebase understanding

Each stage included data from all previous stages, allowing the model to maintain and build upon its learned capabilities.

πŸ› οΈ Training Configuration

Progressive Stages: 32K β†’ 128K β†’ 512K β†’ 2000K
Final Context: 2000K tokens
Base Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
Data Generation: vLLM (fast batch inference)
Training: Unsloth (memory-efficient training)
LoRA Rank: 64
LoRA Alpha: 128
Learning Rate: 0.0002
Batch Size: 1
Gradient Accumulation: 4

πŸš€ Optimizations Used

Data Generation (vLLM)

  • Batch Generation: Process multiple prompts simultaneously
  • Optimized Memory: GPU memory utilization tuning
  • Fast Inference: 10x+ faster than sequential generation

Training (Unsloth)

  • Custom CUDA Kernels: 2-5x training speedup
  • Flash Attention 2: Efficient attention computation
  • Gradient Checkpointing: Memory-efficient backprop
  • 4-bit Quantization: Reduced memory footprint
  • RSLoRA: Rank-stabilized LoRA for better convergence

πŸ“Š Evaluation Tasks

The model excels at:

  • Complete repository architectural analysis
  • Cross-file dependency tracing
  • Large-scale refactoring suggestions
  • Security vulnerability detection across entire codebases
  • Test coverage analysis
  • Documentation generation for entire projects

πŸ† Achievements

  • Successfully extended context from 32K β†’ 2000K tokens
  • Hybrid optimization: vLLM for generation + Unsloth for training
  • Single adapter handles all context lengths
  • Memory-efficient training on single H100 GPU
  • Real repository understanding, not just synthetic data

πŸ”— Links


This model is part of the Ellora project - standardized recipes for enhancing LLM capabilities.