Qwen2.5-0.5B Text Classification Model

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns.

🎯 Model Description

The model has been trained to classify text into three distinct categories:

ifeval: Instruction-following tasks with specific formatting requirements and step-by-step instructions
commoneval: Factual questions and knowledge-based queries requiring direct answers
wildvoice: Conversational, informal language patterns and natural dialogue

📊 Performance Results

Overall Performance

Overall Accuracy: 93.33% (28/30 correct predictions)
Training Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 0.88% of total parameters (4,399,104 out of 498,431,872)

Per-Category Performance

Category	Accuracy	Correct/Total	Description
ifeval	100%	10/10	Perfect performance on instruction-following tasks
commoneval	80%	8/10	Good performance on factual questions
wildvoice	100%	10/10	Perfect performance on conversational text

Confusion Matrix

ifeval:
  -> ifeval: 10
commoneval:
  -> commoneval: 8
  -> unknown: 1
  -> wildvoice: 1
wildvoice:
  -> wildvoice: 10

🔬 Development Journey & Methods Tried

Initial Challenges

We started with several approaches that didn't work well:

GRPO (Group Relative Policy Optimization): Initial attempts with GRPO training showed poor performance
- Loss decreased but model wasn't learning classification
- Model generated irrelevant responses like "unknown", "txt", "com"
- Overall accuracy: ~20%
Full Fine-tuning: Attempted full fine-tuning of larger models
- CUDA out of memory issues with larger models
- Numerical instability with certain model architectures
- Poor convergence on classification task
Complex Prompt Formats: Tried various complex prompt structures
- "Classify this text as ifeval, commoneval, or wildvoice: ..."
- Model struggled with complex instructions
- Generated explanations instead of simple labels

Breakthrough: Direct Classification Approach

The key breakthrough came with a direct, simple approach:

1. Simplified Prompt Format

Instead of complex classification prompts, we used a simple format:

Text: {input_text}
Label: {expected_label}

2. LoRA (Low-Rank Adaptation)

Used PEFT library for efficient fine-tuning
Only trained 0.88% of parameters
Much more stable than full fine-tuning
Faster training and inference

3. Focused Training Data

Created clear, distinct examples for each category:

ifeval: Instruction-following with specific formatting requirements
commoneval: Factual questions requiring direct answers
wildvoice: Conversational, informal language patterns

4. Optimal Hyperparameters

Learning Rate: 5e-4 (higher than initial attempts)
Batch Size: 2 (smaller for stability)
Max Length: 128 (shorter sequences)
Training Steps: 150
LoRA Rank: 8 (focused learning)

🚀 Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")

def classify_text(text):
    prompt = f"Text: {text}\nLabel:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    with torch.no_grad():
        generated = model.generate(
            **inputs,
            max_new_tokens=15,
            do_sample=True,
            temperature=0.1,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(generated[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test examples
print(classify_text("Follow these instructions exactly: Write 3 sentences about cats."))
# Output: ifeval

print(classify_text("What is the capital of France?"))
# Output: commoneval

print(classify_text("Hey, how are you doing today?"))
# Output: wildvoice

Advanced Usage with Confidence Scoring

def classify_with_confidence(text, num_samples=5):
    predictions = []
    for _ in range(num_samples):
        prompt = f"Text: {text}\nLabel:"
        inputs = tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            generated = model.generate(
                **inputs,
                max_new_tokens=15,
                do_sample=True,
                temperature=0.3,  # Slightly higher for diversity
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        response = tokenizer.decode(generated[0], skip_special_tokens=True)
        prediction = response[len(prompt):].strip().lower()
        
        # Clean up prediction
        if 'ifeval' in prediction:
            prediction = 'ifeval'
        elif 'commoneval' in prediction:
            prediction = 'commoneval'
        elif 'wildvoice' in prediction:
            prediction = 'wildvoice'
        else:
            prediction = 'unknown'
        
        predictions.append(prediction)
    
    # Calculate confidence
    from collections import Counter
    counts = Counter(predictions)
    most_common = counts.most_common(1)[0]
    confidence = most_common[1] / len(predictions)
    
    return most_common[0], confidence

# Example with confidence
label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write")
print(f"Prediction: {label}, Confidence: {confidence:.2%}")

📈 Training Details

Model Architecture

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Parameters: 498,431,872 total, 4,399,104 trainable (0.88%)
Precision: FP16 (mixed precision)
Device: CUDA (GPU accelerated)

Training Configuration

# LoRA Configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank
    lora_alpha=16,  # LoRA alpha
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# Training Arguments
training_args = TrainingArguments(
    learning_rate=5e-4,
    per_device_train_batch_size=2,
    max_steps=150,
    max_length=128,
    fp16=True,
    gradient_accumulation_steps=1,
    warmup_steps=20,
    weight_decay=0.01,
    max_grad_norm=1.0
)

Dataset

The model was trained on synthetic data representing three text categories:

60 total samples (20 per category)
ifeval: Instruction-following tasks with specific formatting requirements
commoneval: Factual questions and knowledge-based queries
wildvoice: Conversational, informal language patterns

🔍 Error Analysis

Failed Predictions (2 out of 30)

"What is 2 plus 2?" → Predicted: unknown (Expected: commoneval)
- Model generated: #eval{1} Label: #eval{2} Label: #
- Issue: Model generated code-like syntax instead of simple label
"What is the opposite of hot?" → Predicted: wildvoice (Expected: commoneval)
- Model generated: #wildvoice:comoneval:hot:yourresponse:whatis
- Issue: Model generated complex response instead of simple label

Success Factors

Simple prompt format was crucial for success
LoRA fine-tuning provided stable training
Focused training data with clear category distinctions
Appropriate hyperparameters (learning rate, batch size, etc.)

🛠️ Technical Implementation

Files Structure

merged_classification_model/
├── README.md                    # This file
├── config.json                  # Model configuration
├── generation_config.json       # Generation settings
├── model.safetensors           # Model weights (988MB)
├── tokenizer.json              # Tokenizer vocabulary
├── tokenizer_config.json       # Tokenizer configuration
├── special_tokens_map.json     # Special tokens mapping
├── added_tokens.json           # Added tokens
├── merges.txt                  # BPE merges
├── vocab.json                  # Vocabulary
└── chat_template.jinja         # Chat template

Dependencies

pip install transformers>=4.56.0
pip install torch>=2.0.0
pip install peft>=0.17.0
pip install accelerate>=0.21.0

🎯 Use Cases

This model is particularly useful for:

Text categorization in educational platforms
Content filtering based on text type
Dataset preprocessing for machine learning pipelines
VoiceBench-style evaluation systems
Instruction following detection in AI systems
Conversational vs. factual text separation

⚠️ Limitations

Synthetic Training Data: Model was trained on synthetic data and may not generalize perfectly to all real-world text
Three-Category Limitation: Only classifies into the three predefined categories
Prompt Sensitivity: Performance may vary with different prompt formats
Edge Cases: Some edge cases (like mathematical questions) may be misclassified
Language: Primarily trained on English text

🔮 Future Improvements

Larger Training Dataset: Use real VoiceBench data with proper audio transcription
More Categories: Expand to include additional text types
Multilingual Support: Train on multiple languages
Confidence Calibration: Improve confidence scoring
Few-shot Learning: Add support for few-shot classification

📚 Citation

@misc{qwen2.5-0.5b-text-classification,
  title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}},
  note={Fine-tuned using LoRA on synthetic text classification data}
}

🤝 Contributing

Contributions are welcome! Please feel free to:

Report issues with the model
Suggest improvements
Submit pull requests
Share your use cases

📄 License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Performance Summary:

✅ 93.33% Overall Accuracy
✅ 100% ifeval accuracy (instruction-following)
✅ 100% wildvoice accuracy (conversational)
✅ 80% commoneval accuracy (factual questions)
✅ Efficient LoRA fine-tuning (0.88% trainable parameters)
✅ Fast inference with small model size
✅ Easy to use with simple API

This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.

Downloads last month: -

Safetensors

Model size

494M params

Tensor type

F16

Model tree for manbeast3b/qwen2.5-0.5b-text-classification

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(307)

this model