vit-base-cifar10-augmented

This model is a fine-tuned version of google/vit-base-patch16-224 on the CIFAR-10 dataset using data augmentation.

It achieves the following results on the evaluation set:

Loss: 0.0445
Accuracy: 95.54%

🧠 Model Description

The base model is a Vision Transformer (ViT) originally trained on ImageNet-21k. This version has been fine-tuned on CIFAR-10, a standard image classification benchmark, using PyTorch and Hugging Face Transformers.

Training was done using extensive data augmentation, including random crops, flips, rotations, and color jitter to improve generalization on small input images (32×32, resized to 224×224).

✅ Intended Uses & Limitations

Intended uses

Educational and research use on small image classification tasks
Benchmarking transfer learning for ViT on CIFAR-10
Demonstrating the impact of data augmentation on fine-tuning performance

Limitations

Not optimized for real-time inference
Fine-tuned only on CIFAR-10; not suitable for general-purpose image classification
Requires resized input (224×224)

📦 Training and Evaluation Data

Dataset: CIFAR-10
Size: 60,000 images (10 classes)
Split: 75% training, 25% test

All images were resized to 224×224 and normalized using ViT’s original mean/std values.

⚙️ Training Procedure

Hyperparameters

Learning rate: 1e-4
Optimizer: Adam
Batch size: 8
Epochs: 10
Scheduler: ReduceLROnPlateau

Data Augmentation Used

RandomResizedCrop(224)
RandomHorizontalFlip()
RandomRotation(10)
ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1)

Training Results

Epoch	Training Loss	Test Accuracy
1	0.1969	94.62%
2	0.1189	95.05%
3	0.0899	95.54%
4	0.0720	94.68%
5	0.0650	94.84%
6	0.0576	94.76%
7	0.0560	95.33%
8	0.0488	94.31%
9	0.0499	95.42%
10	0.0445	94.33%

🧪 Framework Versions

transformers: 4.50.0
torch: 2.6.0+cu124
datasets: 3.4.1
tokenizers: 0.21.1

detorcla
/

cifar10-resnet18