File size: 3,546 Bytes

4c82809

---
library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
- image-classification
- chihiro
- studio-ghibli
- custom-dataset
metrics:
- accuracy
- precision
- recall
model-index:
- name: chihiro-classifier-vit
  results:
    - task:
        type: image-classification
        name: Image Classification
      dataset:
        name: Custom Ghibli Dataset
        type: imagefolder
      metrics:
        - name: Test Accuracy
          type: accuracy
          value: 0.9333
        - name: Zero-shot CLIP Accuracy
          type: accuracy
          value: 0.8667
        - name: Zero-shot Precision
          type: precision
          value: 0.8909
        - name: Zero-shot Recall
          type: recall
          value: 0.8667
---

<!-- This model card was customized based on training logs and evaluation metrics. -->

# chihiro-classifier-vit

This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) trained on a small, custom binary classification dataset consisting of images labeled either "chihiro" or "not chihiro" (from Studio Ghibli films). 

It was trained using PyTorch with transfer learning on a dataset of approximately 148 images.

## Model description

The model classifies images into one of two categories: **Chihiro** or **Not Chihiro**. It uses a Vision Transformer (ViT) backbone with a custom classification head for binary output. Data augmentation was used during training to improve generalization. Techniques included random horizontal flip, rotation (30°), color jitter, and random resized crop.

## Intended uses & limitations

**Intended Uses:**
- Student computer vision project

**Limitations:**
- Small dataset may limit real-world performance
- Not robust to domain shift or artistic variation
- Not intended for production deployment

## Training and evaluation data

- Custom image dataset of Chihiro vs. non-Chihiro characters
- Loaded using Hugging Face's `imagefolder` format
- Split: 80% train, 10% validation, 10% test
- Augmentation applied during training; deterministic preprocessing during eval

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam
- num_epochs: 12

### Training results

| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|:-----:|:----------:|:---------:|:--------:|:-------:|
| 1     | 0.8325     | 58.47%    | 0.7285   | 46.67%  |
| 2     | 0.6038     | 55.08%    | 0.6931   | 60.00%  |
| 3     | 0.6047     | 67.80%    | 0.6170   | 66.67%  |
| 4     | 0.4854     | 77.97%    | 0.7272   | 66.67%  |
| 5     | 0.3989     | 79.66%    | 0.5494   | 66.67%  |
| 6     | 0.3091     | 88.14%    | 0.4649   | 86.67%  |
| 7     | 0.2651     | 88.98%    | 0.5736   | 73.33%  |
| 8     | 0.2043     | 94.07%    | 0.5335   | 73.33%  |
| 9     | 0.2668     | 87.29%    | 0.5765   | 80.00%  |
| 10    | 0.2408     | 87.29%    | 0.5346   | 73.33%  |
| 11    | 0.1047     | 95.76%    | 0.4125   | 73.33%  |
| 12    | 0.1297     | 94.07%    | 0.4084   | 86.67%  |


### Final Test Evaluation

- `Test Loss`: 0.3677
- `Test Accuracy`: 0.7333


## 🧪 Zero-Shot CLIP Comparison

Evaluated using `openai/clip-vit-base-patch32` with no fine-tuning:

- `Zero-shot Accuracy`: 86.67%
- `Precision`: 0.8909
- `Recall`: 0.8667

## Framework versions

- Transformers: not used (custom PyTorch)
- PyTorch: 2.x
- Datasets: 2.x
- Tokenizers: N/A