metadata

library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
  - image-classification
  - chihiro
  - studio-ghibli
  - custom-dataset
metrics:
  - accuracy
  - precision
  - recall
model-index:
  - name: chihiro-classifier-vit
    results:
      - task:
          type: image-classification
          name: Image Classification
        dataset:
          name: Custom Ghibli Dataset
          type: imagefolder
        metrics:
          - name: Test Accuracy
            type: accuracy
            value: 0.9333
          - name: Zero-shot CLIP Accuracy
            type: accuracy
            value: 0.8667
          - name: Zero-shot Precision
            type: precision
            value: 0.8909
          - name: Zero-shot Recall
            type: recall
            value: 0.8667

chihiro-classifier-vit

This model is a fine-tuned version of google/vit-base-patch16-224 trained on a small, custom binary classification dataset consisting of images labeled either "chihiro" or "not chihiro" (from Studio Ghibli films).

It was trained using PyTorch with transfer learning on a dataset of approximately 148 images.

Model description

The model classifies images into one of two categories: Chihiro or Not Chihiro. It uses a Vision Transformer (ViT) backbone with a custom classification head for binary output. Data augmentation was used during training to improve generalization. Techniques included random horizontal flip, rotation (30°), color jitter, and random resized crop.

Intended uses & limitations

Intended Uses:

Student computer vision project

Limitations:

Small dataset may limit real-world performance
Not robust to domain shift or artistic variation
Not intended for production deployment

Training and evaluation data

Custom image dataset of Chihiro vs. non-Chihiro characters
Loaded using Hugging Face's imagefolder format
Split: 80% train, 10% validation, 10% test
Augmentation applied during training; deterministic preprocessing during eval

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam
num_epochs: 12

Training results

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	0.8325	58.47%	0.7285	46.67%
2	0.6038	55.08%	0.6931	60.00%
3	0.6047	67.80%	0.6170	66.67%
4	0.4854	77.97%	0.7272	66.67%
5	0.3989	79.66%	0.5494	66.67%
6	0.3091	88.14%	0.4649	86.67%
7	0.2651	88.98%	0.5736	73.33%
8	0.2043	94.07%	0.5335	73.33%
9	0.2668	87.29%	0.5765	80.00%
10	0.2408	87.29%	0.5346	73.33%
11	0.1047	95.76%	0.4125	73.33%
12	0.1297	94.07%	0.4084	86.67%

Final Test Evaluation

Test Loss: 0.3677
Test Accuracy: 0.7333

🧪 Zero-Shot CLIP Comparison

Evaluated using openai/clip-vit-base-patch32 with no fine-tuning:

Zero-shot Accuracy: 86.67%
Precision: 0.8909
Recall: 0.8667

Framework versions

Transformers: not used (custom PyTorch)
PyTorch: 2.x
Datasets: 2.x
Tokenizers: N/A