Finetuned Vision Transformer

This repository contains a Vision Transformer (ViT) model fine-tuned on the ReCAPTCHAv2-29k dataset. The dataset comprises 29,568 labeled images spanning 5 classes, each resized to a resolution of 224×224 pixels.

Model description

This model builds on a pre-trained ViT backbone and is fine-tuned on the ReCAPTCHAv2-29k dataset. It leverages the transformer-based architecture to capture global contextual information effectively, making it well-suited for tasks with diverse visual patterns like ReCAPTCHA classification.

Intended uses & limitations

This fine-tuned ViT model is designed for multi-label-classification tasks involving ReCAPTCHA-like visual patterns. Potential applications include:

Automated ReCAPTCHA analysis for research or accessibility tools
Benchmarking and evaluation of ReCAPTCHA-solving models
Educational purposes, such as studying transformer behavior on visual data

The model is particularly useful in academic and experimental contexts where understanding transformer-based classification on noisy or distorted visual data is a priority.

How to use

Here is how to use this model to classify an image of the ReCAPTCHAv2-29k dataset into one of the 5 classes:

import requests
import torch
from PIL import Image
from transformers import ViTForImageClassification, ViTImageProcessor

url = "https://raw.githubusercontent.com/nobodyPerfecZ/recaptchav2-29k/refs/heads/master/data/bicycle/bicycle_0.png"
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained(
    "nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1"
)
model = ViTForImageClassification.from_pretrained(
    "nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1"
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 5 classes
predictions = (outputs.logits >= 0.5).to(int)
predicted_class_indicies = torch.where(predictions == 1)[1]
labels = [model.config.id2label[idx.item()] for idx in predicted_class_indicies]
print(f"Predicted labels: {labels}")

Training data

The ViT model was fine-tuned on ReCAPTCHAv2-29k dataset, a dataset consisting of 29.568 images and 5 classes.

Training procedure

Preprocessing

The exact details of preprocessing of images during training/validation can be found here.

Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Evaluation results

The ViT model was evaluated on a held-out test set from the ReCAPTCHAv2-29k dataset. Two key metrics were used to assess performance:

Metric	Score
Top-1 Accuracy	0.93
Hamming Accuracy	0.97

Top-1 Accuracy reflects the proportion of images where the model's most confident prediction matched the true label
Hamming Accuracy measures the fraction of correctly predicted labels per sample

These results indicate strong classification performance, especially given the visual complexity and distortion typical in ReCAPTCHA images.