Finetuned Vision Transformer
This repository contains a Vision Transformer (ViT) model fine-tuned on the ReCAPTCHAv2-29k dataset. The dataset comprises 29,568 labeled images spanning 5 classes, each resized to a resolution of 224ร224 pixels.
Model description
This model builds on a pre-trained ViT backbone and is fine-tuned on the ReCAPTCHAv2-29k dataset. It leverages the transformer-based architecture to capture global contextual information effectively, making it well-suited for tasks with diverse visual patterns like ReCAPTCHA classification.
Intended uses & limitations
This fine-tuned ViT model is designed for multi-label-classification tasks involving ReCAPTCHA-like visual patterns. Potential applications include:
- Automated ReCAPTCHA analysis for research or accessibility tools
- Benchmarking and evaluation of ReCAPTCHA-solving models
- Educational purposes, such as studying transformer behavior on visual data
The model is particularly useful in academic and experimental contexts where understanding transformer-based classification on noisy or distorted visual data is a priority.
How to use
Here is how to use this model to classify an image of the ReCAPTCHAv2-29k dataset into one of the 5 classes:
import requests
import torch
from PIL import Image
from transformers import ViTForImageClassification, ViTImageProcessor
url = "https://raw.githubusercontent.com/nobodyPerfecZ/recaptchav2-29k/refs/heads/master/data/bicycle/bicycle_0.png"
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained(
"nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1"
)
model = ViTForImageClassification.from_pretrained(
"nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1"
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 5 classes
predictions = (outputs.logits >= 0.5).to(int)
predicted_class_indicies = torch.where(predictions == 1)[1]
labels = [model.config.id2label[idx.item()] for idx in predicted_class_indicies]
print(f"Predicted labels: {labels}")
Training data
The ViT model was fine-tuned on ReCAPTCHAv2-29k dataset, a dataset consisting of 29.568 images and 5 classes.
Training procedure
Preprocessing
The exact details of preprocessing of images during training/validation can be found here.
Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Evaluation results
The ViT model was evaluated on a held-out test set from the ReCAPTCHAv2-29k dataset. Two key metrics were used to assess performance:
Metric | Score |
---|---|
Top-1 Accuracy | 0.93 |
Hamming Accuracy | 0.97 |
- Top-1 Accuracy reflects the proportion of images where the model's most confident prediction matched the true label
- Hamming Accuracy measures the fraction of correctly predicted labels per sample
These results indicate strong classification performance, especially given the visual complexity and distortion typical in ReCAPTCHA images.
- Downloads last month
- 6,776