--- license: mit tags: - computer vision - image classification - recaptchav2 datasets: - recaptchav2-29k --- # Finetuned Vision Transformer This repository contains a Vision Transformer (ViT) model fine-tuned on the ReCAPTCHAv2-29k dataset. The dataset comprises 29,568 labeled images spanning 5 classes, each resized to a resolution of 224×224 pixels. ## Model description This model builds on a pre-trained ViT backbone and is fine-tuned on the ReCAPTCHAv2-29k dataset. It leverages the transformer-based architecture to capture global contextual information effectively, making it well-suited for tasks with diverse visual patterns like ReCAPTCHA classification. ## Intended uses & limitations This fine-tuned ViT model is designed for multi-label-classification tasks involving ReCAPTCHA-like visual patterns. Potential applications include: - Automated ReCAPTCHA analysis for research or accessibility tools - Benchmarking and evaluation of ReCAPTCHA-solving models - Educational purposes, such as studying transformer behavior on visual data The model is particularly useful in academic and experimental contexts where understanding transformer-based classification on noisy or distorted visual data is a priority. ## How to use Here is how to use this model to classify an image of the ReCAPTCHAv2-29k dataset into one of the 5 classes: ```python import requests import torch from PIL import Image from transformers import ViTForImageClassification, ViTImageProcessor url = "https://raw.githubusercontent.com/nobodyPerfecZ/recaptchav2-29k/refs/heads/master/data/bicycle/bicycle_0.png" image = Image.open(requests.get(url, stream=True).raw) processor = ViTImageProcessor.from_pretrained( "nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1" ) model = ViTForImageClassification.from_pretrained( "nobodyPerfecZ/vit-finetuned-patch16-224-recaptchav2-v1" ) inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits # model predicts one of the 5 classes predictions = (outputs.logits >= 0.5).to(int) predicted_class_indicies = torch.where(predictions == 1)[1] labels = [model.config.id2label[idx.item()] for idx in predicted_class_indicies] print(f"Predicted labels: {labels}") ``` ## Training data The ViT model was fine-tuned on [ReCAPTCHAv2-29k dataset](https://huggingface.co/datasets/nobodyPerfecZ/recaptchav2-29k), a dataset consisting of 29.568 images and 5 classes. ## Training procedure ### Preprocessing The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py). Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). ## Evaluation results The ViT model was evaluated on a held-out test set from the ReCAPTCHAv2-29k dataset. Two key metrics were used to assess performance: | Metric | Score | | ---------------- | ----- | | Top-1 Accuracy | 0.93 | | Hamming Accuracy | 0.97 | - Top-1 Accuracy reflects the proportion of images where the model's most confident prediction matched the true label - Hamming Accuracy measures the fraction of correctly predicted labels per sample These results indicate strong classification performance, especially given the visual complexity and distortion typical in ReCAPTCHA images.