bnina-ayoub
/

finetuned-ViT-model

@@ -1,16 +1,22 @@
 ---
 library_name: transformers
-base_model: bninaos/fine-tuned-vit
 tags:
 - generated_from_trainer
 model-index:
 - name: finetuned-ViT-model
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # finetuned-ViT-model
 This model is a fine-tuned version of [bninaos/fine-tuned-vit](https://huggingface.co/bninaos/fine-tuned-vit) on an unknown dataset.
@@ -19,18 +25,37 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -52,4 +77,4 @@ The following hyperparameters were used during training:
 - Transformers 4.50.1
 - Pytorch 2.5.1+cu121
 - Datasets 3.4.1
-- Tokenizers 0.21.0

 ---
 library_name: transformers
+base_model:
+- facebook/detr-resnet-50
 tags:
 - generated_from_trainer
+- industry
+- construction
 model-index:
 - name: finetuned-ViT-model
   results: []
+license: mit
+datasets:
+- hf-vision/hardhat
+language:
+- en
+pipeline_tag: object-detection
 ---
 # finetuned-ViT-model
 This model is a fine-tuned version of [bninaos/fine-tuned-vit](https://huggingface.co/bninaos/fine-tuned-vit) on an unknown dataset.
 ## Model description
+This model is a demonstration project for the Hugging Face Certification assignment and was created for educational purpose.
+It is a fine-tuned Vision Transformer (ViT) for object detection, specifically trained to detect hard hats, heads, and people in images. It uses the `facebook/detr-resnet-50-dc5` checkpoint as a base and is further trained on the `hf-vision/hardhat` dataset.
+The model leverages the transformer architecture to process image patches and predict bounding boxes and labels for the objects of interest.
 ## Intended uses & limitations
+- **Intented Uses:** This model can be used to demonstrate object detection with ViT. It can potentially be used in safety applications to identify individuals wearing or not wearing hardhats in construction sites or industrial environments.
+- **Limitations:** This model has been limitedly trained and may not generalize well to images with significantly different characteristics, viewpoints, or lighting conditions. It is not intended for production use without further evaluation and validation.
 ## Training and evaluation data
+- **Dataset:** The model was trained on the `hf-vision/hardhat` dataset from Hugging Face Datasets. This dataset contains images of construction sites and industrial settings with annotations for hardhats, heads, and people.
+- **Data splits:** The dataset is divided into "train" and "test" splits.
+- **Data augmentation:** Data augmentation was applied during training using `albumentations` to improve model generalization. These included random horizontal flipping and random brightness/contrast adjustments.
 ## Training procedure
+- **Base model:** The model was initialized from the `facebook/detr-resnet-50-dc5` checkpoint, a pre-trained DETR model with a ResNet-50 backbone.
+- **Fine-tuning:** The model was fine-tuned using the Hugging Face `Trainer` with the following hyperparameters:
+    - Learning rate: 1e-6
+    - Weight decay: 1e-4
+    - Batch size: 1
+    - Epochs: 3
+    - Max steps: 500
+    - Optimizer: AdamW
+- **Evaluation:** The model was evaluated on the test set using standard object detection metrics, including COCO metrics (Average Precision, Average Recall).
+- **Hardware:** Training was performed on Google Colab using GPU acceleration.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - Transformers 4.50.1
 - Pytorch 2.5.1+cu121
 - Datasets 3.4.1
+- Tokenizers 0.21.0