Abdou
/

vit-swin-base-224-gpt2-image-captioning

@@ -9,37 +9,80 @@ metrics:
 model-index:
 - name: vit-swin-base-224-gpt2-image-captioning
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # vit-swin-base-224-gpt2-image-captioning
-This model is a fine-tuned version of [](https://huggingface.co/) on the coco dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.7923
-- Rouge1: 41.8451
-- Rouge2: 16.3493
-- Rougel: 38.0288
-- Rougelsum: 38.049
-- Bleu: 10.2776
-- Gen Len: 11.2946
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -67,4 +110,4 @@ The following hyperparameters were used during training:
 - Transformers 4.26.0
 - Pytorch 1.13.1+cu116
 - Datasets 2.9.0
-- Tokenizers 0.13.2

 model-index:
 - name: vit-swin-base-224-gpt2-image-captioning
   results: []
+license: mit
+language:
+- en
+pipeline_tag: image-to-text
 ---
 # vit-swin-base-224-gpt2-image-captioning
+This model is a fine-tuned [VisionEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder) model on 60% of the [COCO2014](https://huggingface.co/datasets/HuggingFaceM4/COCO) dataset.
+It achieves the following results on the testing set:
+- Loss: 0.7989
+- Rouge1: 53.1153
+- Rouge2: 24.2307
+- Rougel: 51.5002
+- Rougelsum: 51.4983
+- Bleu: 17.7765
 ## Model description
+The model was initialized on [microsoft/swin-base-patch4-window7-224-in22k](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k) as the Vision Encoder, the [gpt2](https://huggingface.co/gpt2) as the decoder.
 ## Intended uses & limitations
+You can use this model for image captioning only.
+## How to use
+You can either use the simple pipeline API:
+```python
+from transformers import pipeline
+image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
+# infer the caption
+caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
+print(f"caption: {caption}")
+```
+Or initialize everything for more flexibility:
+```python
+from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
+import torch
+# a function to perform inference
+def get_caption(model, image_processor, tokenizer, image_path):
+    image = load_image(image_path)
+    # preprocess the image
+    img = image_processor(image, return_tensors="pt").to(device)
+    # generate the caption (using greedy decoding by default)
+    output = model.generate(**img)
+    # decode the output
+    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
+    return caption
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# load the fine-tuned image captioning model and corresponding tokenizer and image processor
+model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
+tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
+image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
+# target image
+url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
+# get the caption
+caption = get_caption(model, image_processor, tokenizer, url)
+print(f"caption: {caption}")
+```
 ## Training procedure
+You can check [this guide](https://www.thepythoncode.com/article/image-captioning-with-pytorch-and-transformers-in-python) to learn how this model was fine-tuned.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - Transformers 4.26.0
 - Pytorch 1.13.1+cu116
 - Datasets 2.9.0
+- Tokenizers 0.13.2