Abdou commited on
Commit
c359c0d
·
1 Parent(s): bc5d20b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -17
README.md CHANGED
@@ -9,37 +9,80 @@ metrics:
9
  model-index:
10
  - name: vit-swin-base-224-gpt2-image-captioning
11
  results: []
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
  # vit-swin-base-224-gpt2-image-captioning
18
 
19
- This model is a fine-tuned version of [](https://huggingface.co/) on the coco dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 0.7923
22
- - Rouge1: 41.8451
23
- - Rouge2: 16.3493
24
- - Rougel: 38.0288
25
- - Rougelsum: 38.049
26
- - Bleu: 10.2776
27
- - Gen Len: 11.2946
28
 
29
  ## Model description
30
 
31
- More information needed
32
 
33
  ## Intended uses & limitations
34
 
35
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ## Training and evaluation data
38
 
39
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## Training procedure
42
 
 
 
43
  ### Training hyperparameters
44
 
45
  The following hyperparameters were used during training:
@@ -67,4 +110,4 @@ The following hyperparameters were used during training:
67
  - Transformers 4.26.0
68
  - Pytorch 1.13.1+cu116
69
  - Datasets 2.9.0
70
- - Tokenizers 0.13.2
 
9
  model-index:
10
  - name: vit-swin-base-224-gpt2-image-captioning
11
  results: []
12
+ license: mit
13
+ language:
14
+ - en
15
+ pipeline_tag: image-to-text
16
  ---
17
 
 
 
 
18
  # vit-swin-base-224-gpt2-image-captioning
19
 
20
+ This model is a fine-tuned [VisionEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder) model on 60% of the [COCO2014](https://huggingface.co/datasets/HuggingFaceM4/COCO) dataset.
21
+ It achieves the following results on the testing set:
22
+ - Loss: 0.7989
23
+ - Rouge1: 53.1153
24
+ - Rouge2: 24.2307
25
+ - Rougel: 51.5002
26
+ - Rougelsum: 51.4983
27
+ - Bleu: 17.7765
 
28
 
29
  ## Model description
30
 
31
+ The model was initialized on [microsoft/swin-base-patch4-window7-224-in22k](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k) as the Vision Encoder, the [gpt2](https://huggingface.co/gpt2) as the decoder.
32
 
33
  ## Intended uses & limitations
34
 
35
+ You can use this model for image captioning only.
36
+
37
+ ## How to use
38
+
39
+ You can either use the simple pipeline API:
40
+
41
+ ```python
42
+ from transformers import pipeline
43
+
44
+ image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
45
+ # infer the caption
46
+ caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
47
+ print(f"caption: {caption}")
48
+
49
+ ```
50
 
51
+ Or initialize everything for more flexibility:
52
 
53
+ ```python
54
+ from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
55
+ import torch
56
+
57
+ # a function to perform inference
58
+ def get_caption(model, image_processor, tokenizer, image_path):
59
+ image = load_image(image_path)
60
+ # preprocess the image
61
+ img = image_processor(image, return_tensors="pt").to(device)
62
+ # generate the caption (using greedy decoding by default)
63
+ output = model.generate(**img)
64
+ # decode the output
65
+ caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
66
+ return caption
67
+
68
+ device = "cuda" if torch.cuda.is_available() else "cpu"
69
+ # load the fine-tuned image captioning model and corresponding tokenizer and image processor
70
+ model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
71
+ tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
72
+ image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
73
+
74
+ # target image
75
+ url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
76
+ # get the caption
77
+ caption = get_caption(model, image_processor, tokenizer, url)
78
+ print(f"caption: {caption}")
79
+
80
+ ```
81
 
82
  ## Training procedure
83
 
84
+ You can check [this guide](https://www.thepythoncode.com/article/image-captioning-with-pytorch-and-transformers-in-python) to learn how this model was fine-tuned.
85
+
86
  ### Training hyperparameters
87
 
88
  The following hyperparameters were used during training:
 
110
  - Transformers 4.26.0
111
  - Pytorch 1.13.1+cu116
112
  - Datasets 2.9.0
113
+ - Tokenizers 0.13.2