|
--- |
|
language: |
|
- en |
|
tags: |
|
- image-to-text |
|
--- |
|
## lokibots/vit-patch16-1280-gpt2-large-image-summary |
|
This model generates a summary from a given chart image. The model accepts an image of size 1280x768 (or less) and generates a summary describing the contents of the image. **However, training is still required.** |
|
|
|
## sample inference code |
|
```{python} |
|
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, GPT2Tokenizer |
|
from PIL import Image |
|
|
|
model = VisionEncoderDecoderModel.from_pretrained("lokibots/vit-patch16-1280-gpt2-large-image-summary") |
|
feature_extractor = ViTFeatureExtractor.from_pretrained("lokibots/vit-patch16-1280-gpt2-large-image-summary") |
|
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') |
|
|
|
image = Image.open("image_file").convert("RGB") |
|
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values |
|
|
|
gen_kwargs = {"max_length": 1024, "num_beams": 4} |
|
output_ids = model.generate(pixel_values, **gen_kwargs) |
|
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True) |
|
``` |