Update model card for ESTR-CoT

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +24 -36
README.md CHANGED
@@ -1,50 +1,38 @@
1
  ---
2
- language:
3
- - en
4
- pipeline_tag: visual-question-answering
5
  library_name: transformers
6
-
7
- inference: false
8
  ---
9
 
10
- <br>
11
- <br>
12
-
13
- # BLIVA Model Card
14
-
15
- ## Model details
16
-
17
- **Model type:**
18
- BLIVA is an open-source Vision-Languagde model trained by initializing from InstructBLIP and alignment with Vicuna on multimodal instruction-finetuning data.
19
- It composes of an EVA-CLIP vision encoder, a Q-Former, a projection layer and an auto-regressive language model, based on the decoder only transformer architecture.
20
-
21
- **Model date:**
22
- BLIVA_Vicuna was trained in July 2023.
23
 
24
- **Paper or resources for more information:**
25
- https://gordonhu608.github.io/bliva/
26
 
27
- **License:**
28
- Non-commercial bespoke license
29
 
30
- **Where to send questions or comments about the model:**
31
- https://github.com/mlpc-ucsd/BLIVA
32
 
33
- ## Intended use
34
- **Primary intended uses:**
35
- The primary use of BLIVA is research on large multimodal models.
 
36
 
37
- **Primary intended users:**
38
- The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
39
 
40
- ## Training dataset
41
- Pre-train data: 558K filtered image-text pairs from LAION,CC-3M, and SBU. Selected by LLaVA.
42
 
43
- Instruction-finetuning data: COCO-Caption, TextCaps, VQAv2, OKVQA, A-OKVQA, LLaVA-150K, OCR-VQA.
44
 
45
- ## Evaluation dataset
46
- For zero-shot evaluation on general image task, we selected Nocaps, Flickr30K, VizWiz, Visual Spaial Reasoning (VSR), IconQA, Visual Dialog, ScienceQA, MSRVTT QA, TextVQA and Hateful Memes.
47
 
48
- For zero-shot evaluation on text-rich image OCR task, we selected ST-VQA, OCR-VQA, Text-VQA, and Doc-VQA.
49
 
50
- More detials are in our github, https://github.com/mlpc-ucsd/BLIVA
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
 
 
3
  library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ license: cc-by-nc-4.0
6
  ---
7
 
8
+ # ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ This repository hosts the **ESTR-CoT** model, presented in the paper [ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning](https://huggingface.co/papers/2507.02200).
 
11
 
12
+ ## Model Description
 
13
 
14
+ ESTR-CoT (Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning) is a novel framework designed for event stream scene text recognition. It aims to overcome limitations of existing methods, particularly in challenging scenarios like low illumination and fast motion, by integrating enhanced interpretability and strong contextual logical reasoning.
 
15
 
16
+ The core of ESTR-CoT involves a chain-of-thought (CoT) reasoning process. Specifically:
17
+ - It employs an **EVA-CLIP (ViT-G/14) vision encoder** to transform input event streams into visual tokens.
18
+ - A **Llama tokenizer** is used to encode the given generation prompt.
19
+ - A **Q-former** aligns these vision tokens with the pre-trained large language model **Vicuna-7B**.
20
 
21
+ This architecture enables ESTR-CoT to simultaneously output the recognition answer and a detailed chain-of-thought reasoning process. The framework is optimized using end-to-end supervised fine-tuning, leveraging a newly proposed large-scale CoT dataset. This dataset, created through a three-stage process (generation, polish, and expert verification), provides a solid foundation for training reasoning-based large models.
 
22
 
23
+ Extensive experiments on benchmark datasets (EventSTR, WordArt*, IC15*) have validated the effectiveness and interpretability of ESTR-CoT.
 
24
 
25
+ ## Code and Usage
26
 
27
+ The source code and pre-trained models for ESTR-CoT will be released by the authors. Please refer to the official paper for updates regarding code availability.
 
28
 
29
+ ## Citation
30
 
31
+ ```bibtex
32
+ @article{estrcot2025,
33
+ title={ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning},
34
+ author={Anonymous}, % Author names will be added upon official publication
35
+ journal={arXiv preprint arXiv:2507.02200},
36
+ year={2025} % Year will be updated upon official publication
37
+ }
38
+ ```