mlpc-lab
/

BLIVA_Vicuna

@@ -1,50 +1,38 @@
 ---
-language:
-- en
-pipeline_tag: visual-question-answering
 library_name: transformers
-inference: false
 ---
-<br>
-<br>
-# BLIVA Model Card
-## Model details
-**Model type:**
-BLIVA is an open-source Vision-Languagde model trained by initializing from InstructBLIP and alignment with Vicuna on multimodal instruction-finetuning data.
-It composes of an EVA-CLIP vision encoder, a Q-Former, a projection layer and an auto-regressive language model, based on the decoder only transformer architecture.
-**Model date:**
-BLIVA_Vicuna was trained in July 2023.
-**Paper or resources for more information:**
-https://gordonhu608.github.io/bliva/
-**License:**
-Non-commercial bespoke license
-**Where to send questions or comments about the model:**
-https://github.com/mlpc-ucsd/BLIVA
-## Intended use
-**Primary intended uses:**
-The primary use of BLIVA is research on large multimodal models.
-**Primary intended users:**
-The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
-## Training dataset
-Pre-train data: 558K filtered image-text pairs from LAION,CC-3M, and SBU. Selected by LLaVA.
-Instruction-finetuning data: COCO-Caption, TextCaps, VQAv2, OKVQA, A-OKVQA, LLaVA-150K, OCR-VQA.
-## Evaluation dataset
-For zero-shot evaluation on general image task, we selected Nocaps, Flickr30K, VizWiz, Visual Spaial Reasoning (VSR), IconQA, Visual Dialog, ScienceQA, MSRVTT QA, TextVQA and Hateful Memes.
-For zero-shot evaluation on text-rich image OCR task, we selected ST-VQA, OCR-VQA, Text-VQA, and Doc-VQA.
-More detials are in our github, https://github.com/mlpc-ucsd/BLIVA

 ---
+language: en
 library_name: transformers
+pipeline_tag: image-text-to-text
+license: cc-by-nc-4.0
 ---
+# ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning
+This repository hosts the **ESTR-CoT** model, presented in the paper [ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning](https://huggingface.co/papers/2507.02200).
+## Model Description
+ESTR-CoT (Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning) is a novel framework designed for event stream scene text recognition. It aims to overcome limitations of existing methods, particularly in challenging scenarios like low illumination and fast motion, by integrating enhanced interpretability and strong contextual logical reasoning.
+The core of ESTR-CoT involves a chain-of-thought (CoT) reasoning process. Specifically:
+- It employs an **EVA-CLIP (ViT-G/14) vision encoder** to transform input event streams into visual tokens.
+- A **Llama tokenizer** is used to encode the given generation prompt.
+- A **Q-former** aligns these vision tokens with the pre-trained large language model **Vicuna-7B**.
+This architecture enables ESTR-CoT to simultaneously output the recognition answer and a detailed chain-of-thought reasoning process. The framework is optimized using end-to-end supervised fine-tuning, leveraging a newly proposed large-scale CoT dataset. This dataset, created through a three-stage process (generation, polish, and expert verification), provides a solid foundation for training reasoning-based large models.
+Extensive experiments on benchmark datasets (EventSTR, WordArt*, IC15*) have validated the effectiveness and interpretability of ESTR-CoT.
+## Code and Usage
+The source code and pre-trained models for ESTR-CoT will be released by the authors. Please refer to the official paper for updates regarding code availability.
+## Citation
+```bibtex
+@article{estrcot2025,
+  title={ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning},
+  author={Anonymous}, % Author names will be added upon official publication
+  journal={arXiv preprint arXiv:2507.02200},
+  year={2025} % Year will be updated upon official publication
+}
+```