Sudhanshu Pandey commited on
Commit
5dcdd42
Β·
1 Parent(s): a7b8c18
Files changed (1) hide show
  1. README.md +134 -3
README.md CHANGED
@@ -1,3 +1,134 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **🌟 Table Extraction Tool: OCR & Computer Vision for Structured Data**
2
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3
+ [![Build Status](https://img.shields.io/badge/build-passing-brightgreen.svg)](https://github.com/Sudhanshu1304/table-transformer)
4
+ [![Stars](https://img.shields.io/github/stars/Sudhanshu1304/table-transformer.svg)](https://github.com/Sudhanshu1304/table-transformer/stargazers)
5
+ [![Watchers](https://img.shields.io/github/watchers/Sudhanshu1304/table-transformer.svg)](https://github.com/Sudhanshu1304/table-transformer/watchers)
6
+
7
+ ## Overview
8
+
9
+ Table Transformer is an advanced open-source tool that leverages state-of-the-art OCR and computer vision techniques to extract structured tabular data from images. It is ideal for enhancing LLM preprocessing, powering data analysis pipelines, and automating your data extraction tasks.
10
+
11
+ ## Features
12
+ - πŸ“Š **Automatic Table Detection**: Effortlessly detect tables in images.
13
+ - πŸ“ **OCR-based Document Processing**: Extract text with high accuracy.
14
+ - 🧠 **Integrated Models**: Seamlessly combine OCR and table detection models.
15
+ - πŸ’Ύ **Flexible Export Options**: Export data as DataFrame, HTML, CSV, and more.
16
+
17
+ ---
18
+
19
+ ## **Tool Overview**
20
+
21
+ <div align="center">
22
+
23
+ <!-- First Row -->
24
+ <img src="images/image1.png" alt="Image upload" width="45%" style="margin: 10px;">
25
+ <img src="images/image2.png" alt="Table detection & extraction" width="45%" style="margin: 10px;">
26
+
27
+ <!-- Second Row -->
28
+ <img src="images/image3.png" alt="Table in HTML format" width="45%" style="margin: 10px;">
29
+ <img src="images/image4.png" alt="Table exported as CSV" width="45%" style="margin: 10px;">
30
+
31
+ </div>
32
+
33
+ ---
34
+
35
+ ## **Open-Source Tools Used**
36
+ - **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**: For text extraction.
37
+ - **[Hugging Face Table Detection](https://huggingface.co/foduucom/table-detection-and-extraction)**: For table structure detection.
38
+
39
+ ---
40
+
41
+ ## **Installation**
42
+
43
+ ### **Prerequisites**
44
+ - Python 3.8+
45
+ - Conda
46
+
47
+ ### **Setup**
48
+
49
+ 1. **Clone the Repository**
50
+
51
+ Clone the repository to your local machine:
52
+
53
+ ```bash
54
+ git clone https://github.com/Sudhanshu1304/table-transformer.git
55
+ cd table-transformer
56
+ ```
57
+
58
+ 2. **Create and Activate Conda Environment**
59
+
60
+ Create a new conda environment and activate it:
61
+
62
+ ```bash
63
+ conda create --name myenv python=3.12.7
64
+ conda activate myenv
65
+ ```
66
+
67
+ 3. **Install PaddlePaddle**
68
+
69
+ Install PaddlePaddle in the conda environment:
70
+
71
+ ```bash
72
+ python -m pip install paddlepaddle==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
73
+ ```
74
+
75
+ 4. **Install PaddleOCR**
76
+
77
+ Install PaddleOCR:
78
+
79
+ ```bash
80
+ pip install paddleocr
81
+ ```
82
+
83
+ 5. **Install Additional Dependencies**
84
+
85
+ Install other required packages:
86
+
87
+ ```bash
88
+ pip install ultralytics pandas
89
+ pip install streamlit
90
+ ```
91
+
92
+ ### **Project Structure**
93
+ ```
94
+ project/
95
+ β”œβ”€β”€ src/
96
+ β”‚ β”œβ”€β”€ streamlit_app.py # Streamlit application
97
+ β”‚ β”œβ”€β”€ table_creator/
98
+ β”‚ β”‚ └── processing.py # Core processing logic
99
+ β”‚ β”œβ”€β”€ models/
100
+ β”‚ β”‚ └── text.py # table detection and text recognition
101
+ β”‚
102
+ β”œβ”€β”€ requirements.txt # Dependencies
103
+ β”œβ”€β”€ README.md # Project documentation
104
+ └── .gitignore # Git ignore configuration
105
+ ```
106
+
107
+ ### **Usage**
108
+ Run the Streamlit app to interact with the tool:
109
+
110
+ ```bash
111
+ streamlit run src/streamlit_app.py
112
+ ```
113
+
114
+ ### **Contributions**
115
+ Contributions are welcome! Please fork the repository and submit a pull request with your improvements or new features.
116
+
117
+ ### **License**
118
+ This project is licensed under the MIT License.
119
+
120
+ ---
121
+
122
+ ## **Connect with Us**
123
+ Stay updated and connect for any queries or contributions:
124
+
125
+ - **GitHub**: [Sudhanshu1304](https://github.com/Sudhanshu1304)
126
+ - **LinkedIn**: [Sudhanshu Pandey](https://www.linkedin.com/in/sudhanshu-pandey-847448193/)
127
+ - **Medium**: [@sudhanshu.dpandey](https://medium.com/@sudhanshu.dpandey)
128
+
129
+ ---
130
+
131
+ ## **Support**
132
+ If you find this tool useful, please consider giving it a ⭐ on GitHub. Your support is greatly appreciated!
133
+
134
+ Happy Extracting!