KevinHuSh
commited on
Commit
·
6943c52
1
Parent(s):
41c7a59
refine README (#72)
Browse files* refine README
* Update README.md
- deepdoc/README.md +3 -11
deepdoc/README.md
CHANGED
@@ -1,8 +1,6 @@
|
|
1 |
English | [简体中文](./README_zh.md)
|
2 |
|
3 |
-
|
4 |
-
|
5 |
-
---
|
6 |
|
7 |
- [1. Introduction](#1)
|
8 |
- [2. Vision](#2)
|
@@ -11,7 +9,6 @@ English | [简体中文](./README_zh.md)
|
|
11 |
<a name="1"></a>
|
12 |
## 1. Introduction
|
13 |
|
14 |
-
---
|
15 |
With a bunch of documents from various domains with various formats and along with diverse retrieval requirements,
|
16 |
an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
|
17 |
There 2 parts in *Deep*Doc so far: vision and parser.
|
@@ -19,8 +16,6 @@ There 2 parts in *Deep*Doc so far: vision and parser.
|
|
19 |
<a name="2"></a>
|
20 |
## 2. Vision
|
21 |
|
22 |
-
---
|
23 |
-
|
24 |
We use vision information to resolve problems as human being.
|
25 |
- OCR. Since a lot of documents presented as images or at least be able to transform to image,
|
26 |
OCR is a very essential and fundamental or even universal solution for text extraction.
|
@@ -64,19 +59,16 @@ We use vision information to resolve problems as human being.
|
|
64 |
<a name="3"></a>
|
65 |
## 3. Parser
|
66 |
|
67 |
-
---
|
68 |
-
|
69 |
Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser.
|
70 |
The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
|
71 |
- Text chunks with their own positions in PDF(page number and rectangular positions).
|
72 |
- Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
|
73 |
- Figures with caption and text in the figures.
|
74 |
|
75 |
-
###Résumé
|
76 |
|
77 |
-
---
|
78 |
The résumé is a very complicated kind of document. A résumé which is composed of unstructured text
|
79 |
with various layouts could be resolved into structured data composed of nearly a hundred of fields.
|
80 |
We haven't opened the parser yet, as we open the processing method after parsing procedure.
|
81 |
|
82 |
-
|
|
|
1 |
English | [简体中文](./README_zh.md)
|
2 |
|
3 |
+
# *Deep*Doc
|
|
|
|
|
4 |
|
5 |
- [1. Introduction](#1)
|
6 |
- [2. Vision](#2)
|
|
|
9 |
<a name="1"></a>
|
10 |
## 1. Introduction
|
11 |
|
|
|
12 |
With a bunch of documents from various domains with various formats and along with diverse retrieval requirements,
|
13 |
an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
|
14 |
There 2 parts in *Deep*Doc so far: vision and parser.
|
|
|
16 |
<a name="2"></a>
|
17 |
## 2. Vision
|
18 |
|
|
|
|
|
19 |
We use vision information to resolve problems as human being.
|
20 |
- OCR. Since a lot of documents presented as images or at least be able to transform to image,
|
21 |
OCR is a very essential and fundamental or even universal solution for text extraction.
|
|
|
59 |
<a name="3"></a>
|
60 |
## 3. Parser
|
61 |
|
|
|
|
|
62 |
Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser.
|
63 |
The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
|
64 |
- Text chunks with their own positions in PDF(page number and rectangular positions).
|
65 |
- Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
|
66 |
- Figures with caption and text in the figures.
|
67 |
|
68 |
+
### Résumé
|
69 |
|
|
|
70 |
The résumé is a very complicated kind of document. A résumé which is composed of unstructured text
|
71 |
with various layouts could be resolved into structured data composed of nearly a hundred of fields.
|
72 |
We haven't opened the parser yet, as we open the processing method after parsing procedure.
|
73 |
|
74 |
+
|