File size: 2,175 Bytes
7eea4d3
 
 
 
 
 
 
 
 
 
61dd6ae
7eea4d3
 
61dd6ae
 
 
 
 
fad0e5d
61dd6ae
4e8efba
61dd6ae
 
 
481110e
2663466
fad0e5d
61dd6ae
 
 
 
 
 
 
fad0e5d
61dd6ae
 
 
9e314f4
61dd6ae
4b9c48b
2663466
ffd17db
 
c60033b
 
54ed0ba
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
language:
- en
- de
tags:
- parser
- parsing
- PDF
- pdfplumber
- txt
- tables
- python
- windows
- RAG
---

---

# <b>PDF to TXT converter ready to chunck for your RAG</b>
<b>ONLY WINDOWS</b><br>
<b>EXE and PY aviable (en and german)</b><br>

<b>&#x21e8;</b> give me a ❤️, if you like  ;)<br><br>

Most LLM applications only convert you PDF simple to txt, nothing more, its like you save your PDF as txt file. Often textblocks are mixed and tables not readable. therefore its bit better to convert it with some help of a <b>parser</b>.<br>
I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
<ul style="line-height: 1.05;">
<li>Works with single and multi pdf list, works with folder</li>
<li>Intelligent multiprocessing</li>
<li>Error tolerant, that means if your PDF is not convertible, it will be skipped</li>
<li>Instant view of the result</li>
<li>Converts some common tables as json to txt files</li>
<li>It adds the absolute PAGE number to each page</li>
<li>All txt files will be created in original folder of PDF</li>
</ul>
<br>
This I have created with my brain and the help of chatGPT, sorry so I will not fulfill any wishes unless there are real errors.<br>
It is really hard for me with GUI and the Function and in addition to compile it.<br>
For the python-file oc you need to import missing libraries.<br>
<br>
I also have a "<b>docling</b>" parser with OCR, butit will only be the python-file, not comiled.<br>
You have to download all libs, and if you start internal also OCR models are loaded. At the moment i have a kind of multi docling, the number depend on VRAM and if you use OCR only for tables or for all. I have set VRAM = 16GB and the multiple calls to docling are VRAM/1.3, so it uses ~12GB and processes 12 PDFs at once, only txt and tables, so no images no diagrams. For now alld PDFs must be same folder like the python file. if you cahnge the VRAM consum is rasing you hace to set 1.3 to 2 or more.
<br><br>

<b>now have fun and leave a comment if you like  ;)</b><br>
on discord "sevenof9"
<br>
<br>
<br>
I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility!