File size: 2,730 Bytes
7eea4d3 d1812eb 7eea4d3 61dd6ae 7eea4d3 61dd6ae fad0e5d 61dd6ae d1812eb 38bf95f 61dd6ae b6b3d08 c0373dd 61dd6ae e99f8d3 1ec7a25 2663466 fad0e5d 61dd6ae 8d05aef 7d9f9e9 61dd6ae 38bf95f fad0e5d 61dd6ae d1812eb 61dd6ae 9e314f4 61dd6ae c0373dd b8951a6 7d9f9e9 b8951a6 7cdfdd1 ffd17db c60033b 54ed0ba 25b37bd 54ed0ba |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
language:
- en
- de
tags:
- parser
- parsing
- PDF
- pdfplumber
- docling
- txt
- tables
- python
- windows
- RAG
---
# <b>PDF to TXT converter ready to chunck for your RAG</b>
<b>ONLY WINDOWS</b><br>
<b>EXE and PY available (en and german)</b><br>
better input = better output<br>
<b>⇨</b> give me a ❤️, if you like ;)<br><br>
...
Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. Blocks of text that are close together are often mixed up and tables cannot be read logically.
Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
<ul style="line-height: 1.05;">
<li>Works with single and multi pdf list, works with folder</li>
<li>Intelligent multiprocessing</li>
<li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
<li>Instant view of the result, hit one pdf on top of the list</li>
<li>Converts some common tables as json inside the txt file</li>
<li>It adds the absolute PAGE number to each page</li>
<li>All txt files will be created in original folder of PDF</li>
<li>All previous txt files are overwritten</li>
</ul>
<br>
This I have created with my brain and the help of chatGPT, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.<br>
It is really hard for me with GUI and the Function and in addition to compile it.<br>
For the python-file oc you need to import missing libraries.<br>
<br>
...
<br>
I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>
You have to download all libs, and if you start (first time) internal also OCR models are downloaded. At the moment i have prepared a kind of multi docling,
the number of parallel processed PDFs depend on VRAM and if you use OCR only for tables or for all. I have set VRAM = 16GB (my GPU RAM, you should set yours) and the multiple calls for docling are VRAM/1.3,
so it uses ~12GB (in my version) and processes 12 PDFs at once, only txt and tables are converted, so no images no diagrams (to process pages in parallel its to complicate). For now all PDFs must be same folder like the python file.
If you change OCR for all the VRAM consum is rasing you have to set 1.3 to 2 or more.
<br><br>
<b>now have fun and leave a comment if you like ;)</b><br>
on discord "sevenof9"
<br>
my embedder collection:<br>
https://huggingface.co/kalle07/embedder_collection
<br>
<br>
I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility! |