pdf2txt_parser_converter / README.md

kalle07

Update README.md

4e8efba verified 5 months ago

preview code

raw

history blame

1.44 kB

metadata

language:
  - en
  - de
tags:
  - parser
  - parsing
  - PDF
  - pdfplumber
  - txt
  - tables
  - python
  - windows
  - RAG

PDF to TXT converter ready to chunck for your RAG

ONLY WINDOWS
EXE and PY aviable (en and german)

⇨ give me a ❤️, if you like ;)

Most LLM applications only convert you PDF simple to txt, nothing more, its like you save your PDF as txt file. Often textblocks a mixed and tables not readable. therefore its bit better to convert it with some help of a parser.
I work with "pdfplumber/pdfminer" none OCR, so its very fast! I also have a "docling" parser in progress with OCR, but i think it will only be the python-file, not comiled.

Works with single and multi pdf list, works with folder
Intelligent multiprocessing
Error tolerant, that means if your PDF is not convertible, it will be skipped
Instant view of the result
Converts some common tables as json to txt files
It adds the absolute PAGE number to each page
All txt files will be created in original folder of PDF

This I have created with my brain and the help of chatGPT, sorry so I will not fulfill any wishes unless there are real errors.
It is really hard for me with GUI and the Function and in addition to compile it.

now have fun and leave a comment if you like ;)