OCR on Layout Detection

#1
by okayatul - opened

I have successfully done the layout detection, I just want to do the OCR on those boundary boxes and get the data but I am having trouble doing that please help.

PaddlePaddle org

In PP-StructureV3, the layout detection of PP-DocLayout_plus-L is only the first step. It is followed by the OCR module of your choice. You can refer to Introduction to PP-StructureV3 and PP-StructureV3 Pipeline Usage Tutorial for more information.

Okay, understood. I'm using the ppstructurev3 pipeline for parsing resumes, but when I give table-type resumes, it returns the markdown in HTML format. I expected it to return the markdown in a text file. How do I fix this? Because parsing is complicated here, the structure of the PDF is lost. I have also used tablerecognitionpipeline v2 and ppstructurev3 pipeline too.

In PP-StructureV3, the layout detection of PP-DocLayout_plus-L is only the first step. It is followed by the OCR module of your choice. You can refer to Introduction to PP-StructureV3 and PP-StructureV3 Pipeline Usage Tutorial for more information.

Please reply

Title:
How to avoid reloading PP-StructureV3 models on every request in Flask/Celery without std::exception crashes?

Body:
I’m using PP-StructureV3 in a Flask + Celery application.

If I reload the pipeline for each request, it works but is too slow (since it reloads ~6 models every time).

If I cache/reuse the pipeline across requests, the first request succeeds, but on the second one I get a std::exception crash from Paddle’s C++ backend.

This makes it look like Paddle’s internal state gets corrupted after reuse.

Question:
What is the recommended way to run PP-StructureV3 in a long-lived server environment (Flask/Celery)?

Is there an officially supported pattern for keeping the models loaded in memory across multiple requests?

Or is the only safe option to reload the pipeline each time or run it in short-lived subprocesses?

Do PaddleOCR/PaddleX developers recommend any workaround for avoiding std::exception while still getting reasonable performance?

Title:
How to avoid reloading PP-StructureV3 models on every request in Flask/Celery without std::exception crashes?

Body:
I’m using PP-StructureV3 in a Flask + Celery application.

If I reload the pipeline for each request, it works but is too slow (since it reloads ~6 models every time).

If I cache/reuse the pipeline across requests, the first request succeeds, but on the second one I get a std::exception crash from Paddle’s C++ backend.

This makes it look like Paddle’s internal state gets corrupted after reuse.

Question:
What is the recommended way to run PP-StructureV3 in a long-lived server environment (Flask/Celery)?

Is there an officially supported pattern for keeping the models loaded in memory across multiple requests?

Or is the only safe option to reload the pipeline each time or run it in short-lived subprocesses?

Do PaddleOCR/PaddleX developers recommend any workaround for avoiding std::exception while still getting reasonable performance?

please reply

Sign up or log in to comment