Improving data extraction from complex forms
I’m working on extracting structured data from Bills of Lading and similar documents using smalldocling. While the overall OCR performance is solid, I’ve run into a recurring issue: some table fields—especially those with long or multi-line text—aren’t being read, even though the image quality is high and the text is clearly legible.
This seems to affect fields like item descriptions or freight terms that:
• Span multiple lines within a cell
• Contain detailed specs, units, or numerical values
• Are nested inside dense table structures
Has anyone encountered this and found a reliable way to improve extraction?
I’m open to:
• Fine-tuning ideas
• Preprocessing tricks (e.g., image cleanup, table detection)
• Any Hugging Face model better suited for structured form extraction?
Thanks in advance—any pointers or shared experiences would be super helpful!