

Nonetheless, these promotional claims do not at all times match real-world efficiency, in response to current exams. “I am usually a fairly large fan of the Mistral fashions, however the brand new OCR-specific one they launched final week actually carried out poorly,” Willis famous.
“A colleague despatched this PDF and requested if I may assist him parse the desk it contained,” says Willis. “It is an outdated doc with a desk that has some complicated format components. The brand new [Mistral] OCR-specific mannequin actually carried out poorly, repeating the names of cities and botching a number of the numbers.”
AI app developer Alexander Doria additionally not too long ago identified on X a flaw with Mistral OCR’s potential to know handwriting, writing, “Sadly Mistral-OCR has nonetheless the standard VLM curse: with difficult manuscripts, it hallucinates fully.”
In accordance with Willis, Google at the moment leads the sphere in AI fashions that may learn paperwork: “Proper now, for me the clear chief is Google’s Gemini 2.0 Flash Professional Experimental. It dealt with the PDF that Mistral didn’t with a tiny variety of errors, and I’ve run a number of messy PDFs by way of it with success, together with these with handwritten content material.”
Gemini’s efficiency stems largely from its potential to course of expansive paperwork (in a sort of short-term reminiscence known as a “context window”), which Willis particularly notes as a key benefit: “The scale of its context window additionally helps, since I can add giant paperwork and work by way of them in components.” This functionality, mixed with extra strong dealing with of handwritten content material, apparently provides Google’s mannequin a sensible edge over rivals in real-world document-processing duties for now.
The drawbacks of LLM-based OCR
Regardless of their promise, LLMs introduce a number of new issues to doc processing. Amongst them, they will introduce confabulations or hallucinations (plausible-sounding however incorrect info), unintentionally comply with directions within the textual content (considering they’re a part of a person immediate), or simply usually misread the information.