GitHub - mindee/doctr: docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

github.com
4 min read
standard
docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning. - GitHub - mindee/doctr: docTR by Mindee (Document...
Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2 (PyTorch in beta)

What you can expect from this repository:

efficient ways to parse textual information (localize and identify each word) from your documents

guidance on how to integrate this in your current architecture

Quick Tour

Getting your pretrained model

End-to-End OCR is achieved in DocTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.

from doctr . models import ocr_predictor model = ocr_predictor ( det_arch = 'db_resnet50' , reco_arch = 'crnn_vgg16_bn' , pretrained = True )

Reading files

Documents can be interpreted from PDF or images:

from doctr . io import DocumentFile # PDF pdf_doc = DocumentFile . from_pdf ( "path/to/your/doc.pdf" ). as_images () # Image single_img_doc = DocumentFile . from_images ( "path/to/your/img.jpg" ) # Webpage webpage_doc = DocumentFile . from_url ( "https://www.yoursite.com" ). as_images () # Multiple page images multi_img_doc = DocumentFile . from_images ([ "path/to/page1.jpg" , "path/to/page2.jpg" ])

Putting it together

Let's use the default pretrained model for an example:

from doctr . io import DocumentFile from doctr . models import ocr_predictor model = ocr_predictor ( pretrained = True ) # PDF doc = DocumentFile . from_pdf ( "path/to/your/doc.pdf" ). as_images () # Analyze result = model ( doc )

To make sense of your model's predictions, you can visualize them interactively as follows:

result . show ( doc )

Or even rebuild the original document from its predictions:

import matplotlib . pyplot as plt plt . imshow ( result . synthesize ()); plt . axis ( 'off' ); plt . show ()

The ocr_predictor returns a Document object with a nested structure (with Page , Block , Line , Word , Artefact ). To get a…
Read full article