How To Automatically Verify Your Bulk Scan Job Scanned Correctly
Whether you use an unmonitored Automatic Document Feeder (ADF) to bulk scan your documents or you contract out your scanning to a document services company, you want to know whether or not your documents were all scanned correctly.
If you only had a few thousand documents scanned, it’s easy enough to load the image of every document into a image thumbnail program and skim through them 20 images at a time during your lunch break, but what if you have tens of thousands of scanned documents—or more?
Get Your Computer To Read Your Documents
If you used a document services company or you’re familiar with scanning technology, you’ve probably heard of Optical Character Recognition (OCR) programs which can read printed text with accuracy rates exceeding 99%. Your document services company may have tried to upsell you OCR services as part of your scanning contract as a way to add a searchable index to your documents. Whether or not you bought the upgrade, you can use OCR yourself to test your scanned documents.
OCR software can create a text-only version of an image if that image has any text. Some OCR software is totally automatic, so it can create those text versions of each page on your computer overnight while you sleep. (For millions of pages, you might need to dedicate a computer to the task for several days or weeks.)
Many OCR programs you see for sale have point-and-click user interfaces which let you select which documents to process, but there are a number of command-line or scriptable OCR programs which don’t have a graphical user interface. These OCR programs are easy for programmers and system administrators to integrate into other programs, and that’s exactly what you need—a program to read your scanned documents.
How the Document Auto Verifier Works & How To Build It
Any programmer and any senior system administrator can write a super simple program which does an easy job:
- Uses a command-line OCR program to create a text version of a scanned document image.
- Checks to see how many English words are in the text version.
- Makes a note in a log if the text version has too few words. (A better way is to make a copy of the image version into a special folder for easy review.)
- Starts over on the next image.
If the OCR program finds a lot of words in a document, the document was probably scanned well. If it doesn’t find a lot of words, the scanned image may have a problem—it could be blank, it could be unrecognizable, or the image file could be damaged. (You might also not have a problem—the scanned document could’ve been a picture or blank form.)
The best news is that this process doesn’t need to cost you anything except for the hour it takes your system administrator to set it up and the time you (or your assistant) spends analyzing the log of problem images. That’s because, thanks to Google, one of the best command-line OCR programs is open source software—in other words, it’s free.
Tell your system administrator to search for Tesseract OCR—the program Google uses to scan millions of books for Google Books—and get proof your documents were accurately scanned.
For more information on scanning large and format documents, contact our experts in the Production Department at 757-545-7675 or email@example.com.