At StateReference, we gather a large variety of documents from municipalities and state agencies across the Commonwealth of Massachusetts. These documents come to us in a variety of file formats including Portal Document Format (PDF), Microsoft Excel, and Electronic Mail Format (EMF).
Whenever possible, we would like to index the text contained within these documents, and make the full text searchable. In addition, we would like for our users to be able to preview every document on the StateReference website, and not have to download the document and invoke a separate application.
OCR allows us to extract text from documents where the text is only available as a non-machine-readable image. While many PDFs we receive from agencies contain machine-readable text, many are in-fact scans or photos, and the text must be reconstructed.
To accomplish this, we use the excellent open source OCRmyPDF software. This software adds a "text layer" to a PDF document.
Any document on StateReference that we have run through OCR is accompanied by a notice informing the user. In addition, we always present the option to download either the OCR document, or the original provided by the agency.
User may find the OCR document more desirable, because you can use the "Find" feature of your PDF reader to search the document, and text can be copy-and-pasted from it. However, the OCR process is not perfect, and there may be errors where text has been incorrectly transcribed.
Some document types can not be displayed directly in your web browser, for example a Microsoft Word document or an email. In those cases we automatically convert into a PDF. The PDF is then viewable in your browser using an embedded viewer.
When you are viewing such a preview, StateReference will always present a notice informing you of this. You will also be presented the option to download the original document that was provided by the agency.