Engineering Drawing OCR
OCR (Optical Character Recognition) is a technology that can be used to systematically read and index text within images. When used as a part of an Engineering Data and Drawing Management System, this can be used to unlock a large portion of the data set that would not ordinarily be retrievable using search. Searching on text format documents like word files, or text-based PDFs (as opposed to image only PDFs) is a simple matter of loading each of the files into a data-based that supports full-text search (such as Lucene). This approach however doesn't work when it comes to image files, which don't support indexing in the same way.
The Problem
Historical scanned drawings (usually TIFF or PDF) do not often have adequate meta-data to all for search and retrieval. This leads to situations where people either need to spend a large amount of time browsing through archives to find the files or worse the give up and end up re-drawing the diagram from scratch.
Operations and maintenance manuals, contracts, and similar files are often too large to appropriately index using standard meta-data tagging techniques. Searching for the file project name, or asset description may work in some cases, but more broad searches by part description or contractor name are impossible.
The Solution
The solution, as adopted by Lunr is to OCR all image file types as a part of the document upload process. The text extracted by this process is then indexed using a standard full-text search solution, allowing searching over a much greater portion of your Engineering Data and Drawing library.