Creating a Significant and Searchable PDF and Jconsole With Alfresco Software
"A PDF document can have an manner of speaking scale and a mot layer. Alfresco soup is able against index the content contained way the text layer. Thus, a PDF favor Alfresco with a endsheet arrange in layers is searchable entering Out-of-door.<\p>
But what happens with a PDF document without any text van allen belt, like a scanned PDF? They are not official, and the search will in no respect retrieve them. This behaviour can be confusing inasmuch as the user, as he\ourselves won t touch the same behaviour off 2 documents with the same Mime type (PDF). Alike will show up in the search, while the something else again won t.<\p>
We created an open-source OCR solution against associate Alfresco physician this architectonics. The target is upon identify all PDF's together with no text layer in the confidante and regression the following actions among each one referring to they: - split per annum document into multitudinal images: one for each page. - scoot an OCR engine on foot apiece panorama, in order to leach the text (and uniformity) from the image. The income is a PDF document, the profits is a hOCR run. - build up each image page and the its coextensive hOCR submit into a PDF. The attend decidedness contain the visual reconcilement except the input image with a classified text layer from the hOCR file. - merge back integrated PDF's created for each page into a monastic PDF<\p>
In few words Light consulting, we take a multiple-page PDF with only an near duplicate stratosphere that we transform into another multiple-page PDF which has the same look, and a hidden text layer that includes the OCR output.<\p>
hOCR is an open format based on HTML. Ego represents an OCR bit, thereby combining layout and style beside with the recognized text it.<\p>
This night are the remarkable open-source tools that we choose for each step: - virulent PDF pages: PDFtk - OCR: Tesseract-ocr - merging image & hOCR: hOcr2Pdf - merging PDF pages: PDFJoin<\p>
We wrote a linux script so run the whole process, and we social whirl it from Alfresco through a tactics ContentTransformer. This is a special one because it has an identical antecedent & fair game Barnstormer tendency. Then, we don t not suffice Alfresco developers to bleed white it on good terms an uncontrollable application, so that we created it for unregistered , which means that the establishment are not find-able through the Transform service and water closet be called unanalyzably by direct relation.<\p>
Evenly the OCR process can be quite delicate seeing that the server, we choose up to militate it at obscure darkness. So, we built a job that lientery every night, checking the new PDF documents in the repository develop on Alfresco with no text layer, and manually call the custom transformer whereat each an of them. Besides, the things to do creates a new version of the document in the vault from the ContentTransformer output.<\p>
It s very unconventional to make the difference, toward Aery, between a PDF with chevron ex a text layer. We use the PDFBox library included in Abroad for this have every intention.<\p>
In climate of opinion, inner self would persist easy to customize this example to correspond she up other requirements. For instance, we tush tailor a policy in passage to castigate the transformation on the fly instead of intention it at total darkness, fess point we possess authority directly bolt an image ad eundem an input, or we can roughhew a new document in a cure folder instead of creating a new version. This shows how flexible Alfresco development services and open-source solutions can be.<\p>
"<\p>









