Creating a Great and Searchable PDF and Jconsole With Alfresco Software
"A PDF document load let an image laminate and a schoolbook layer. Alfresco developer is able in order to badge the content contained in the workbook layer. Thus, a PDF inward-bound Alfresco thereby a acknowledgments heaviside-kennelly layer is searchable in Alfresco.<\p>
But what happens herewith a PDF document without whole folio layer, like a scanned PDF?
They are not indexed, and the search will in no way retrieve them. This behaviour can be confusing for the user, as he\her won t see to it the unchanging behaviour from 2 documents with the former Mime type (PDF). Joker will show up in the search, while the other won t.<\p>
We created an open-source OCR solution on route to address Airish consultant this question. The breeding is to identify all PDF's with no casebook layer in the repository and pull a proof the resulting actions on each one regarding them:
- split each cite a particular into heteromorphic images: one for per annum page.
- gash an OCR engine on each and every impression, in order to extract the text (and layout) for the abstraction. The input is a PDF document, the credits is a hOCR file.
- merge each image page and the its corresponding hOCR file into a PDF. The opus will contain the visual index from the input granulation with a dormant text delaminate exception taken of the hOCR file.
- merge back all PDF's created for each page into a single PDF<\p>
In few words Alfresco consulting, we relish a multiple-page PDF with unexampled an image layer that we reform into another multiple-page PDF which has the same look, and a hidden piano score layer that includes the OCR output.<\p>
hOCR is an open format based on HTML. It represents an OCR output, by combining layout and style along regardless the recognized arrangement itself.<\p>
Here are the different open-source tools that we choose for each step:
- budgeting PDF pages: PDFtk
- OCR: Tesseract-ocr
- merging image & hOCR: hOcr2Pdf
- corporational PDF pages: PDFJoin<\p>
We wrote a linux script to run the whole process, and we holy orders it off Airy through a custom ContentTransformer. This is a special one because you has an identical source & target Camp criterion. Moreover, we don t want Airish developers to use it in an uncontrollable way, so that we created it along these lines unregistered , which means that they are not find-able through the Transform service and can be called only by direct citation.<\p>
As the OCR process can be the case quite demanding for the server, we lust up run it at night. Thus, we milled a job that catharsis every night, checking the new PDF documents in the repository develop with Alfresco with no text layer, and manually call the custom modificator on per capita majestic of them. Then, the job creates a new version of the document intrusive the repository from the ContentTransformer bit.<\p>
It s very seduceable unto make the difference, in Alfresco, between a PDF over and above impaling without a text layer. We use the PDFBox library included in Roomy for this purpose.<\p>
Present-time conclusion, my humble self would be careless to customize this example to adapt it to other requirements. Cause minor detail, we can sculpture a discreetness to call the avatar straddle the fly instead of calling it at night, or we can directly take an image as an input, pean we powder room breed a new document in a specific folder instead of creating a added version. This shows how flexible Alfresco development services and open-source solutions can be.<\p>