PRImA Research Lab 2015-07-17T15:27:13 2018-07-19T07:29:57 Example Page A l e t h e i a Aletheia D o c u m e n t Document A n a l y s i s Analysis S y s t e m System Aletheia Document Analysis System Aletheia Document Analysis System O v e r v i e w : Overview: A l e t h e i a Aletheia i s is a n an a d - ad- Overview: Aletheia is an ad- vanced system for accurate and yet vanced system for accurate and yet cost-effective ground truthing of cost-effective ground truthing of large amounts of documents. It aids large amounts of documents. It aids the user with a number of automated the user with a number of automated and semi-automated tools which and semi-automated tools which were partly developed and improved were partly developed and improved based on feedback from major librar- based on feedback from major librar- ies across Europe and from their digit- ies across Europe and from their digit- isation service providers which are us- isation service providers which are us- ing the tool in a production environ- ing the tool in a production environ- ment. ment. Overview: Aletheia is an ad- vanced system for accurate and yet cost-effective ground truthing of large amounts of documents. It aids the user with a number of automated and semi-automated tools which were partly developed and improved based on feedback from major librar- ies across Europe and from their digit- isation service providers which are us- ing the tool in a production environ- ment. Novel features are, among others, the Novel features are, among others, the support of top-down ground truthing support of top-down ground truthing with sophisticated split and shrink tools with sophisticated split and shrink tools as well as bottom-up ground truthing as well as bottom-up ground truthing supporting the aggregation of lower-level supporting the aggregation of lower-level elements to more complex structures. elements to more complex structures. Special features have been developed to Special features have been developed to support working with the complexities of support working with the complexities of historical documents. The integrated vali- historical documents. The integrated vali- dator, in combination with powerful cor- dator, in combination with powerful cor- rection tools, enable efficient production rection tools, enable efficient production of highly accurate ground truth. of highly accurate ground truth. Novel features are, among others, the support of top-down ground truthing with sophisticated split and shrink tools as well as bottom-up ground truthing supporting the aggregation of lower-level elements to more complex structures. Special features have been developed to support working with the complexities of historical documents. The integrated vali- dator, in combination with powerful cor- rection tools, enable efficient production of highly accurate ground truth. Aletheia uses the PAGE (Page Analysis Aletheia uses the PAGE (Page Analysis and Ground truth Elements) XML format and Ground truth Elements) XML format framework which incorporates several framework which incorporates several XML schemas representing the whole XML schemas representing the whole workflow of document analysis. See also workflow of document analysis. See also the dedicated infobox. the dedicated infobox. Aletheia uses the PAGE (Page Analysis and Ground truth Elements) XML format framework which incorporates several XML schemas representing the whole workflow of document analysis. See also the dedicated infobox. Layers and reading order Layers and reading order Layers and reading order Screenshot of Aletheia showing regions and properties Screenshot of Aletheia showing regions and properties Screenshot of Aletheia showing regions and properties The PAGE (Page Analysis and Ground The PAGE (Page Analysis and Ground truth Elements) format framework incorpo- truth Elements) format framework incorpo- rates several XML schemas representing the rates several XML schemas representing the whole workflow of document analysis, includ- whole workflow of document analysis, includ- ing image enhancement, binarisation, geo- ing image enhancement, binarisation, geo- metrical correction, layout analysis, layout metrical correction, layout analysis, layout evaluation and OCR. The here used schema evaluation and OCR. The here used schema for document layouts allows for polygonal for document layouts allows for polygonal regions with various attributes (including text regions with various attributes (including text content), reading order, layers and more. content), reading order, layers and more. The PAGE (Page Analysis and Ground truth Elements) format framework incorpo- rates several XML schemas representing the whole workflow of document analysis, includ- ing image enhancement, binarisation, geo- metrical correction, layout analysis, layout evaluation and OCR. The here used schema for document layouts allows for polygonal regions with various attributes (including text content), reading order, layers and more. From Scratch, Top-Down From Scratch, Top-Down From Scratch, Top-Down • Marking regions using man- • Marking regions using man- ual or semi-automated tools ual or semi-automated tools • Marking regions using man- ual or semi-automated tools • Marking text kines with easy- • Marking text kines with easy- to-use split tools to-use split tools • Marking text kines with easy- to-use split tools • Marking words with assistive • Marking words with assistive tools tools • Marking words with assistive tools • Marking glyphs (characters) • Marking glyphs (characters) • Marking glyphs (characters) • Text transcription and propa- • Text transcription and propa- gation to any required level gation to any required level • Text transcription and propa- gation to any required level • Reading order definition • Reading order definition • Reading order definition • Validation to reduce risk • Validation to reduce risk of mistakes of mistakes • Validation to reduce risk of mistakes • Correcting text content • Correcting text content using rendered text over- using rendered text over- lay lay • Correcting text content using rendered text over- lay • Correcting layout using • Correcting layout using convenient tools such as convenient tools such as merge and split merge and split • Correcting layout using convenient tools such as merge and split • Automated page analysis • Automated page analysis with integrated Tesseract with integrated Tesseract OCR or opening externally OCR or opening externally generated result generated result • Automated page analysis with integrated Tesseract OCR or opening externally generated result T y p i c l a Typical W o r k fl o s w Workflows Typical Workflows Typical Workflows Preproduction + Correction Preproduction + Correction Preproduction + Correction O t h e r Other S o f t w a r e Software T o o l s Tools b y by P R I A m PRImA Other Software Tools by PRImA Other Software Tools by PRImA Pattern Recognition and Image Analysis Research Lab, School of Computing, Science and Engineering, Pattern Recognition and Image Analysis Research Lab, School of Computing, Science and Engineering, University of Salford, Greater Manchester, United Kingdom, www.primaresearch.org University of Salford, Greater Manchester, United Kingdom, www.primaresearch.org Pattern Recognition and Image Analysis Research Lab, School of Computing, Science and Engineering, University of Salford, Greater Manchester, United Kingdom, www.primaresearch.org WebAletheia Webapp WebAletheia Webapp WebAletheia Webapp Tesseract OCR to PAGE For Windows Tesseract OCR to PAGE For Windows Tesseract OCR to PAGE For Windows PAGE Libraries For Java and C++ PAGE Libraries For Java and C++ PAGE Libraries For Java and C++ Layout Evaluation Performance Analysis System Layout Evaluation Performance Analysis System Layout Evaluation Performance Analysis System A lightweight web-based version of the Aletheia A lightweight web-based version of the Aletheia ground truthing system. Ideal for customised ground truthing system. Ideal for customised workflows and crowdsourcing applications. Go to workflows and crowdsourcing applications. Go to the PRImA website to try it yourself. the PRImA website to try it yourself. A lightweight web-based version of the Aletheia ground truthing system. Ideal for customised workflows and crowdsourcing applications. Go to the PRImA website to try it yourself. A command line tool to analyse document page A command line tool to analyse document page images using the open source OCR engine Tesser- images using the open source OCR engine Tesser- act and save the results to PAGE XML format. act and save the results to PAGE XML format. Version 1.3 is based on the latest release of Tesser- Version 1.3 is based on the latest release of Tesser- act (3.03). act (3.03). A command line tool to analyse document page images using the open source OCR engine Tesser- act and save the results to PAGE XML format. Version 1.3 is based on the latest release of Tesser- act (3.03). Platform independent libraries to create valid lay- Platform independent libraries to create valid lay- out descriptions in PAGE XML format. The libraries out descriptions in PAGE XML format. The libraries can be easily integrated in other software projects can be easily integrated in other software projects such as page segmentation methods for ICDAR such as page segmentation methods for ICDAR competitions. competitions. Platform independent libraries to create valid lay- out descriptions in PAGE XML format. The libraries can be easily integrated in other software projects such as page segmentation methods for ICDAR competitions. This tool is part of a framework for evaluating the This tool is part of a framework for evaluating the performance of layout analysis methods. It com- performance of layout analysis methods. It com- bines efficiency and accuracy by using a special bines efficiency and accuracy by using a special interval based geometric representation of regions. interval based geometric representation of regions. A wide range of sophisticated evaluation measures A wide range of sophisticated evaluation measures provide the means for a deep insight into the provide the means for a deep insight into the analysed systems, analysed systems, which goes far which goes far beyond simple beyond simple benchmarking. The benchmarking. The support of user- support of user- defined profiles defined profiles allows the tuning allows the tuning for any kind of for any kind of evaluation scenario evaluation scenario related to real related to real world applications. world applications. This tool is part of a framework for evaluating the performance of layout analysis methods. It com- bines efficiency and accuracy by using a special interval based geometric representation of regions. A wide range of sophisticated evaluation measures provide the means for a deep insight into the analysed systems, which goes far beyond simple benchmarking. The support of user- defined profiles allows the tuning for any kind of evaluation scenario related to real world applications.