PRImA Research Lab
2015-07-17T15:27:13
2018-07-19T07:29:57
Example Page
A
l
e
t
h
e
i
a
Aletheia
D
o
c
u
m
e
n
t
Document
A
n
a
l
y
s
i
s
Analysis
S
y
s
t
e
m
System
Aletheia Document Analysis System
Aletheia Document Analysis System
O
v
e
r
v
i
e
w
:
Overview:
A
l
e
t
h
e
i
a
Aletheia
i
s
is
a
n
an
a
d
-
ad-
Overview: Aletheia is an ad-
vanced
system
for
accurate
and
yet
vanced system for accurate and yet
cost-effective
ground
truthing
of
cost-effective ground truthing of
large
amounts
of
documents.
It
aids
large amounts of documents. It aids
the
user
with
a
number
of
automated
the user with a number of automated
and
semi-automated
tools
which
and semi-automated tools which
were
partly
developed
and
improved
were partly developed and improved
based
on
feedback
from
major
librar-
based on feedback from major librar-
ies
across
Europe
and
from
their
digit-
ies across Europe and from their digit-
isation
service
providers
which
are
us-
isation service providers which are us-
ing
the
tool
in
a
production
environ-
ing the tool in a production environ-
ment.
ment.
Overview: Aletheia is an ad-
vanced system for accurate and yet
cost-effective ground truthing of
large amounts of documents. It aids
the user with a number of automated
and semi-automated tools which
were partly developed and improved
based on feedback from major librar-
ies across Europe and from their digit-
isation service providers which are us-
ing the tool in a production environ-
ment.
Novel
features
are,
among
others,
the
Novel features are, among others, the
support
of
top-down
ground
truthing
support of top-down ground truthing
with
sophisticated
split
and
shrink
tools
with sophisticated split and shrink tools
as
well
as
bottom-up
ground
truthing
as well as bottom-up ground truthing
supporting
the
aggregation
of
lower-level
supporting the aggregation of lower-level
elements
to
more
complex
structures.
elements to more complex structures.
Special
features
have
been
developed
to
Special features have been developed to
support
working
with
the
complexities
of
support working with the complexities of
historical
documents.
The
integrated
vali-
historical documents. The integrated vali-
dator,
in
combination
with
powerful
cor-
dator, in combination with powerful cor-
rection
tools,
enable
efficient
production
rection tools, enable efficient production
of
highly
accurate
ground
truth.
of highly accurate ground truth.
Novel features are, among others, the
support of top-down ground truthing
with sophisticated split and shrink tools
as well as bottom-up ground truthing
supporting the aggregation of lower-level
elements to more complex structures.
Special features have been developed to
support working with the complexities of
historical documents. The integrated vali-
dator, in combination with powerful cor-
rection tools, enable efficient production
of highly accurate ground truth.
Aletheia
uses
the
PAGE
(Page
Analysis
Aletheia uses the PAGE (Page Analysis
and
Ground
truth
Elements)
XML
format
and Ground truth Elements) XML format
framework
which
incorporates
several
framework which incorporates several
XML
schemas
representing
the
whole
XML schemas representing the whole
workflow
of
document
analysis.
See
also
workflow of document analysis. See also
the
dedicated
infobox.
the dedicated infobox.
Aletheia uses the PAGE (Page Analysis
and Ground truth Elements) XML format
framework which incorporates several
XML schemas representing the whole
workflow of document analysis. See also
the dedicated infobox.
Layers
and
reading
order
Layers and reading order
Layers and reading order
Screenshot
of
Aletheia
showing
regions
and
properties
Screenshot of Aletheia showing regions and properties
Screenshot of Aletheia showing regions and properties
The
PAGE
(Page
Analysis
and
Ground
The PAGE (Page Analysis and Ground
truth
Elements)
format
framework
incorpo-
truth Elements) format framework incorpo-
rates
several
XML
schemas
representing
the
rates several XML schemas representing the
whole
workflow
of
document
analysis,
includ-
whole workflow of document analysis, includ-
ing
image
enhancement,
binarisation,
geo-
ing image enhancement, binarisation, geo-
metrical
correction,
layout
analysis,
layout
metrical correction, layout analysis, layout
evaluation
and
OCR.
The
here
used
schema
evaluation and OCR. The here used schema
for
document
layouts
allows
for
polygonal
for document layouts allows for polygonal
regions
with
various
attributes
(including
text
regions with various attributes (including text
content),
reading
order,
layers
and
more.
content), reading order, layers and more.
The PAGE (Page Analysis and Ground
truth Elements) format framework incorpo-
rates several XML schemas representing the
whole workflow of document analysis, includ-
ing image enhancement, binarisation, geo-
metrical correction, layout analysis, layout
evaluation and OCR. The here used schema
for document layouts allows for polygonal
regions with various attributes (including text
content), reading order, layers and more.
From
Scratch,
Top-Down
From Scratch, Top-Down
From Scratch, Top-Down
•
Marking
regions
using
man-
• Marking regions using man-
ual
or
semi-automated
tools
ual or semi-automated tools
• Marking regions using man-
ual or semi-automated tools
•
Marking
text
kines
with
easy-
• Marking text kines with easy-
to-use
split
tools
to-use split tools
• Marking text kines with easy-
to-use split tools
•
Marking
words
with
assistive
• Marking words with assistive
tools
tools
• Marking words with assistive
tools
•
Marking
glyphs
(characters)
• Marking glyphs (characters)
• Marking glyphs (characters)
•
Text
transcription
and
propa-
• Text transcription and propa-
gation
to
any
required
level
gation to any required level
• Text transcription and propa-
gation to any required level
•
Reading
order
definition
• Reading order definition
• Reading order definition
•
Validation
to
reduce
risk
• Validation to reduce risk
of
mistakes
of mistakes
• Validation to reduce risk
of mistakes
•
Correcting
text
content
• Correcting text content
using
rendered
text
over-
using rendered text over-
lay
lay
• Correcting text content
using rendered text over-
lay
•
Correcting
layout
using
• Correcting layout using
convenient
tools
such
as
convenient tools such as
merge
and
split
merge and split
• Correcting layout using
convenient tools such as
merge and split
•
Automated
page
analysis
• Automated page analysis
with
integrated
Tesseract
with integrated Tesseract
OCR
or
opening
externally
OCR or opening externally
generated
result
generated result
• Automated page analysis
with integrated Tesseract
OCR or opening externally
generated result
T
y
p
i
c
l
a
Typical
W
o
r
k
fl
o
s
w
Workflows
Typical Workflows
Typical Workflows
Preproduction
+
Correction
Preproduction + Correction
Preproduction + Correction
O
t
h
e
r
Other
S
o
f
t
w
a
r
e
Software
T
o
o
l
s
Tools
b
y
by
P
R
I
A
m
PRImA
Other Software Tools by PRImA
Other Software Tools by PRImA
Pattern
Recognition
and
Image
Analysis
Research
Lab,
School
of
Computing,
Science
and
Engineering,
Pattern Recognition and Image Analysis Research Lab, School of Computing, Science and Engineering,
University
of
Salford,
Greater
Manchester,
United
Kingdom,
www.primaresearch.org
University of Salford, Greater Manchester, United Kingdom, www.primaresearch.org
Pattern Recognition and Image Analysis Research Lab, School of Computing, Science and Engineering,
University of Salford, Greater Manchester, United Kingdom, www.primaresearch.org
WebAletheia
Webapp
WebAletheia Webapp
WebAletheia Webapp
Tesseract
OCR
to
PAGE
For
Windows
Tesseract OCR to PAGE For Windows
Tesseract OCR to PAGE For Windows
PAGE
Libraries
For
Java
and
C++
PAGE Libraries For Java and C++
PAGE Libraries For Java and C++
Layout
Evaluation
Performance
Analysis
System
Layout Evaluation Performance Analysis System
Layout Evaluation Performance Analysis System
A
lightweight
web-based
version
of
the
Aletheia
A lightweight web-based version of the Aletheia
ground
truthing
system.
Ideal
for
customised
ground truthing system. Ideal for customised
workflows
and
crowdsourcing
applications.
Go
to
workflows and crowdsourcing applications. Go to
the
PRImA
website
to
try
it
yourself.
the PRImA website to try it yourself.
A lightweight web-based version of the Aletheia
ground truthing system. Ideal for customised
workflows and crowdsourcing applications. Go to
the PRImA website to try it yourself.
A
command
line
tool
to
analyse
document
page
A command line tool to analyse document page
images
using
the
open
source
OCR
engine
Tesser-
images using the open source OCR engine Tesser-
act
and
save
the
results
to
PAGE
XML
format.
act and save the results to PAGE XML format.
Version
1.3
is
based
on
the
latest
release
of
Tesser-
Version 1.3 is based on the latest release of Tesser-
act
(3.03).
act (3.03).
A command line tool to analyse document page
images using the open source OCR engine Tesser-
act and save the results to PAGE XML format.
Version 1.3 is based on the latest release of Tesser-
act (3.03).
Platform
independent
libraries
to
create
valid
lay-
Platform independent libraries to create valid lay-
out
descriptions
in
PAGE
XML
format.
The
libraries
out descriptions in PAGE XML format. The libraries
can
be
easily
integrated
in
other
software
projects
can be easily integrated in other software projects
such
as
page
segmentation
methods
for
ICDAR
such as page segmentation methods for ICDAR
competitions.
competitions.
Platform independent libraries to create valid lay-
out descriptions in PAGE XML format. The libraries
can be easily integrated in other software projects
such as page segmentation methods for ICDAR
competitions.
This
tool
is
part
of
a
framework
for
evaluating
the
This tool is part of a framework for evaluating the
performance
of
layout
analysis
methods.
It
com-
performance of layout analysis methods. It com-
bines
efficiency
and
accuracy
by
using
a
special
bines efficiency and accuracy by using a special
interval
based
geometric
representation
of
regions.
interval based geometric representation of regions.
A
wide
range
of
sophisticated
evaluation
measures
A wide range of sophisticated evaluation measures
provide
the
means
for
a
deep
insight
into
the
provide the means for a deep insight into the
analysed
systems,
analysed systems,
which
goes
far
which goes far
beyond
simple
beyond simple
benchmarking.
The
benchmarking. The
support
of
user-
support of user-
defined
profiles
defined profiles
allows
the
tuning
allows the tuning
for
any
kind
of
for any kind of
evaluation
scenario
evaluation scenario
related
to
real
related to real
world
applications.
world applications.
This tool is part of a framework for evaluating the
performance of layout analysis methods. It com-
bines efficiency and accuracy by using a special
interval based geometric representation of regions.
A wide range of sophisticated evaluation measures
provide the means for a deep insight into the
analysed systems,
which goes far
beyond simple
benchmarking. The
support of user-
defined profiles
allows the tuning
for any kind of
evaluation scenario
related to real
world applications.