background
text_tech
AAC - Basics
Infrastructure
Scanning
OCR
Image Processing
XML Markup
Text Retrieval
Databases
Corpus Tools
Web Design
AAC-Container
Applications
Lab
Institution
Markup Print
From the very beginning of their work, the AAC's scholars have been relying on XML and cognate technologies as the foundation of their corpus build-up, applying an encoding scheme that has been characterized by a combined approach trying to capture both structural features of the texts as well as a certain amount of data describing the physical appearance of the original texts.

All the data in the corpus has been organized on a one-page-one-document basis. Markup has been applied according to a neatly defined workflow that begins with basic structural features of the texts such as divisions, titles, images, headers and footnotes to more detailed markup in a number of specific projects. Formal information is also preserved when it can be achieved with justifiable expenditure and whenever such data may have semantic relevance worth while considering. Capturing bold and italic text is a comparatively straightforward task, whereas marking-up of spaced text passages (which are often of remarkable importance in historical data) poses considerably more problems. In all decisions feasibility has been thoroughly weighed against the value of information gathered.

The main perspective of the AAC's digitization efforts has been throughout the creation of digital resources that live up to scholars' needs in philological studies. Hence, doing "dirty OCR" without subsequent refinements of the digital objects has been precluded. While it is practically impossible to avoid a certain amount of noise, scholarly investigations need neatly edited texts to yield reliable results. With this as a guideline in mind, the AAC's working group have always tried to strike a reasonable balance between manual work and the application of automatic and semi-automatic routines. The remarkably heterogeneous structure of the digital objects of the corpus has further exacerbated the problem. Working on the digital texts, encoders and scholars have been urged to apply markup in a very cautious manner, rather making use of interactive, semi-automated routines than having software applications doing overall changes to the texts.