From the very beginning of their work, the AAC's scholars have been relying on XML and
cognate technologies as the foundation of their corpus build-up, applying an encoding
scheme that has been characterized by a combined approach trying to capture both structural
features of the texts as well as a certain amount of data describing the physical appearance
of the original texts.
All the data in the corpus has been organized on a one-page-one-document basis. Markup has
been applied according to a neatly defined workflow that begins with basic structural
features of the texts such as divisions, titles, images, headers and footnotes to more
detailed markup in a number of specific projects. Formal information is also preserved when
it can be achieved with justifiable expenditure and whenever such data may have semantic
relevance worth while considering. Capturing bold and italic text is a comparatively
straightforward task, whereas marking-up of spaced text passages (which are often of
remarkable importance in historical data) poses considerably more problems. In all decisions
feasibility has been thoroughly weighed against the value of information gathered.
The main perspective of the AAC's digitization efforts has been throughout the creation of
digital resources that live up to scholars' needs in philological studies. Hence, doing
"dirty OCR" without subsequent refinements of the digital objects has been precluded. While
it is practically impossible to avoid a certain amount of noise, scholarly investigations
need neatly edited texts to yield reliable results. With this as a guideline in mind, the
AAC's working group have always tried to strike a reasonable balance between manual work
and the application of automatic and semi-automatic routines. The remarkably heterogeneous
structure of the digital objects of the corpus has further exacerbated the problem. Working
on the digital texts, encoders and scholars have been urged to apply markup in a very
cautious manner, rather making use of interactive, semi-automated routines than having
software applications doing overall changes to the texts.







