The CLARIN service center of the Zentrum Sprache at the BBAW

The CLARIN service center of the Zentrum Sprache at the BBAW

Data Curation at the BBAW Language Center and German Text Archive

Digital research data of texts from the late 16th to early 20th century can be published via the German Text Archive (DTA) at BBAW Language Center. Here, the module DTA Extensions forms the setting of the publication. The respective research data are transformed into the DTA input format (DTABf) and integrated in the DTA and CLARIN-D infrastructure. This way, they can be provided not only for research in further external contexts but can also be made available and exploitable in combination with other corpora of the BBAW Language Center.

Preliminaries for Data Curation at BBAW


Suitable extensions to the DTA’s historical corpora are primary sources which were received by a large audience, which are key texts for notable discourses or epochs, or which by other characteristics justify being object to research today.

Ideally, texts should date back to some time between the late 16th and the early 20th century. This is the time frame for the vast majority of DTA corpus texts. However, there is some tolerance beyond these time limits, as well.


The images which the transcription is based on should be available in high resolution, in an uncomprised format (TIFF) and with a license that allows for further reuse. Input format for texts is the DTA Base Format (DTABf), an XML format following the TEI P5 Guidelines. Text annotation has to be carried out according to the DTABf, or according to guidelines which are compatible with the DTABf, allowing for lossless conversion of the texts.

The CLARIN centre at BBAW grants support for planning and carrying out digitization projects, especially concerning the following tasks:

  • Image Digitization
    The DTA cooperates with various libraries which are specialized on collecting prints from the 17th to 20th century. We can offer advise and connect you with the libraries most suitable for your digitization task.
  • Recording of Metadata
    It is important to record bibliographic metadata, information on the dgital facsimile, as well as on the process of text recognition. The CLARIN center at BBAW provides a web form that facilitates the proper recording of metadata according to the DTABf Guidelines and to gain a DTABf conformant TEI Header.
  • Text Recognition
    For text recognition two workflows are possible. The DTA can either perform the entire text recognition task according to a standardized workflow, or advise users in performing text recognition on their own. For the latter, we provice comprehensive documentation and schemas of the DTABf. Additionally, a Framework for the Author-Mode of the oXygen-XML-Editor may be used which facilitates text annotation in a WYSIWYG-like view.

DTAE Checklist

The following information are usually collected in order to estimate the necessary efforts for the integration of texts into the DTA:

General Information about the Document(s)

  • Short description of text selection criteria
  • Time frame of text origins (publication dates)
  • Language
  • Text type/Genre
  • Discourse
  • Print or manuscript
  • Complete or partial
  • Extent

General Information about the Project

  • Time frame of background project
  • Time frame for data integration
  • Responsible person(s)/institution(s)
  • Website of project
  • Contact person(s) and address

Text Recognition and Annotation

  • Text recognition (TR) completed y/n
  • Edition number of source
  • Metadata at hand y/n
  • Metadata format
  • Guidelines for TR
  • Deviation and general distance of TR guidelines from the DTA guidelines
  • Person/institution/company responsible for text digitization
  • Estimated quality of text recognition
  • Text annotation y/n
  • Guidelines for text annotation
  • Text annotation with TEI y/n
  • Text annotation according to DTABf y/n
  • Estimated effort necessary for conversion into DTABf
  • Text annotation in XML?
  • Licenses/conditions for reuse
  • disclosure risk present y/n
  • anonymization necessary y/n


  • Images at hand y/n
  • Possessing library - signature
  • Format
  • License/conditions for reuse


  • Link of external publication
  • Self-link of DTAE publication
  • Static or dynamic corpus
  • Scheduled extensions

These information are also needed to create minimal metadata records for the DTA publication of each text.