The work of the CLARIN service center of the Zentrum Sprache is primarily focused on historical corpora of texts originating from the 16th to 20th century. The corpora are, if this satisfies copyright law, provided with open licenses and are thus reuseable as research data. All corpora are annotated in accordance with recognized standards of good practice (TEI / P5, CMDI) and can be accessed via the BBAW repository
The corpora of the CLARIN service center at the BBAW are
Deutsches Textarchiv (DTA)
The Deutsches Textarchiv (DTA) project provides a large cross-section of printed works from different disciplines and text types, dating back to the 17th to 19th century, as digitized and annotated full-texts. The goal is to create the basis of a reference corpus for the development of the Historical New High German Language. All texts may be queried across spelling variants and under consideration of linguistic specialities. In addition, they are presented as electronic full texts together with their corresponding images on page level and are available for download via the project's website.
DTAE – DTA extensions
Scholars working on digitizing texts of the late 16th till early 20th century (preparing digital editions, transcriptions, corpora, etc.), have the opportunity to publish these texts via the module DTAE. This way, the DTA core corpus is continuously supplemented by primary texts from other project contexts to increase the text basis and variety of the DTA. Moreover, these accumulated texts can be compared as special corpora with the DTA core corpus based on their linguistic specifics.
Digitales Wörterbuch der Deutschen Sprache (DWDS)
The Digital Dictionary of the German Lanugage (Digitales Wörterbuch der Deutschen Sprache; DWDS) is a long-term project of the BBAW. Its aim is to develop a large lexical information system based on the dictionaries already hosted or built-up at the BBAW and on large corpora. Three types of corpora, consisting of about 2.5 billion text words in total, can be accessed via the web platform: reference corpora, newspaper corpora, and special corpora.
"Dingler online" was a project carried out at the Humboldt University of Berlin (duration 2007-2013) and funded by the German Research Foundation (DFG). Its aim was to digitize all 375 volumes of the journal for technical developments "Polytechnisches Journal" (1820-1931). The resource (205 000 pages) is available as full text and is annotated in its entirety according to the TEI P5 guidelines. Since the end of the project, all of its data resources have been maintained by the CLARIN center at the BBAW and have been further processed in accordance with the recognized CLARIN best practice format for the annotation of historical corpora, the DTA Base Format (DTABf)
. They are thus available as research data in a sustainable way for future use. Furthermore, the CLARIN center also provides the infrastructure for the online component of the project.
This resource is available via the repository of the Zentrum Sprache at the BBAW
The corpus C4 is a joint initiative of the Digital Dictionary of the German Language (Digitales Wörterbuch der Deutschen Sprache, DWDS), the Austrian Academy Corpus (AAC), the corpus South Tyrol and the Swiss text corpus (CHTK). The corpus consists of subcorpora from each partner project which may be queried in a distributed way, i.e. the merging of the corpora is virtual but only the results are presented collectively.
This corpus comprises all articles of the "Berliner Zeitung" that were published online in the period from January 1994 to December 2005. Extent: 252 million text words (tokens) in 869,000 articles.
This corpus contains all articles of the newspaper "Der Tagesspiegel" that were published online between 1996 and June 2005. Extent: 170 million text words (tokens) in 350,000 articles.
In the course of the curation project of CLARIN’s F-AG 10 (i.e. discipline-specific working group for studies of contemporary history), the textual resources of the GDR press portal (DDR Presseportal)
are currently processed to be CLARIN compliant. The GDR Presseportal comprises the newspapers "Neues Deutschland" (ND
, 1946-1990), Berliner Zeitung (BZ
, 1945-1993) and Neue Zeit (NZ
, 1945-1994). The rights of use for ND and BZ within the CLARIN infrastructure were acquired via a cooperation agreement between the rights donors on the one side and the rights holders (BBAW, Centre for Contemporary History, Berlin State Library) on the other side. A full text search in these texts is already possible. Also further text analysis tools are provided (co-occurrences, time series). This requires authentication in CLARIN (either through an account on clarin.eu or by means of a regular shibboleth-accounts). The sources will be integrated into the BBAW repository by the end of the CLARIN-construction phase.
ReM – Reference Corpus of Middle High German
The ReM corpus consists of 398 documents (2.5m tokens) and is
searchable via DDC. On a separate website
all documents are available
for reading. For further information visit: https://www.linguistics.rub.de/rem/