Digital Humanities and Spanish diachronic online corpora: problems and suggested solutions
Por Rocío Díaz Bravo
Universidad de Granada
The aim of this paper is to show the results of the analysis of Spanish diachronic online corpora, with a particular emphasis on lemmatisers and PoS (part of speech) taggers, taking into account users’ needs and backgrounds.
The results of my interviews with scholars working on different projects and areas within the history of Spanish have shown the need for intuitive user friendly interfaces, lemmatised and annotated corpora, as well as advanced search options that allow different types of linguistic research (at all linguistic levels, including sociolinguistic variation).
In recent years there has been an increasing number of digital resources and diachronic corpora of Spanish. Despite the advances of very large Spanish diachronic corpora (such as Corpus del Español by Mark Davies and Corpus del Nuevo Diccionario Histórico by Real Academia Española), from a textual and a technological point of view, they still exhibit problems, as they are not suitable for many types of linguistic research and in addition the data need to be revised manually. On the other hand, after analysing lemmatisers and PoS taggers for pre-20th century Spanish and testing them with texts from different periods in the history of Spanish, it can be concluded that they also need to be improved in terms of accuracy. Furthermore, the only lemmatiser of pre-20th century Spanish that follows international standards, Freeling, is not user friendly.
Among the solutions suggested, it is relevant to highlight the importance of international standards (such as TEI for editing texts and metadata, or EAGLES for PoS tagging) for reasons of transferability and preservation; as well as the need for greater collaboration (including crowdsourcing) and understanding between disciplines (i.e. digital humanities, computational linguistics and history of the Spanish language). Many scholars are currently creating and / or using diachronic corpora and could benefit from this research.