Share your datasets

An important aspect of scientific research is that findings are reproducible, falsifiable and transparent. Especially in an empirical approach, it is of the utmost importance to make datasets available. It should become a natural reflex to feel an urge for seeing the data behind the publication. No matter how well the publication describes the variables, it is always interesting and insightful to learn how certain observations are annotated. From your own experience, you probably already know how difficult it usually is to decide which value to assign from the variables you are investigating. These insecureties are also present in other corpus linguists. Perhaps, that is why many (corpus) linguists do not make their datasets freely available. But usually, they bring two kinds of arguments to the table.

Continue reading

Advertisements

Accountability, recall and precision in corpus linguistics

For many inexperienced linguists who start working with corpora, there is the misconception that a query in a corpus leads almost directly towards solving a research question. Nothing, however, is less true than this. A corpus linguistic approach to a research question often involves a lot of work, both on an intellectual and on a technical/mind-numbing level.

Continue reading