Corpora versus datasets

As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:

  • both contain linguistic production,
  • both usually provide further information about the production in the form of annotations,
  • these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.

Continue reading

How to extract data from COHA into Excel or R?

The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.

However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.

Continue reading

Share your datasets

An important aspect of scientific research is that findings are reproducible, falsifiable and transparent. Especially in an empirical approach, it is of the utmost importance to make datasets available. It should become a natural reflex to feel an urge for seeing the data behind the publication. No matter how well the publication describes the variables, it is always interesting and insightful to learn how certain observations are annotated. From your own experience, you probably already know how difficult it usually is to decide which value to assign from the variables you are investigating. These insecureties are also present in other corpus linguists. Perhaps, that is why many (corpus) linguists do not make their datasets freely available. But usually, they bring two kinds of arguments to the table.

Continue reading

Accountability, recall and precision in corpus linguistics

For many inexperienced linguists who start working with corpora, there is the misconception that a query in a corpus leads almost directly towards solving a research question. Nothing, however, is less true than this. A corpus linguistic approach to a research question often involves a lot of work, both on an intellectual and on a technical/mind-numbing level.

Continue reading

How to make a frequency table in R

Once you have imported your dataset into R, there are countless possibilities for analyzing the data in a quantitative way, as opposed to the qualitative analysis that went into the annotation. The very first quantitative analysis that you may want to perform on categorical variables is to see how often a certain value occurs with respect to another value in a variable. This might be practical if you want to find out if you have fairly balanced dataset.

Continue reading

Save your datasets as csv files

Although it is a good idea to build your datasets in spreadsheet software, it is an even better idea to save your dataset (after you are ready with the annotation, of course) into the csv format.

Continue reading