As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:
- both contain linguistic production,
- both usually provide further information about the production in the form of annotations,
- these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.
In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.
The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.
However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.
An important aspect of scientific research is that findings are reproducible, falsifiable and transparent. Especially in an empirical approach, it is of the utmost importance to make datasets available. It should become a natural reflex to feel an urge for seeing the data behind the publication. No matter how well the publication describes the variables, it is always interesting and insightful to learn how certain observations are annotated. From your own experience, you probably already know how difficult it usually is to decide which value to assign from the variables you are investigating. These insecureties are also present in other corpus linguists. Perhaps, that is why many (corpus) linguists do not make their datasets freely available. But usually, they bring two kinds of arguments to the table.
The very first step of any quantitative study is to get the data into software that can do a quantitative analysis, such as R. In this post, it is explained how this is done. For the explanation in this post, we assume a working R installation, but no extra packages are required.
Although it is a good idea to build your datasets in spreadsheet software, it is an even better idea to save your dataset (after you are ready with the annotation, of course) into the csv format.
Datasets are among the most important objects in a scientific study. It is best to stick to a widely used format for your dataset so that other people are able to understand what you have done. In order to find a good format for corpuslinguistic datasets, the nature of corpuslinguistic data needs to be investigated.