Corpora versus datasets

As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:

  • both contain linguistic production,
  • both usually provide further information about the production in the form of annotations,
  • these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.

For both a corpus and a dataset, I stick to the following two minimal definitions:

  • A corpus is a representative sample of actual language production within a meaningful context and with a general purpose.
  • A dataset is a representative sample of a specific linguistic phenomenon in a restricted context and with annotations that relate to a specific research question.

The prototypical corpus and the prototypical dataset can thus be summarized in the following table:

Prototypical corpus Prototypical dataset
Language unrestricted production specific phenomenon
Context wide restricted
Purpose general research question

Now, obviously, these prototypical cores do not necessarily correspond to all corpora and datasets out there. Let me nonetheless give some extreme examples. As a prototypical corpus, I consider the Usenet corpus of Westbury. This corpus is pure text downloaded from Usenet servers and presented as is (except for a couple of clean-up measurements). There is no restriction on the context, and everybody is in principle free to use it for whatever research question they like. A prototypical dataset is for me the horse name corpus (yes, it is called a corpus, let the troubles begin). There is only a very specific phenomenon sampled in the data (horse names), and there is actually no context whatsoever. Although you can think up a small amount of research questions for this data, it is quite obvious that the purpose is much more limited than the Usenet corpus.

In reality, however, corpora and datasets are somewhat of a mix between the two. Most corpora are annotated (with part-of-speech, lemmatized, syntax, named entitities, etc.) and do not contain full texts (usually paragraphs or chapters, snippets, fragments) for copyright reasons. And often, corpora are compiled with a certain research question in mind. Vice versa, datasets are often based on Keywords in Context, so that they in fact also contain a relatively wide context, so that they could be used for other research questions on different linguistic phenomena, too.

For me, however, there is one distinctive characteristic that sets corpora and datasets apart from each other. Datasets are in the unstacked format and freely distributed as csv files, whereas corpora may have a wide range of formats (vertical, xml, graphs).

4 thoughts on “Corpora versus datasets

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s