Corpora versus datasets

As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:

  • both contain linguistic production,
  • both usually provide further information about the production in the form of annotations,
  • these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.

Continue reading