Very often in linguistics, it is simply not possible to provide a classical definition with necessary and sufficient conditions for our categories. This is the case for most (perhaps all?) linguistic categories. Even basic categories such as parts of speech are not entirely clearly defined. In fact, Langacker (1987) takes that as a sign that we should re-think our whole linguistic ideas. But how can we then correctly annotate our data as a corpus linguist? Well, that is where inter-annotator agreement comes into play.
As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:
- both contain linguistic production,
- both usually provide further information about the production in the form of annotations,
- these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.
In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.