Corpus linguistics is real(ly)? awesome

Now and then, you hear something, and you wonder why it was said the way it was said. For me, that is the phenomenon that you hear the word “real” without the prescriptively required adverbial “ly” as a modifier of adjectives:

I just heard some real bad news (Kanye West)

That shirt is real fly! (Fresh Prince of Bel-Air)

As said, one would expect “really bad” and “really fly”. These kinds of things attract my attention, and I decided to do a small corpus linguistic investigation to find out what is going on.

Continue reading

Corpora versus datasets

As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:

  • both contain linguistic production,
  • both usually provide further information about the production in the form of annotations,
  • these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.

Continue reading