Now and then, you hear something, and you wonder why it was said the way it was said. For me, that is the phenomenon that you hear the word “real” without the prescriptively required adverbial “ly” as a modifier of adjectives:
I just heard some real bad news (Kanye West)
That shirt is real fly! (Fresh Prince of Bel-Air)
As said, one would expect “really bad” and “really fly”. These kinds of things attract my attention, and I decided to do a small corpus linguistic investigation to find out what is going on.
As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:
- both contain linguistic production,
- both usually provide further information about the production in the form of annotations,
- these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.
In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.
The very first step of any quantitative study is to get the data into software that can do a quantitative analysis, such as R. In this post, it is explained how this is done. For the explanation in this post, we assume a working R installation, but no extra packages are required.