Some corpora come without a search interface. How do you search in them? Perhaps you read them into a concordance program like AntConc, but then you notice that the corpus has some weird idiosyncratic format that messes with the lines. AntConc quickly becomes pretty unusable if that is the case. So, what can you do? The simplest solution is to write a small Python script!
In previous posts, you have already learned how to make a frequency table or a contingency table for categorical variables. Although a table can be very insightful, things usually only get tangible when they are visualized. In this post, we learn how to turn a frequency/contingency table into a barplot with R.
In a previous post, it was explained how you can make a simple frequency table in R. Such a frequency table tells you for a single categorical variable how often each level (variant) of the categorical variable occurs in your dataset.
A contingency table does the same thing, but for two categorical variables at the same time, and in “comparison” to each other. Basically, what happens is that each level of the first categorical variable is considered with respect to each level of the second categorical variable.
Very often in linguistics, it is simply not possible to provide a classical definition with necessary and sufficient conditions for our categories. This is the case for most (perhaps all?) linguistic categories. Even basic categories such as parts of speech are not entirely clearly defined. In fact, Langacker (1987) takes that as a sign that we should re-think our whole linguistic ideas. But how can we then correctly annotate our data as a corpus linguist? Well, that is where inter-annotator agreement comes into play.
Now and then, you hear something, and you wonder why it was said the way it was said. For me, that is the phenomenon that you hear the word “real” without the prescriptively required adverbial “ly” as a modifier of adjectives:
I just heard some real bad news (Kanye West)
That shirt is real fly! (Fresh Prince of Bel-Air)
As said, one would expect “really bad” and “really fly”. These kinds of things attract my attention, and I decided to do a small corpus linguistic investigation to find out what is going on.
As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:
- both contain linguistic production,
- both usually provide further information about the production in the form of annotations,
- these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.
In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.
The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.
However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.