In a previous post, it was explained how you can make a simple frequency table in R. Such a frequency table tells you for a single categorical variable how often each level (variant) of the categorical variable occurs in your dataset.
A contingency table does the same thing, but for two categorical variables at the same time, and in “comparison” to each other. Basically, what happens is that each level of the first categorical variable is considered with respect to each level of the second categorical variable.
For linguists, the contingency table is an important tool. It allows you to quickly find out how to factors are related to each other: which socio-demographic class (low/middle/high) has the ‘r’ dropped/pronounced (sociolinguistics), what the position of the verb is in different types of sentences (syntax), etc.
Let’s make this a bit more simple and tangible. Imagine you keep a dataset of going out with friends, in which you record what you have been drinking and what you have been doing. That dataset could look like this:
Now, in R, we could replicate this dataset as follows:
ds = rbind( c("1", "beer", "movie"), c("2", "cola", "movie"), c("3", "cola", "hanging out"), c("4", "beer", "movie"), c("5", "cola", "hanging out") ) colnames(ds) = c("obs", "drink", "activity") ds.df = as.data.frame(ds)
In R, that would like this:
> ds.df obs drink activity 1 1 beer movie 2 2 cola movie 3 3 cola hanging out 4 4 beer movie 5 5 cola hanging out
If you followed this blog, you already know how to make a frequency table for the individual factors with the table command. Now, I am going to show you how to investigate the factor ‘drink’ and ‘activity’ at the same time. So, the question is: is there any link between the kind of activity (movie or hanging out) and what I drink? By hand, you could proceed like this: “ok, so every time I just hang out, I seem to drink cola, but if I go to the movies, I have a beer two out of three times”. And there you have your descriptive statistic already.
How can we automate this? In R, we can simply use the ftable command. This command takes one argument, in which you write the a formula that takes this form: ‘factor 1’ ~ ‘factor 2’. If we were doing inferential statistics — but we are not, we are doing descriptive statistics — you might read this as use ‘factor 2’ to predict (~) ‘factor 1’. For a descriptive approach, you can simply interpret that ~-sign as ‘correlation’ (which does not imply causation, but simply a descriptive co-variation).
For our little question: “is there a link between my activities and what I drink?” we just do this:
ds.df.ftab = ftable(ds.df$drink ~ ds.df$activity)
The table in ds.df.ftab looks as follows:
> ftable(ds.df$drink ~ ds.df$activity) ds.df$drink beer cola ds.df$activity hanging out 0 2 movie 2 1
It is fairly simple to read this table. The first row with names shows that the columns will reflect the ‘drink’ factor, with the two levels ‘beer’ and ‘cola’, the first column with names shows that the rows will reflect the ‘activity’ factor, with the two levels ‘hanging out’ and ‘movie’. The cells than show you that there are zero occurrences of drinking beer while hanging out, but two occurrences of beer drinking during the movies. I drank cola two times while hanging out, and one cola during a movie.
These absolute frequencies are already pretty cool — and you can imagine how handy this will get with a larger dataset — but what you actually want is relative frequencies, so that you can say something like “in 66% of going to the movies, I drank beer”.
It is fairly dangerous to calculate relative frequencies if you do not know what you are doing. In the case of the absolute frequencies that we have here, ranging between zero and two, it is absolute madness to calculate a relative frequency. Perhaps, I’ll explain later why, but as a general rule of thumb, if you do not have at least 15 or better still 30 observations per factor (so 30 observations for activity, or 30 for drink), refrain from calculating a relative frequency!
We can calculate relative frequencies with the prop.table command. The prop.table command takes two arguments, the first one is the table for which you want to calculate the relative frequencies, and the second one is either “1” or “2”, to calculate the relative frequencies respectively per row, or per column.
Calculating the relative frequencies per row:
> prop.table(ds.df.ftab,1) ds.df$drink beer cola ds.df$activity hanging out 0.0000000 1.0000000 movie 0.6666667 0.3333333
Calculating the relative frequencies per column:
> prop.table(ds.df.ftab, 2) ds.df$drink beer cola ds.df$activity hanging out 0.0000000 0.6666667 movie 1.0000000 0.3333333
Let’s try this out on some actual data. Here is a dataset that was collected during a seminar that I held at the HU Berlin. It investigates the reduction of pronomina as a subject after the main verb in German and in clothing shops in Berlin. The thing is, you can either say in High German ‘haben wir nicht’ or in the Berlin style ‘hamwa nich’. So, the subject ‘wir’ is reduced to ‘wa’.
Ok, we read in the dataset as usual:
ds = read.delim("https://corpuslinguisticmethods.files.wordpress.com/2014/02/berlin-red.key", header=T, row.names=1, sep="\t") ds.df = as.data.frame(ds) # overview of the columns colnames(ds.df)
The clothing shops have been categorized into four price categories, by looking at the average price for jeans and jackets. This information is captured in the column Price_cat_4. The information about whether or not we have a reduced realization of the pronoun is captured in the column Reduction. Now, we want to find out, whether there is a link between the price category and the amount of reduction that we can observe. We can find this out by means of a contingency table.
So, we run the following R code to get the contingency table
red.price <- ftable(ds.df$Reduction ~ ds.df$Price_cat_4)
that looks like this:
> red.price ds.df$Reduction f t ds.df$Price_cat_4 1 6 9 2 6 7 3 6 2 4 5 3
The common hypothesis is that the vernacular ‘hamwa’ form should occur more in low-prestige environments of the cheaper stores, whereas the standard ‘haben wir’ should occur more in the high-prestige environments of the expensive stores. To observe the relative frequency of reduced versus non-reduced pronouns per price category, we calculate the relative frequencies per row:
red.price.rf = prop.table(red.price, 1)
The relative frequencies now look as follows:
> red.price.rf ds.df$Reduction f t ds.df$Price_cat_4 1 0.4000000 0.6000000 2 0.4615385 0.5384615 3 0.7500000 0.2500000 4 0.6250000 0.3750000
If we focus on the relative frequency of the non-reduced observations (the first column), we can see that the shops with the highest prices have more non-reduced (standard) forms (0.625) than the shops with the lowest prices (0.40), and that the second lowest price range is in between (0.46). However, the third category of more expensive, but not yet the most expensive, clothes peaks at 0.75, having the most non-reduced forms of them all. This might be an indication of Labov’s ‘hypercorrection’.