Once you have imported your dataset into R, there are countless possibilities for analyzing the data in a quantitative way, as opposed to the qualitative analysis that went into the annotation. The very first quantitative analysis that you may want to perform on categorical variables is to see how often a certain value occurs with respect to another value in a variable. This might be practical if you want to find out if you have fairly balanced dataset.
This post will explain how to make a frequency table in R, and on the way, a number of other practical functions will be introduced. If you want to jump to the actual creation of the frequency table, click here.
Let’s create a fictional dataset on demographic information of people that live in the north or in the south. As we explained earlier, we want a dataset in which rows represent individual observations, and in which columns contain the characteristics of these observations. In the dataset of demographic information on people that live in the north or in the south, every line thus must be a single person. The columns might be variables such as gender and age. Also, we need to keep track where this person lives.
The dataset can be constructed like this. For the first observation, we create the following line:
> row1 <- c("1", "male", "old", "south")
This is the first line and contains the following characteristics: the first observation (1), a male person, an old person, a person living in the south.
We can do further observations:
> row2 <- c("2", "male", "young", "south") > row3 <- c("3", "female", "old", "north") > row4 <- c("4", "female", "old", "south")
We can now combine these rows together into a single dataset by using the function rbind, which binds the rows.
> obs <- rbind(row1, row2, row3, row4)
If we now let R print out this observation table, it looks like this:
> obs [,1] [,2] [,3] [,4] row1 "1" "male" "old" "south" row2 "2" "male" "young" "south" row3 "3" "female" "old" "north" row4 "4" "female" "old" "south"
This already looks quite nice. The first row contains all the data that we entered in the variable row1, and the second row contains all the data that we entered in the variable row2, etc. We still need to give the columns some names, because now, they are still just numbered. That is easy in R, by using this function:
> colnames(obs) <- c("obs", "gender", "age", "location")
As you can see, you just say that the column names (colnames) of the table obs (colnames(obs)) should become the array of values after the <-. The observation table now looks as follows:
> obs obs gender age location row1 "1" "male" "old" "south" row2 "2" "male" "young" "south" row3 "3" "female" "old" "north" row4 "4" "female" "old" "south"
To make R think about this table as a dataset that contains data in a structured format, we have to tell R that this is a so called data frame. The conversion of table obs to the data frame obs.df goes as follows, and has the following output:
> obs.df <- as.data.frame(obs) > obs.df obs gender age location row1 1 male old south row2 2 male young south row3 3 female old north row4 4 female old south
A nice advantage of having converted the table to a data frame is the fact that we can now access the variables (columns) by means of the $ operator. If we combine the name of the data frame (obs.df) with the $ sign and then the name of a column (e.g. gender) we get the full column, including some information about the different values that occur in this variable:
> obs.df$gender row1 row2 row3 row4 male male female female Levels: female male
Imagine we want to find out whether our dataset is balanced with respect to the locations. We can get the variable location from our dataset by doing obs.df$location, and if we feed this variable into the R function table, we get a frequency table for that variable.
> table(obs.df$location) north south 1 3
We see that there is only one observation for the North, and three observations for the South.
It is usually nice to get the relative frequencies of these absolute values. R makes it fairly simple to calculate these relative frequencies by means of the function prop.table which stands for proportion table, because it calculates proportions for the values in a table. If we apply prop.table to our frequency table for location, we get the following:
> prop.table(table(obs.df$location)) north south 0.25 0.75
And this shows us that three quarters of our dataset is taken up by observations in the south.
Let us try this out on a more realistic example. Read in the dataset on Old High German semantic classes after article-alike determiners, and turn it into a data frame.
> ds <- read.delim("https://corpuslinguisticmethods.files.wordpress.com/2013/12/ahd_da_noun_semclass_century_note_exclude.key", header=T, row.names=1, sep=";", fileEncoding="UTF-8") > ds.df <- as.data.frame(ds) > head(ds.df) determiner noun semantic.class century note exclude 0 daemo dolge concrete 8 t 1 daez dolg concrete 8 t 2 ther heilant abstract 10 is god abstract? t 3 einen brunnon concrete 10 f 4 ein quena concrete 10 f 5 thaz uuazzer concrete 10 f
Let us now investigate how well our dataset is distributed across the centuries. With the example from above, this is now very easy. We just make a table for the variable century by typing in the following command:
> table(ds.df$century) 8 9 10 11 2 78 37 25
The output is perhaps a little bit confusing, because there are just two rows of numbers. But the upper row contains the centuries, and the bottom row contains the amount of observations per century.
As a little extra, let us try to remove those observations that we wanted to exclude from this dataset, as annotated in the variable exclude. We will go into this in another post later, but we can restrict the dataset to observation that we did not want to exclude by using this command:
> ds.df.restr <- ds.df[ds.df$exclude == "f", ]
We stored the restricted dataset in the variable ds.df.restr. If we now create the frequency table for century on the basis of this dataset, the output is obviously different (and reflects the exclusions that were made):
> table(ds.df.restr$century) 9 10 11 53 35 18
As you can see, the two observations of the eight century are gone (because you can not do any quantitative work on two observations), and quite some observations from the other centuries, because it was unclear whether religious terms are concrete or abstract.