The very first step of any quantitative study is to get the data into software that can do a quantitative analysis, such as R. In this post, it is explained how this is done. For the explanation in this post, we assume a working R installation, but no extra packages are required.
R is a command-line kind of program, which is sometimes seen as a disadvantage by beginning corpuslinguists. However, the command line interface is very practical when one wants to make specific steps very explicit. Whereas some people prefer graphical user interfaces for beginning corpuslinguists, I am strongly convinced that getting used to the command line does not take too long.
The first thing one usually does after starting R is telling it where your data is stored. This can be done by means of the setwd (set working directory) command. As an example, all my data is stored on a separate hard disk that is mounted under D:\ (under Windows). Imagine this dataset “imag_data.csv” that is stored under “D:\ImaginaryDatasets\imag_data.csv”. So, I could type in the following command to get to right working directory:
First of all, note that R has the slashes in the other direction than you are used to in Windows. Whereas there is a backslash \ in Windows, R uses the forward slash /. Second, it is not yet necessary to specify the filename, since we are only browsing to the correct working directory here. By the way, if you would like to know what the current working directory is, you could simply type getwd().
Now that we are in the relevant working directory, we can read in data that is directly stored in that folder. So, as an example, download this dataset on Old High German semantic classes after article-alike determiners to a specific directory. Set the working directory to the directory in which you just downloaded and stored the dataset with the setwd command. The next step is to actually read in the dataset in R. The following command is used for this, and every little detail about it is explained below:
ds <- read.delim(file="ahd_DA_noun_semclass_century_note_exclude.key", header=TRUE, row.names=1, sep=";", encoding="UTF-8")
As a matter of fact, you can also download this dataset directly into R by referring to the URL as follows:
ds <- read.delim(file="https://corpuslinguisticmethods.files.wordpress.com/2013/12/ahd_da_noun_semclass_century_note_exclude.key", header=TRUE, row.names=1, sep=";", encoding="UTF-8")
There are a couple of functions available in R for reading in a dataset, but I prefer read.delim. The following fields need to be given for reading in your datasets:
- file: here, you provide the exact file name of the dataset
- header: either TRUE or FALSE. If TRUE, then the csv file is interpreted as if the first line contains the names of the columns. If FALSE, then the csv file is interpreted as if the first line already contains data; R makes up some column names by itself.
- row.names: either a number of the whole field is left out. The number (starting at 1) indicates which column contains row names. Since I consider it a good practice to number my observations, I give R the opportunity to use these numbers as row names.
- sep: this is the character that is used as a delimiter in the csv file.
- encoding: since we always try to have UTF-8 files, please specify here the encoding of the dataset.
Here is a golden tip to conclude this post: if you hit the tab key twice, R will list suggested values. This is especially handy when typing in the file name of the dataset. So after you type read.delim(file=" you could hit tab twice, and a list of files in the working directory will appear. This also works for completing commands. If you wrote read.de and then hit tab once, the command will be completed to read.delim.