An important aspect of scientific research is that findings are reproducible, falsifiable and transparent. Especially in an empirical approach, it is of the utmost importance to make datasets available. It should become a natural reflex to feel an urge for seeing the data behind the publication. No matter how well the publication describes the variables, it is always interesting and insightful to learn how certain observations are annotated. From your own experience, you probably already know how difficult it usually is to decide which value to assign from the variables you are investigating. These insecureties are also present in other corpus linguists. Perhaps, that is why many (corpus) linguists do not make their datasets freely available. But usually, they bring two kinds of arguments to the table.
For many inexperienced linguists who start working with corpora, there is the misconception that a query in a corpus leads almost directly towards solving a research question. Nothing, however, is less true than this. A corpus linguistic approach to a research question often involves a lot of work, both on an intellectual and on a technical/mind-numbing level.
Once you have imported your dataset into R, there are countless possibilities for analyzing the data in a quantitative way, as opposed to the qualitative analysis that went into the annotation. The very first quantitative analysis that you may want to perform on categorical variables is to see how often a certain value occurs with respect to another value in a variable. This might be practical if you want to find out if you have fairly balanced dataset.
The very first step of any quantitative study is to get the data into software that can do a quantitative analysis, such as R. In this post, it is explained how this is done. For the explanation in this post, we assume a working R installation, but no extra packages are required.
Although it is a good idea to build your datasets in spreadsheet software, it is an even better idea to save your dataset (after you are ready with the annotation, of course) into the csv format.
Datasets are among the most important objects in a scientific study. It is best to stick to a widely used format for your dataset so that other people are able to understand what you have done. In order to find a good format for corpuslinguistic datasets, the nature of corpuslinguistic data needs to be investigated.
Corpuslinguists believe that theoretic linguistic phenomena should also have an empirical counterpart. Our first reaction to anything theoretical is “let’s see if I can find this in a corpus!” However, as soon as we fire up our computers to see if we can find the phenomenon, we are quickly stuck due to technical limitations. The goal of this website is to provide practical tips and tricks for beginning corpuslinguists around the themes of searching in corpora, exporting corpus results, annotation, importing data into statistical software, and producing descriptive tables and figures.
Now, something like this can not be written in a days time. So, there will be additions and changes over the coming months. Everybody is free to comment on the content or to ask for further explanations. Content will first appear as blog posts, so keep the blog in your RSS reader, and will then later on be transferred into the chapters. If you feel like you could add something yourself, please let me know!
Have fun familiarizing yourself with methods in corpuslinguistics.