For many inexperienced linguists who start working with corpora, there is the misconception that a query in a corpus leads almost directly towards solving a research question. Nothing, however, is less true than this. A corpus linguistic approach to a research question often involves a lot of work, both on an intellectual and on a technical/mind-numbing level.
The intellectual challenge of a corpus linguistic methodology is the fact that you will (usually) have to annotate the observations from your corpus linguistic query for the phenomenon that you want to investigate. Indeed, there usually is no available corpus that is perfectly suited to answer your research question. A lot of corpora are only annotated with additional lemmata and part-of-speech information. For your linguistic research question, you need annotations such as constituents, syntax, morphology, or sometimes even more semantic or pragmatic categories such as animacy, information structure or implicatures. So, in most cases, a corpora can perhaps give you a set of observations that is relevant for your research question, but the linguistic work that needs to be done to turn this set of observations into something that may answer your research question is a whole different matter. We talk about some best practices for formatting your corpus linguistic annotations here.
In this post, we do not focus on turning a set of observations into a set of observations that can anwer your research question. Rather, we talk about getting a set of observations that is above all relevant to your research question. Three concepts are important when it comes to relevancy of a set of observations. There are two rather technical terms recall and precision, and one more theoretical term accountability.
This term comes from Information Technology and is a quantitative measure that tells you if you have all relevant data. With relation to this concept, we also use the terms false positives (for observations that you made and included in the dataset, but that you should not have made and should not have included) and false negative (for observations that you did not make and are therefore not in the dataset, but that would have been valuable observations and should have been in the dataset). Recall is a measurement of how well you did in avoiding false negatives. Just for completeness, the term true positive means an observation that you included in the dataset and which you should have; the term true negative means an observation that you did not include in the dataset, and which you should not have included.
|included in dataset||not included in dataset|
|should be included in dataset||true positive||false negative|
|should not be included in dataset||false positive||true negative|
Whereas “real” Information Technologists would like to calculate recall, it suffices for corpus linguists to consider how well the recall is of a corpus query. In other words, with every corpus query, you should consider whether you have obtained all potentially relevant observations from the corpus. In fact, it should not worry you too much whether your query is too general, as long as you are certain that you have retrieved everything that is potentially relevant to your research question. The general rationale is here that it is easier to remove observations from a dataset than adding to them. Indeed, when you have to add certain data points, you have to explain how you obtained them; in the opposite direction, i.e. removing observations from a dataset that is too general, can be explained much more easily as “restricting the scope of the investigation”. When it comes to assessing whether your corpus query was too general, we use the term precision.
Precision is a term that comes from Information Technology and is a quantitative measure that tells you if you have any irrelevant data. With the terminology of above, precision tells you how many false positives you have. Whereas for a corpus linguist the most important measure is recall — if you do not have everything you need for your research question, your dataset is severely limited (cf. accountability, see below) — it is of practical necessity to make your corpus query not too general (i.e. the precision of the query is too low), so that you end up with a dataset that is not so large that you can not see the end of it. As a general rule, start from a corpus query that is too general, and try to add principled parameters that reduce the dataset.
As an example, I have been performing an analysis of the verb position in Old High German relative clauses with some collaborators. At first, we constructed our dataset on the basis of a query in the Referenzkorpus Altdeutsch by searching for clauses that were annotated as relative clauses (example query in a restricted corpus) by the compilers of the corpus. However, it soon appeared that the compilers of the corpus used the verb position as a heuristic to tag clauses as relative clauses, i.e. mostly clauses with the verb at the end were tagged as relative clauses. This is particularly problematic as we are interested in relative clauses with the verb at the second position. So, we had to expand the query, to improve recall, and by doing so, we made precision worse. However, that is a fair trade-off, since now our dataset accurately describes the phenomenon (relative clauses).
When a dataset contains all possible realizations of the phenomenon at hand, we could claim that our dataset is accountable (viz. the corpus).
This term comes from variational linguistics and was coined by William Labov in the context of his Principle of Accountability. The idea is that a sociolinguistic study should investigate all possible variants of a variable, e.g. the allophones of a phoneme. If not all possible variants are investigated, the Principle of Accountability is violated. Leech (1992) referes to this principle as the Principle of Total Accountability, in a reference to Popper’s falsifiability concept.
In reference to the above example of Old High German relative clauses, we could give an extreme example of not being accountable. If we had only considered those clauses from the corpus that were being annotated as relative clauses, we would in the end have found out, with overwhelming quantitative evidence, that relative clauses in Old High German have the verb at the end (or at least, late). By expanding the dataset beyond the theoretically motivated annotation of the corpus (theoretical motivation was that German has a verb final word order), we were able to observe all instances that belong to the phenomenon under investigation, hence delivering a more truthful account.
Another example from my own research would be my investigation of lexical preferences in Dutch swearing, as can be found in Twitter. The running theory for swearing is that swear words come from taboo domains. Now, although a swearing dictionary exists, I nonetheless looked for observations in the (large) Twitter corpus that contained any word that could reasonable be considered to be in the taboo domains that I was investigating. Many of the observations that contained a taboo lexeme were not actually examples of swearing, so my dataset had very low precision. However, after careful annotation of the many thousands of observations, I was fairly certain that my dataset had an extremely high recall. This makes me feel confident that the research that will come off this dataset will adhere to the principle of accountability.
Generally speaking, the idea of accountability is a plea for avoiding circular arguments. If you start searching in a corpus with a very clear goal of what you need to find for proving your theory, you will probably find what you need to prove your theory. However, to make a great scientific contribution, you should also give a lot of thought to how you could disprove your theory, and those examples must be included in your dataset, too. Or at least, you should show that you did everything you could to find them.