What is inter-annotator agreement?

Very often in linguistics, it is simply not possible to provide a classical definition with necessary and sufficient conditions for our categories. This is the case for most (perhaps all?) linguistic categories. Even basic categories such as parts of speech are not entirely clearly defined. In fact, Langacker (1987) takes that as a sign that we should re-think our whole linguistic ideas. But how can we then correctly annotate our data as a corpus linguist? Well, that is where inter-annotator agreement comes into play.

Let’s focus on the consequence of fuzzy linguistic categories for corpus linguistics. As you know, during annotation, the corpus linguist has to decide whether a certain observation belongs the one or to another category. Remember the dataset on concrete or abstract Old High German nouns after an article-alike determiner? To decide whether a noun belongs to the concreta or to the abstracta is not that easy, especially for a language from the 9th century. Take, as an example, the noun God, is that concrete or abstract? I guess it depends on how strongly you believe in God.

So, as a corpus linguist, you make some decision for the annotation, but you actually want to provide the user of your dataset with some kind of a metric about how certain you are about the annotation of that category. This is were inter-annotator-agreement comes into play.

Inter-annotator agreement is a measure of how well two (or more) annotators can make the same annotation decision for a certain category.

From that measure, you can derive two things:

  1. how easy was it to clearly delineate the category: if the annotators make the same decision in almost all cases, then the annotation guidelines, i.e. the definition of the category that needed to be annotated, were very clear, and this implies that it is somehow possible to give the annotator a nicely delineated view on the category.
  2. how trustworthy is the annotation: one prefers categories that are firmly delineated — even if that is utopia for linguistics — because they make it easier to perform a quantitative analysis. If the inter-annotator agreement were low, the annotators found it difficult to agree on which items belong to the category, and which didn’t. That category might be very very interesting from a qualitative point of view, it is very difficult to incorporate it in a quantitative valorization of the data.

So, let us know calculate an inter-annotator agreement. Download the dataset for real(ly)? good|bad in which two annotators have annotated whether or not a certain adjective phrase is used attributively or not. The category “Attributive” is relatively straightforward, in the sense that an adjective (phrase) is used to modify a noun. If it does not modify a noun, it is not used attributively.

There are basically two ways of calculating inter-annotator agreement. The first approach is nothing more than a percentage of overlapping choices between the annotators. This approach is somewhat biased, because it might be sheer luck that there is a high overlap. Indeed, this might be the case if there are only a very limited amount of category levels (only yes versus no, or so), so the chance of having the same annotation is a priori already 1 out of 2. Also, it might be possible that the majority of observations belongs to one of the levels of the category, so that the a priori overlap is already potentially high.

Therefore, an inter-annotator measure has been devised that takes such a priori overlaps into account. That measure is known as Kohen’s Kappa. To calculate inter-annotator agreement with Kohen’s Kappa, we need an additional package for R, called “irr”. Install it as follows:

# install the library

Now you can run the code below to calculate the inter-annotator agreement. Notice how we first construct a data frame with two columns, one for each annotator.

# load the required library

# read in the dataset
ds.full <- read.delim("https://corpuslinguisticmethods.files.wordpress.com/2014/01/coha_real-ly-good-bad_period_attributive_genre.key", 
           header=T, sep="\t")

# combine the two columns of the annotators in a single data frame
ds.iaa <- data.frame(ds.full$attributive, ds.full$attributive.anno2)

# find observation that were annotated by both annotators
# here, we can only retain the annotations of annotator 2,
# because annotator 1 did all observations, whereas annotator 2
# only did a subset
ds.iaa.sharedobs <- droplevels(
    ds.iaa[ds.iaa$ds.full.attributive.anno2 != "", ]

# cross tabulation

# Cohen's kappa

kappa2() is the function that will give you the actual inter-annotator agreement. But it if often a good idea to also draw a cross tabulation of the annotators, so that you get a perspective on the actual numbers:

> kappa2(ds.iaa.sharedobs)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 46 
   Raters = 2 
    Kappa = 0.826 

        z = 5.69 
  p-value = 1.27e-08

So, the kappa value is 0.826, which is in fact pretty high. Although the Kappa should always be interpreted with respect to the available levels in the category for which inter-annotator agreement is being calculated, a rule of thumb is that any value over 0.8 is outstanding.

Two annotators are also available in the dataset on lexical variation in the Brown corpora. Perhaps it is a nice exercise for the reader to find the two columns of the annotators, restrict the dataset to observations that have been annotated by both annotators, and then to calculate the inter-annotator agreement.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s