Query a text corpus with Python

Some corpora come without a search interface. How do you search in them? Perhaps you read them into a concordance program like AntConc, but then you notice that the corpus has some weird idiosyncratic format that messes with the lines. AntConc quickly becomes pretty unusable if that is the case. So, what can you do? The simplest solution is to write a small Python script!

What is inter-annotator agreement?

Very often in linguistics, it is simply not possible to provide a classical definition with necessary and sufficient conditions for our categories. This is the case for most (perhaps all?) linguistic categories. Even basic categories such as parts of speech are not entirely clearly defined. In fact, Langacker (1987) takes that as a sign that we should re-think our whole linguistic ideas. But how can we then correctly annotate our data as a corpus linguist? Well, that is where inter-annotator agreement comes into play.

Corpora versus datasets

As a corpus linguist, the terms corpus and dataset are sometimes very confusing. Indeed, they are very similar:

  • both contain linguistic production,
  • both usually provide further information about the production in the form of annotations,
  • these annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

In fact, some people would go so far as to say that there is no difference between a corpus and a dataset. However, I do not agree and I would like to suggest a prototype-based approach.

