Query a text corpus with Python

Some corpora come without a search interface. How do you search in them? Perhaps you read them into a concordance program like AntConc, but then you notice that the corpus has some weird idiosyncratic format that messes with the lines. AntConc quickly becomes pretty unusable if that is the case. So, what can you do? The simplest solution is to write a small Python script!

To follow this post, you are assumed to have a working Python environment. For Windows, I recommend the IDLE environment, if you are on a UNIX system (e.g. Ubuntu, or OSX), Python is pre-installed, and you can directly call it from the prompt in a terminal. Also, this is not a detailed introduction to Python, but rather an overview of how you proceed in solving a typical corpus linguistic problem with scripting. Anyway, I think you can already learn a lot by trying to reverse-engineer the script below.

As the leading example for this post, we will deal with a corpus that I compiled with a colleague for the study of Moroccan Dutch, based on IRC protocols (big file). The first five lines of the corpus are as follows:

bellamafia	enta bari temche tkowed
ilmas-nador	ala  ana  bri  khtek
Chickaaa	Heeerlijk zo'n kopje warme chocolademelk
ilmas-nador	3ndak  chi  khtk
Chickaaa	met een sultana derbij

As you can see, the format is fairly simple: a username, followed by a tab, and then the message of that user. However, if you would read this in with AntConc, you have no possibility to indicate that you only want to search in the messages, and that the KWIC should not go to the next lines or into the messages of other people.

Let us write a Python script that:

  1. reads in the corpus line per line
  2. while interpreting the tab delimited format
  3. accepts a regular expression as a search query
  4. returns the line of the corpus in which the query matches
  5. and puts everything in a tab-delimited format that we can read into Excel to create a corpus linguistic dataset.

Let’s go through this step by step.

Reading in a unicode text file in Python can be done with the following procedure:

def readCorpus():
  # read in as utf-8 with codecs
  fin = codecs.open("moroccorp.txt", "r", "utf-8")
  lines = fin.readlines() # readlines chops it up in lines for you
  fin.close()

  # initialize an empty dictionary
  crp = {}

  # i will be the line counter, zero-based
  i = 0

  # go through the lines
  for line in lines:
    uname = line.split("\t")[0] # before tab is username [0]
    # split the line in tabs, and take everything after the first tab [1:]
    # join everything back together with join()
    # remove line endings with strip()
    msg = " ".join(line.split("\t")[1:]).strip()
    # entry in dictionary
    crp[i] = {"uname": uname, "msg": msg}
    i += 1 # increment i
  return crp

Now we can make a procedure that takes the corpus dictionary from the readCorpus() procedure and a certain regular expression and returns the lines of the corpus in which the regex matches.

def search(c, q):
  out = []
  regex = re.compile(q, re.IGNORECASE)
  for k in c:
    linehits = regex.findall(c[k]["msg"])
    if linehits:
      out.append((str(k), c[k]["uname"], c[k]["msg"]))
  return out

So, we can call these two procedures to obtain a list of lines, already partitioned into line number, user and message, in which our regular expression matches. As an example, we provide the regular expression that could have been used for (a part of) the case-study that was described in the paper that accompanies the corpus. It searches for indefinite noun phrases with an adjective and a neutral noun (i.e. meisje) in which the adjective ends with an “e”, which is a typical marker for Moroccan Dutch.

moroccorp = readCorpus()
hits = search(moroccorp, r"\been [^\s]+e \bmeisje\b")

Hits now contains the matching lines. Now we need a procedure that fetches the whole message and user in a format that is suitable for a corpus linguistic dataset.

for hit in hits:
  print "\t".join(hit)

So, if we run the following script (save it to “search.py”):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import codecs, re

def search(c, q):
  out = []
  regex = re.compile(q, re.IGNORECASE)
  for k in c:
    linehits = regex.findall(c[k]["msg"])
    if linehits:
      out.append((str(k), c[k]["uname"], c[k]["msg"]))
  return out

def readCorpus():
  fin = codecs.open("moroccorp.txt", "r", "utf-8")
  lines = fin.readlines()
  fin.close()
  crp = {}
  i = 1
  for line in lines:
    uname = line.split("\t")[0]
    msg = " ".join(line.split("\t")[1:]).strip()
    crp[i] = {"uname": uname, "msg": msg}
    i += 1
  return crp

moroccorp = readCorpus()
hits = search(moroccorp, r"\been [^\s]+e \bmeisje\b")

for hit in hits:
  print "\t".join(hit)
$ python search.py > meisje.txt

The file meisje.txt will catch the print output of the script, and it should contain 132 observations. The observations indicate the expected additional sjwa at the end of the adjective:

9602	redadubai	ik ben opzoek na een mooie meisje
20246	Marouanee	ewaa snapnietv is een goeie meisje
21100	Marouanee	liina jij bent zeker een goeie meisje?
21857	ScheleSmurf	Chinees: is een marokkaanse meisje

And since this text file is tab delimited, it is easy to import this into Excel or R.

Advertisements

4 thoughts on “Query a text corpus with Python

  1. Thanks very much for the exercise, I carefully replicated the corpus reading process but only got 42 instances of the pattern, my code matches what’s shown here, just wondering.

    1. Hi there, you are welcome. I have just checked again, and if I run the script on my computer, it returns 132 occurrences:
      $ python search.py | wc -l
      132
      I am a bit puzzled by the discrepancy. Are you on a Windows machine, Linux or Mac? I guess it might have to do with the line breaks? If you provide me with some more information, I may be able to help.

      1. Thanks for the quick reply, Tom. Yes, I’m on Win7. I use Regex every once in a while but if findall() is used, I can’t think of how the code can miss instances. I’m training myself on a range of topics, so the discrepancy here isn’t critical in my context. BTW, I’m training myself to do as much data wrangling as I can in Python.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s