How to extract data from COHA into Excel or R?

The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.

However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.

There is no way to simply download all observations into a CSV file with observations as rows and variables (e.g. year, genre, text) as columns. This is a real drawback of using the COHA web interface, but I assume there are some good arguments for this. Nonetheless, it is perfectly fine to make use of technology to get all the observations out of the web interface. This post will explain how it goes.

As a corpus linguist, you are also a little bit of a technician. If you are at first a little bit intimidated by the steps that are described here, please fight the urge to give up. These kinds of steps are still very much necessary for every corpus linguist during the compilation of the dataset. Perhaps someday, corpus tools will be able to freely export data in a useful format, but until that day, corpus linguists are tech nerds who have to hack their datasets together. Learn to appreciate and enjoy it!

This post assumes that you know about:

This post assumes that you have:

  • a working Python environment (Windows, UNIX-based OS have Python pre-installed)

Real bad: a corpus investigation in COHA

We want to investigate the phrase “real bad” in the COHA to see how “real” without “ly” can be used as an adverb. One might wonder whether the “-ly” is dropped rather in attributive contexts (e.g. she is a real(ly) bad girl) or in predicative contexts (e.g. that girl is real(ly) bad). Below, you find the steps in words, but even lower, there is a screencast available.

The first step is to find observations of the phrase “real bad” in COHA. Therefore, go to the COHA on http://corpus.byu.edu/coha/. In the search box (behind WORD(S)), simply type in “real bad”. Then click the “search” button, and an frequency overview page will occur on the right. If you click on the search phrase “real bad” (underneath the CONTEXT button in the frequency panel), a panel with the individual observations appears at the bottom.

Normally, you would like to get these observations into a standard corpus linguistic dataset, so that you can annotate each observation for the research question that you are asking. In our case, we would like to verify every observation to see if “real” is used as an adverb with “bad”, and whether the Adjective Phrase with “bad” as its head is used attributively or predicatively. The dataset could look like this:

year genre text left.context keyword right.context real.as.adverb usage
1815 FIC Book A This left context is real bad . And this right context is nice. yes predicative
real bad

To get to such a nice dataset from the COHA web interface is not trivial. You can not simply copy paste the table from the COHA website, and you can not export all the observations, either (probably due to copyright restrictions, I guess?). Therefore, you need a nifty workaround, which will employ the raw html that generates the list overview of the observations, and a python script with regular expressions to extract the observations from this html.

Before we start, you need to make a folder on your computer called “scrape_coha”, and within this folder, you need to make a subfolder called “data”. In “data”, you will save the raw html from the COHA corpus. In top folder “scrape_coha”, you can already save the Python script which you find at the bottom of this post. Simply copy the python code; make a text file in “scrape_coha”, which you will call “combine.py” (make sure to remove the .txt extension); then paste the python code in this file.

First, right click on the observation list and use the option (in Google Chrome) to show the frame source. A new window or tab will open, contain a bunch of html. If you search in this html (by means of Ctrl-F) for the word “bad”, you will see that the actual observations are hidden within the html. Not far from the observation, you will also find the meta information on year, genre and text. Now, select all the html code in your browser (Ctrl-A), then copy the code (Ctrl-C) and then make a text file “real_bad_1.txt” in the “data” folder. Paste (Ctrl-V) the copied html code into that new text file.

You want to repeat this step for the second and the third page of search results in COHA. So, simply go to the second page in the COHA web interface; right click on the list of observations; show the source code; select the complete html code; copy the code; make a text file “real_bad_2.txt” in the “data” folder; paste the html code into that file. And exactly so for the third page, which you can save in a text file “real_bad_3.txt”.

Second, we now apply the Python script to filter out the observations and meta information from the three text files in data, and store these observations and meta data in a delimited text file that we can import into Excel for further annotation. The script basically contains a number of regular expressions that first search the complete table row of a single observation, and then search within that table row for the text and the metadata. Try to read and understand the python script, it might seem a little bit difficult at first, but it should be fairly self-explanatory.

So, simply run the Python script in the “scrape_coha” directory, and a new text file “dataset_raw.txt” will emerge in that folder. If you open the text file in a text editor, you will see that it simply contains the meta data, the left context, the keyword (“real bad”) and the right context, delimited by tabs. This file can simply be imported into Excel (see video) for further annotation, or you can import it into R as a tab-delimited file.

Additional information

Screencast

This video explains everything on the screen. You might need to set the quality to HD to be able to read the on-screen text in the video.

Python script

This is the pyhon code that you will need for converting the COHA html into a tab delimited text file.

import codecs, re, glob

def getKwic(f):
  out = []
  regex_lines = re.compile(r'<tr name=\"t\d+.+?</tr>', re.DOTALL)
  lines = regex_lines.findall(f)
  for line in lines:
    outline = ""
    regex_meta = re.compile(r'<td class="auto-style2".+?valign="top" nowrap><a href="x4\.asp\?t=\d+&ID=\d+">(.+?)</a></font></td>', re.DOTALL)
    metas = regex_meta.findall(line)
    regex_text = re.compile(r'input type="hidden" name="texto_\d+" ID="texto_\d+" value="(.+?)">', re.DOTALL)
    text = regex_text.findall(line)[0]
    for meta in metas:
      outline += meta.strip().replace("\n", " ").replace("\r", " ") + "\t"
    outline += metas[0][0:3] + "0" + "\t"
    outline += text.replace("</u></b> <b><u>", " ").replace("<b><u>", "\t").replace("</u></b>", "\t").replace("\n", " ").replace("\r", " ").strip()
    out.append(outline)
  return "\n".join(out)

ds = ""

fl = glob.glob("./data/*.txt")
for f in fl:
  fin = codecs.open(f, "r", "utf-8")
  export = fin.read()
  fin.close()

  kwic = getKwic(export)
  ds += kwic + "\n"

fout = codecs.open("dataset_raw.txt", "w", "utf-8")
fout.write(ds.strip())
fout.close()
Advertisements

23 thoughts on “How to extract data from COHA into Excel or R?

  1. Hello! I’m doing a linguistic analysis of the COHA as my BA thesis and had a question about this tutorial. Everything is fine until i try to run the combine.py file, as I get the error message:

    Traceback (most recent call last):
    File “C:\Users\Paul\Desktop\COHA\scrape_coha\combine.py”, line 25, in
    export = fin.read()
    File “C:\Python27\lib\codecs.py”, line 671, in read
    return self.reader.read(size)
    File “C:\Python27\lib\codecs.py”, line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
    UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xcd in position 8203: invalid continuation byte

    I do not know what is wrong and would very much appreciate it if you could help me!

    Keep up the good work!//Paul

    1. Hello Paul,
      ok, this is a typical unicode error. I would not mind running the code for you here, but you could experiment by changing line 24 ‘fin = codecs.open(f, “r”, “utf-8”)’ to just ‘fin = open(f, “r”)’. Let me know if that helps?

    2. Hi there, here is another thing that might help. If you are on Windows, make sure to use Notepad++ and open the text files one by one. For each file, use the menu ‘Encoding’ and the function ‘Convert to UTF-8 without BOM’. Save the files and try again. That should put all your files in unicode, for which the script is set up.
      Also, If you are on Windows, you might want to change this line
      fin = codecs.open(f, “r”, “utf-8”)
      to this
      fin = codecs.open(f, “r”, “latin-1”)
      so that the file is read in the standard encoding of Windows.
      Good Luck!

  2. Hi,

    Thanks for the useful script. Here are 3 thoughts:

    * Excel stores only around 1 million records/rows, ambitious projects will have to have data sets cut off to fit into multiple Excel files.

    * The open() causes a UnicodeDecoderError (invalid continuation byte). I took out the encoding argument and the script worked.

    – If we’re not actively encoding or decoding data which is the case here since we’re only accessing files via open(), we can leave out the encoding argument, this is because the method provides transparent handling of encoding/decoding (explained in the URL below). As long as the data conforms to Unicode, utf-8 doesn’t need to be specified since codecs cover Unicode.

    https://docs.python.org/2/library/codecs.html

    * An alternative of manually copying HTML code into separate files is to use the urllib2 library to scrape the data, can also combine it with the Beautiful Soup library in lieu of regex.

  3. Hello Tom,

    I’m working on a term paper for my Linguistics class right now.
    I tried to run the combine.py file, but I got the same error message like Paul did.
    And your suggestions didn’t help me. It still doesn’t work.
    I’m about to fall into despair because I have never worked with Python Scripts or anything like that before.
    I would appreciate help from you a lot.

    Thanks in advance Alena.

  4. Hi Alena, you could try the suggestion of Betu here: “If we’re not actively encoding or decoding data which is the case here since we’re only accessing files via open(), we can leave out the encoding argument” So, try to change line ‘fin = codecs.open(f, “r”, “utf-8″)”to ‘fin = codecs.open(f, “r”)’.

    If that does not work, you could just send me your data via mail and I can have a look at it.

  5. Thank you! It worked. But when I open up the dataset_raw file, it is empty. I don’t know what I’m doing wrong. Maybe you can send me an email so I can explain my problem more detailed.

  6. Hello,
    This method allows you to extract one page at a time, which is ok when there aren’t more than a dozen, but say you have a hundred pages you’d like to extract, do you know any way to automatize this process?

    1. Hi Chris, I presume that the Mark Davies corpora all have the same basic layout, so it should work (perhaps with some minimal changes, but that should be easy for any corpus linguist)

  7. Dear Tom,

    First of all, thanks for writing this post and publishing this video. It’s a really good halp.
    I’m working on Spanish linguistics so I’m working with the corpusdelespañol, (same author, same layout) and I have the same problem as Alena. My data_raw set is empty. I really would appreciate some help!

    Thanks in advance!

    1. Hi Elise,
      I did not design this to work with the spanish corpora, but as I said, this is the same software so it could work. Why don’t you check if the regular expressions from the script evaluate correctly on the raw html that you copy pasted? That should give you a first clue.

  8. Dear Tom,

    It seemed like the problem had nothing to do with the code itself, but with the privacy settings on my PC. I just needed to switch off (temporarily) AVG and then everything worked perfectly.

    So, if anyone ever reads this post and works with the Spanish Mark Davies corpus, you can use exactly the same code as posted here above.

    I genuinely want to thank you once again, Tom, for sharing your video and your code with us. You probably save a lot of student lives (or at least their linguistic papers).

    With kind regards,

    Elise

  9. Hello,

    First of all, I would like to thank you for this tutorial. It will no doubt be helpful to many people working with COHA.

    I have the same problem as Alena and Elise. I noted the small changes we can bring for Windows users and changed the encoding of the text files accordingly. I also changed the encoding in the code to Latin-1. However, the running the code yields an empty file. Could you help me solve this issue?

    All the best,

    Chris

    1. Hi Chris, I noticed that the COHA is now also available for download. I will see how I update this post with respect to that information. With your current question, I am going to need a bit more background. If you want, you can send the files you stored in /data/ as an attachment to tomdotruetteatgmaildotcom, and I will give it a look.

  10. Hi Tom,

    Thank you for responding. If COHA can be downloaded, it’s provided that you purchase it under a license for at least $245. I send you an e-mail with the files and I hope you will be able to solve the issue :-).

    1. Hi Tom,

      I have the same problem as Chris so if you have solved it, could you please share the solution here? Thank you in advance! 🙂

      1. Thanks, EKBROWN77. I actually managed to download everything and to reformat all in order to keep metadata information, KWIC concordance and context, with a personal script in R.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s