The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.
However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.
There is no way to simply download all observations into a CSV file with observations as rows and variables (e.g. year, genre, text) as columns. This is a real drawback of using the COHA web interface, but I assume there are some good arguments for this. Nonetheless, it is perfectly fine to make use of technology to get all the observations out of the web interface. This post will explain how it goes.
As a corpus linguist, you are also a little bit of a technician. If you are at first a little bit intimidated by the steps that are described here, please fight the urge to give up. These kinds of steps are still very much necessary for every corpus linguist during the compilation of the dataset. Perhaps someday, corpus tools will be able to freely export data in a useful format, but until that day, corpus linguists are tech nerds who have to hack their datasets together. Learn to appreciate and enjoy it!
This post assumes that you know about:
- HTML: learn about html
- Regular expressions: learn about regular expressions
- COHA: learn about the Corpus of Historical American English
This post assumes that you have:
- a working Python environment (Windows, UNIX-based OS have Python pre-installed)
Real bad: a corpus investigation in COHA
We want to investigate the phrase “real bad” in the COHA to see how “real” without “ly” can be used as an adverb. One might wonder whether the “-ly” is dropped rather in attributive contexts (e.g. she is a real(ly) bad girl) or in predicative contexts (e.g. that girl is real(ly) bad). Below, you find the steps in words, but even lower, there is a screencast available.
The first step is to find observations of the phrase “real bad” in COHA. Therefore, go to the COHA on http://corpus.byu.edu/coha/. In the search box (behind WORD(S)), simply type in “real bad”. Then click the “search” button, and an frequency overview page will occur on the right. If you click on the search phrase “real bad” (underneath the CONTEXT button in the frequency panel), a panel with the individual observations appears at the bottom.
Normally, you would like to get these observations into a standard corpus linguistic dataset, so that you can annotate each observation for the research question that you are asking. In our case, we would like to verify every observation to see if “real” is used as an adverb with “bad”, and whether the Adjective Phrase with “bad” as its head is used attributively or predicatively. The dataset could look like this:
|1815||FIC||Book A||This left context is||real bad||. And this right context is nice.||yes||predicative|
To get to such a nice dataset from the COHA web interface is not trivial. You can not simply copy paste the table from the COHA website, and you can not export all the observations, either (probably due to copyright restrictions, I guess?). Therefore, you need a nifty workaround, which will employ the raw html that generates the list overview of the observations, and a python script with regular expressions to extract the observations from this html.
Before we start, you need to make a folder on your computer called “scrape_coha”, and within this folder, you need to make a subfolder called “data”. In “data”, you will save the raw html from the COHA corpus. In top folder “scrape_coha”, you can already save the Python script which you find at the bottom of this post. Simply copy the python code; make a text file in “scrape_coha”, which you will call “combine.py” (make sure to remove the .txt extension); then paste the python code in this file.
First, right click on the observation list and use the option (in Google Chrome) to show the frame source. A new window or tab will open, contain a bunch of html. If you search in this html (by means of Ctrl-F) for the word “bad”, you will see that the actual observations are hidden within the html. Not far from the observation, you will also find the meta information on year, genre and text. Now, select all the html code in your browser (Ctrl-A), then copy the code (Ctrl-C) and then make a text file “real_bad_1.txt” in the “data” folder. Paste (Ctrl-V) the copied html code into that new text file.
You want to repeat this step for the second and the third page of search results in COHA. So, simply go to the second page in the COHA web interface; right click on the list of observations; show the source code; select the complete html code; copy the code; make a text file “real_bad_2.txt” in the “data” folder; paste the html code into that file. And exactly so for the third page, which you can save in a text file “real_bad_3.txt”.
Second, we now apply the Python script to filter out the observations and meta information from the three text files in data, and store these observations and meta data in a delimited text file that we can import into Excel for further annotation. The script basically contains a number of regular expressions that first search the complete table row of a single observation, and then search within that table row for the text and the metadata. Try to read and understand the python script, it might seem a little bit difficult at first, but it should be fairly self-explanatory.
So, simply run the Python script in the “scrape_coha” directory, and a new text file “dataset_raw.txt” will emerge in that folder. If you open the text file in a text editor, you will see that it simply contains the meta data, the left context, the keyword (“real bad”) and the right context, delimited by tabs. This file can simply be imported into Excel (see video) for further annotation, or you can import it into R as a tab-delimited file.
This video explains everything on the screen. You might need to set the quality to HD to be able to read the on-screen text in the video.
This is the pyhon code that you will need for converting the COHA html into a tab delimited text file.