How to extract data from COHA into Excel or R?

The Corpus of Historical American English is a wonderful source for corpus linguistic research on diachronic English phenomena. There are about 400 million words from newspapers, magazines, fiction and non-fiction books, starting in 1810 up to 2009. A very neat web interface is available for searching in the COHA, and there are actually quite a number of neat features available for search.

However, the COHA web interface does not allow you to make a really good dataset for corpus linguistic research.

Continue reading

Save your datasets as csv files

Although it is a good idea to build your datasets in spreadsheet software, it is an even better idea to save your dataset (after you are ready with the annotation, of course) into the csv format.

Continue reading