Corpus linguistics is real(ly)? awesome

Now and then, you hear something, and you wonder why it was said the way it was said. For me, that is the phenomenon that you hear the word “real” without the prescriptively required adverbial “ly” as a modifier of adjectives:

I just heard some real bad news (Kanye West)

That shirt is real fly! (Fresh Prince of Bel-Air)

As said, one would expect “really bad” and “really fly”. These kinds of things attract my attention, and I decided to do a small corpus linguistic investigation to find out what is going on.

Normally, one should do a exhaustive literature review, but for the purpose of this blog, I will focus on the corpus linguistic methods that one need to investigate the following two research questions:

  1. Does the use of “real” without “-ly” as an adverb change over time?
  2. Is the use of “real” without “-ly” influenced by whether or not its related adjective (e.g. “bad” or “fly”) is used attributively (e.g. “some real bad news”) or not (e.g. “is real fly”).

And obviously, one could combine the two research questions and ask whether (non-)attributive use of “real” without “-ly” changes over time. As a good indicator that I am onto something, let me show you the Google NGram plot for “real good” versus “really good”. Obviously, this is not what one would use for a scientific study, but it usually is a good indication about what is going on. Here, we see that “real good” used to be more popular than “really good”. This can be due to the fact that “good” can also be a noun, and then it takes “real” as an adjective without “-ly”.

real(ly)? good

Finding a relevant corpus

The first step in a corpus linguistic study is to find a relevant corpus for the research questions from which a dataset can be derived that may actually answer the questions. There are so many corpora for English available, that it sometimes becomes difficult to keep track. However, when it comes to corpora that are online freely searcheable, one is limited to the corpora of Mark Davies. When one needs a historical corpora for English, then one quickly stumbles upon the Corpus of Historical American English (COHA).

The interface of Mark Davies to his corpora is ingenious and an enormous advantage for the research community. However, it would be great to be able to download the complete (and not just a sample) list view as a simple keyword-in-context spreadsheet — I understand that this is now the case due to copyright issues, and it would be great if we could have a more lenient copyright for scientific research.

Searching the corpus for relevant data

It is not always simple to find out what the best query is in a corpus to obtain all the relevant material. When it comes to the combination of “real(ly)?” with an adjective, there are quite a large number of combinations available. With every combination, a lot of additional observations can be gained, but that also prolongs the annotation period. For that reason, we limit ourselves to the adjectives “good” and “bad” (both top ranking collocates of “real(ly)?”). Interestingly enough, the combination “real(ly)? good” as a non-attributive item (e.g. “but it doesn’t work really good“) is also a common topic in grammar fora.

So, we look for “real good”, “really good”, “real bad” and “really bad” in the COHA. Obviously, we also search for the “really” variant, and not just for “real”, because we want to investigate the variation (Principle of Accountability). To obtain material from the COHA and to construct a dataset from it, one needs to do some computer stuff. By doing this computer stuff, I was able to obtain a table from COHA that looks a bit like this:

year genre text period left context keyword right context
1977 NEWS CSMonitor 1970 percent rise. But growers are worried. ” In Oklahoma farmers are in real bad shape, ” says Deputy Secretary of Agriculture John C. White. ” Almost 13,000
1978 NEWS Chicago 1970 night. This morning I was sick. I had a headache, a real bad one, and I threw up. But I’m OK now. ” Ever
1978 NEWS Chicago 1970 drizzle freezing as it hits the surfaces — and it will continue to be real bad in the morning, ” said Bob Corbett, a meteorologist with the National Weather
1980 FIC CradleWillFall 1980 to see Edna drink much more, because I knew she’d be feeling real bad in the morning, so I got out that nice canned ham and opened it

Annotating the data

This table needs to be expanded with columns that add information about attributive use per observation. It is also useful to have in a separate column whether the adjective is “good” or “bad” and whether or not the adverb is with or without “-ly” for every observation. If we blend out (for the example here) the KWIC part, the dataset could look something like this:

year genre text period pos1 pos2 attributive
1977 NEWS CSMonitor 1970 real bad attributive
1978 NEWS Chicago 1970 real bad attributive
1978 NEWS Chicago 1970 real bad non-attributive
1980 FIC CradleWillFall 1980 real bad non-attributive

As you can see in Real(ly)? (good|bad) both research question (period and attributive use) are now annotated. We read in the dataset into R with the following commands, which also remove the observations that we do not want to include (tagged in attributive with NA).

ds.full <- read.delim("https://corpuslinguisticmethods.files.wordpress.com/2014/01/coha_real-ly-good-bad_period_attributive_genre.key", 
           header=T, sep="\t")
ds <- droplevels(ds.full[!is.na(ds.full$attributive),])
# first overview of the data
summary(ds)

I asked a colleague to annotate about 50 random observations for “attributiveness”, so that I could see how neatly the category can be delineated. During annotation some more or less difficult cases already popped up:

  1. “good” can also be used as a noun, and “(a) real good” is not a valid observation (NA in attributive)
  2. What about “anything real(ly) good” (typically attributive, but if you think about it…)?
  3. Sometimes in a conversation, you find something like: “Hey, did you eat that really good cheesecake?” — “Oh yeah, really good!” Is “really good” in the second turn this still “attributive”?

In fact, this is also why I preferred the binary distinction between attributive versus non-attributive, and not predicative. This allows one to be really specific about the definition of the category on one pole only; otherwise, one would have to give a really convincing definition of predicative, too.

To calculate Cohen’s Kappa, we need to rely on an additional package in R, called “irr”. In this package, the method “kappa2” is included which can calculate Cohen’s Kappa for two raters. The two raters are in the two columns “attributive” and “attributive.anno2”:

# kappa of attributive annotation
ds.iaa <- data.frame(ds$attributive, ds$attributive.anno2)
ds.iaa.sharedobs <- droplevels(
    ds.iaa[ds.iaa$ds.attributive.anno2 != "", ]
  )

# cross tabulation
table(ds.iaa.sharedobs)

# Cohen's kappa
kappa2(ds.iaa.sharedobs)

The cross tabulation of the two raters looks as follows.

attributive non-attributive
attributive 19 4
non-attributive 0 23

From this table, we can learn that some potential attributives have been annotated as non-attributives, but that non-attributives have been annotated equally. Overall, it seems that there are only 4 cases in which annotator 1 and 2 diverge.

So, how does that look when we calculate a Cohen’s Kappa:

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 46 
   Raters = 2 
    Kappa = 0.826 

        z = 5.69 
  p-value = 1.27e-08

A Cohen’s Kappa score of 0.826 is actually not bad at all. Although there is quite some discussion on how to interpret the Kappa, with two raters and an almost fifty-fifty distribution between the binary distinction between attributive and non-attributive, everything higher than 0.8 is very good. Nonetheless, it is a good idea to print out the cross tabulation, so that you get the actual numbers, which almost always is much more revealing than an aggregate statistic.

Frequency analysis

If we want to do an analysis of this dataset that is based on frequencies, we need to thoroughly inspect the frequency characteristics of our variables. We can use the “table” command in R to get one-dimensional frequency tables for the relevant variables. As an example, let us inspect the distribution of observations over the different periods that are available:

# distribution over periods
table(ds$period)

This yields the following table:

1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920
1 5 8 16 43 51 72 80 66 82 87 92
1930 1940 1950 1960 1970 1975 1980 1985 1990 1995 2000 2005
92 112 183 165 108 109 130 136 168 233 303 204

Now, there is a somewhat conservative rule of thumb that says that you need at least thirty observations for an independent variable. Right away, it becomes clear that there simply is not enough data to base a serious quantitative analysis of the distribution of attributive versus non-attributive use of real(ly)? on in the periods 1810 – 1840. Of course, we could therefore restrict the dataset to periods that are larger than 1840:

# select periods larger than 1840
ds <- droplevels(
        ds.full[ds.full$attributive != "" & 
        ds.full$period > 1840,]
      )

Another option would be to pool together the data from the periods between 1810 and 1840, so that a thirty year period is represented. Pooling is not always a good idea, but in the case of this example, one could defend pooling together three decades as to obtain an insight in the past, which is difficult to investigate, anyway. To pool together the data, the period annotations 1810. 1820, 1830 and 1840 need to be reannotated:

# re-annotate years before 1850 to a larger period
ds$period[ds$period == 1810] <- "1810-1840"
ds$period[ds$period == 1820] <- "1810-1840"
ds$period[ds$period == 1830] <- "1810-1840"
ds$period[ds$period == 1840] <- "1810-1840"

The distribution of observations over periods now looks as follows:

1810-1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950
30 43 51 72 80 66 82 87 92 92 112 183
1960 1970 1975 1980 1985 1990 1995 2000 2005
165 108 109 130 136 168 233 303 204

Every period now has a sufficient amount of observations.

By the way, to make these html tables, use the “xtable” library in R:

print(xtable(table(ds$period)), type="html")

and for a horizontal table, transpose the table with the “t” command in R:

print(xtable( t(table(ds$period))), type="html")

Ok, with this slightly modified dataset, we can now do some simple frequency analyses for our variables. First, we investigate the distribution of attributive versus non-attributive observations. We used the code below to first make a separate table that contains the absolute (ds.attr) and the relative frequencies (ds.attr.rf) for the values in the “attributive” column). Then, we make a plot, nicely with a main title and some information on the axes, and with the y-axis stretched to represent all possible values between zero and one (ylim=c(0,1)). Finally, I find it good practice to add the absolute values above the bars to show the amount of observations that underlie the relative frequencies.

# distribution of attributive use
ds.attr <- table(ds$attributive)
ds.attr.rf <- prop.table(ds.attr)

# plot (add info for axes and main title)
barplot(ds.attr.rf, ylim=c(0,1), xlab="", ylab="Proportion",
  main="Attributive versus non-attributive observations")

# add absolute values on top of the bars
text(0.7,0.55,ds.attr[1])
text(1.9,0.5,ds.attr[2])

attributive_bar

It appears that our dataset contains slightly more attributive uses than non-attributive uses, but that the difference is really small.

By the way, you can make nice resolution plots with the following code:

# define where to store the png 
# (default: in the getwd() directory)
png("attributive_bar.png", 
  height=2000, width=2000, res=300)

# the (simplified) barplot from above, as an example
barplot(ds.attr.rf)

# write it out
dev.off()

We can do the same for the distribution of “real” versus “really”:

# distribution of really
ds.really <- table(ds$pos1)
ds.really.rf <- prop.table(ds.really)

# make the barplot with main and axes titles
barplot(ds.really.rf, ylim=c(0,1), xlab="", ylab="Proportion",
  main="'Real' versus 'really'")

# add absolute frequencies
text(0.7,0.46,ds.really[1])
text(1.9,0.6,ds.really[2])

And this yields the following graph:

pos1_bar

Obviously, we could do some more frequency tables for this dataset (“good” versus “bad” genres, “real good” versus “real bad” versus “really good” versus “really bad”), but the R code more or less stays the same.

Correlation analysis

Therefore, we move on to the next challenge, which is doing a correlation analysis. Before we go into our initial research questions, I will first perform a simple correlation analysis to show how it goes. Here, we will see how the use of “real” versus “really” is related to “genre”. To make a two-way correlation table, I somehow prefer the use of the “ftable” command in R (although there are other possibilities, such as “table” or “xtabs”). The thing with “ftable” is that you can simply insert a typical statistics formula in the form “response ~ predictor”. In our case, where we want to see if we can forecast the behavior of “real” versus “really” (as the response) by the genre (as the predictor), the formula simply becomes “ds$pos1 ~ ds$genre”, which can easily be inserted into “ftable”.

### investigate a first possible correlation between -ly and genre 
### (two categorical variables)
really.genre <- ftable(ds$pos1 ~ ds$genre)

The resulting table looks as follows:

> really.genre
         ds$pos1 real really
ds$genre                    
FIC               903    755
MAG                99    386
NEWS               75    166
NF                 19    143

There are some drawbacks to “ftable” (e.g. it is hard to extract subsets of the data), but overall, it is pretty usable. As an example, we can immediately use it to make a barplot (see below). However, if we want a barplot that tells us for each genre, what the “pos1” proportion is, we need to make “genre” the response in the formula (otherwise, we would see for each level in “pos1” (“real” and “really”) what the “genre” proportion is).

# for a barplot of this, we need to turn around the variables
genre.really <- ftable(ds$genre ~ ds$pos1)

# relative frequencies with prop.table
# argument '2' for relative frequencies with respect to columns
genre.really.rf <- prop.table(genre.really, 2) 

# make the barplot
barplot(genre.really.rf[,order(genre.really.rf[1,], decreasing=T)],
  names.arg=c("Fiction", "News", "Magazine", "Non-fiction"),
  legend.text=c("real X", "really X"), 
  args.legend=c(x=5, y=1.18))

really-genre_bar

This is a bit of a weird result. Although “real X” is known to be an informal form, the news genre is ranked second? This calls for an inspection of the dataset:

ds[ds$genre == "NEWS" & ds$pos1 == "real",]

We immediately observe that most of the “real” observations in the newspapers are quotes from witnesses or bystanders. This is a fairly marked context that does not necessarily represent journalist-speak. Perhaps it would be a good idea to annotate quotes in the newspaper genre in our dataset?

Returning to our research questions, we remember that we want to investigate two correlations:

  1. Does the use of “real” without “-ly” as an adverb change over time?
  2. Is the use of “real” without “-ly” influenced by whether or not its related adjective (e.g. “bad” or “fly”) is used attributively (e.g. “some real bad news”) or not (e.g. “is real fly”).

The first question wants us to cross-tabulate the values in column “pos1” (“real” versus “really”) with the values in column “period”, so that we can see the use of “real” versus “really” per period. As shown above, we can make this cross-tabulation with the “ftable” command in R, and the relative frequencies with “prop.table”:

really.period <- ftable(ds$pos1 ~ ds$period)
really.period.rf <- prop.table(really.period, 1)

# show the table with relative frequencies
> really.period.rf
          ds$pos1      real    really
ds$period                            
1810-1840         0.3000000 0.7000000
1850              0.4651163 0.5348837
1860              0.4313725 0.5686275
1870              0.5000000 0.5000000
1880              0.4875000 0.5125000
1890              0.3484848 0.6515152
1900              0.4512195 0.5487805
1910              0.3908046 0.6091954
1920              0.3586957 0.6413043
1930              0.2500000 0.7500000
1940              0.3482143 0.6517857
1950              0.5355191 0.4644809
1960              0.5575758 0.4424242
1970              0.4907407 0.5092593
1975              0.5871560 0.4128440
1980              0.5000000 0.5000000
1985              0.6176471 0.3823529
1990              0.5595238 0.4404762
1995              0.3390558 0.6609442
2000              0.3432343 0.6567657
2005              0.2352941 0.7647059

If we want to observe the relative frequency of “real” (relative to the use of “real” + “really”), we simply select the first column of this table really.period.rf[,1]. We can in fact simply plot this column in R, and add some bells and whistles to provide some more information with the following code:

# make the plot
plot(really.period.rf[,1], type="b", axes=F, 
  xlab="Period", ylab="Proportion of 'real' without '-ly'",
  main="'Real' as an adverb",
  ylim=c(0.1,0.7))
# smooth line
lines(smooth.spline(really.period.rf[,1]), lwd=2)

# info
axis(1, at=c(1:nrow(really.period.rf)), las=2,
labels=attributes(really.period.rf)$row.vars[[1]])
axis(2, at=c(0.2,0.4,0.6))

really-period

Notice the use of smooth.spline which calculates a smoothed line through the data points. The use and interpretation of this smoothed line is not that simple, and one should always keep a close eye on the actual datapoints. But we will come to that later. In this part, we merely want to focus on the methodological aspects of getting a handle on the data.

The second research question wants us to cross-tabulate the values in “pos1” with the values in “attributive”. We can basically do a completely parallel analysis as with the “pos1” versus “period” analysis, but with the difference that this time, we can not interpret “attributive” as an ordinal variable, such as “period”. In other words, we are now dealing with two purely categorical variables. For categorical variables, a mosaicplot is a good option in R:

## investigate a second possible correlation 
## between -ly and the attributive use
## (two categorical variables)
really.attr <- ftable(ds$pos1 ~ ds$attributive)
mosaicplot(ds$pos1 ~ ds$attributive,
  main="Real(ly)? (bad|good), (non-)attributively used",
  xlab="Really bad/good?", ylab="Attributively used?")

really-attributive_mosaic

The advantage of a mosaicplot is the combination of relative frequencies and absolute frequencies.

Nonetheless, it is possible to represent this data with a lines plot, too, if the view on the absolute frequencies is not that important:

# relative frequencies
really.attr.rf <- prop.table(really.attr,1)

# plot attributive use
plot(really.attr.rf[1,], ylim=c(0.3,0.7), type="b", 
  axes=FALSE, xlim=c(0.9,2.1), lty=1,
  xlab="'real' vs. 'really'",
  ylab="Proportion of (non-)attributive use'")
# add non-attributive use
lines(really.attr.rf[2,], type="b", lty=2)

# info
axis(1, at=c(1,2), labels=c("real","really"))
axis(2, at=c(0.4,0.5,0.6))
legend("topleft", lty=c(1,2), 
  legend=c("Attributive use","Non-attributive use"))

really-attributive_lines

Both the mosaicplot and the lines plot show that there is a clear influence of attributive use on the choice for “real” or “really”.

A combination of both questions forces us to make a more complicated cross-tabulation of three columns. We have to see how the values in “pos1” are related to an interaction of the values in “attributive” and “period”. For this, we can simply extend the formula that we used before: pos1 ~ attributive + period.

really.attr.per <- ftable(ds$pos1 ~ ds$attributive + ds$period)

# show the table
> really.attr.per
                          ds$pos1 real really
ds$attributive  ds$period                    
attributive     1810-1840            8     17
                1850                12     16
                1860                10     18
                1870                23     28
                1880                20     26
                1890                14     29
                1900                16     33
                1910                24     37
                1920                25     38
                1930                12     46
                1940                24     43
                1950                50     54
                1960                34     41
                1970                19     33
                1975                24     24
                1980                21     45
                1985                25     27
                1990                33     28
                1995                33     85
                2000                35    102
                2005                18     88
non-attributive 1810-1840            1      4
                1850                 8      7
                1860                12     11
                1870                13      8
                1880                19     15
                1890                 9     14
                1900                21     12
                1910                10     16
                1920                 8     21
                1930                11     23
                1940                15     30
                1950                48     31
                1960                58     32
                1970                34     22
                1975                40     21
                1980                44     20
                1985                59     25
                1990                61     46
                1995                46     69
                2000                69     97
                2005                30     68

However, now we bump into some trouble with the “ftable” command, because it does not allow us to freely select from this table. You see, it is quite difficult to now only grab the attributive part of the table. This is very stupid. So, we need to work around this.

As a workaround, I propose to simply make two subsets of the dataset, one for attributive use, and one for non-attributive use. For each separate dataset, we calculate the relative use of “real” versus “really” in a table, and these tables are then used for plotting.

So, first we make two subsets of the data:

# grab non-attributive use
ds.fattr <- ds[ds$attributive == "non-attributive",]
fattr.really.per <- ftable(ds.fattr$pos1 ~ ds.fattr$period)
fattr.really.per.rf <- prop.table(fattr.really.per, 1)

# grab attributive use
ds.tattr <- ds[ds$attributive == "attributive",]
tattr.really.per <- ftable(ds.tattr$pos1 ~ ds.tattr$period)
tattr.really.per.rf <- prop.table(tattr.really.per, 1)

And then, we use the two tables with relative frequencies for plotting:

# plot of relative frequencies for "real bad/good" per period
# line for non-attributive use
plot(fattr.really.per.rf[,1], type="b", ylim=c(0,1), axes=FALSE,
  xlab="Period", ylab="Proportion",
  main="Proportion of 'real' vs. 'really' as an adverb for good/bad")

# line for attributive use
lines(tattr.really.per.rf[,1], type="b", col="red")

# smooth lines
lines(smooth.spline(fattr.really.per.rf[,1]), lwd=2)
lines(smooth.spline(tattr.really.per.rf[,1]), lwd=2, col="red")

# info
legend("topright", col=c("black", "red"), lwd=2,
  legend=c("non-attributive use", "attributive use"))
axis(1, at=c(1:length(levels(as.factor(ds$period)))),
  labels=levels(as.factor(ds$period)), las=2)
axis(2, at=c(0,0.5,1), labels=c(0,0.5,1))

really-attributive-period

It is also possible to represent the data differently. Above, we took the perspective of the relative frequency of “real” with respect to “really”. Now, we will take the perspective of the attributiveness. For each period and for both “real” and “really”, we plot the proportion of non-attributive use over attributive use (so the amount of observations of non-attributive use, divided by the amount of observations of attributive use):

# plot of the proportion of attributive versus non-attr. use 
# of real and really bad/good
# plot 'real'
plot(fattr.really.per[,1] / tattr.really.per[,1], 
  xlab="Period", ylim=c(0,2.5), type="b",
  ylab="Proportion of non-attributive over attributive use",
  main="'Real' and 'really' as adverbs for good/bad", 
  axes=FALSE)

# plot 'really'
lines(fattr.really.per[,2] / tattr.really.per[,2], 
  type="b", col="red")

# smooth lines
lines(smooth.spline(fattr.really.per[,1] / tattr.really.per[,1]), 
  lwd=2)
lines(smooth.spline(fattr.really.per[,2] / tattr.really.per[,2]), 
  lwd=2, col="red")

# info
axis(1, at=c(1:length(levels(as.factor(ds$period)))),
  labels=levels(as.factor(ds$period)), las=2)
axis(2, at=c(0,1,2), labels=c(0,1,2))
abline(h=1, col="darkgrey", lty=2)
text(16,1.05, "Attributive = non-attributive use", cex=0.7)
abline(h=2, col="darkgrey", lty=2)
text(6,2.05,"2x more non-attributive than attributive use", cex=0.7)
abline(h=0, col="darkgrey", lty=2)
text(19.5,0.05,"Attributive use only", cex=0.7)
legend("topleft", col=c("black", "red"), 
  legend=c("real", "really"), lwd=2)

attributive-really-period

These plots now give us all the perspectives on the data that we need for an interpretation of the patterns.

Interpretation

Although all the above already seems quite some hard work, with the creation of the dataset, the annotation, the cleaning of the data, and then all the descriptive statistics, it is in fact the case that the real linguistic work only starts now. What do all these lines on these graphs mean “linguistically”? By just describing what we see in the data, we only did the first step of a linguistic analysis, now we have to make sense of the data.

Now obviously, the scope of this blog does not allow for a comprehensive interpretation of the data that is presented here. So I invite the readers to propose intelligent interpretations in the comments.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s