Share your datasets

An important aspect of scientific research is that findings are reproducible, falsifiable and transparent. Especially in an empirical approach, it is of the utmost importance to make datasets available. It should become a natural reflex to feel an urge for seeing the data behind the publication. No matter how well the publication describes the variables, it is always interesting and insightful to learn how certain observations are annotated. From your own experience, you probably already know how difficult it usually is to decide which value to assign from the variables you are investigating. These insecureties are also present in other corpus linguists. Perhaps, that is why many (corpus) linguists do not make their datasets freely available. But usually, they bring two kinds of arguments to the table.

The first kind of argument is the “I want to publish on this data before I give it to the others”. This is sometimes also known as the “right for first publication”. I can most certainly understand that feeling, and in certain cases, it is very understandable. If your dataset is so specific that it can only answer your one specific research question, it would be indeed foolish to give that away before valorizing it yourself. However, you might also consider how many people actually have the same research question as you have; and if other people have the exact same research question, you should be collaborating with them, anyway. In case your dataset is fairly general, e.g. when it was not set up with a specific research question in mind, there is not really an argument against holding back the data. Possibly, by sharing the dataset, you might gain some recognition in the research community?

The second kind of argument is something like “I do not want that some big shot uses my data, and that everybody will then cite him instead of me”. The horror story goes as follows: after months and months of hard work compiling and annotating The Dataset, you present some first results at a conference. Professor Big Shot from Big Shot University comes to you after the presentation and asks if he could have a look at the data, because he finds it so interesting. Flushed by emotion, you give him your perfectly prepared dataset. Some months later, you discover in the International Journal of Big Shot Linguists an article by professor Big Shot who shamelessly used your dataset for a publication, and the only thing you got was a thank-you-footnote. The rest of the research community hails professor Big Shot for his exquisite analysis on some extraordinary data, and the article gets a record in citations. You, however, never get a part of the glory and leave academia for lack of funding. Obviously, this is bogus, since it is quite unrealistic that your dataset is going to change the field forever — get real. Moreover, you have certain legal rights on a dataset as it is your Intellectual Property. Make sure always include an adequate license to your dataset (this will be discussed below).

Although I can understand to a certain extent the two arguments against publishing your datasets, it is my strong opinion that scholars should nonetheless make their work publicly available. For this, there are at least three arguments.

First, scholars in corpus linguistics are usually payed by the tax payers and should therefore make their work publicly available, preferably without additional cost. Scientist have an academic responsability, but are also there at the grace of and for society. In fact, this is also why you should prefer an open format (or even just plain text in the csv format) over proprietary formats. Proprietary formats can only be read adequately by software that people have to pay for, and science should be free, both in the monetary meaning as well as in the “free as in speech, not free as in beer” meaning.

Second, making data available ensures a more transparent insight in the findings of your research. The scientific community is a community, and although we do not need to be best friends across the board, scientist should function as an example community of equals who respect each other. Nobody should be afraid to put a dataset (preliminary or finished) out there in front of the critical eye of the peers. I know how harsh and painful it sometimes is to receive critical reviews, but they make you into a better scholar. If you keep your cookies in the jar forever, they will go bad.

Third, your work is protected by copyright (several licenses are available, e.g. Creative Commons) and you have the right to complain when somebody violates the terms for your license. In the words of a respected collegue: “if you want that people draw a dragon every time they cite your data/publication, and you write that into the license, you can sue them if you don’t see the dragon”. Perhaps that is pushing it too far, but as the creator of the dataset you have the rights to the Intellectual Property (IP). With a Creative Commons By License, everybody who uses you IP is obliged to make a reference to you. You can fight for your right to, well, be quoted!

Postscript: There is quite obviously one major exception to this universal principle of scientific openness, and that is when your data may violate the privacy of your subjects.


One thought on “Share your datasets

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s