Datasets are among the most important objects in a scientific study. It is best to stick to a widely used format for your dataset so that other people are able to understand what you have done. In order to find a good format for corpuslinguistic datasets, the nature of corpuslinguistic data needs to be investigated.
Now, the main task of corpuslinguists is to observe linguist phenomena in large collections of text, i.e. a corpus. Typically, these observations are provided in the form of a Keyword-in-context (KWIC), where every individual line represents an observation of the linguistic phenomenon. Such a linguistic phenomenon may have several qualities. Two qualities that are already present in a KWIC is the left and the right context of the phenomenon. But these qualities may also be purely linguistic in nature. We may, as an example, want to enrich the observation with its part-of-speech or any other linguistic category. Finally, the quality may also refer to non-linguistic information, such as the year in which the text containing the observation was produced. All in all, we want to end up with a (long) list of observations of the phenomena under investigation, enriched with contextual, linguistic and meta-information.
For this kind of data, a well-known (and already widely used in corpuslinguistics) data format exist: unstacked data. Unstacked data is basically a data format where every variable gets its own column. The table below gives a good idea of the unstacked format. Here, three observations are shown (in the column “Observation”, we simply count the observations). For each observation, two variables are recorded, i.e. “Variable 1” and “Variable 2”. For each variable, we observe certain values. The first observation receives a certain value for “Variable 1”, i.e. “Variable 1, Value 1”.
|Observation||Variable 1||Variable 2|
|1||Variable 1, Value 1||Variable 2, Value 1|
|2||Variable 1, Value 2||Variable 2, Value 2|
|2||Variable 1, Value 3||Variable 2, Value 3|
(Unstacked data is in contrast to stacked data: read up on wikipedia)
Let us make this abstract example a bit more tangible with data that might be linguistic in nature. Image we observe three noun phrases, and we want to annotate whether they have a definite article, and whether the head noun is animate or inanimate. The observation table in the unstacked format would look like this.
|1||a beautiful girl||indefinite||animate|
|2||the handsome guy||definite||animate|
Now, imagine that we would like to further annotate the amount of words in the NP. In contrast to the two previous categorical variables “article_definite?” and “head_animate?”, the amount of words in an NP is an ordinal variable, because there is a logical ordering of the values. This, however, does not make any difference to the way in which we annotate this variable. Just like the previous variables, this would become an extra column in the dataset.
|1||a beautiful girl||indefinite||animate||3|
|2||the handsome guy||definite||animate||3|
This is the best way of gathering data for a corpuslinguistic investigation. Every single observation is a line in your table, and all the additional information that you want to add to the observation is put into separate columns. If your data is prepared like this, it will be a breeze to perform a quantitative analysis of the data afterwards, because this format can easily be imported into spreadsheet software or statistical software such as R. Moreover, this is also the preferred format for making frequency tables, contingency tables, doing chisquare tests, regressions etc.