Save your datasets as csv files

Although it is a good idea to build your datasets in spreadsheet software, it is an even better idea to save your dataset (after you are ready with the annotation, of course) into the csv format.

“csv” stands for “comma separated value”, which means that the columns of the datasheet are separated by a comma, or in fact, any kind of delimiter you want, e.g. a tab. Obviously, the lines of the dataset are retained. In practice, a table is basically represented in a straightforward text format. Remember this table from a previous post?

Definiteness and animacy of NPs
obs NP article_definite? head_animate?
1 a beautiful girl indefinite animate
2 the handsome guy definite animate
3 the chair definite inanimate

This table would be represented in the csv format (with a comma as the separator) as follows:

obs, NP, article_definite?, head_animate?
1, a beautiful girl, indefinite, animate
2, the handsome guy, definite, animate
3, the chair, definite, inanimate

Obviously, the csv format is not so easily readable on screen as a regular table. Nonetheless, there are two different arguments in favor of the csv format. The first argument is the “free, as in open”-argument, the second argument emphasis portability.

Free data

As a scientist, you do not want to sit on your data. Although it is common practice to keep your raw datasets from other scholars (and sometimes for good reasons), it should at least be your intention to make your raw datasets available to other scholars. This guarantees the openness of scientific research, with important values such as verification (other people do not need to believe you on your word, and may want to check your findings). This freedom of data also implies that your data should be available “forever” and independent of specific software. Indeed, software usually does not have a long life. Remember WordPerfect, or Lotus 1-2-3? If you had saved your datasets in the idiosyncratic formats of these software packages, your data would now be lost.

Portable data

The rigidly structured csv format can be interpreted by almost any piece of spreadsheet or statistical software there is. The idea is, just as above, that the dataset is not stored in some proprietary format (such as the Excel format) or in a format that cannot be easily transported (such as the OpenDocument format) to other software. Since csv files are so widely accepted, they are the most appropriate way of porting data from one software to the other. It is also a good idea to share your datasets with others in the csv software, because you can not assume that the other person will use the same software as you do.

Converting to csv

Obviously, there is no problem at all in preparing your dataset in a proprietary software package such as Excel. And by all means, as long as you are still working on the dataset, do save the dataset in the software specific format. However, when you prepare a “stable” version of the dataset for a presentation or a publication, then store a csv version of the dataset.

Most software tools have a csv export function. In MS Excel, you can simply select csv as an option under “save as”. As an example, you could download the Excel file of the Old High German dataset with article-alike determiners followed by concrete or abstract nouns. When trying to save the file, you will notice that there are a couple of csv options, i.e. for MS-DOS, Macintosh, and a generic one. Simply use the generic one, and save the file. The csv file should look like this, when opened in a text editor:

id;determiner;noun;semantic class;century
0;daemo;dolge;concrete;8
1;daez;dolg;concrete;8
2;ther;heilant;abstract;10
3;einen;brunnon;concrete;10
4;ein;quena;concrete;10
...

You will immediately notice that the delimiter for the columns is not a comma, but rather a semicolon. That is fine, as said above, the separator can be anything you want to be. The smart question here, however, would be what happens if you have a semicolon in one of the values? Luckily, the csv format allows you to put values that contain the separator between quotes, so that during import the separators between quotes are not interpreted as column dividers, but rather as parts of the value. MS Excel automatically puts quotes around values that contain the delimiter during saving.

However, if you are on Windows (I do not know how this is on a Mac), the MS Excel file will not be saved as to a unicode UTF-8 text file, which is the preferred encoding for text files. To resolve potential encoding issues, use Notepad++ to convert the csv file to UTF-8. To do so, open the csv file in Notepad++ and in the “Encoding” menu, select “Convert to UTF-8 without BOM”. Then save, the file. Now you have a csv dataset in UTF-8!

Advertisements

2 thoughts on “Save your datasets as csv files

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s