Organization of data files

Kuehl discusses an example of a study in which beef steaks are packaged in controled atmosphere to prevent the growth of meat-spoiling bacteria on the surface of the steak. The data are organized in a table. Each row of the table represents a single observation.

Beaf steak packaging data

Steak
Treatment

 1
Commercial
7.66

 6
Commercial
6.98

7
Commercial
7.80

12
Vacuum
5.26

5
Vacuum
5.44

3
Vacuum
5.80

10
Mixed Gas
7.41

 9
Mixed Gas
7.33

 2
Mixed Gas
7.04

 8

3.51

 4

2.91

 11

3.66

 The first column is a random permutation resulting in a random assignment of each of the 12 steaks to a treatment group. The second column identifies the treatment ("Commercial", "Mixed Gas" and ""), and finally the last column is the (treatment) response, which in this case is the logarithmic count of the bacterium found on the surface of each steak.
The first column is primarily for bookkeeping reasons, and could be discarded in performing the statistical analysis. We note that discarding the first column breaks the connection between individual steaks and the data. In medical studies, typically we would discard the unique identifier of a human subject for privacy reasons, before publishing data, if required by the law.
The table can be stored in a variety of file formats, facilitating interaction with statistical software:

  • CSV (comma-separated values)
  • Databases (MySQL, Postgress, DB2, etc)
  • Plain text files
  • Custom binary files

 The choice of the format is not important for small data sets. For large datasets, a plain text file will be inefficient because of the extra storage required to store binary numbers and the necessary conversion between the decimal and binary formats while reading and writing numbers. The same comment applies to CSV. Databases and binary files with fixed-width fields will utilize efficient I/O operations of the hard drive, and will occupy less storage. Also, the access speed to data will be reduced.