Reading Tables into R

In school, we work mainly with tiny datasets we can type into our technology if all else fails. In practice, we generally start with raw data in a computer file. Often, getting the data into a form our technology can work with is a major undertaking. We will go through that once right now.

R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. You can access such a file named heartatk4R.txt. Download and save this file in the directory where the R program lives. The file looks like this.

Patient	DIAGNOSIS	SEX	DRG	DIED	CHARGES	LOS	AGE
1	41041	F	122	0	4752	0010	079
2	41041	F	122	0	3941	0006	034
3	41091	F	122	0	3657	0005	076
4	41081	F	122	0	1481	0002	080
5	41091	M	122	0	1681	0001	055
6	41091	M	121	0	6378.6400	0009	084
7	41091	F	121	0	10958.520	0015	084
8	41091	F	121	0	16583.930	0015	070
9	41041	M	121	0	4015.3300	0002	076
10	41041	F	123	1	1989.4400	0001	065
11	41041	F	121	0	7471.6300	0006	052
12	41091	M	121	0	3930.6300	0005	072
13	41091	F	122	0	¥	0009	083
14	41091	F	122	0	4433.9300	0004	061
15	41041	M	122	0	3318.2100	0002	053
16	41041	M	122	0	4863.8300	0005	077
17	41041	M	121	0	5000.6400	0003	053

Above are only the first 17 cases (of 12,844). To use this in R you must define a variable to be equal to the contents of this file.

> heartatk = read.table("heartatk4R.txt",header=TRUE)

The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (These should not include spaces.) You can now get a table of contents for what you have created in R with

> objects() 

This should return heartatk along with any other variables you may have created. You will not see on this list any of the variables that are inside of heartatk because they are hiding. To see them, type

> names(heartatk)
[1] "Patient"   "DIAGNOSIS" "SEX"       "DRG"       "DIED"      "CHARGES"  
[7] "LOS"       "AGE"   

To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.)

> attach(heartatk)

These data came from an ActivStats CD which provided this background information:

Heart Attack Patients

This set of data is all of the hospital discharges in New York State with an admitting diagnosis 
of an Acute Myocardial Infarction (AMI), also called a heart attack, who did not have surgery, 
in the year 1993. There are 12,844 cases.

AGE gives age in years

SEX is coded M for males F for females

DIAGNOSIS is in the form of an International Classification of Diseases, 9th Edition, Clinical 
Modification code. These tell which part of the heart was affected.

DRG is the Diagnosis Related Group. It groups together patients with similar management. 
In this data set there are just three different drgs.

121 for AMIs with cardiovascular complications who did not die.
122 for AMIs without cardiovascular complications who did not die.
123 for AMIs where the patient died.

LOS gives the hospital length of stay in days.

DIED has a 1 for patients who died in hospital and a 0 otherwise.

CHARGES gives the total hospital charges in dollars.

Data  provided by Health Process Management of Doylestown, PA.

After you attach the data table you can work with the internal variables providing you remember that R is case-sensitive.

> table(sex)
Error in table(sex) : object "sex" not found
> table(SEX)
SEX
   F    M 
5065 7779

© 2006 Robert W. Hayden. Data Desk is a registered trademark of Data Description.