Math 263, Section 001 and 003 - Excel/R Assignment 5

Last updated on December 3, 12:58PM.

Excel/R Assignment 5

In this assignment you will:

Software used

The statistical package R. Although RExcel may be used, the best way is to use R without RExcel.

The data file

The data file is Dataset5.csv. This file uses semicolon as a separator of columns! You may need to specify this fact when your spreadsheet program reads this file. You can view it with Excel, or a text editor such as Notepad. There is an alternative version, Dataset5.txt, which can be viewed with a browser.

About the dataset

The dataset is a famous dataset from CMU of a survey of kids on their personal goals. The sample consisted of 478 subjects. Please read the description of the dataset at the original website.

Sample script

There is a sample script included, which using the \(\chi^2\)-test, tries to answer the question whether the girls and boys have different personal goals. The script is here. You need to understand how the script works and you need to be able to run it with R. The basic procedure for running a script with R is this:

Answer the following questions using \(\chi^2\)-test

Instructions for each question

Perform a sequence of \(\chi^2\) tests, with different data.

For each test, include the output of the command

table
which creates the cross-tabulated data (2-way table). Alternatively, you may use the command
xtabs
which uses the formula syntax, as in examples below. The latter is more powerful cross-tabulation command than the prior.

For each test, include the output of the command

chisq.test
which prints the \(\chi^2\)-statistic, degrees of freedom, and P-value.

For every question, formulate precisely the null and alternative hypothesis. State whether the null hypothesis is rejected, based on the test conducted. Use 95% confidence level.

Important!

Sometimes R will print this message:
	Warning message:
	In chisq.test(table(Age, Money)) :
	Chi-squared approximation may be incorrect
      
This may mean that the \(\chi^2\)-test may not work for this data due to violations of "rules of thumb". In this case you can call the chisq.test as follows:
	chisq.test(table(Age, Money), simulate.p.value=T, B=100000)
      
In this case, R will modify the method by performing a Monte Carlo simulation to find the P-value. The parameter B controls the size of a random sample used in the Monte Carlo simulation. Its size defaults to 2000, but we may increase the accuracy of the simulation by raising this value. However, very large values will result in slowing down the computation, so keep B below a million.

The questions

Your own question?

There are 55 possible questions to be asked. I suggest you investigate a question or two that interests you about this dataset. Extra credit may be awarded for creative question/discussion.

Calculate marginal distributions

Calculate the marginal distribution of boys and girls conditioned upon their attitude towards money. You should have 4 distributions, each consisting of two complementary probabilities.

In this part, you may follow the script presented in class: chi_alt.R

An example - using 'xtabs' and 'chisq.test' together

A quick way to combine cross-tabulation of two (of 11) variables is presented below. Note that there are \[ {11 \choose 2} = 55 \] distinct ways to build a two-way table out of 11 variables, by cross-tabulation. For example, lets decide whether the goals of students depend on the school. We do the following:
      > chisq.test(print(xtabs(~ School + Goals)))
      Goals
      School                 Grades Popular Sports
      Brentwood Elementary     40      17     10
      Brentwood Middle         47      25     12
      Brown Middle             22      14     16
      Elm                       4      11      6
      Main                     45      12     11
      Portage                  30      20     11
      Ridge                    16      18     14
      Sand                     15       7      6
      Westdale Middle          28      17      4

      Pearson's Chi-squared test

      data:  print(xtabs(~School + Goals)) 
      X-squared = 34.5069, df = 16, p-value = 0.00464

      Warning message:
      In chisq.test(print(xtabs(~School + Goals))) :
      Chi-squared approximation may be incorrect
    
Note that we inserted 'print' between 'xtabs' (cross-tabulation function of R) and 'chisq.test' (the \(\chi^2\) testing function of R). This has the effect of printing the cross-tabulated data.

An example - computing a marginal distribution

This is how we can compute the marginal distribution of gender in each school:
      > round(prop.table(xtabs(~ School + Gender),1),2)
      Gender
      School                  boy girl
      Brentwood Elementary 0.63 0.37
      Brentwood Middle     0.56 0.44
      Brown Middle         0.50 0.50
      Elm                  0.24 0.76
      Main                 0.46 0.54
      Portage              0.43 0.57
      Ridge                0.48 0.52
      Sand                 0.43 0.57
      Westdale Middle      0.31 0.69
    
Here, we first cross-tabulate with 'xtabs', then we compute the marginal distribution with 'prop.table' and finally we round the final result to two decimal places with 'round', all in one line of code!

Troubleshooting

Below I address some issues that were reported to me.

The "/" in the header of Dataset5.txt causes error

I cannot confirm that this is an issue, but if you suspect it is, edit Dataset5.txt after downloading and replace "/" in "Urban/Rural" with a period. I am able to do this:
> x <- read.table("Dataset5.txt", header=T)
> names(x)
 [1] "Gender"      "Grade"       "Age"         "Race"        "Urban.Rural"
 [6] "School"      "Goals"       "Grades"      "Sports"      "Looks"      
[11] "Money"      
> 
    

Working directory

If you have problem reading data files or sourcing scripts, probably your working directory is not set to the location of the data/script files. You can use the command 'getwd' to find out current working directory, and 'setwd' to set it. Here is an example:
> getwd()
[1] "/home/marek/public_html/math263/ExcelAssignments/Assignment5"
> setwd("/home/marek/Desktop")
> getwd()
[1] "/home/marek/Desktop"
> 
    
This is done on Linux and would be similar on Mac. On windows, the location of the files would looks like this (depending on your Windows version/configuration)
C:/Documents And Settings/Marek/Desktop
    
or
C:/Users/Marek/Desktop
    
However, to avoid these typically long names of folders, you should use the R console menu ('File/Change dir' on Windows and 'Misc/Change Working Directory' on Mac OS X).