Math 263, Section 001 and 003 - Excel/R Assignment 5
Last updated on December 3, 12:58PM.Excel/R Assignment 5
In this assignment you will:- Read data from a CSV file using R.
- Cross-tabulate the data using two-factors.
- Conduct the \(\chi^2\)-test for various null hypotheses.
Software used
The statistical package R. Although RExcel may be used, the best way is to use R without RExcel.The data file
The data file is Dataset5.csv. This file uses semicolon as a separator of columns! You may need to specify this fact when your spreadsheet program reads this file. You can view it with Excel, or a text editor such as Notepad. There is an alternative version, Dataset5.txt, which can be viewed with a browser.About the dataset
The dataset is a famous dataset from CMU of a survey of kids on their personal goals. The sample consisted of 478 subjects. Please read the description of the dataset at the original website.Sample script
There is a sample script included, which using the \(\chi^2\)-test, tries to answer the question whether the girls and boys have different personal goals. The script is here. You need to understand how the script works and you need to be able to run it with R. The basic procedure for running a script with R is this:- Download the script and save it in a local file on your computer.
- Start R
- Type in the R-console
source("script.R", echo = T)
You should replace "script.R" with the full path of your file on your computer, e.g.C:\\tmp\\script.R
if your file is run on Windows, and it has been saved in the folderC:\tmp
Note the double backslashes, necessary when you use R under Windows. Alternatively, you can use the File menu to change the working folder to to the folder where the file resides. If you succeed doing so, you do not need to change the command above.
Answer the following questions using \(\chi^2\)-test
Instructions for each question
Perform a sequence of \(\chi^2\) tests, with different data.
For each test, include the output of the command
tablewhich creates the cross-tabulated data (2-way table). Alternatively, you may use the command
xtabswhich uses the formula syntax, as in examples below. The latter is more powerful cross-tabulation command than the prior.
For each test, include the output of the command
chisq.testwhich prints the \(\chi^2\)-statistic, degrees of freedom, and P-value.
For every question, formulate precisely the null and alternative hypothesis. State whether the null hypothesis is rejected, based on the test conducted. Use 95% confidence level.
Important!
Sometimes R will print this message:Warning message: In chisq.test(table(Age, Money)) : Chi-squared approximation may be incorrectThis may mean that the \(\chi^2\)-test may not work for this data due to violations of "rules of thumb". In this case you can call the chisq.test as follows:
chisq.test(table(Age, Money), simulate.p.value=T, B=100000)In this case, R will modify the method by performing a Monte Carlo simulation to find the P-value. The parameter B controls the size of a random sample used in the Monte Carlo simulation. Its size defaults to 2000, but we may increase the accuracy of the simulation by raising this value. However, very large values will result in slowing down the computation, so keep B below a million.
The questions
- Is the opinion on the importance of money independent of gender?
- Is the opinion on the importance of money independent of school?
- Do girls and boys have the same attitude about the importance of good looks (being pretty/handsome)?
- Do the personal goals change with Grade (4,5 or 6)?
- Are personal goals of students the same in urban and rural schools?
Your own question?
There are 55 possible questions to be asked. I suggest you investigate a question or two that interests you about this dataset. Extra credit may be awarded for creative question/discussion.Calculate marginal distributions
Calculate the marginal distribution of boys and girls conditioned upon their attitude towards money. You should have 4 distributions, each consisting of two complementary probabilities.
In this part, you may follow the script presented in class: chi_alt.R
An example - using 'xtabs' and 'chisq.test' together
A quick way to combine cross-tabulation of two (of 11) variables is presented below. Note that there are \[ {11 \choose 2} = 55 \] distinct ways to build a two-way table out of 11 variables, by cross-tabulation. For example, lets decide whether the goals of students depend on the school. We do the following:> chisq.test(print(xtabs(~ School + Goals))) Goals School Grades Popular Sports Brentwood Elementary 40 17 10 Brentwood Middle 47 25 12 Brown Middle 22 14 16 Elm 4 11 6 Main 45 12 11 Portage 30 20 11 Ridge 16 18 14 Sand 15 7 6 Westdale Middle 28 17 4 Pearson's Chi-squared test data: print(xtabs(~School + Goals)) X-squared = 34.5069, df = 16, p-value = 0.00464 Warning message: In chisq.test(print(xtabs(~School + Goals))) : Chi-squared approximation may be incorrectNote that we inserted 'print' between 'xtabs' (cross-tabulation function of R) and 'chisq.test' (the \(\chi^2\) testing function of R). This has the effect of printing the cross-tabulated data.
An example - computing a marginal distribution
This is how we can compute the marginal distribution of gender in each school:> round(prop.table(xtabs(~ School + Gender),1),2) Gender School boy girl Brentwood Elementary 0.63 0.37 Brentwood Middle 0.56 0.44 Brown Middle 0.50 0.50 Elm 0.24 0.76 Main 0.46 0.54 Portage 0.43 0.57 Ridge 0.48 0.52 Sand 0.43 0.57 Westdale Middle 0.31 0.69Here, we first cross-tabulate with 'xtabs', then we compute the marginal distribution with 'prop.table' and finally we round the final result to two decimal places with 'round', all in one line of code!
Troubleshooting
Below I address some issues that were reported to me.The "/" in the header of Dataset5.txt causes error
I cannot confirm that this is an issue, but if you suspect it is, edit Dataset5.txt after downloading and replace "/" in "Urban/Rural" with a period. I am able to do this:> x <- read.table("Dataset5.txt", header=T) > names(x) [1] "Gender" "Grade" "Age" "Race" "Urban.Rural" [6] "School" "Goals" "Grades" "Sports" "Looks" [11] "Money" >
Working directory
If you have problem reading data files or sourcing scripts, probably your working directory is not set to the location of the data/script files. You can use the command 'getwd' to find out current working directory, and 'setwd' to set it. Here is an example:> getwd() [1] "/home/marek/public_html/math263/ExcelAssignments/Assignment5" > setwd("/home/marek/Desktop") > getwd() [1] "/home/marek/Desktop" >This is done on Linux and would be similar on Mac. On windows, the location of the files would looks like this (depending on your Windows version/configuration)
C:/Documents And Settings/Marek/Desktopor
C:/Users/Marek/DesktopHowever, to avoid these typically long names of folders, you should use the R console menu ('File/Change dir' on Windows and 'Misc/Change Working Directory' on Mac OS X).