Information about Midterm 1
General Information
- Midterm 1 will cover topics of Chapter 1 and Chapter 2.
- Sections 2.5 and 2.6 are excluded.
- Linear interpolation is not needed for the test.
- The number of questions is exactly 15.
- The questions are primarily multiple-choice.
- There may be a problem involving calculation of regression
- Several questions require calculator use.
- Table A for normal distribution will be provided.
-
Some questions will required good judgment. For instance,
a statement that 165 is approximately 180 may be valid. If a variable
has mean 180 and standard deviation is 20 then 165 is approximately
180. If standard deviation is 2, it is not a good approximation.
List of Chapter 1 topics covered
- Quantitative vs. categorical variables.
- Stem plots. Please make sure your familiar with the notion of splitting.
- Calculation of median, quartiles \(Q_1\) and \(Q_3\), range and \(IQR=Q_3-Q_1\).
-
Understand that there are differences in calculating quartiles; if your calculator
gives different values that the calculation method in the book, and further
discussed on our Website, you must use the method discussed in the book
and dissected on this page.
- Bar Charts
- Applicable to one quantitative and one categorical variable.
- Bar chart vs. histogram
- Histograms
- One quantitative variable. Contrast with bar chart.
- Bins. Under and over- summarized.
- Recognizing right/left skewed distributions.
- Recognizing outliers.
- Modality (unimodal/bimodal/multimodal).
- The mean
- What does it measure?
- Understanding the formula
\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]
where the sample consists of the numbers \(x_1,x_2,\ldots,x_n\) and \(n\) is the sample size.
- Calculating for small samples.
- Estimation based on graphs.
- Standard deviation.
- What does it measure?
- Understanding the formula
\[s_x^2 = \frac{1}{df}\sum_{i=1}^n (x_i-\bar{x})^2\]
where the sample consists of the numbers \(x_1,x_2,\ldots,x_n\) and \(n\) is the sample size
and \(df = n-1\) is the number of degrees of freedom; in particular, do not
make a mistake of dividing by \(n\).
- Be aware that \(s_x^2\) is called sample variance.
- Calculation for tiny samples (1-4 elements).
- \(IQR\)
- Estimating from data.
- Estimating from stemplots.
- Estimating from boxplots.
- Understanding boxplots, including representation of outliers
- \(1.5\cdot IQR\) rule; a data point that is not in the interval:
\[ \bigg[Q_1-1.5\cdot IQR, Q_3+1.5\cdot IQR \bigg] \]
is an outlier, where \(Q_1\) is the first quartile and \(Q_3\) is the
third quartile.
- Uniform distribution.
- Density curve
\[ f(x) = \begin{cases} \frac{1}{b-a} & \text{$a\leq x\leq b$}\\
0 &\text{otherwise}
\end{cases}
\]
- Understanding that the area under the curve is 1
- Evaluating mean, median, quartile, range, \(IQR\)
- Estimating probability
- Normal distribution
- Density curve.
- Parameters \(\mu\) and \(\sigma\), center, symmetry, spread.
- Familiarity with the form of equation when given:
\[f(x)=\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
Memorization not necessary, but be able to distinguish from uniform distribution density curve.
-
Understanding the distinction between the mean \(\mu\) of the normal
distribution and the (a parameter of the normal density curve) and
a sample mean \[\bar{x}=\frac{1}{n}\sum_i x_i\] calculated from
a sample drawn from a normally distributed general population.
-
Understanding the distinction between the standard deviation \(\sigma\) of the normal
distribution and the (a parameter of the normal density curve) and
a sample standard deviation
\[s_x=\sqrt{\frac{1}{n-1}(x_i-\bar{x})^2}\] calculated from
a sample drawn from a normally distributed general population.
-
Understanding notation \(N(\mu,\sigma)\)
-
Understanding the standard normal distribution \(N(0,1\) and Table A. Note that Table A
tabulates the area under the density curve: the value in the table for given \(z\) is:
\[F(z)=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^z e^{-\frac{1}{2}x^2}\,dx\]
-
Understand that \(F(z)\) is an increasing function, \(\lim_{z\to\infty}F(z)=1\)
and \(\lim_{z\to\infty}F(z)=0\).
-
Standardization and z-score.
-
68-95-99.7 rule.
-
Calculating probabilities based on Table A, both straight (given z, find p)
and inverse (given p, find z).
- Interpretation of questions in terms of inequalities (Z>z, z1<Z<z2).
- Identification of probability as area under the curve.
- Linear interpolation based on Table A (both straight and inverse).
List of Chapter 2 topics covered
- Association and relationship.
- Form (linear, non-linear, no association).
- Stength.
- Direction.
- Plotting for 2 variables.
- Multiple box plots (when one variable is categorical).
- Scatter plots (when both variables are quantitative).
- Response vs. explanatory variables.
- Relationship vs. association.
- Causality.
- Being able to identify in examples.
- Scatter plots
- Identifying form from the plot (linear, non-linear, no association).
- Being able to draw for tiny samples (3 and 4 points).
- Determining association strength (weak, strong, very strong) from a scatter plot.
- Identifying outliers, understanding the difference vs. the single variable case.
- Correlation and correlation coefficient.
- Understand the formula (corrected)
\[ r = \frac{1}{n-1} \sum_{i=1}^n
\left(\frac{x_i-\bar{x}}{s_{x}}\right)
\left(\frac{y_i-\bar{y}}{s_{y}}\right)
\]
- Being able to calculate correlation coefficient by hand for 3 and 4 points.
- Knowledge of basic properties such as range (-1 to 1), lack
of units, symmetry with respect to swapping variables.
- The significance of high vs. low correlation.
- Correlation and direction of a relationship.
- Understanding the limitations of correlation coefficient for non-linear relationship
(correlation coefficient does not capture non-linear relationships, may be zero
in the presence of a strong non-linear relationship).
- Understanding that correlation only applies to quantitative variables;
for example, there cannot be correlation between gender and life span of people, although it is
known that women live a few years longer than man.
- Least squares, linear regression.
- Formulas for the slope \(b_1\) and intercept \(b_0\) of a regression line
\( y = b_0 + b_1\cdot x \):
-
Ability to calculate regression line, predicted values \[\hat{y}_i = b_0 + b_1\,x_i\]
residuals \(y_i-\hat{y}_i\); (\(ESS\) - the sum of squared errors of prediction;
See Wikipedia page for
variation in naming of this quantity)
\[ ESS = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
-
The total sum of squares = sum of squares of residuals of \(y\)
\[ TSS = \sum_{i=1}^n (y_i - \bar{y})^2 \]
-
Interpolation and extrapolation using regression line
-
Understanding predicted value; ability to calculate, given the regression line equation
-
Understanting residuals; ability to calculate, given the regression line equation
-
Ability to calculate fitted values and residuals by hand for 3-point datasets.
-
Understanding the interpretation of R-squared as the
percent of the variation of \(y\) in the vertical
direction explained by the variation of \(x\); calculation
of the coefficient of determination
\[ R^2 = 1-\frac{ESS}{TSS}\]
-
Knowing that \(R^2 = r^2\) for the two-variable linear
least squares regression (the only kind of regression we
have done so far; there will be other kinds of regression
when this formula does not hold!).
-
NOTE: The fraction
\[\frac{ESS}{TSS}\]
is interpreted as the unexplained portion of the variation of \(y\).
-
There is a third sum of squares, the regression sum of squares or \(RSS\):
\[ RSS = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]
There is a famous equation, the partition of the sum of squares which states
\[ TSS = ESS + RSS \]
reminiscent of the Pythagorean Theorem. This is a mathematical theorem and it is always exact.
This implies that
\[ R^2 = \frac{RSS}{TSS} \]
-
See also Wikipedia article
on the coefficient of determination. Unfortunately, there is an incompatibility of notations.
-
Here is a summary of various notations related to the sums of squares (there are many more!):
NOTE: SSM translates to "Sum of squares of error of model", where the "model" refers to
a linear model (regression line).
Topics explicitly excluded
Section 2.5 (the two-way tables) will not be covered until later and is not included in Midterm 1.