Midterm 2 Information

Information about Midterm 2

Last minute tips

Be able to set up sample spaces with pairs and tuples.
Be familiar with a deck of playing cards.
Know how to use Bayes formula in the context of medical testing, such as the breast cancer screening example on the slides.
Be able to calculate probabilities using density curves such as the normal curve, and "triangular" curve.
The formula for the correlation of two random variables ($\rho_{X,Y}$ below) will not be required on the test.
The number of questions is exactly 18. Calculations are quick and easy.

General Information

Midterm 2 is an in-class test.
Midterm 2 will cover all topics of Chapter 3 and 4 and Section 2.6.
The number of questions is exactly 18.
The questions are primarily multiple-choice.
Several questions require calculator use.
Table A (normal distribution), Table B (random digits) and Table C (binomial distribution) will be provided if needed.
Please review old tests (2010 Midterm 2 and a portion of 2010 Midterm 3 in this folder) for a sample of problems that may be similar to some test questions.
You may find these notes on the three set Venn diagram and 2-way tables useful.

List of Chapter 3 topics covered

The three principles of experimental design.
Observational vs. experimental studies.
Identification of experimental units.
Identification of population.
Sampling techniques using table of random digits
Basic experimental designs:
- Block (=Stratified)
- Matched pair
- Multi-stage
Lurking variables (including information in Section 2.6). Definition, identification, when to watch out for.
Confounding (including information in Section 2.6). Definition, identification, when to watch out for.
Bias and variability. Differentiating between the two.
Controlling bias. Controlling by randomization.
Problems when using anecdotal evidence.
Problems when using polling.

List of Chapter 4 topics covered

Probability Models

Know the meaning of outcomes.
Be familiar with basic set theory: elements, sets, curly brace notation, pairs, tuples, union, intersection, complement.
Know the meaning of sample spaces and events.
Know the difference between outcomes and elementary events.
Be able to identify and construct sample spaces. Be able to describe sample spaces using set notation.
Be able to use union, intersection and complement to describe events described in plain English, using connectives such as "or", "and" and "not".
Be familiar with standard examples used in class such as multiple coin tosses, die tosses, free throws in basketball, picking M&M candy out of a jar (with and without replacement), tosses of a bottle cap, etc.
Be able to perform calculations of probabilities of events, based on laws of probability and set notation (union, intersection, complement).
Know the addition rule for disjoint events and its generalization, the Inclusion-Exclusion Principle for 2 and 3 events: \[ P(A\cup B) = P(A) + P(B) \quad\text{if $A\cap B=\emptyset$} \] \[ P(A\cup B) = P(A) + P(B) - P(A\cap B) \quad\text{(always)} \] \[ P(A\cup B \cup C) = P(A) + P(B) + P(C) - P(A\cap B) - P(A\cap C) - P(B\cap C) + P(A\cap B \cap C) \]
Know the meaning of independence of events. Be able to apply the Multiplication Rule for independent events.

Random variables

Understand the following definition: A random variable is a function on the sample space, with numerical values. Using mathematical notation, let $S$ be a sample space. Any function $X:S\to \mathbb{R}$ is a random variable (on the sample space $S$).
Understand the terminology of functions. Thus, if $X:S\to U$ then $S$ is called the domain of $X$ and $U$ is called the range of $X$. Thus, for random variables the range is a subset of $\mathbb{R}$ the real numbers. Note: \[\mathbb{R}=(-\infty,\infty)\] \[\mathbb{R}= ]-\infty,\infty[ \] using two different conventions about denoting intervals.
The set of values of a random variable is the set of numbers $X(s)$ where $s$ is an outcome (an element of $S$). It is denoted $X(S)$ ("X of the entire sample space").
Know the definition of a discrete random variable: A function $X:S\to\mathbb{R}$ is a discrete random variable if the set of values of $X$ is either finite or countable infinite. Know that the definition in the book is incorrect, disallowing an infinite set of values. An example of a discrete random variable with an infinite set of values is the number of tosses of a coin before you see the first head. Thus,
- If in the first toss you get $Head$, $X=0$.
- If in the first toss is a $Tail$ but you get $Head$ on the second toss, $X=1$.
- Thus, $X$ may be $0, 1, 2, \ldots$.
- It may be shown (using, for instance, tree-based calculations), that \[ P(X=k) = \frac{1}{2^{k+1}} \] for $k=0,1,2,\ldots$.
Know the definition of the probability function for discrete random variables. The probability function for a random variable assuming values $x_1,x_2,\ldots,x_n$ with probabilities $p_1,p_2,\ldots,p_n$ is: \[ f(x_i) = p_i \] The probability table of $X$ is simply a table that lists the values of this function: $\begin{array}{c|ccccc}
\hline\\
x & x_1 & x_2 & x_3 & \ldots & x_n \\
\hline\\
p & p_1 & p_2 & p_3 & \ldots & p_n\\
\hline
\end{array}$ Since the set of values may be infinite (but countable), the table may be infinite, and it may be necessary to give a formula rather than listing values, as in the previous example.
Know the definition of a continuous random variable. The set of values of such a variable is an uncountable set such as an interval $[a,b]$ or $[0,\infty[$. The probability \[ P(X=x)=0 \] of a particular value $x$ is always zero for a continuous random variable. Therefore, to calculate probabilities related to continuous random variables requires the knownledge of the probability density function ("density curve"). If the formula for the density curve is $y = f(x) $ then the formula allowing us to compute the probability is: \[ P(a \le X \le b) = \int_{a}^b f(x)\,dx \] Thus, the probability is the area under the curve.
Note that \[ P(a \le X \le b) = P(a < X \le b)= P(a \le X < b) = P(a < X < b) \] for continuous random variables. This definitely not the case for discrete random variables (why?).
Probability distribution functions, probability tables, cumulative distributions, expected value, variance, standard deviation. Note that the cumulative distribution function of a random variable $X$ is defined by: \[ F(x) = P(X \le x) \] This formula is valid, regardless of whether $X$ is discrete or continuous. However, the calculation is different in these two cases: \[ F(x) = \sum_{y\leq x} P(X=y) \] for discrete variables, where the summation is over $y$ which $X$ actually assumes. For continuous random variables, \[F(x) = \int_{-\infty}^x f(y) \,dy \] where $f(x)$ is the "density curve". Thus, this is the area under the curve $w=f(y)$ and to the left of the line $y=x$. Note that Table A tabulates $F(x)$ for the normal density curve: \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} x^2} \]
Calculation of the random variable expected value (also called the mean). Be familiar with the formula: \[ \mathbb{E}X = \mu_X = \sum_{i=1}^n x_i p_i = \sum_{x} x\, P(X_i = x)\] where the summation extends over all values of $X$. The set of values could be an infinite, but countable set, such as natural numbers. In this case, the formula is an infinite series: \[ \mathbf{E}X = \mu_X = \sum_{i=1}^\infty x_i p_i = \sum_{x} x\, P(X_i = x) \] i.e. formally $n=\infty$. For example, the number of tosses before the first head is seen in a potentially infinite sequence of coin tosses is: \[ \sum_{k=0}^\infty k \cdot \frac{1}{2^{k+1}} = 1.\] Obtaining this result requires some familiarity with infinite series.
Know that $X+Y$ is only defined if $X$ and $Y$ share the same sample space. In calculus, you have learnt that two functions may be added only if they have the same domain. This is the same principle.
The rule for the mean of a sum of random variables: \[ \mu_{X+Y} = \mu_{X}+\mu_{Y} \] Know that $X$ and $Y$ do not have to be independent for this rule to hold.
The rule for the variance of independent random variables: \[ \sigma_{X+Y}^2 = \sigma_X^2 + \sigma_Y^2 \] if variables $X$ and $Y$ are independent.
Another variance formula: \[ \sigma_{a\,X+b}^2 = b^2\,\sigma_{X}^2\] where $a$ and $b$ are constants.
Combined rule for expected values: \[ \mu_{a\,X+b} = a\,\mu_X + b \] where $a$ and $b$ are numbers.
Combined rule for variances: \[ \sigma_{a\,X+b}^2 = a^2\,\sigma_X^2\] where $a$ and $b$ are numbers.
Know the meaning of independence of random variables: for all $x$ and $y$ \[ P(X=x\;\text{and}\;Y=y) = P(X=x)\,P(Y=y) \]
Familiarity with correlation for random variables: \[ \rho = \rho_{X,Y} = \mathrm{corr}(X,Y)=\mathbb{E}\left(\left(\frac{X-\mu_X}{\sigma_X}\right) \left(\frac{Y-\mu_Y}{\sigma_Y}\right)\right) = \sum_{i=1}^m\sum_{j=1}^n \left(\frac{x_i-\mu_X}{\sigma_X}\right) \left(\frac{y_j-\mu_Y}{\sigma_Y}\right)\cdot p_{ij} \] where \[ p_{ij} = P(X=x_i\;\text{and}\;Y=y_j) \] (NOTE: Cannot use product rule because $X$ and $Y$ possibly are not independent; if they were, $\rho_{X,Y}=0$) Here, we avoided excessive subscripts by using common notation for the expected value of a variable: \[ \mathbb{E}X = \mu_X \]
Kow the meaning of the formula: \[ \sigma_{X+Y}^2 = \sigma_{X}^2 + \sigma_{Y}^2 + 2\,\rho_{X,Y}\,\sigma_X\sigma_Y \]
Note that the formula for $\rho$ is not in the book, although it appears to be used in several examples. Be able to use it when $X$ and $Y$ assume very few values (2 or 3). Here is an example: For a certain basketball team it was determined that the probability of scoring a hit in the first free shot is $0.8$. However, the probability of scoring on the second shot depends on whether the player scored on the first shot. If the player scored on the first shot, the probability of scoring on the second shot remains $0.8$. However, if the player missed on the first shot, the probability of scoring on the second shot is only $0.7$. This lower performance is known in the sports as "choking". Let $X$ be a random variable which represents the number of points scored on the first shot (0 or 1), and let $Y$ be the number of points scored on the second shot. Please answer the following questions:
- Find the four probabilities: \[p_{ij} = P(X=i\;\text{and}\;Y=j) \] for $i,j=0,1$. That is, fill out the following table: $\begin{array}{c|c|c}
  i\backslash j & 0 & 1 \\
  \hline\hline\\
  0 & p_{00} & p_{01}\\
  1 & p_{10} & p_{11}\\
  \hline\hline
  \end{array}$ The above table is the joint probability function or joint distribution of $X$ and $Y$.
- Find $\mu_X$ and $\mu_Y$.
- Find $\sigma_X$ and $\sigma_Y$.
- Find $\mathrm{corr}(X,Y)$. Note that in this case: $\begin{eqnarray}
  \mathrm{corr}(X,Y) &=&\frac{1}{\sigma_X\,\sigma_Y} \times \\
  &&\bigg[ (0-\mu_X)(0-\mu_Y)\,p_{00}\\
  &+&(0-\mu_X)(1-\mu_Y)\,p_{01}\\
  &+&(1-\mu_X)(0-\mu_Y)\,p_{10}\\
  &+&(1-\mu_X)(1-\mu_Y)\,p_{11}\bigg]
  \end{eqnarray}$
- Find the probability function (table) for the random variable $Z$, the total score of two free throws:$Z=X+Y$. Find $\mu_Z$ and $\sigma_Z$ directly.
- Verify the equation \[ \sigma_{Z}^2 = \sigma_{X}^2+\sigma_{Y}^2+2\rho_{X,Y}\,\sigma_X\,\sigma_Y \]
- Please do finish the calculations above!
- You may also use the above example to practice conditional probabilities and the Bayes' formula. For example, if it is known that a player scored on the second throw, what is the probability that she/he missed on the first throw?
Know that independence of random variables $X$ and $Y$ implies $\mathrm{corr}(X,Y)=0$. Know that $\mathrm{corr}(X,Y)=0 $ does not imply independence of $X$ and $Y$. However, $\mathrm{corr}(X,Y)=\pm 1$ implies that $Y=a\,X+b$ for some constants $a$ and $b$. Moreover the sign of $a$ is the same as the sign of $\mathrm{corr}(X,Y)$. This is perfect linear dependence. NOTE: A similar result was true for sample correlations.
Note that $\mathrm{corr}(X,Y)$ is only defined when $\sigma_X>0$ and $\sigma_Y>0$. In particular, $X$ and $Y$ must assume at least two different values.
Know that there is a mean $\mu_X$ of a random variable and sample mean \[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\] The former is a parameter of the population, and the latter is a statistic (a property of the sample).
Similarly, the variance $\sigma_X^2$ is a parameter and the sample variance \[s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \] which is a statistic.
Also, the correlation $\rho_{X,Y}$ is a parameter, while the sample correlation $r=r_{xy}$ is a statistic. Recall the formula for the sample correlation: \[ r=r_{xy}= \frac{1}{n-1}\sum_{i=1}^n \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} \] Compare with the formula for $\mathrm{corr}(X,Y)$, which involves double summation.

Conditional probability

Know the meaning of $ P(A\,|\,B)$
Know the formula: \[ P(A\,|\,B) = \frac{P(A\cap B)}{P(B)} \]
Know the rules of probability for conditional probabilities. The function $ P'(A) = P(A|B) $ satisfies all Probability Rules for fixed $B$. For example, \[ P(A\cup C|B) = P(A|B) + P(C|B) - P(A\cap C|B) \] Thus, if you learned a rule for ordinary, non-conditional probability, there is a corresponding rule for the conditional probability.
Know the Law of Alternatives also known as the Total Probability Formula. Let $C_1,C_2,\ldots,C_n$ be mutually disjoint events ("causes"): \[C_i\cap C_j = \emptyset\quad\text{when $i\neq j$}\] and exhaustive events: \[ C_1\cup C_2\cup \ldots \cup C_n = S \] Then for every event $A$ ("consequence"): \[ P(A) = \sum_{i=1}^n P(A|C_i)\,P(C_i) \]
Know the Bayes' Formula: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
An alternative form of the Bayes' Formula for the probability of the cause, given a known consequence: \[ P(C_i|A) = \frac{P(A|C_i) P(C_i)}{\sum_{j=1}^nP(A|C_j) P(C_j)} \]
Know how to apply Bayes' Formula to common examples discussed by the book and slides.
You may find it useful to read the following article on Bayes' Theorem
The Monty Hall Problem provides an interesting example of an application of conditional probability. This example is often used in job interviews.