Information about Midterm 2
Last minute tips
-
Be able to set up sample spaces with pairs and tuples.
-
Be familiar with a deck of playing cards.
-
Know how to use Bayes formula in the context of medical
testing, such as the breast cancer screening example on the
slides.
-
Be able to calculate probabilities using density curves
such as the normal curve, and "triangular" curve.
-
The formula for the correlation of two random variables
(\(\rho_{X,Y}\) below) will not be required on the test.
-
The number of questions is exactly 18.
Calculations are quick and easy.
General Information
-
Midterm 2 is an in-class test.
-
Midterm 2 will cover all topics of Chapter 3 and 4 and
Section 2.6.
-
The number of questions is exactly 18.
-
The questions are primarily multiple-choice.
-
Several questions require calculator use.
-
Table A (normal distribution), Table B (random digits) and Table
C (binomial distribution) will be provided if needed.
-
Please review old tests (2010 Midterm 2 and a portion of
2010 Midterm 3 in this folder) for a
sample of problems that may be similar to some test
questions.
-
You may find these notes
on the three set Venn diagram and 2-way tables useful.
List of Chapter 3 topics covered
-
The three principles of experimental design.
-
Observational vs. experimental studies.
-
Identification of experimental units.
-
Identification of population.
-
Sampling techniques using table of random digits
-
Basic experimental designs:
- Block (=Stratified)
- Matched pair
- Multi-stage
-
Lurking variables (including information in Section 2.6).
Definition, identification, when to watch out for.
-
Confounding (including information in Section 2.6).
Definition, identification, when to watch out for.
-
Bias and variability. Differentiating between the two.
-
Controlling bias. Controlling by randomization.
-
Problems when using anecdotal evidence.
-
Problems when using polling.
List of Chapter 4 topics covered
Probability Models
-
Know the meaning of outcomes.
-
Be familiar with basic set theory: elements, sets, curly
brace notation, pairs, tuples, union, intersection,
complement.
-
Know the meaning of sample spaces and events.
-
Know the difference between outcomes and elementary events.
-
Be able to identify and construct sample spaces.
Be able to describe sample spaces using set notation.
-
Be able to use union, intersection and complement to
describe events described in plain English, using
connectives such as "or", "and" and "not".
-
Be familiar with standard examples used in class such as
multiple coin tosses, die tosses, free throws in
basketball, picking M&M candy out of a jar (with and without
replacement), tosses of a bottle cap, etc.
-
Be able to perform calculations of probabilities of events,
based on laws of probability and set notation
(union, intersection, complement).
-
Know the addition rule for disjoint events and its generalization,
the Inclusion-Exclusion Principle for 2 and 3 events:
\[ P(A\cup B) = P(A) + P(B) \quad\text{if $A\cap B=\emptyset$} \]
\[ P(A\cup B) = P(A) + P(B) - P(A\cap B) \quad\text{(always)} \]
\[ P(A\cup B \cup C) = P(A) + P(B) + P(C) - P(A\cap B) - P(A\cap C) - P(B\cap C) +
P(A\cap B \cap C) \]
-
Know the meaning of independence of events. Be able to
apply the Multiplication Rule for independent events.
Random variables
-
Understand the following definition:
A random variable is a function
on the sample space, with numerical values.
Using mathematical notation, let \(S\) be a sample
space. Any function \(X:S\to \mathbb{R}\) is a random
variable (on the sample space \(S\)).
-
Understand the terminology of functions. Thus, if \(X:S\to U\)
then \(S\) is called the domain of \(X\) and
\(U\) is called the range of \(X\). Thus,
for random variables the range is a subset of \(\mathbb{R}\)
the real numbers. Note:
\[\mathbb{R}=(-\infty,\infty)\]
\[\mathbb{R}= ]-\infty,\infty[ \]
using two different conventions about denoting intervals.
-
The set of values of a random variable is the set
of numbers \(X(s)\) where \(s\) is an outcome (an element of \(S\)).
It is denoted \(X(S)\) ("X of the entire sample space").
-
Know the definition of a discrete random variable:
A function \(X:S\to\mathbb{R}\) is a discrete random variable
if the set of values of \(X\) is either finite
or countable infinite. Know that the definition
in the book is incorrect, disallowing an infinite set of values.
An example of a discrete random variable with an infinite
set of values is the number of tosses of a coin before you
see the first head. Thus,
-
If in the first toss you get \(Head\), \(X=0\).
-
If in the first toss is a \(Tail\) but you get \(Head\) on the second toss, \(X=1\).
-
Thus, \(X\) may be \(0, 1, 2, \ldots\).
-
It may be shown (using, for instance, tree-based calculations), that
\[ P(X=k) = \frac{1}{2^{k+1}} \]
for \(k=0,1,2,\ldots\).
-
Know the definition of the probability function for discrete
random variables. The probability function for a random variable
assuming values \(x_1,x_2,\ldots,x_n\) with probabilities
\(p_1,p_2,\ldots,p_n\) is:
\[ f(x_i) = p_i \]
The probability table of \(X\) is simply a table
that lists the values of this function:
Since the set of values may
be infinite (but countable), the table may be infinite, and
it may be necessary to give a formula rather than listing
values, as in the previous example.
-
Know the definition of a continuous random variable.
The set of values of such a variable is an uncountable set
such as an interval \([a,b]\) or \([0,\infty[\). The probability
\[ P(X=x)=0 \]
of a particular value \(x\) is always zero for a continuous
random variable. Therefore, to calculate probabilities related
to continuous random variables requires the knownledge of
the probability density function ("density curve").
If the formula for the density curve is \(y = f(x) \) then
the formula allowing us to compute the probability is:
\[ P(a \le X \le b) = \int_{a}^b f(x)\,dx \]
Thus, the probability is the area under the curve.
-
Note that
\[ P(a \le X \le b) = P(a < X \le b)= P(a \le X < b) = P(a < X < b) \]
for continuous random variables. This definitely not the case
for discrete random variables (why?).
-
Probability distribution functions, probability tables, cumulative distributions,
expected value, variance, standard deviation. Note that the cumulative distribution
function of a random variable \(X\) is defined by:
\[ F(x) = P(X \le x) \]
This formula is valid, regardless of whether \(X\) is discrete or continuous.
However, the calculation is different in these two cases:
\[ F(x) = \sum_{y\leq x} P(X=y) \]
for discrete variables, where the summation is over \(y\) which
\(X\) actually assumes. For continuous random variables,
\[F(x) = \int_{-\infty}^x f(y) \,dy \]
where \(f(x)\) is the "density curve". Thus, this is the area under the curve
\(w=f(y)\) and to the left of the line \(y=x\). Note that Table A tabulates
\(F(x)\) for the normal density curve:
\[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} x^2} \]
-
Calculation of the random variable expected value (also called the mean).
Be familiar with the formula:
\[ \mathbb{E}X = \mu_X = \sum_{i=1}^n x_i p_i = \sum_{x} x\, P(X_i = x)\]
where the summation extends over all values of \(X\).
The set of values could be an infinite, but countable set, such as natural numbers.
In this case, the formula is an infinite series:
\[ \mathbf{E}X = \mu_X = \sum_{i=1}^\infty x_i p_i = \sum_{x} x\, P(X_i = x) \]
i.e. formally \(n=\infty\).
For example, the number of tosses before the first head is seen in a potentially infinite
sequence of coin tosses is:
\[ \sum_{k=0}^\infty k \cdot \frac{1}{2^{k+1}} = 1.\]
Obtaining this result requires some familiarity with infinite series.
-
Know that \(X+Y\) is only defined if \(X\) and \(Y\) share
the same sample space. In calculus, you have learnt that
two functions may be added only if they have the same
domain. This is the same principle.
-
The rule for the mean of a sum of random variables:
\[ \mu_{X+Y} = \mu_{X}+\mu_{Y} \]
Know that \(X\) and \(Y\) do not have to be independent
for this rule to hold.
-
The rule for the variance of independent
random variables:
\[ \sigma_{X+Y}^2 = \sigma_X^2 + \sigma_Y^2 \]
if variables \(X\) and \(Y\) are independent.
-
Another variance formula:
\[ \sigma_{a\,X+b}^2 = b^2\,\sigma_{X}^2\]
where \(a\) and \(b\) are constants.
-
Combined rule for expected values:
\[ \mu_{a\,X+b} = a\,\mu_X + b \]
where \(a\) and \(b\) are numbers.
-
Combined rule for variances:
\[ \sigma_{a\,X+b}^2 = a^2\,\sigma_X^2\]
where \(a\) and \(b\) are numbers.
-
Know the meaning of independence of random variables:
for all \(x\) and \(y\)
\[ P(X=x\;\text{and}\;Y=y) = P(X=x)\,P(Y=y) \]
-
Familiarity with correlation for random variables:
\[ \rho = \rho_{X,Y} = \mathrm{corr}(X,Y)=\mathbb{E}\left(\left(\frac{X-\mu_X}{\sigma_X}\right)
\left(\frac{Y-\mu_Y}{\sigma_Y}\right)\right)
= \sum_{i=1}^m\sum_{j=1}^n \left(\frac{x_i-\mu_X}{\sigma_X}\right)
\left(\frac{y_j-\mu_Y}{\sigma_Y}\right)\cdot p_{ij}
\]
where
\[ p_{ij} = P(X=x_i\;\text{and}\;Y=y_j) \]
(NOTE: Cannot use product rule because \(X\) and \(Y\) possibly are not
independent; if they were, \(\rho_{X,Y}=0\))
Here, we avoided excessive subscripts by using common notation for
the expected value of a variable:
\[ \mathbb{E}X = \mu_X \]
-
Kow the meaning of the formula:
\[ \sigma_{X+Y}^2 = \sigma_{X}^2 + \sigma_{Y}^2 + 2\,\rho_{X,Y}\,\sigma_X\sigma_Y \]
-
Note that the formula for \(\rho\) is not in the book,
although it appears to be used in several examples. Be able
to use it when \(X\) and \(Y\) assume very few values (2 or
3). Here is an example: For a certain basketball team it
was determined that the probability of scoring a hit in the
first free shot is \(0.8\). However, the probability of
scoring on the second shot depends on whether the player
scored on the first shot. If the player scored on the first
shot, the probability of scoring on the second shot remains
\(0.8\). However, if the player missed on the first shot,
the probability of scoring on the second shot is only
\(0.7\). This lower performance is known in the sports as
"choking". Let \(X\) be a random variable which represents
the number of points scored on the first shot (0 or 1), and
let \(Y\) be the number of points scored on the second
shot. Please answer the following questions:
-
Find the four probabilities:
\[p_{ij} = P(X=i\;\text{and}\;Y=j) \]
for \(i,j=0,1\). That is, fill out the
following table:
The above table is the joint probability function
or joint distribution of \(X\) and \(Y\).
-
Find \(\mu_X\) and \(\mu_Y\).
-
Find \(\sigma_X\) and \(\sigma_Y\).
-
Find \(\mathrm{corr}(X,Y)\). Note that in this case:
-
Find the probability function (table) for the random variable \(Z\), the total
score of two free throws:\(Z=X+Y\). Find \(\mu_Z\) and \(\sigma_Z\) directly.
-
Verify the equation
\[ \sigma_{Z}^2 = \sigma_{X}^2+\sigma_{Y}^2+2\rho_{X,Y}\,\sigma_X\,\sigma_Y \]
-
Please do finish the calculations above!
-
You may also use the above example to practice
conditional probabilities and the Bayes' formula. For
example, if it is known that a player scored on the
second throw, what is the probability that she/he missed
on the first throw?
-
Know that independence of random variables \(X\) and \(Y\) implies \(\mathrm{corr}(X,Y)=0\).
Know that \(\mathrm{corr}(X,Y)=0 \) does not imply
independence of \(X\) and \(Y\). However, \(\mathrm{corr}(X,Y)=\pm 1\) implies that
\(Y=a\,X+b\) for some constants \(a\) and \(b\). Moreover the sign of \(a\) is
the same as the sign of \(\mathrm{corr}(X,Y)\). This is perfect linear dependence.
NOTE: A similar result was true for sample correlations.
-
Note that \(\mathrm{corr}(X,Y)\) is only defined when \(\sigma_X>0\) and \(\sigma_Y>0\).
In particular, \(X\) and \(Y\) must assume at least two different values.
-
Know that there is a mean \(\mu_X\) of a random variable and sample mean
\[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\]
The former is a parameter of the population, and
the latter is a statistic (a property of the sample).
-
Similarly, the variance \(\sigma_X^2\) is a parameter and the sample variance
\[s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]
which is a statistic.
-
Also, the correlation \(\rho_{X,Y}\) is a parameter,
while the sample correlation \(r=r_{xy}\) is a statistic.
Recall the formula for the sample correlation:
\[
r=r_{xy}=
\frac{1}{n-1}\sum_{i=1}^n \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}
\]
Compare with the formula for \(\mathrm{corr}(X,Y)\), which involves
double summation.
Conditional probability
-
Know the meaning of \( P(A\,|\,B)\)
- Know the formula:
\[ P(A\,|\,B) = \frac{P(A\cap B)}{P(B)} \]
-
Know the rules of probability for conditional probabilities.
The function \( P'(A) = P(A|B) \) satisfies all Probability Rules
for fixed \(B\). For example,
\[ P(A\cup C|B) = P(A|B) + P(C|B) - P(A\cap C|B) \]
Thus, if you learned a rule for ordinary, non-conditional probability,
there is a corresponding rule for the conditional probability.
-
Know the Law of Alternatives also known as the Total Probability Formula.
Let \(C_1,C_2,\ldots,C_n\) be
mutually disjoint events ("causes"):
\[C_i\cap C_j = \emptyset\quad\text{when $i\neq j$}\]
and exhaustive events:
\[ C_1\cup C_2\cup \ldots \cup C_n = S \]
Then for every event \(A\) ("consequence"):
\[ P(A) = \sum_{i=1}^n P(A|C_i)\,P(C_i) \]
-
Know the Bayes' Formula:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
-
An alternative form of the Bayes' Formula for the probability of
the cause, given a known consequence:
\[ P(C_i|A) = \frac{P(A|C_i) P(C_i)}{\sum_{j=1}^nP(A|C_j) P(C_j)} \]
-
Know how to apply Bayes' Formula to common examples discussed by
the book and slides.
-
You may find it useful to read the following article on
Bayes' Theorem
-
The Monty Hall Problem
provides an interesting example of an application of conditional probability.
This example is often used in job interviews.