\[ \def\floor#1{\left\lfloor #1\right\rfloor} \]

How to calculate the five number summary (correctly)?

The following summary yields a method compatible with the function fivenum of the statistical package R.

A note on notation

Instead of giving concrete numerical examples, we will assume that the sorted data are
$$ x_1,\, x_2,\, \ldots,\, x_n $$
We also assume that the data has been sorted in ascending order. The sample size will always be denoted by \(n\).

The basics

The five number summary of a data sample consists of 5 numbers:

The "floor" function

If x is a real number then \(\floor{x}\) is the largest whole number not greater than \(x\). The floor function is available in Microsoft Excel and in most programming languages and software packages. When entering this function from the keyboard, one typically types
floor(x)
instead of \(\floor{x}\).

Examples

Sorting of the data

The first step is ordering the data from the smallest to the largest.

Determining the positions of the five numbers

Let n be the samples size. The positions in the sorted dataset are computed based on n only. The positions may be fractional, that is, they may fall between the real data positions, which are 1, 2, ..., n (whole numbers). Here are the positions, in the order of difficulty of understanding:

How to calculate the five numbers from their position?

The general rule is simple:

An interpretation of \(n_4\)

Let us elaborate on the meaning of the formula
$$ n_4 = \frac{\floor{\frac{n+3}{2}}}{2} $$
It is easy to see that this formula is equivalent to:
$$ n_4 = \frac{\floor{\frac{n+1}{2}}+1}{2} $$
The number \(m=\floor{\frac{n+1}{2}}\) is the right-most position to the left of the position of the median (the position of the median may be that position). Thus,
$$ n_4 = \frac{m+1}{2} $$
That is, it is the position of the median of all data to the left of the median (sometimes including the median itself). This gives an alternative method to compute the position of \(Q_1\):

Summary of the calculation of \(Q_1\)

  1. Compute the position of the median first; this is \(\frac{n+1}{2}\);
  2. Round it down to the nearest integer; this is \(m\) defined above; this is the right-most position to the left of the position of the median (this may coincide with the position of the median);
  3. Compute the position of the median of data \(x_1,\,x_2,\,\ldots,\,x_m\); this is \(n_4=\frac{m+1}{2}\).
  4. Compute \(Q_1\); if \(n_4\) is a whole number, \(Q_1 = x_{n_4}\); if \(n_4\) is fractional \(Q_1 = \frac{1}{2}(x_{n_4'}+x_{n_4''})\) where \(n_4'\) and \(n_4''\) are the whole numbers nearest to \(n_4\);

Examples


\(n=50\)

We compute n4 (the position of the first quartile) first:
$$ n_4 = \frac{\floor{\frac{50+3}{2}}}{2} = \frac{\floor{26.5}}{2} = \frac{26}{2} = 13. $$
Number Position
Minimum 1
Maximum 50
Median \( \frac{50+1}{2}=25.5\)
\( Q_1 \) 13
\( Q_3 \) \( 50+1-13=38 \)
Thus, the positions of the five numbers are (in increasing order):
$$ 1,\, 13,\, 25.5,\, 38,\, 50 $$
The five numbers are:
$$ x_1,\, x_{13},\, \frac{x_{25}+x_{26}}{2},\, x_{38},\, x_{50} $$

\( n=65 \)

We compute \( n_4 \) first:
$$ n_4 = \frac{\floor{\frac{65+3}{2}}}{2} = \frac{\floor{34}}{2} = \frac{34}{2} = 17. $$
Number Position
Minimum 1
Maximum 65
Median \( \frac{65+1}{2}=33\)
\( Q_1 \) 17
\( Q_3 \) \( 65+1-17=49 \)
Thus, the positions of the five numbers are (in increasing order):
$$ 1,\, 17,\, 33,\, 49,\, 65 $$
The five numbers are:
$$ x_1,\,x_{17},\,x_{33},\,x_{49},\,x_{65} $$

n=19

We compute n4 first:
$$ n_4 = \frac{\floor{\frac{19+3}{2}}}{2} = \frac{\floor{11}}{2} = \frac{11}{2} = 5.5. $$
Number Position
Minimum 1
Maximum 19
Median \( \frac{19+1}{2}=10 \)
\( Q_1 \) 5.5
\( Q_3 \) \( 19+1-5.5=14.5 \)
Thus, the positions of the five numbers are (in increasing order):
$$ 1,\, 5.5,\, 10,\, 14.5,\, 65 $$
The five numbers are:
$$ x_1,\,\frac{x_5+x_6}{2},\, x_{10},\, \frac{x_{14}+x_{15}}{2},\, x_{19} $$

For the curious - the R language code of the fivenum

The following R session reveals the code of fivenum. The above explanation concerns the 3 lines of R code in the curly braces after the word "else":
> fivenum
function (x, na.rm = TRUE) 
{
    xna <- is.na(x)
    if (na.rm) 
        x <- x[!xna]
    else if (any(xna)) 
        return(rep.int(NA, 5))
    x <- sort(x)
    n <- length(x)
    if (n == 0) 
        rep.int(NA, 5)
    else {
n4 <- floor((n + 3)/2)/2 d <- c(1, n4, (n + 1)/2, n + 1 - n4, n) 0.5 * (x[floor(d)] + x[ceiling(d)])
} } >
	
Besides the 3 lines, the code primarily has to do with removing missing data.