A qualitative factor is a variate with a finite, discrete set of values.A factor can be unordered or ordered.
An example of an unordered factor is eye color in a population sample.
An example of an ordered factor is a student's grade on a standarized test.
An ordered factor does not have to be a quantitative factor, as in this last example. An example of a quantitative factor would be the amount of a fertilizer per acre in a crop yield experiment.
Statistical software, such as R, distinguishes between quantitative and qualitative (ordered and unordered) factors when performing analysis of variance. For example, when dealing with an unordered qualitative factor, R assigns the default set of contrasts to the factor, as generated by the function contr.treatment.. These factors are
where , where is the number of levels of the factor (typically equal to the number of treatments).
We note that the maximum number of linearly independent factors is , due to the requirement of orthogonality to to , the sample mean.
Two contrasts are called orthogonal if they are not correlated as random variables.If we have two contrasts, and , and
then, under the assumption of independence of the sample,
If the -th treatment group has units then
and it is easy to see that
where is the population variance.
Thus, we have the following formula for the covariance of the contrasts:
We note that this is a bilinear form of the coefficients. It is positive definite. Moreover, when then this form is
and thus proportional to the standard dot product of vectors . Thu, we may identify contrasts with the vectors of their coefficients:
for the purpose of being able to apply linear algebra and geometric intuition in their study.
The important fact about bilinear forms is that they define the notion of orthogonality. If the treatment groups are not even in size then the notion of orthogonality is non-standard, i.e. different from the one defined by the standard dot product.
The case of mutually orthogonal contrasts is especially important. Under the assumption of normality, this implies independence of the contrasts as random variables. Moreover, in analysis of variance, in this case the sum of squares splits exactly between the contrasts, which is essentially the definition of a balanced design. We note that for unequal treatment groups the default contrasts are not orthogonal, and thus the design is not balanced. In R, an explicit warning is printed if the design is not balanced. There could be other reasons for this warning than the contrasts not being orthogonal.
The Gram-Schmidt process may be used to fix a set of contrasts which are not orthogonal, but this typically leads to the loss of the clear intuitive form.
Alternatively, one may evaluate the impact of non-orthogonality on analysis of variance. For small deviations from the equal treatment group condition, the impact will be small, resulting in somewhat higher P-values of the F-test.
In addition to the default contrasts, R offers two other "brand" contrasts for qualitative factors:
- Helmert contrasts (generated by contr.helmert)
- Sum contrasts (generated by contr.sum)
Helmert contrasts
These contrasts are used to express the research hypothesis that the -th treatment is better than the mean of all treatments from to . Thus,
In R, the -th contrast is multiplied by to make the coefficients integer.
Helmert contrasts are orthogonal.
The following R example yields Helmert contrasts for 4 treatment levels. Column yields the coefficients of the -th contrast.
> contr.helmert(4)
[,1] [,2] [,3]
1 -1 -1 -1
2 1 -1 -1
3 0 2 -1
4 0 0 3
Sum contrasts
They can be used for comparison the -th mean with the last one. They are orthogonal.
The following R example gives the coefficients for the sum contrasts:
> contr.sum(4)
[,1] [,2] [,3]
1 1 0 0
2 0 1 0
3 0 0 1
4 -1 -1 -1
Studentizing a contrast
The statistic
has the Student t-distribution with degrees of freedom (as long as 0). Here is the square root of the estimate of the variance:
Treatment "contrasts" of R
The default setting for R contrasts is somewhat puzzling:
> contr.treatment(4)
2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
We observe that the sum of the coefficients in each column is not . What R calls contrasts, others may call index variables. The bottom line is what the model matrix (which may be called the design matrix by others) is, as this matrix determines which least squares problem is solved. The following R code (which should be put in a file "treatment.R") illustrates this statement:
Treatment <- as.factor(c("a","a","b","b","c","c","d","d"))
contrasts(Treatment) <- contr.treatment(4)
m <- model.matrix(~Treatment, Treatment)
We run the program with the following result:
> source("treatment.R")
> as.data.frame(m)
(Intercept) Treatment2 Treatment3 Treatment4
1 1 0 0 0
2 1 0 0 0
3 1 1 0 0
4 1 1 0 0
5 1 0 1 0
6 1 0 1 0
7 1 0 0 1
8 1 0 0 1
This model matrix is consistent with the model:
In other words, we use the first mean as the base.