The purpose of ordination is to simplify the interpretation of a complex data set. However, this purpose is defeated if there are a very large number of environmental variables. What is meant by a "very large number" is largely a matter of taste, and the objectives of the analysis. Including dozens of environmental variables in a CCA diagram may be very informative to the investigator in an exploratory phase of his or her study, yet it is difficult to communicate the major patterns of compositional variation to colleagues if the ordination diagrams are cluttered by more than half a dozen arrows (unless the patterns are obvious). Thus it is sometimes desirable to reduce the number of variables included in an analysis.
Another problem with the inclusion of many variables (e.g. if the number of variables approaches the number of samples) is that the arch effect may appear.
It is impossible to reduce the number of variables without some loss of information. When one removes variables, one should make sure to retain as much of ecologically relevant information as possible. Four ways to do this are described below. The first two (if performed a priori) are valid for both exploratory and hypothesis-driven research, the last two are only valid for exploratory research.
Variables
can be chosen for reasons external to the data set. One such reason could be
comparability to other studies. If another study was performed on a similar
region, it might be valuable to use exactly the same variables. Another
external criterion might involve the biology of the species involved. For
example, a previous study might have determined that the rooting depth of most
of the species did not exceed 15 cm. If so, it might not be useful to include
the results of soil analyses collected at 25 cm.
Environmental
variables may be highly correlated or "redundant" with one another.
For example, soil pH, calcium, magnesium, and cation exchange capacity are
usually very tightly correlated. If so, any one of these variables could be
used as a proxy for all the others. Generally, it is best to choose the
variable which is most likely to be the direct cause of species response,
and/or a variable which has been used in other ecological studies.
It might not be known beforehand which variables are correlated with each other. In this case, a detailed examination of the correlation matrix would be helpful. A relatively sophisticated way to do this would be to perform a Principal Components Analysis (PCA) on the correlation matrix, and to choose the environmental variable which is most strongly associated with each of the first several principal axes. I do not recommend using the PCA axes themselves as environmental variables for CCA, because this makes the interpretation of the CCA diagrams very confusing.
Removal of redundant variables might assist in interpretability, but it must be noted that as long as the correlation between two variables is less then 1, then there is some variation in each variable which is not redundant with the other. This variation could potentially influence species composition. The existence of intercorrelated variables is not an obstacle for CCA, but it may be an obstacle for interpretation.
Variables
can be removed post-hoc in exploratory analyses if they don't seem to
explain variation along major axes in an easily interpretable way. The user
must be aware, however, that even if the result is uninterpretable
based on current expertise and intuition, it may represent a very real and
important feature of species composition. Because of this, I recommend removing
only variables which are represented by very short arrows in the first several
CCA axes, or variables whose arrows are almost identical with other variables.
Uninformative
variables can be removed by continual inspection of the results, as suggested
above, or they can be removed by a semiautomated
stepwise procedure. Stepwise analysis includes a wide variety of techniques
that have proved useful in exploratory multiple regression (Draper and Smith
1981). The package CANOCO includes a stepwise procedure known as forward
selection, which adds environmental variables one at a time, until no other
variables "significantly" explain residual variation in species
composition. The CANOCO procedure is semi- automated, and allows the user to
make choices about inclusion of variables. The procedure occurs as follows:
The
reason "significant" is in quotation marks is
because it does not represent true statistical significance - even with random
data, you will get a large number of falsely significant results due to the
problem of multiple comparisons.
Categorical variables pose a special problem in stepwise analysis, because they are coded as dummy variables (see Environmental variables in constrained ordination). Suppose you had five bedrock types, but only "limestone" was selected by forward selection. Should you select the other variables in the subsequent stages of analysis? The answer is subtle: If you consider there to be one categorical variable (bedrock) with five states, then the answer is yes; if you consider the different bedrock types to be independent of each other, the answer is no.
It is important to realize that the last two remaining variables within a category will always have identical fit, because they contain identical information (if it is not one, then it must be the other). It does not matter which one is chosen, but it is often more interpretable to choose the more common category.
There is no guarantee that forward selection (or any other stepwise procedure) will result in the "best" set of environmental variables. The only way to determine the best set of variables is to perform a separate CCA for every conceivable combination of variables; this is, in most cases, impossible with current technology because it involves an astronomically large number of combinations. However, the lack of a guarantee should not be a concern for those performing exploratory analyses: the objective is to determine a limited number of variables which explain species composition well, and not to be fixated with p-values and mathematical purity.
Draper,
N. R., and H. Smith.
1981. Applied Regression Analysis. second edition. Wiley, New York.
This
page was created and is maintained by Michael
Palmer.
To the ordination
web page