Reducing the Number of Variables


The purpose of ordination is to simplify the interpretation of a complex data set. However, this purpose is defeated if there are a very large number of environmental variables. What is meant by a "very large number" is largely a matter of taste, and the objectives of the analysis. Including dozens of environmental variables in a CCA diagram may be very informative to the investigator in an exploratory phase of his or her study, yet it is difficult to communicate the major patterns of compositional variation to colleagues if the ordination diagrams are cluttered by more than half a dozen arrows (unless the patterns are obvious). Thus it is sometimes desirable to reduce the number of variables included in an analysis.

Another problem with the inclusion of many variables (e.g. if the number of variables approaches the number of samples) is that the arch effect may appear.

It is impossible to reduce the number of variables without some loss of information. When one removes variables, one should make sure to retain as much of ecologically relevant information as possible. Four ways to do this are described below. The first two (if performed a priori) are valid for both exploratory and hypothesis-driven research, the last two are only valid for exploratory research.

Selection by external criteria

Variables can be chosen for reasons external to the data set. One such reason could be comparability to other studies. If another study was performed on a similar region, it might be valuable to use exactly the same variables. Another external criterion might involve the biology of the species involved. For example, a previous study might have determined that the rooting depth of most of the species did not exceed 15 cm. If so, it might not be useful to include the results of soil analyses collected at 25 cm.

Examination of correlation structure

Environmental variables may be highly correlated or "redundant" with one another. For example, soil pH, calcium, magnesium, and cation exchange capacity are usually very tightly correlated. If so, any one of these variables could be used as a proxy for all the others. Generally, it is best to choose the variable which is most likely to be the direct cause of species response, and/or a variable which has been used in other ecological studies.

It might not be known beforehand which variables are correlated with each other. In this case, a detailed examination of the correlation matrix would be helpful. A relatively sophisticated way to do this would be to perform a Principal Components Analysis (PCA) on the correlation matrix, and to choose the environmental variable which is most strongly associated with each of the first several principal axes. I do not recommend using the PCA axes themselves as environmental variables for CCA, because this makes the interpretation of the CCA diagrams very confusing.

Removal of redundant variables might assist in interpretability, but it must be noted that as long as the correlation between two variables is less then 1, then there is some variation in each variable which is not redundant with the other. This variation could potentially influence species composition.  The existence of intercorrelated variables is not an obstacle for CCA, but it may be an obstacle for interpretation.

Interpretability

Variables can be removed post-hoc in exploratory analyses if they don't seem to explain variation along major axes in an easily interpretable way. The user must be aware, however, that even if the result is uninterpretable based on current expertise and intuition, it may represent a very real and important feature of species composition. Because of this, I recommend removing only variables which are represented by very short arrows in the first several CCA axes, or variables whose arrows are almost identical with other variables.

Stepwise Analysis

Uninformative variables can be removed by continual inspection of the results, as suggested above, or they can be removed by a semiautomated stepwise procedure. Stepwise analysis includes a wide variety of techniques that have proved useful in exploratory multiple regression (Draper and Smith 1981). The package CANOCO includes a stepwise procedure known as forward selection, which adds environmental variables one at a time, until no other variables "significantly" explain residual variation in species composition. The CANOCO procedure is semi- automated, and allows the user to make choices about inclusion of variables. The procedure occurs as follows:

  1. CCA is performed on each variable separately
  2. CANOCO will tell you how well each of these variables fits the vegetation data
  3. You choose one of these variables - either on the basis of the goodness of fit, or on how important you think it is that the variable be included in the analysis.
  4. This variable can be tested for "significance" using a randomization test
  5. CANOCO will use this variable as a covariable, and perform a partial Canonical Correspondence Analysis (pCCA) on the remaining variables separately.
  6. Go to 2
  7. You decide when to stop - conventionally when the "significance level" (p) is no longer less than 0.05. Alternatively, you can automatically stop once you include a fixed number of variables.
  8. You end by performing a CCA on all of the "significant" variables together.

The reason "significant" is in quotation marks is because it does not represent true statistical significance - even with random data, you will get a large number of falsely significant results due to the problem of multiple comparisons.

Categorical variables pose a special problem in stepwise analysis, because they are coded as dummy variables (see Environmental variables in constrained ordination). Suppose you had five bedrock types, but only "limestone" was selected by forward selection. Should you select the other variables in the subsequent stages of analysis? The answer is subtle: If you consider there to be one categorical variable (bedrock) with five states, then the answer is yes; if you consider the different bedrock types to be independent of each other, the answer is no.

It is important to realize that the last two remaining variables within a category will always have identical fit, because they contain identical information (if it is not one, then it must be the other). It does not matter which one is chosen, but it is often more interpretable to choose the more common category.

There is no guarantee that forward selection (or any other stepwise procedure) will result in the "best" set of environmental variables. The only way to determine the best set of variables is to perform a separate CCA for every conceivable combination of variables; this is, in most cases, impossible with current technology because it involves an astronomically large number of combinations. However, the lack of a guarantee should not be a concern for those performing exploratory analyses: the objective is to determine a limited number of variables which explain species composition well, and not to be fixated with p-values and mathematical purity. 


Reference Cited

(see also suggested references for self-education)

Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.



This page was created and is maintained by Michael Palmer.
 To the ordination web page