Environmental Variables in Constrained Ordination (e.g. CCA, RDA, DCCA)


Choice of variables - The choice of environmental variables greatly influences the outcome of CCA and other constrained ordinations.  For an exploratory analysis, one should certainly include variables which are thought to be related to the most important determinants of species composition. However, it is also often desirable to include other variables which are easy and inexpensive to measure - one may be surprised and find that previously unsuspected factors are quite important. In any case, one can always remove superfluous variables if they are confusing or difficult to interpret. On the other hand, the choice of variables for a hypothesis-driven analysis must be very careful, because post-hoc removal of variables is not valid.  See Hypothesis-driven and Exploratory Data Analysis.

It is possible to have only one environmental variable in a constrained ordination. In this case, the species scores indicate how species are arranged along this variable. Since such an ordination is 1 - dimensional (i.e. there is one 'canonical axis'), it is not possible to produce a 2-dimensional figure, and the results are better presented in tabular form.  However, you can produce a 2-dimensional image in which the second axis is the first "residual" axis.

If you have at least as many variables as you have samples in CCA, then your ordination is no longer 'constrained', and your sample and species scores are the same as in correspondence analysis.  Another way of saying this is that you have explained 100% of the variation in species composition due to overfitting.  See Multiple Regression.  If you consider this overfitting to be undesirable, see Reducing the number of variables.

Interaction terms - Interaction terms are easy to implement in CANOCO, but they are often extremely difficult to interpret. If one suspected, for example, that elevation and precipitation interacted to influence species composition, then one could introduce a variable which is the product of elevation and precipitation. However, the ecological meaning of the location of particular stands or species in ordination space with respect to such a compound variable is unclear.

Another problem with interaction terms is that there can be many of them. With N variables, there are N(N-1)/2 possible interaction terms (e.g. for 5 variables, there are 10 interaction terms) - and it would be very difficult to sort out which of the interactions are meaningful. This excludes higher-order interaction terms (e.g. variable 1 times variable 2 times variable 3).

A quadratic term results when you multiply an environmental variable by itself. This is very useful in multiple regression, especially when one is fitting polynomial functions. However, quadratic terms are to be avoided in CCA because it can cause the arch effect, a warpage of ordination space (C. J. F. ter Braak, personal communication). See also Variance explained and Variance partitioning.

My general recommendation is to avoid the use of interaction terms except in the case of a particular hypothesis-driven analysis, in which the null hypothesis is "the environmental variables do not interact to influence species composition". I suspect that such an analysis is rarely the objective of community ecology research.

Linear combinations - As with most other multivariate analyses, environmental variables cannot be linear combinations of other variables. If a variable is a linear combination of others, a "singular matrix" results; this leads to a matrix operation which is analogous to dividing by zero. There are a few situations in which this might be a problem in ecological studies. One example is in soil cations: if all of the cations are individually included, then the variable "total cations" will be a linear combination of other variables. The solution to this problem is to omit from the analysis either the total cations, or one of the component cations. A second situation in which problems with linear combinations might arise is in relative composition data; in which variables add to unity or 100%. For example, soil texture consists of % clay, % sand, and % silt, which must add to 100%. One of these variables must be removed for later analysis. A third situation is the case of dummy variables, to be discussed later.

Fortunately, CANOCO detects the presence of linear combinations, and automatically eliminates one of the variables. However, it will not detect linear combinations if the combination is not exact. For example, if there is 33.3% clay, 33.3% sand, and 33.3% silt, CANOCO would not recognize this as a linear combination because the sum would be 99.9%. Unpredictable results might occur.

For cases in which the linear combination is indeed exact, I recommend including all variables in the analysis and letting CANOCO remove them for you.  This is because you would (most likely) still want the final variable included in graphical displays (e.g. using CANODRAW).

Transformation of environmental data - The scaling we use while measuring the environment may not be the most relevant scaling for species composition. Unfortunately, we never know how the species "perceive" the environment. We thus need to make educated guesses. In absence of detailed physiological data, I recommend a logarithmic transformation for most soil nutrient data (Palmer 1993), because a 1-unit difference in nutrient concentration is probably much more important at low concentrations than it is at high concentrations.

Changing the units of a variable (e.g. British to metric, centimeters to millimeters, proportion to percentage), and changing the base point (e.g. elevation above a fixed point to elevation above sea level) will have no effects on the outcome of CCA, because these are examples of a linear transformation.

Since the statistical significance of a CCA or RDA is determined by a randomization test, there is no need to transform data to fulfill statistical assumptions. However, transformations can be used to dampen the influence of outliers.  The choice of transformation impacts the location of sample scores, species scores, and environmental scores.  A dampening transformation (e.g. square root) tend to make samples and species more evenly spread out.  Only rarely will tranformation of environmental variables change the overall interpretation of an ordination.

Dummy variables - Many important aspects of the environment cannot be easily described using continuous variables. Factors such as type of bedrock, land use history, and current management are better described by categorical variables. Multivariate analyses cannot deal with categorical variables directly; they need to be converted into "dummy variables". Dummy variables take the value 1 if the plot belongs to the category, and 0 if it does not. Some statistical packages do the conversion of categorical variables to dummy variables automatically; however CANOCO 4.5 and before requires that the user perform the conversion beforehand. Canoco 5.0 allows and encourages listing variables as factors, but it is still useful to know that internally the factors are decomposed into dummy variables.

For every categorical variable with K categories, only K-1 dummy variables can be included in the analysis. To illustrate this point, Suppose you have a categorical variable for "bedrock" which takes three values, granite, limestone, and basalt, for a data set of 10 plots. Plots 1, 3 and 10 are on granite; 2, 4 and 5 are on limestone, and 6, 7, 8 and 9 are on basalt. The table below illustrates how the dummy variables are constructed. Note that there is a problem with linear combinations: As previously stated, no set of variables can add to a constant, yet the sum of the values for each plot is 1. One of the variables must therefore be removed. Fortunately, this results in no loss of information: if you remove "basalt", the information about basalt is still in the data set because basalt always occurs when you do not have granite or limestone. CANOCO removes superfluous dummy variables automatically, but it is important to be aware of this removal, because the CANOCO output for removed variables appears quite different than for the other variables.

Example: Creation of dummy variables for the hypothetical data set described in the text.

Plot #

Granite 

Limestone

Basalt 

1

1

0

0

2

0

1

0

3

1

0

0

4

0

1

0

5

0

1

0

6

0

0

1

7

0

0

1

8

0

0

1

9

0

0

1

10

1

0

0



Circular data - some kinds of variables are circular: large values may be very close to small values. In most cases, circular data must be transformed. The two cases in which circular data are likely to occur in species composition data are in aspect and day of the year.

Aspect (or compass direction of a slope) can be transformed by trigonometric functions (Roberts 1986). The simplest way to do this is to create two variables, "northness" and "eastness" as follows:

northness = cos(aspect)
eastness = sin(aspect)

Northness will take values close to 1 if the aspect is generally northward, close to -1 if the aspect is southward, and close to 0 if the aspect is either east or west. Eastness behaves similarly, except that values close to 1 represent east-facing slopes.

A trigonometric transformation of aspect data, as described above, is rather 'pure' since it retains the continuity of aspect.  However, converting aspect into dummy variables (e.g. N, NE, E, SE, S, SW, W, NW) often produces results that are easier to interpret, as long as there is good representation of each aspect class in the data set.  This approach has the additional advantage of allowing level surfaces (i.e. no aspect) to have a category (i.e. 'level').

The day of the year is circular because early January will have low values, and late December will have high values, yet the climates at the two times will be somewhat similar. One solution is to create new variables "winterness" and "springness" similar to the northness and eastness described above. A second solution is to create dummy variables for each month (or if there are a lot of data, for each week). This solution is MUCH PREFERABLE in most cases because it is easier to interpret.  If sampling occurred over a limited number of months (even if over the course of several years), it is not necessary to perform a transformation, and one could use the Julian date directly.  Yet another option is to use variables that might be more proximate indicators of seasonality, such as day length or weather variables.

Species-derived variables - there is a paradox in gradient analysis: species respond to the environment, but they also modify the environment. For example, vegetation itself can be considered an environmental factor to which vegetation responds. There are a suite of variables derived from species data which might be useful in CCA and other constrained ordinations: maximum height of vegetation, total biomass, light penetrating through the canopy, woody plant cover, etc. These variables might be quite informative in an exploratory analysis, though the ecologist must realize that it would be difficult to distinguish cause and effect. Species-derived variables should NOT be used in hypothesis testing, because the same data would be represented in both the dependent and the independent variables. This would lead to circular reasoning.  A compromise would be to have these species-derived variables considered "passive" or "supplementary" - i.e. they would be included in diagrams, but would not otherwise influence the ordination.

An extreme case of species-derived variables is to use dummy variables derived from a classification of samples.  The classification could either be a result of a subjective procedure (e.g. the Braun-Blanquet approach or something less formal) or a multivariate analysis.  When the centroids of these dummy variables are plotted along with species scores in a CCA biplot, you have an ideal display of the relationships between your classes, and the species that occur in them.  Of course, any statistical tests would be inappropriate.


References cited

See also selected references for self-education.

Palmer, M. W. 1993. Putting things in even better order: the advantages of canonical correspondence analysis. Ecology 74:2215-30.

Roberts, D. W. 1986. Ordination on the basis of fuzzy set theory. Vegetatio 66:123-31.



This page was created and is maintained by Michael Palmer.
 To the ordination web page