Hypothesis-Driven and Exploratory Data Analysis

The 14th-century maxim known as Ockham's Razor, paraphrased by Jefferys and Berger (1992) as "It is vain to do with more what can be done with less", is usually applied to the interpretation of scientific results. However, it applies equally well to choice of analysis. Thus if one has a very simple ecological data set, consisting of few species and few samples, ordination is not worthwhile. In such a case, the data are easiest to interpret in a simple table.

In a typical data set, however, there are dozens of species and samples. It is impossible for the human mind to simultaneously contemplate dozens of dimensions. The purpose of ordination is to assist the implementation of Ockham's Razor: a few dimensions are easier to understand than many dimensions. A good ordination technique will be able to determine the most important dimensions (or gradients) in a data set, and ignore "noise" or chance variation.

Both direct and indirect gradient analysis have the potential to reduce the dimensionality of a data set. However, reduction of dimensionality is not the only reason to use ordination. Before the development of CCA, most widely-used ordination techniques were indirect, and the primary goal of ordination was considered "exploratory" (Gauch 1982). It was the job of the ecologist to use his or her knowledge and intuition to collect and interpret data; pure objectivity could potentially interfere with the ability to distinguish important gradients. Ordination was often considered as much an art as a science.

Once CCA was available, multivariate direct gradient analysis became feasible. It became possible to rigorously test statistical hypotheses and go beyond mere "exploratory" analysis. However, testing hypotheses requires complete objectivity, which results in repeatability and falsifiability. The two basic motivations for multivariate direct gradient analysis, hypothesis testing and exploratory analysis, conflict with each other to some extent:

Table 1. Hypothesis-driven analysis, exploratory analysis, and their major characteristics and motivations. This table applies to regression techniques and indirect gradient analysis in addition to CCA.
Motivating Question: "Can I reject the null hypothesis that species are unrelated to a postulated environmental factor or factors?"  Motivating Question: "How can I optimally explain or describe variation in my data set?" 
objective  subjective
sites must be representative of universe: random, stratified random, regular placement sites can be "encountered" or subjectively located
analyses must be planned a priori  "data diving" permissible; post-hoc analyses, explanations, hypotheses OK
p-values meaningful p-values only a rough guide 
stepwise techniques not valid without cross-validation stepwise techniques (e.g. forward selection) valid and useful. 
To perform a hypothesis-driven analysis, one must be very specific about the analyses one wishes to perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable manner. Usually, the sampling design will involve random, stratified random, or regular distribution of study plots. If there is any subjectivity involved in locating or orienting study plots, the results are technically not valid. All of the analyses, including variations of data transformation and use of different ordination options (e.g. detrending or not), must be planned in advance, or else the user runs the risk of "data diving" or "data mining", i.e. getting an artificially significant result because so many options are tried. Stepwise techniques (discussed later) are automated forms of "data diving", and will typically also lead to incorrect statistical inference (Cliff 1987, Draper and Smith 1981). The reward for rigorously adhering to these rather stringent criteria is that the statistical inference (i.e. the p-value) is valid.

Exploratory analyses might lack statistical rigor, but they are still a mainstay of vegetation research. The purpose of exploratory analysis is to find pattern in nature, which is an inherently subjective enterprise. Exploratory analyses incorporate the wisdom, skill, and intuition of the investigator into the experiment. Unless you can find another investigator with identical wisdom, skill and intuition, the analyses are not strictly repeatable, and are hence not falsifiable. While it is possible to perform exploratory analyses on sample plots located according to a rigorous, objective sampling design, such careful placement is not necessary. Indeed, an exploratory analysis can be aided if the investigator subjectively places study plots in locations he or she considers to be important or interesting. Orienting plots within vegetation which appears homogeneous is highly subjective, but very useful in evaluating differences between plots.

With exploratory analysis, "data diving" (e.g. using different transformations of species abundances, adjusting ordination options, selecting different subsets of environmental variables, or selecting different subsets of study plots) is no longer to be avoided. Instead, it is a way for the investigator to learn more about the data set. Stepwise analysis is a form of automated data diving. It is useful as a tool to help discover "important" or "interesting" variables.

Ecologists are often mislead into thinking that p-values from stepwise methods have a rigorous meaning, and that the results of stepwise methods give the best possible model. Such thinking is false.

It is possible to combine exploratory analysis and hypothesis-driven analysis into a larger study. One way of doing this is to perform a 2-phase study, in which the first phase is an exploratory analysis, perhaps involving subjectively located plots and employing many variations on analysis. The patterns found in the first phase are then posed as hypotheses for the second phase. The second phase involves the collection of fresh data from objectively located plots, and an entirely planned data analysis.

A second way to combine the two major types of analysis is through data set subdivision. The data set is randomly divided into two subsets: an exploratory subset and a confirmatory subset (alternatively called model building and model validation, respectively). Many, varied analyses can be performed on the exploratory subset (including stepwise analysis) - and such analyses can be based upon intuition, hunches, or superstition.  If interesting patterns are found with respect to particular environmental variables, and using particular data transformations, these patterns can be statistically tested using the confirmatory subset. To use data set subdivision properly, samples must be objectively located.

Literature cited

(see also selected references for self-education)

Cliff, N. 1987. Analyzing Multivariate Data. Harcourt Brace Jovanovich, Publishers, San Diego, California.

Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.

Gauch, H. G., Jr. 1982. Multivariate Analysis and Community Structure. Cambridge University Press, Cambridge.

Hallgren, E., M. W. Palmer, and P. Milberg. 1999. Data diving with cross validation: an investigation of broad-scale gradients in Swedish weed communities. Journal of Ecology 87:1037-1051.

Jefferys, W. H., and J. O. Berger. 1992. Ockham's Razor and Bayesian Analysis. Am. Sci. 80:64-72.

This page was created and is maintained by Michael Palmer
 To the ordination web page