DATA FORMTS FOR INPUT INTO CANOCO, DECORANA, OR TWINSPAN
Canoco uses input data in ASCII form. In CANOCO for Windows, it is
theoretically possible that you would never need to see such ASCII files,
since they can be created and read by other facilities. However,
it is good practice to know the general data formats, for the purpose of
troubleshooting. Most of this page is valid for the older CANOCO
for DOS. Special considerations for CANOCO for WINDOWS are listed
at the end of this page.
Suppose you had a data set in which four large quadrats were sampled
for birds, and you obtained the following data:
Sample 1 Sample 2 Sample 3 Sample 4
Cardinals 1 0 0 3
roadrunners 1 0 0 0
blue birds 3 2 0 0
phoebes 1 0 5 2
titmice 0 9 6 0
red-tails 1 0 0 0
chickadees 20 1 1 0
waxwings 66 0 0 0
How would you get these data into shape, so that CANOCO can read them?
CANOCO is a FORTRAN program, and therefore requires input in FORTRAN
format.
Conceptually, the most straightforward way to input these data into
CANOCO is in "full format". In full format, the samples are the rows, and
the columns are the species. An example of the above data translated into
full format follows. (In the remaining, data sets ready for analysis are
surrounded by horizontal lines - and the lines are not part of the data
files). It is worth noting here that you would in most cases be better
off having your data in reduced condensed format, to be discussed later.
BIRD DATA IN FULL FORMAT
(I3,8F3.0)
8
1 1 1 3 1 0 1 20 66
2 0 0 2 0 9 0 1 0
3 0 0 0 5 6 0 1 0
4 3 0 0 2 0 0 0 0
0 0 0 0 0 0 0 0 0
CARDINALROADRUNNBLUEBIRDPHOEBE TITMOUSEREDTAILSCHICKADEWAXWINGS
SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4
Let us now dissect the above data file.
-
Line #1: There must be exactly one title line. This line, AS WELL AS ALL
THE REMAINING, must be no more than 80 characters long. It is useful for
this line to be informative, not only because it will remind you what the
data set is, but because it will be printed in most of the computer output.
Some people like to put the analysis date in the sample line, so that you
can know when you performed the analysis while poring over output.
-
Line #2: This is a FORTRAN formatting line. It tells the computer the nature
of your data. In this case, it says that there is an integer taking up
three spaces (I3), plus eight groups of values taking up three spaces each
(8F3.0). The ".0" means that there could be a decimal point anywhere in
these three spaces. Other variations on formatting statements will be described
later. The "F" means "real number" (e.g. there could be a decimal point).
Even if your data are integers, as they are here, you must specify them
as real numbers because that is how they are analyzed by the program.
-
Line #3: This is the number of data values per sample, not including the
sample number.
-
Lines #4-7: These are the actual data. The sample numbers must be ascending,
though not necessarily consecutive. However, missing samples must still
have a sample name. Likewise, missing species must have a species name.
-
Line #8: The final line must begin with a sample #0. This tells the computer
that the data have ended.
-
Line #9: These are the species names. Each species must be given an 8-character
code (which may be upper or lower case, may contain numerals, and may contain
spaces). A maximum of 10 species can fit on a line. If you have more than
10, put species 11-20 on the second line of species names, 21-30 on the
third line, etc. If you have short names, you must still include extra
spaces to make up 8 characters, even if it the last species in the row.
-
It is conventional for birds to be listed by common names, and other organisms
by Latin binomials.
-
Line #10: Immediately after the species names, and starting on a new line,
are the sample names. The same rules about format for species names apply
here. It is possible to omit the sample names, in which case CANOCO supplies
default names. However, you must supply "hard returns" (i.e. end of paragraph)
- at least one return for every 10 samples. In general, it is a good idea
to make your sample names as descriptive as possible - for example, to
indicate your site, the year, the date, the treatment, etc.
Make sure to end your data set with a hard return.
Make ABSOLUTELY SURE that the file is stored in ASCII form (i.e. "text"
or "data" form with no tabs).
As mentioned above, you cannot have more than 80 characters per line.
What if you just have too much data per sample? You can either use reduced
condensed format, or use the slash (/) to indicate an additional line;
both of these will be discussed later. It is permissible to have data values
without spaces in between, as long as the format statement is precise.
Notice in the above data file that there are a lot of zeros - and indeed,
most data sets are loaded with zeros. It wastes space, computer memory,
and effort to include them all. Therefore, it is usually preferable to
have data files in "Cornell reduced condensed format" - so called because
it was originally developed for Cornell Ecology Programs. The data are
given below in this format:
BIRD DATA IN REDUCED CONDENSED FORMAT
(I3,5(I3,F3.0))
5
1 1 1 2 1 3 3 4 1 6 1
1 7 20 8 66
2 3 2 5 9 7 1
3 4 5 5 6 7 1
4 1 3 4 2
0
CARDINALROADRUNNBLUEBIRDPHOEBE TITMOUSEREDTAILSCHICKADEWAXWINGS
SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4
-
Line #1 is the title line.
-
Line #2 is the FORTRAN format statement. In this case, we have one three-character
integer indicating the sample number, then up to five couplets of numbers.
Each couplet consists of one three-character integer indicating the species
number, and one three-character real number indicating the species abundance.
Of course, you can change these numbers to fit the characteristics of your
own data set.
-
Line #3 indicates the maximum number of species per line. This number can
be as large as you like, as long as all rows are no longer than 80 characters.
This number must correspond with the formatting statement above.
-
Lines #4 - 8 are the data. They consist of sample number, then pairs of
species numbers and species abundances. Note that there are more than five
species in sample 1, so it must be continued on the second line.
-
The remaining lines are the same as described for
full format.
A special case of reduced condensed format is when you have only one couplet
per line:
BIRD DATA IN REDUCED CONDENSED FORMAT - with one couplet
(I3,I3,F3.0)
1
1 1 1
1 2 1
1 3 3
1 4 1
1 6 1
1 7 20
1 8 66
2 3 2
2 5 9
2 7 1
3 4 5
3 5 6
3 7 1
4 1 3
4 4 2
0
CARDINALROADRUNNBLUEBIRDPHOEBE TITMOUSEREDTAILSCHICKADEWAXWINGS
SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4
Now why might you want to do this, given that it takes up more space?
This is because it is easy to input data in this format in a spreadsheet,
and it is easier to manipulate in programs other than CANOCO. It is also
a lot easier to make sure your columns are aligned correctly! I don't recommend
that you use this format if you plan on printing hard copies of your data
set.
The data sets for environmental variables are best kept in separate
files from those for the species data. Environmental data can be in the
form of either full format or in Cornell reduced condensed format. In general,
I recommend full format if you have a preponderance of quantitative (e.g.
continuous) variables. Reduced condensed format is better if you have a
preponderance of qualitative (e.g. categorical) variables. Categorical
variables must be coded as dummy variables; please see Environmental
Variables in Constrained Ordination (CCA, RDA)
For environmental data (including covariable files), the variable names
are given in place of species names. The sample names can either be left
blank (in which case a number of hard returns should be given at the end
of the file), or they should be identical to the sample names in the species
data file.
Although it can be frustrating to use FORTRAN formatting statements,
they do allow a wide range of flexibility.
SKIPPED COLUMNS
Suppose you had a data file with some information you did not want to use.
You could then use an "X" to indicate skipped columns. A statement like:
(I3,10X,5(I3,F3.0))
Would indicate that immediately following the sample number, there were
10 characters of either blank spaces, comments, or numbers that you did
not want CANOCO to read.
WARNING: CANODRAW is a program which takes the output of CANOCO
and plots it. Since CANODRAW is not a FORTRAN program, it imperfectly reads
FORTRAN statements. CANODRAW may interpret these skipped columns as data.
CONTINUED LINES
For very large environmental data files, with dozens of variables, it may
be impossible to fit all the data for a sample on one line. Therefore,
you could have a FORTRAN formatting statement such as:
(I5,8F3.0/9F5.0/5F5.0)
This means that there is a five-character integer, followed by eight three-character
(including spaces) real numbers, and then a second line consisting of nine
five-character real numbers, and then a third line consisting of five five-character
real numbers. If you have continued lines, then you must make sure your
final sample (the notational "zero" sample which ends the data set) has
the same number of lines.
Data file names
Although any file extension is acceptable, it is good form to develop a
convention for naming your data files. I give all of my reduced condensed
format files the extension "*.rc", my full format files the extension "*.ful",
my environmental files the extension "*.env", and my covariable files "*.cov".
However, note that "*.spe" is becoming a standard extension for species
data. Choose whatever format is convenient for you.
Many people don't realize you can use the same file for both the environmental
data and for the covariables. All you need to do is to omit the other variables
from the analysis.
For heaven's sakes, don't give your data files names like "data.spe"
or "envdata.env". You will regret this in the long run, when you
accumulate a large number of data files. Try to be as descriptive
as possible.
Common mistakes
Because the input formats for CANOCO are fairly awkward, it is common to
have errors in the files. Common errors include:
-
listing species, site, or environmental variable names in the wrong order
-
mistakes in the format statement
-
misalignment of a column
-
substituting characters (e.g. l and O) for numbers (e.g. 1 and 0)
-
inclusion of tabs
-
saving as a word processor, spreadsheet, or database file
-
insertion of blank lines in incorrect places
-
having more than 80 characters per column
-
having fewer than 8 characters per name (especially if at the end of a
row)
-
forgetting the decimal point
-
forgetting the zeros as the last line of data
-
forgetting to include extra lines of zeros, if the input data have extra
lines in full format.
Fortunately, CANOCO will alert you to a number of errors (e.g. if the number
or names of entities do not match), and the results of other errors will
be obvious (e.g. nonsensical species names). Errors in coding or ordering
species can be detected in CANOCO output if rare species have high weights
and common species have low weights. However, some errors remain elusive,
and can best be found by repeatedly proofing the files.
HINT: if you would like to check your input format by printing out
a hard copy of your file, make sure to use a nonproportional (i.e. fixed
letter width) font. This makes it easier to count characters and determine
the alignment of columns.
CANOCO for Windows
CANOCO for Windows has a new facility, WCanoImp, which makes it easier
to create data from a spreadsheet. The general procedure is as follows:
-
Create a block of data, in which the columns are species and the rows are
samples (or the other way around).
-
Copy the data into the clipboard.
-
Open WCanoImp.
-
Select CanoImp options.
-
Save the file using a descriptive name (see above).
-
Provide a descriptive subject line.
-
Exit WCanoImp.
Some general recommendations and comments:
-
You can provide names of species and samples to head the rows and columns.
-
If these names are longer than 8 characters long, WCanoImp will truncate
them to 8 characters.
-
Make sure that species names and samples are still distinguishable after
truncation. For example, if you spell out the entire taxon name,
members of the same genus will end up having the same name if the genus
name is more than 6 characters long.
-
Species and/or sample names (very boring ones) will be generated by WCanoImp
if they are missing in the block you copied.
-
For species data, and for environmental data with many categorical variables,
make sure to save your file in reduced condensed format. Otherwise,
you may end up with a very large file (since it would contain a lot of
zeros).
-
It is entirely possible that your raw data are not in a clean column by
row format. If so, check whether your spreadsheet has a summary table
facility (such as Pivot Tables in Microsoft Excel). It is often possible
to create a new data table.
-
In manipulating complex data tables in spreadsheets, be careful when copying
columns and rows which include calculations (such as data transformations).
Often they calculations refer to relative cell locations. Convert
formulae to values where appropriate.
-
Despite the last advice, WCanoImp is perfectly capable of reading the results
of formulae within the block of data.
-
Make sure that samples are in the same order in both the species data and
the environmental data.
-
Eliminating species, samples, and environmental variables is quite
simple in CANOCO for Windows (unlike in older versions). Therefore,
even if you plan on performing many different analyses on subsets of the
data, I generally recommend creating only one huge species data file,
and one huge environmental data file. When running CANOCO,
you can open an old *.con file, alter it, and save the new options into
a new *.con file. Therefore, you can keep your runtime options entirely
the same (where appropriate) with the exception of altering the actual
data sets you analyze, and changing the name of the solution file.
This practice will prevent the proliferation of a large number of data
files.
This web page is intended as a quick overview. See the CANOCO manual and
readme files for further details and options.
This page was created and is maintained by Michael
Palmer.
To
the ordination web page