DATA FORMATS FOR INPUT INTO CANOCO, DECORANA, OR TWINSPAN

NOTE added in 2013: This section is no longer relevant for users of CANOCO 5.0. I am not eliminating the material, in case users have legacy data sets they need to understand.

Canoco uses input data in ASCII form. In CANOCO for Windows, it is theoretically possible that you would never need to see such ASCII files, since they can be created and read by other facilities. However, it is good practice to know the general data formats, for the purpose of troubleshooting. Most of this page is valid for the older CANOCO for DOS. Special considerations for CANOCO for WINDOWS are listed at the end of this page.

Suppose you had a data set in which four large quadrats were sampled for birds, and you obtained the following data:

            Sample 1   Sample 2   Sample 3   Sample 4

Cardinals      1         0           0          3

roadrunners    1         0           0          0

blue birds     3         2           0          0

phoebes        1         0           5          2

titmice        0         9           6          0

red-tails      1         0           0          0

chickadees    20         1           1          0

waxwings      66         0           0          0

How would you get these data into shape, so that CANOCO can read them?

CANOCO is a FORTRAN program, and therefore requires input in FORTRAN format.

Conceptually, the most straightforward way to input these data into CANOCO is in "full format". In full format, the samples are the rows, and the columns are the species. An example of the above data translated into full format follows. (In the remaining, data sets ready for analysis are surrounded by horizontal lines - and the lines are not part of the data files). It is worth noting here that you would in most cases be better off having your data in reduced condensed format, to be discussed later.

BIRD DATA IN FULL FORMAT

(I3,8F3.0)

  1  1  1  3  1  0  1 20 66

  2  0  0  2  0  9  0  1  0

  3  0  0  0  5  6  0  1  0

  4  3  0  0  2  0  0  0  0

  0  0  0  0  0  0  0  0  0

CARDINALROADRUNNBLUEBIRDPHOEBE  TITMOUSEREDTAILSCHICKADEWAXWINGS

SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4

Let us now dissect the above data file.

Line #1: There must be exactly one title line. This line, AS WELL AS ALL THE REMAINING, must be no more than 80 characters long. It is useful for this line to be informative, not only because it will remind you what the data set is, but because it will be printed in most of the computer output. Some people like to put the analysis date in the sample line, so that you can know when you performed the analysis while poring over output.
Line #2: This is a FORTRAN formatting line. It tells the computer the nature of your data. In this case, it says that there is an integer taking up three spaces (I3), plus eight groups of values taking up three spaces each (8F3.0). The ".0" means that there could be a decimal point anywhere in these three spaces. Other variations on formatting statements will be described later. The "F" means "real number" (e.g. there could be a decimal point). Even if your data are integers, as they are here, you must specify them as real numbers because that is how they are analyzed by the program.
Line #3: This is the number of data values per sample, not including the sample number.
Lines #4-7: These are the actual data. The sample numbers must be ascending, though not necessarily consecutive. However, missing samples must still have a sample name. Likewise, missing species must have a species name.
Line #8: The final line must begin with a sample #0. This tells the computer that the data have ended.
Line #9: These are the species names. Each species must be given an 8-character code (which may be upper or lower case, may contain numerals, and may contain spaces). A maximum of 10 species can fit on a line. If you have more than 10, put species 11-20 on the second line of species names, 21-30 on the third line, etc. If you have short names, you must still include extra spaces to make up 8 characters, even if it the last species in the row.
It is conventional for birds to be listed by common names, and other organisms by Latin binomials.
Line #10: Immediately after the species names, and starting on a new line, are the sample names. The same rules about format for species names apply here. It is possible to omit the sample names, in which case CANOCO supplies default names. However, you must supply "hard returns" (i.e. end of paragraph) - at least one return for every 10 samples. In general, it is a good idea to make your sample names as descriptive as possible - for example, to indicate your site, the year, the date, the treatment, etc.

Make sure to end your data set with a hard return.

Make ABSOLUTELY SURE that the file is stored in ASCII form (i.e. "text" or "data" form with no tabs).

As mentioned above, you cannot have more than 80 characters per line. What if you just have too much data per sample? You can either use reduced condensed format, or use the slash (/) to indicate an additional line; both of these will be discussed later. It is permissible to have data values without spaces in between, as long as the format statement is precise.

Notice in the above data file that there are a lot of zeros - and indeed, most data sets are loaded with zeros. It wastes space, computer memory, and effort to include them all. Therefore, it is usually preferable to have data files in "Cornell reduced condensed format" - so called because it was originally developed for Cornell Ecology Programs. The data are given below in this format:

BIRD DATA IN REDUCED CONDENSED FORMAT

(I3,5(I3,F3.0))

  1  1  1  2  1  3  3  4  1  6  1

  1  7 20  8 66

  2  3  2  5  9  7  1

  3  4  5  5  6  7  1

  4  1  3  4  2

CARDINALROADRUNNBLUEBIRDPHOEBE  TITMOUSEREDTAILSCHICKADEWAXWINGS

SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4

Line #1 is the title line.
Line #2 is the FORTRAN format statement. In this case, we have one three-character integer indicating the sample number, then up to five couplets of numbers. Each couplet consists of one three-character integer indicating the species number, and one three-character real number indicating the species abundance. Of course, you can change these numbers to fit the characteristics of your own data set.
Line #3 indicates the maximum number of species per line. This number can be as large as you like, as long as all rows are no longer than 80 characters. This number must correspond with the formatting statement above.
Lines #4 - 8 are the data. They consist of sample number, then pairs of species numbers and species abundances. Note that there are more than five species in sample 1, so it must be continued on the second line.
The remaining lines are the same as described for full format.

A special case of reduced condensed format is when you have only one couplet per line:

BIRD DATA IN REDUCED CONDENSED FORMAT - with one couplet

(I3,I3,F3.0)

  1  1  1

  1  2  1

  1  3  3

  1  4  1

  1  6  1

  1  7 20

  1  8 66

  2  3  2

  2  5  9

  2  7  1

  3  4  5

  3  5  6

  3  7  1

  4  1  3

  4  4  2

CARDINALROADRUNNBLUEBIRDPHOEBE  TITMOUSEREDTAILSCHICKADEWAXWINGS

SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4

Now why might you want to do this, given that it takes up more space? This is because it is easy to input data in this format in a spreadsheet, and it is easier to manipulate in programs other than CANOCO. It is also a lot easier to make sure your columns are aligned correctly! I don't recommend that you use this format if you plan on printing hard copies of your data set.

The data sets for environmental variables are best kept in separate files from those for the species data. Environmental data can be in the form of either full format or in Cornell reduced condensed format. In general, I recommend full format if you have a preponderance of quantitative (e.g. continuous) variables. Reduced condensed format is better if you have a preponderance of qualitative (e.g. categorical) variables. Categorical variables must be coded as dummy variables; please see Environmental Variables in Constrained Ordination (CCA, RDA)

For environmental data (including covariable files), the variable names are given in place of species names. The sample names can either be left blank (in which case a number of hard returns should be given at the end of the file), or they should be identical to the sample names in the species data file.

Although it can be frustrating to use FORTRAN formatting statements, they do allow a wide range of flexibility.

SKIPPED COLUMNS

Suppose you had a data file with some information you did not want to use. You could then use an "X" to indicate skipped columns. A statement like:

(I3,10X,5(I3,F3.0))

Would indicate that immediately following the sample number, there were 10 characters of either blank spaces, comments, or numbers that you did not want CANOCO to read.

WARNING: CANODRAW is a program which takes the output of CANOCO and plots it. Since CANODRAW is not a FORTRAN program, it imperfectly reads FORTRAN statements. CANODRAW may interpret these skipped columns as data.

CONTINUED LINES

For very large environmental data files, with dozens of variables, it may be impossible to fit all the data for a sample on one line. Therefore, you could have a FORTRAN formatting statement such as:

(I5,8F3.0/9F5.0/5F5.0)

This means that there is a five-character integer, followed by eight three-character (including spaces) real numbers, and then a second line consisting of nine five-character real numbers, and then a third line consisting of five five-character real numbers. If you have continued lines, then you must make sure your final sample (the notational "zero" sample which ends the data set) has the same number of lines.

Data file names

Although any file extension is acceptable, it is good form to develop a convention for naming your data files. I give all of my reduced condensed format files the extension "*.rc", my full format files the extension "*.ful", my environmental files the extension "*.env", and my covariable files "*.cov". However, note that "*.spe" is becoming a standard extension for species data. Choose whatever format is convenient for you.

Many people don't realize you can use the same file for both the environmental data and for the covariables. All you need to do is to omit the other variables from the analysis.

For heaven's sakes, don't give your data files names like "data.spe" or "envdata.env". You will regret this in the long run, when you accumulate a large number of data files. Try to be as descriptive as possible.

Common mistakes

Because the input formats for CANOCO are fairly awkward, it is common to have errors in the files. Common errors include:

listing species, site, or environmental variable names in the wrong order
mistakes in the format statement
misalignment of a column
substituting characters (e.g. l and O) for numbers (e.g. 1 and 0)
inclusion of tabs
saving as a word processor, spreadsheet, or database file
insertion of blank lines in incorrect places
having more than 80 characters per column
having fewer than 8 characters per name (especially if at the end of a row)
forgetting the decimal point
forgetting the zeros as the last line of data
forgetting to include extra lines of zeros, if the input data have extra lines in full format.

Fortunately, CANOCO will alert you to a number of errors (e.g. if the number or names of entities do not match), and the results of other errors will be obvious (e.g. nonsensical species names). Errors in coding or ordering species can be detected in CANOCO output if rare species have high weights and common species have low weights. However, some errors remain elusive, and can best be found by repeatedly proofing the files.

HINT: if you would like to check your input format by printing out a hard copy of your file, make sure to use a nonproportional (i.e. fixed letter width) font. This makes it easier to count characters and determine the alignment of columns.

CANOCO for Windows

CANOCO for Windows has a new facility, WCanoImp, which makes it easier to create data from a spreadsheet. The general procedure is as follows:

Create a block of data, in which the columns are species and the rows are samples (or the other way around).
Copy the data into the clipboard.
Open WCanoImp.
Select CanoImp options.
Save the file using a descriptive name (see above).
Provide a descriptive subject line.
Exit WCanoImp.

Some general recommendations and comments:

You can provide names of species and samples to head the rows and columns.
If these names are longer than 8 characters long, WCanoImp will truncate them to 8 characters.
Make sure that species names and samples are still distinguishable after truncation. For example, if you spell out the entire taxon name, members of the same genus will end up having the same name if the genus name is more than 6 characters long.
Species and/or sample names (very boring ones) will be generated by WCanoImp if they are missing in the block you copied.
For species data, and for environmental data with many categorical variables, make sure to save your file in reduced condensed format. Otherwise, you may end up with a very large file (since it would contain a lot of zeros).
It is entirely possible that your raw data are not in a clean column by row format. If so, check whether your spreadsheet has a summary table facility (such as Pivot Tables in Microsoft Excel). It is often possible to create a new data table.
In manipulating complex data tables in spreadsheets, be careful when copying columns and rows which include calculations (such as data transformations). Often they calculations refer to relative cell locations. Convert formulae to values where appropriate.
Despite the last advice, WCanoImp is perfectly capable of reading the results of formulae within the block of data.
Make sure that samples are in the same order in both the species data and the environmental data.
Eliminating species, samples, and environmental variables is quite simple in CANOCO for Windows (unlike in older versions). Therefore, even if you plan on performing many different analyses on subsets of the data, I generally recommend creating only one huge species data file, and one huge environmental data file. When running CANOCO, you can open an old *.con file, alter it, and save the new options into a new *.con file. Therefore, you can keep your runtime options entirely the same (where appropriate) with the exception of altering the actual data sets you analyze, and changing the name of the solution file. This practice will prevent the proliferation of a large number of data files.

This web page is intended as a quick overview. See the CANOCO manual and readme files for further details and options.

This page was created and is maintained by Michael Palmer.
To the ordination web page