Load an input matrix
Open input matrix
To select the input matrix from your local file system, click
"Select file" and browse the file tree. The first time you select a file or a folder after launching CoNet,
it may take a bit of time before your file tree is visualized.
After having selected a file, click "Open". The file path is displayed in the field
above the "Select file" button.
If you want to clear the selected file, click "Select File", then "Cancel".
Alternatively, you can paste the full path of your file into the text field below the "Select file" button.
You can optionally load a second input matrix.
In this case, co-occurrence is computed between all cross-matrix row combinations
(and never between two rows belonging to the same matrix).
However, the first and second matrix are expected to have an equal number of columns.
This results in a bipartite network with two node types, each node type corresponding to
one input matrix.
Note that metadata have to include all rows of the first and second input matrix.
For instance, the first input matrix could represent Eukaryotes and the second Bacteria
sampled for the same number of locations. A combined metadata file could then provide the lineages
of both Eukaryotes and Bacteria. The delimiter and transpose options explained below apply to
both input matrices.
When two matrices are read in, each of their rows is assigned a group attribute called "matrix_number",
with different values for the first and second matrix. This group attribute can be entered into the
"Metadata and Feature loading menu", to use group options (see below).
Note that if the second matrix was generated from a biom file, its lineages will not be parsed.
Matrix property computation
If you enable "Compute matrix info" and push the GO button,
some basic properties of the input matrix are displayed along with
configuration tips. For instance, the row and column number, minimal
and maximal column sums as well as total possible edge number is displayed.
The total possible edge number is computed according to the formula:
n*(n-1)/2, where n is the row number (note that this
formula is not valid for association rule mining). Matrix properties
are computed after selected preprocessing and filter steps have been applied.
Parsing options
By default, strength of co-occurrence is computed between rows. If it should be
computed between columns, activate the "Transpose data matrix" check box.
If the matrix transposing is enabled, feature columns should match the columns of the
transposed matrix (i.e. the rows of the input matrix) and metadata identifier
should refer to the rows of the transposed matrix.
The columns of the input matrix are by default expected to be tab-delimited.
However, the column separator can be changed by entering the desired
delimiter string in the text field below "Change default column delimiter".
The table can be optionally given in the QIIME taxon count table format, which can be
generated automatically from biom files. A converter for this task is available at http://biom-format.org/.
CoNet accepts QIIME OTU tables with and without taxonomy. When an OTU table with taxonomy is given (where the last column holds the lineages in the format k__KINGDOM;p__PHYLUM;c__CLASS;o__ORDER;f__FAMILY;g__GENUS;s__SPECIES),
metadata and metadata attributes are automatically set from this table (overriding any previously set metadata),
which allows the usage of metadata-dependent CoNet options such as higher-level taxon assignment or the suppression
of relationships between OTUs belonging to the same genus. The taxonomy is also parsed automatically from QIIME phylotype tables
where the lineage is included in the row name (example for a row name: k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella).
Note however that the OTU/phylotype table should not contain features, i.e. non-taxon rows. Note also that the QIIME taxon count table
format, if enabled, is assumed to be valid also for the second input matrix, if one is given. If two matrices are provided,
lineages will be parsed only from the first.
To enable the QIIME taxon count table format, activate the "Table obtained from biom file" check box. With this format enabled
(and lineage information provided), taxon nodes will provide values for the phylum, class, order, family, genus and species attribute,
although these values may be set to none.
Input matrix format
In general, each row should appear only once in your matrix, i.e. each row name should be unique.
Lines preceded by # are treated as comment lines and omitted.
The input matrix can be a pure numeric matrix with tab-delimited entries, such as the small example below:
9.4 5.4 10.1
13.5 4.4 11.9
14.3 13.4 8.9
It is also possible to specify a matrix with row names and without column names. The row name column
is expected to be the first column and row names should not be purely numeric or blank. Example:
taxon1 1.2 4.4 3.2 0.0 1.1
taxon2 2.2 4.3 1.1 0.0 1.2
taxon3 1.3 2.1 3.1 2.3 1.1
The matrix can also provide row and column names. In this case, it is expected that the column names form the
first row and that the row name column has itself a column name. Row and column names
should not be purely numeric or blank.
A small example illustrates this:
colnames col1 col2 col3 col4
tax1 1.2 1.0 3.2 0.0
tax2 2.2 2.3 1.4 0.0
tax3 1.3 2.1 3.1 2.3
tax4 1.2 1.4 2.1 2.4
tax5 2.3 2.1 4.0 3.1
tax6 2.5 2.4 2.3 1.3
Note that OTU tables generated from biom files are also accepted (check out the "Parsing options").
Treatment of missing values
Missing values have to be indicated using NaN. For example:
9.4 NaN 10.1
13.5 4.4 11.9
14.3 13.4 NaN
Treatment of special characters
Special characters in row names (or in the first metadata column, which represents row names, see below)
are replaced by a dash to avoid confusion with CoNet's own special characters.
However, the original row names are stored as values of the node attribute "oriId".
Matrix type
The matrix can be of three different types. An abundance matrix has continuous values,
a count matrix has integer values and an incidence matrix has only two values representing
presence (1) and absence (0). Methods like the hypergeometric distribution and association
rule mining can only be applied to the incidence matrix and will produce an error if applied to
an abundance matrix.
Time series data
CoNet's methods are not tailored to deal with time series.
However, CoNet allows to compute lagged similarities via the option "Specify maximum lag".
If a value larger than zero is set for this option, input data are treated as time series.
For each time series, shifted versions are generated up to the given maximum shift.
This allows to detect associations between two species with similar, but shifted abundances.
Take for example the following time series observed for species A and B:
speciesA 1 2 3 4 5
speciesB 2 3 4 5 6
If a lag of 1 is specified, then CoNet would generate rows
speciesA-0 1 2 3 4
speciesA-1 2 3 4 5
speciesB-0 2 3 4 5
speciesB-1 3 4 5 6
If missing values are treated, the shifted rows would be generated as follows:
speciesA-0 1 2 3 4 5
speciesA-1 2 3 4 5 NaN
speciesB-0 2 3 4 5 6
speciesB-1 3 4 5 6 NaN
CoNet does not compute auto-correlations, e.g. no similarity is computed between speciesA-0 and speciesA-1.
In addition, similarities between the same shift (larger than zero) are also
not computed, e.g. the combination speciesA-1, speciesB-1 is omitted.
Beware that specifying a lag greatly increases the runtime as well as the risk of false positives.
Metadata and Feature Loading
If you click on the "Metadata and features menu", a sub-menu will be opened.
The four functions of this sub-menu are explained below:
-
Metadata Metadata are row attributes.
For instance, if the input matrix rows represent species and we want to associate each species with its genus,
we can upload a metadata matrix consisting of two tab-delimited columns: the first listing the row names of the input matrix
and the second listing the genera.
More than one attribute per row can be up-loaded, by providing a matrix with several tab-delimited
columns. The attribute names have to be given in the order in which they appear in the metadata matrix in the
text field below "Enter metadata column names". If several attribute names are given, they have to be
separated by a slash. For instance, if we want to load values for the attributes "lineage" and "taxon" on the rows,
we need to up-load a matrix with 3 columns (row names, lineage attribute values and taxon attribute values) and
specify the two attribute names separated by slash in the metadata name input field.
One column of the metadata file may assign group memberships. For instance, a column assigning body sites to
each row groups rows body-site-wise. If a group attribute has been specified, parent-child relationships between taxa across
different groups are allowed and column-wise normalization and renormalization (see randomization)
are carried out group-wise.
A second group attribute can be provided, which does not have any of the effects of the first
group attribute. Instead, specifying this group attribute prevents intra-group links, i.e. links
between nodes sharing the same value for this group attribute are forbidden. This is useful to prevent
links between too closely related taxa.
- Lineage
The lineage is a specific row attribute that provides the taxonomic classification, from most
generic to most specific, e.g. Bacteria--Bacteroidetes--Bacteroidia--Bacteroidales--Prevotellaceae--Prevotella.
The two dashes ('--') between the taxonomic level are the lineage separator, which can be changed using
option lineage separator. The lineage column metadata name is fixed to "lineage".
It is recommended that, if both lineage and taxon are provided, the lineage includes the taxon itself as last entry,
e.g. if the taxon is OTU-12345, the lineage could be Bacteria--Bacteroidetes--OTU-12345.
When a matrix consists of taxa on the same level and the lineages are given in the metadata, higher-level
taxa rows can be automatically assigned by summing lower-level taxon rows. The option explore links
between higher-level taxa enables automatic assignment of higher taxon rows from the lineages, which allows
to compute correlations between higher-level taxa. This option however assumes that all taxa in the input matrix are
on the same taxonomic level and that taxonomic levels have different names (Beware: Actinobacteria is the name of the class and the phylum,
so the name would need a modification such as Actinobacteria_class to differentiate between class and phylum).
It is advisable to use this option together with the parent-child exclusion filter.
-
Row combination filters. Specifying metadata allows to apply filters during network building that are listed here:
- The parent-child exclusion filter prevents the formation of links between taxa that have a parent-child relationship,
such as in Pasteurellaceae versus Pasteurella. This has an impact on the run-time and the (non-edge-wise) p-value calculation.
To apply this filter, a lineage and a taxon attribute are expected (with metadata column names "lineage" and "taxon" respectively).
The taxon is expected to correspond to the last entry of the lineage.
For instance, one line in the metadata file (with a lineage and a taxon column) could look like this:
id1 Bacteria--Actinobacteria Actinobacteria
- The Within-group relationships only filter can be applied to prevent links between items belonging to different groups. It requires the first group attribute to be set.
- The Between-group relationships only filter can be applied to prevent links between items belonging to the same group. It requires the first group attribute to be set.
-
Features Features are additional rows that differ from the other rows of the input matrix. For instance, if
the matrix describes count data of species, the features could capture physical properties such as the temperature or pH.
In the network, feature nodes are displayed with another shape than the other nodes.
Features should be set separately, since they need to be excluded from some operations, such as column-wise normalization.
The feature matrix is expected to be tab-delimited and to contain continuous data when the input matrix is of count or abundance type
or of binary data when the input matrix is an incidence (i.e. presence/absence) matrix. Otherwise, the rules for input matrix
parsing also apply to feature matrix parsing, i.e. row and column names should not be purely numeric, special characters will be treated
and lines preceded by # are interpreted as comment lines and skipped.
It should not contain negative values, because these cannot be treated correctly by Kullback-Leibler and other measures that
interpret standardized abundances as probabilities (if you only plan to use correlations, negative values are fine).
In addition, constant features (i.e. those that have the same value
across all samples) should also be avoided.
If transpose is enabled, the feature matrix is transposed, i.e. columns are
parsed as rows and vice versa.
If match samples is enabled, feature samples are matched to input samples. Samples not present in the feature matrix
will be discarded from the input matrix and vice versa. In addition, samples in the feature matrix are ordered in the same
way as the input matrix. Thus, this option can be activated if the feature samples are in a different order than those of
the input matrix. Note that matching is carried out after transposing, if both options are enabled.
Note that the feature matrix is expected to have the same samples in the same order as the input matrix (after transposing),
except when matching is enabled.
The following feature-specific filters are provided:
- Activating the first filter excludes features from association rule mining.
- Activating the second filter keeps only features in the network that are supported by Spearman correlation.
Saving the resulting network to a given location is an option that allows saving results in formats
not supported by Cytoscape. If no output file is selected, the network is visualized in Cytoscape
without being saved to a file.
To specify an output file location, select first a folder via button "Select folder",
then click on the folder into which the output file should be saved and click "Choose". The selected
folder will be displayed in the text area below this button.
Then type a file name in the text area below "Type file name".
To undo the specification of an output file, remove the file name or push the button "Select folder"
and then push Cancel.
Network formats
The GDL (GraphDataLinker) format is a custom format that stores networks along with their node and edge attributes.
In contrast to all Cytoscape-supported formats, it preserves multi-edges.
GDL networks can be loaded into Cytoscape via a button in the main menu.
The "tab" format is a tab-delimited custom format
that consists of two parts: a node part listing all nodes and their attribute values, one by line,
and an edge part listing all edges and their attribute values, one by line.
Here is a small example:
;NODES genus
a Escherichia
b Yersinia
c Salmonella
d Mycoplasma
;ARCS score
a b 1.1
b c 2.4
c a 3
The other formats are supported by the following tools:
- VisML This is the input format of the network and pathway analysis platform VisANT.
- Dot This is the input format of the command line graph visualization software GraphViz.
Output network properties
Each output network has a "Comment" and a "CALL" entry. The comment is a long string that lists
the details of network generation, whereas the CALL string precedes the command line call that was used to generate the network.