Load an input matrix

Open input matrix

To select the input matrix from your local file system, click "Select file" and browse the file tree. The first time you select a file or a folder after launching CoNet, it may take a bit of time before your file tree is visualized. After having selected a file, click "Open". The file path is displayed in the field above the "Select file" button. If you want to clear the selected file, click "Select File", then "Cancel". Alternatively, you can paste the full path of your file into the text field below the "Select file" button.
You can optionally load a second input matrix. In this case, co-occurrence is computed between all cross-matrix row combinations (and never between two rows belonging to the same matrix). However, the first and second matrix are expected to have an equal number of columns. This results in a bipartite network with two node types, each node type corresponding to one input matrix.
Note that metadata have to include all rows of the first and second input matrix. For instance, the first input matrix could represent Eukaryotes and the second Bacteria sampled for the same number of locations. A combined metadata file could then provide the lineages of both Eukaryotes and Bacteria. The delimiter and transpose options explained below apply to both input matrices.
When two matrices are read in, each of their rows is assigned a group attribute called "matrix_number", with different values for the first and second matrix. This group attribute can be entered into the "Metadata and Feature loading menu", to use group options (see below).

Matrix property computation

If you enable "Compute matrix info" and push the GO button, some basic properties of the input matrix are displayed along with configuration tips. For instance, the row and column number, minimal and maximal column sums as well as total possible edge number is displayed. The total possible edge number is computed according to the formula: n*(n-1)/2, where n is the row number (note that this formula is not valid for association rule mining). Matrix properties are computed after selected preprocessing and filter steps have been applied.

Parsing options

By default, strength of co-occurrence is computed between rows. If it should be computed between columns, activate the "Transpose data matrix" check box. If the matrix transposing is enabled, feature columns should match the columns of the transposed matrix (i.e. the rows of the input matrix) and metadata identifier should refer to the rows of the transposed matrix.

The columns of the input matrix are by default expected to be tab-delimited. However, the column separator can be changed by entering the desired delimiter string in the text field below "Change default column delimiter".

The table can be optionally given in the QIIME taxon count table format, which can be generated automatically from biom files. A converter for this task is available at http://biom-format.org/. CoNet accepts QIIME OTU tables with and without taxonomy. When an OTU table with taxonomy is given (where the last column holds the lineages in the format k__KINGDOM;p__PHYLUM;c__CLASS;o__ORDER;f__FAMILY;g__GENUS;s__SPECIES), metadata and metadata attributes are automatically set from this table (overriding any previously set metadata), which allows the usage of metadata-dependent CoNet options such as higher-level taxon assignment or the suppression of relationships between OTUs belonging to the same genus. The taxonomy is also parsed automatically from QIIME phylotype tables where the lineage is included in the row name (example for a row name: k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella). Note however that the OTU/phylotype table should not contain features, i.e. non-taxon rows. Note also that the QIIME taxon count table format, if enabled, is assumed to be valid also for the second input matrix, if one is given. If two matrices are provided, lineages will be parsed only from the first. To enable the QIIME taxon count table format, activate the "Table obtained from biom file" check box. With this format enabled (and lineage information provided), taxon nodes will provide values for the phylum, class, order, family, genus and species attribute, although these values may be set to none.

Input matrix format

In general, each row should appear only once in your matrix, i.e. each row name should be unique. Lines preceded by # are treated as comment lines and omitted. The input matrix can be a pure numeric matrix with tab-delimited entries, such as the small example below:
9.4	5.4	10.1
13.5	4.4	11.9
14.3	13.4	8.9

It is also possible to specify a matrix with row names and without column names. The row name column is expected to be the first column and row names should not be purely numeric or blank. Example:
taxon1	1.2	4.4	3.2	0.0	1.1
taxon2	2.2	4.3	1.1	0.0	1.2
taxon3	1.3	2.1	3.1	2.3	1.1

The matrix can also provide row and column names. In this case, it is expected that the column names form the first row and that the row name column has itself a column name. Row and column names should not be purely numeric or blank. A small example illustrates this:
colnames	col1	col2	col3	col4
tax1	1.2	1.0	3.2	0.0
tax2	2.2	2.3	1.4	0.0
tax3	1.3	2.1	3.1	2.3
tax4	1.2	1.4	2.1	2.4
tax5	2.3	2.1	4.0	3.1
tax6	2.5	2.4	2.3	1.3

Note that OTU tables generated from biom files are also accepted (check out the "Parsing options").

Treatment of missing values

Missing values have to be indicated using NaN. For example:
9.4	NaN	10.1
13.5	4.4	11.9
14.3	13.4	NaN

Treatment of special characters

Special characters in row names (or in the first metadata column, which represents row names, see below) are replaced by a dash to avoid confusion with CoNet's own special characters. However, the original row names are stored as values of the node attribute "oriId".

Matrix type

The matrix can be of three different types. An abundance matrix has continuous values, a count matrix has integer values and an incidence matrix has only two values representing presence (1) and absence (0). Methods like the hypergeometric distribution and association rule mining can only be applied to the incidence matrix and will produce an error if applied to an abundance matrix.

Time series data

CoNet's methods are not tailored to deal with time series. However, CoNet allows to compute lagged similarities via the option "Specify maximum lag". If a value larger than zero is set for this option, input data are treated as time series. For each time series, shifted versions are generated up to the given maximum shift. This allows to detect associations between two species with similar, but shifted abundances. Take for example the following time series observed for species A and B:
speciesA	1	2	3	4	5
speciesB	2	3	4	5	6
If a lag of 1 is specified, then CoNet would generate rows
speciesA-0	1	2	3	4
speciesA-1 	2	3	4	5
speciesB-0	2	3	4	5
speciesB-1 	3	4	5	6
If missing values are treated, the shifted rows would be generated as follows:
speciesA-0	1	2	3	4	5
speciesA-1 	2	3	4	5	NaN
speciesB-0	2	3	4	5	6
speciesB-1 	3	4	5	6	NaN
CoNet does not compute auto-correlations, e.g. no similarity is computed between speciesA-0 and speciesA-1. In addition, similarities between the same shift (larger than zero) are also not computed, e.g. the combination speciesA-1, speciesB-1 is omitted. Beware that specifying a lag greatly increases the runtime as well as the risk of false positives.

Metadata and Feature Loading

If you click on the "Metadata and features menu", a sub-menu will be opened. The four functions of this sub-menu are explained below:
  1. Metadata Metadata are row attributes. For instance, if the input matrix rows represent species and we want to associate each species with its genus, we can upload a metadata matrix consisting of two tab-delimited columns: the first listing the row names of the input matrix and the second listing the genera. More than one attribute per row can be up-loaded, by providing a matrix with several tab-delimited columns. The attribute names have to be given in the order in which they appear in the metadata matrix in the text field below "Enter metadata column names". If several attribute names are given, they have to be separated by a slash. For instance, if we want to load values for the attributes "lineage" and "taxon" on the rows, we need to up-load a matrix with 3 columns (row names, lineage attribute values and taxon attribute values) and specify the two attribute names separated by slash in the metadata name input field.
    One column of the metadata file may assign group memberships. For instance, a column assigning body sites to each row groups rows body-site-wise. If a group attribute has been specified, parent-child relationships between taxa across different groups are allowed and column-wise normalization and renormalization (see randomization) are carried out group-wise.
    A second group attribute can be provided, which does not have any of the effects of the first group attribute. Instead, specifying this group attribute prevents intra-group links, i.e. links between nodes sharing the same value for this group attribute are forbidden. This is useful to prevent links between too closely related taxa.
  2. Lineage The lineage is a specific row attribute that provides the taxonomic classification, from most generic to most specific, e.g. Bacteria--Bacteroidetes--Bacteroidia--Bacteroidales--Prevotellaceae--Prevotella. The two dashes ('--') between the taxonomic level are the lineage separator, which can be changed using option lineage separator. The lineage column metadata name is fixed to "lineage". It is recommended that, if both lineage and taxon are provided, the lineage includes the taxon itself as last entry, e.g. if the taxon is OTU-12345, the lineage could be Bacteria--Bacteroidetes--OTU-12345.
    When a matrix consists of taxa on the same level and the lineages are given in the metadata, higher-level taxa rows can be automatically assigned by summing lower-level taxon rows. The option explore links between higher-level taxa enables automatic assignment of higher taxon rows from the lineages, which allows to compute correlations between higher-level taxa. This option however assumes that all taxa in the input matrix are on the same taxonomic level and that taxonomic levels have different names (Beware: Actinobacteria is the name of the class and the phylum, so the name would need a modification such as Actinobacteria_class to differentiate between class and phylum). It is advisable to use this option together with the parent-child exclusion filter.
  3. Row combination filters. Specifying metadata allows to apply filters during network building that are listed here:
    1. The parent-child exclusion filter prevents the formation of links between taxa that have a parent-child relationship, such as in Pasteurellaceae versus Pasteurella. This has an impact on the run-time and the (non-edge-wise) p-value calculation. To apply this filter, a lineage and a taxon attribute are expected (with metadata column names "lineage" and "taxon" respectively). The taxon is expected to correspond to the last entry of the lineage. For instance, one line in the metadata file (with a lineage and a taxon column) could look like this:
      id1	Bacteria--Actinobacteria	Actinobacteria
      
    2. The Within-group relationships only filter can be applied to prevent links between items belonging to different groups. It requires the first group attribute to be set.
    3. The Between-group relationships only filter can be applied to prevent links between items belonging to the same group. It requires the first group attribute to be set.
  4. Features Features are additional rows that differ from the other rows of the input matrix. For instance, if the matrix describes count data of species, the features could capture physical properties such as the temperature or pH. In the network, feature nodes are displayed with another shape than the other nodes. Features should be set separately, since they need to be excluded from some operations, such as column-wise normalization.
    The feature matrix is expected to be tab-delimited and to contain continuous data when the input matrix is of count or abundance type or of binary data when the input matrix is an incidence (i.e. presence/absence) matrix. Otherwise, the rules for input matrix parsing also apply to feature matrix parsing, i.e. row and column names should not be purely numeric, special characters will be treated and lines preceded by # are interpreted as comment lines and skipped. It should not contain negative values, because these cannot be treated correctly by Kullback-Leibler and other measures that interpret standardized abundances as probabilities (if you only plan to use correlations, negative values are fine). In addition, constant features (i.e. those that have the same value across all samples) should also be avoided. If transpose is enabled, the feature matrix is transposed, i.e. columns are parsed as rows and vice versa. If match samples is enabled, feature samples are matched to input samples. Samples not present in the feature matrix will be discarded from the input matrix and vice versa. In addition, samples in the feature matrix are ordered in the same way as the input matrix. Thus, this option can be activated if the feature samples are in a different order than those of the input matrix. Note that matching is carried out after transposing, if both options are enabled. Note that the feature matrix is expected to have the same samples in the same order as the input matrix (after transposing), except when matching is enabled.

    The following feature-specific filters are provided:
    1. Activating the first filter excludes features from association rule mining.
    2. Activating the second filter keeps only features in the network that are supported by Spearman correlation.

Save output network

Saving the resulting network to a given location is an option that allows saving results in formats not supported by Cytoscape. If no output file is selected, the network is visualized in Cytoscape without being saved to a file.

To specify an output file location, select first a folder via button "Select folder", then click on the folder into which the output file should be saved and click "Choose". The selected folder will be displayed in the text area below this button. Then type a file name in the text area below "Type file name". To undo the specification of an output file, remove the file name or push the button "Select folder" and then push Cancel.

Network formats

The GDL (GraphDataLinker) format is a custom format that stores networks along with their node and edge attributes. In contrast to all Cytoscape-supported formats, it preserves multi-edges. GDL networks can be loaded into Cytoscape via a button in the main menu.
The "tab" format is a tab-delimited custom format that consists of two parts: a node part listing all nodes and their attribute values, one by line, and an edge part listing all edges and their attribute values, one by line. Here is a small example:
;NODES	genus
a Escherichia
b Yersinia
c Salmonella
d Mycoplasma
;ARCS score
a b 1.1
b c 2.4
c a 3
The other formats are supported by the following tools:

Output network properties

Each output network has a "Comment" and a "CALL" entry. The comment is a long string that lists the details of network generation, whereas the CALL string precedes the command line call that was used to generate the network.