CoNet

CoNet - Frequently Asked Questions (FAQ)

Which data can be analysed with CoNet?
CoNet builds networks from matrices that have items as rows and observations of items as columns. For instance, if you measured the abundance of 50 species in hundred different locations, this data set forms a 50 x 100 matrix. CoNet is suitable for small (hundreds of rows) to medium-sized (a few thousands of rows) matrices.
I have a really big matrix. Can I build a network from it with CoNet?
CoNet is inconvenient for building networks from matrices with more than a few thousand rows (the threshold row number depends on the selected measures and randomization routine). To build networks from bigger matrices, the following options can be considered:
1) filter rarely observed items (this is anyway a good idea, to avoid bias due to double absences)
2) merge strongly similar items into groups (clustering algorithms can help here)
3) split your data if appropriate
4) consider another tool (check the CoNet manual for a list of other network building tools)
Can I analyse time series data with CoNet?
In principle yes, but there is no method dedicated to time series in CoNet. However, CoNet offers an option called "maximum lag", which can be used to detect time-shifted dependencies. You can have a look at LSA or lagged correlations (as done here) for other approaches to infer networks from time series data.
Can CoNet deal with missing values?
Yes, but you need to enable missing value treatment in CoNet's configuration menu. The symbol for a missing value is NaN. Missing values are dealt with by omitting them from calculations. For instance, if you have two vectors (2, 4, NaN, 3, 1) and (NaN, 3, 5, 6, 2), the values pairing with a missing value are omitted, so the resulting vectors are: (4, 3, 1) and (3, 6, 2).
How can I parse taxon lineages?
There are two ways to parse taxon lineages in CoNet: you can either prepare a lineage file which lists for each taxon its lineage (this is called a metadata file in CoNet) or you can give a taxon table generated from a biom file, which contains lineages in a standard format either in the row names or in a taxonomy column (which is expected to be the last column). Once lineages are parsed, CoNet can automatically assign higher-level taxa from them. Lineages will be set as node attributes in the network.
How should I treat unclassified taxa?
Provide the lineage as far as you have it and set the OTU name as last entry, e.g. myphylum-myclass-OTU12345 or if you do not have OTUs, make sure the last entry is unique, e.g. myphylum-myclass-unclassified1. The tips of the lineages need to be unique, since otherwise they will be grouped together when higher-level taxa are assigned. For instance, if you have two lineages myphylum-myclass-unclassified and myphylum-myclass-myfamily-unclassified, unclassified is treated as a single taxon, although it represents two different taxonomic units here. Since version 1.1.0.b, CoNet ensures that taxonomic levels within the same lineage have unique names, so taxonomic groups like Actinobacteria (class) and Actinobacteria (phylum) will not be mixed up. However, in the example above, "unclassified" is used in two different lineages, referring to two different taxonomic units.
How can I parse sample properties, such as pH and temperature?
In many cases, we are not only interested in taxon-taxon relationships, but also in taxon-feature relationships, where features describe sample properties (e.g. pH, temperature etc.). The best is to load these data as a feature matrix (in the Metadata and Feature submenu of the Input menu). The sample identifiers of the feature matrix are expected to be in the same order as the sample identifiers of the input matrix. If this is not the case, it is possible to order sample identifiers in the feature matrix automatically (option "Match samples"). Note that for count or abundance matrices, features should best be non-negative continuous (or count) data, whereas for presence/absence matrices, they should be binary. Also make sure not to provide constant features, i.e. features that have the same value for every sample. Features are displayed with a different node shape in the resulting network and excluded from normalization steps. In addition, non-Spearman supported feature links can be discarded (Spearman is one of the measures recommended by Legendre & Legendre for identifying taxon-feature links).
What do I do with non-numeric sample properties, such as patient groups or different treatments?
Sometimes, you may have non-numeric sample properties, such as gender, patient or treatment groups. The treatment of these sample properties depends on the type. If they are binary (e.g. gender with 2 values that can be coded as FALSE/TRUE), you can test for associations in CoNet using binary association measures such as the hypergeometric distribution (available through the incidence matrix sub-menu in the methods menu). If it is categoric (e.g. treatment groups), you can either split the data and infer a network for each category (if each categoy has enough samles) or you can binarize the categories. Here is an example for binarization, where one categoric group variable with 2 groups is split into two binary group variables:

col_names Sample1_Group1 Sample2_Group1 Sample3_Group1 Sample1_Group2 Sample2_Group2 Sample3_Group2
group1 1 1 1 0 0 0
group2 0 0 0 1 1 1

The latter option visualizes which microbiota respond together to a treatment. If you want to look at interactions between microorganisms, it is better to split the samples by treatment.
Can I compute relationships between items in two or more matrices?
Yes, but only if all your matrices have the same column number. In case of two matrices, you can simply us the second input matrix field in the Data menu. In this case, CoNet will only compute cross-relationships between the two matrices. In fact, the problem of cross-relationships between items in different matrices with the same column number generalizes to grouping rows in a single matrix. CoNet allows you to define such groups in your input matrix. You can then decide whether to compute all cross-group relationships only, or all intra-group relationships only.
How can I export my network into a format that is recognized by other tools?
You can always export a network you have computed using Cytoscape export facilities ("File"->"Export"). CoNet also supports a few formats not provided by Cytoscape (namely tab-delimited, VisML, Dot and GDL). If you want to export the network into one of those, you can configure CoNet as follows before launching the computation: In the "Data menu", open the "Output menu" and there select the format and location of the output file. This causes CoNet to save the network in the selected format at the selected location, in addition to computing its visualization in Cytoscape.
CoNet runs ages to compute my network. What can I do?
If the input matrix has several thousands of rows, permutations and bootstraps can take long to compute. In this situation, it is best to run CoNet on command line, and better still, to submit jobs to a cluster. For more information on CoNet's command line support, please check here.
CoNet creates some pretty big log files on command line. How can I get rid of them?
Set the --verbosity option to fatal.
I just got an error message saying there was a null pointer exception while computing permutations. What can I do?
The following exception during the permutation step: java.lang.NullPointerException at be.ac.vub.bsb.cooccurrence.util.ArrayTools.fillJSLOutputArray
is caused in some cases by the assignment of higher-level taxa. To avoid it, open the Metadata and features sub-menu in the data menu and switch off "explore links between higher-level taxa" and "parent-child exclusion". Please also restart Cytoscape. Sorry about this.
Problem with the scaled variance of log-ratios
In the scaled variance of log ratios, the assignment of copresence and mutual exclusion is wrong, so please do not use this measure.
I got an error message saying: java.lang.IllegalArgumentException: Initial edge selection: The lower threshold is larger than the upper threshold, such that lower and upper thresholds overlap!
This error message means that you asked for more separate positive and negative edges than edges exist. Please reduce the initial edge number by opening the automatic threshold setting sub-menu in the Methods menu and then reducing the edge selection parameter.
I got an error message saying: java.lang.IllegalArgumentException: The requested number of edges is higher than the maximal possible edge number!
This error message means that you asked for more edges than exist. For a matrix of n rows, there are n*(n-1)/2 possible edges. Please reduce the initial edge number by opening the automatic threshold setting sub-menu in the Methods menu and then reducing the edge selection parameter.
Why has CoNet so many options?
Different data sets require different preprocessing steps and measures. For instance, it doesn't make much sense to apply the hypergeometric distribution on abundance data or to normalize presence/absence data. Preprocessing and randomization options have crucial impact on the result and need to be carefully chosen. There is also a lack of co-occurrence network validation efforts. With systematic evaluations in place, we would know the best options for each data type, but in the absence of such knowledge, CoNet cannot be preconfigured.
Can you give some advice concerning the configuration?
The role of data preprocessing is to reduce the variation coming from differences in processing. For instance, if different total read numbers were sequenced for each sample, species can covary just because of varying total read numbers. One way to address this problem is to divide each column by its sum, thus working with relative instead of absolute abundances. However, this normalization step distorts Pearson correlations (see Aitchison's "A Concise Guide to Compositional Data Analysis"), an effect that can be avoided by using log-ratio-based distance measures ("Compositional data and their analysis: an introduction", Pawlowsky-Glahn and Egozcue) or mitigated by renormalization (PLoS Comput Biol 8(7): e1002606, 2012).
Another issue is the presence of rows with many zeros. These zeros can be ambiguous, e.g. represent real absence or presence below detection level. Double zeros may make two rows more similar to each other than is desirable. For instance, consider 2 rare species. They may only occur in the three samples with the highest total read number, making them appear highly correlated. In reality, the 2 species could have occurred in all samples with different abundances, but below detection level, such that the detected relationship is due to differences in processing. For this reason, it is best to remove rare species.
In case of missing values, the option "pairwise_omit" should be enabled in the CoNet configuration menu.
As for the measures, they have different strengths and weaknesses. For example, mutual information (MI) is often recommended to detect dependencies, but its computation is time-consuming. Also, since MI detects generic dependencies, it does not state whether a dependency is negative or positive. In addition, the right way of MI computation is still a matter of debate (Bioinformatics 26(9): 1135-1139, 2010). Spearman correlation can detect non-linear dependencies well, but is sensitive to the rare-species problem. An overall comparison of measures for ecological network inference is still missing. Some hints can be found in "Numerical ecology" by L. Legendre & P. Legendre 1983.
To threshold the measures, CoNet offers various randomization routines. The basic idea of all these routines is the same: compute the measure's value (e.g. Pearson's correlation coefficient) for many randomized data sets and then choose a threshold that is above most values achieved in randomized data. If the hypergeometric distribution is run alone on a presence/absence data set, such a randomization is not necessary, as this measure (as implemented in CoNet) already returns multiple-test corrected p-values.
Which measures should I choose?
Measures can be grouped roughly in three groups (which often cluster together according to their edge rankings): dissimilarities and distances (Bray Curtis and Kullback-Leibler dissimilarity, Euclidean distance), correlations (Pearson, Spearman, Kendall) and measures of general dependency (mutual information). Measures have different weaknesses and strengths. Criteria to compare the performance of measures include: robustness to noise and outliers, compositionality bias, power (sensitivity), treatment of matching zeros, detection of non-linear dependencies, number of samples required, computational complexity and so on. For instance, Spearman can detect some types of non-linear dependencies, but it has a lower power than Pearson and both are biased by compositionality. Bray Curtis and Kullback Leibler dissimilarity are not biased by matching zeros or compositionality, but they are sensitive to outliers. Mutual information can detect non-linear dependencies, but it needs many samples, has a long run time and its results strongly depend on the selected implementation (as described here). A systematic comparison of association measures with respect to 16S data is still missing. To answer the question: choose at least a dissimilarity and a correlation measure. If you have enough samples, include mutual information (mutual information requires binning and the bin number scales with the root of the sample number).
What's the renormalization all about?
One way to normalize your data is to convert absolute into relative abundances by dividing read counts by the sample's total count. This causes distortions in the Pearson and Spearman correlations, an effect known as compositionality bias (see Aitchison's "A Concise Guide to Compositional Data Analysis"). Renormalization (see PLoS Comput Biol 8(7): e1002606, 2012) computes null distributions in such a way that the compositionality bias is mitigated.
What about this Rserve thing? I enabled Rserve in the configuration, and it still doesn't work.
Rserve is an R package that allows to contact R from within another application written in Java. You need to start the Rserve server outside CoNet first, before enabling Rserve in the CoNet configuration menu. The help explains how to install Rserve (it's really easy to install and runs everywhere).
I did not get a network. What's the matter?
Assuming that you did not get an error message, this means that CoNet did not find any significant associations for your data, given its current settings. You can run it less stringently by setting union instead of intersection as the network merge strategy, by increasing the p-value threshold or collecting more initial edges (this is equivalent to reducing the effect sizes). However, be aware that when you run CoNet less stringently, you also increase the probability of false positives.
What's the advantage of combining several similarities?
In a nutshell, the idea is that different measures agree on true positive relationships and disagree on false positive relationships. Thus, by combining measures, we would predict less false relationships (possibly at the cost of predicting less correct relationships). In the future, when data on known relationships are available, measures could be weighted to increase prediction accuracy.
So CoNet can only find pairwise relationships?
Yes, that's right, with one exception: association rule mining (the apriori algorithm is integrated in CoNet) can find relationships between more than 2 items. However, it is very time-consuming to compute associations between more than 3 items. In the future, CoNet might support some form of sparse multiple regression to find higher-level associations in abundance data, most probably by interfacing an R package designed for that purpose.
How do I interpret CoNet's result?
CoNet outputs an association network which summarizes significant relationships between the input items. These relationships do not make any statements about causality. If for instance a positive relationship was found between two species A and B, there are several possible interpretations: A and B have overlapping niches, A and B are mutualistic, A is a commensalist of B, A is a predator and B is its major prey etc. The same is true for negative relationships (different niches, amensalism, competition, etc.). Some measures can predict asymmetric relationships (e.g. association rule mining) which help to resolve some of these ambiguities, but most measures available in CoNet predict symmetric relationships. For ideas on how to analyse association networks, see BMC Bioinformatics 13:113, 2012 and PLoS Comput Biol 8(7): e1002606, 2012. Cytoscape offers a variety of plugins to compute network properties. This tutorial gives some more advice and examples on network interpretation.
At which edge attribute should I look?
When you do not randomize, your edge attribute of interest is "weight", when you randomize, it is "pval", when you merge p-values, it is pval-MERGE_STRATEGY (e.g. pval-brown-merge) and when you correct for multiple testing, it is either "qval" or "sig".
How can I size my nodes by abundance or degree?
CoNet computes for each node its degree and the sum of its corresponding row. Thus, nodes can be sized according to degree or abundance by selecting "Node size" in the VizMapper, choosing either degree or abundance as attribute and "ContinuousMapping" as mapping type.
How can I color or shape my nodes by group membership or feature status?
All nodes in networks computed with CoNet provide the "isafeature" attribute, thus feature nodes can be colored or shaped differentially from taxon nodes by selecting "Node color" or "Node shape" in the VizMapper, then choosing the "isafeature" attribute and selecting "DiscreteMapping" as mapping type. If nodes should be colored or shaped according to group membership, the procedure is the same, except that the previously provided group attribute should be selected as node attribute.
Why doesn't Cytoscape save my CoNet network?
It seems that Cytoscape has a problem with saving unnamed networks, so this problem is solved by assigning a name to your network.
How reproducible are my results?
Networks computed with randomization are not perfectly reproducible, because the randomization distribution might differ slightly from run to run and thus result in slightly different p-values. However, the variability due to this effect is small, i.e. the edge overlap, measured by the Jaccard index (intersection of edges divided by union of edges), should be around 90% or above. Networks generated with CoNet alpha have smaller Jaccard indices of overlap when re-computed with the latest version (typically between 50-60%), because of a major re-implementation of CoNet's core plus subsequent bug fixes. You can however re-run with CoNet's alpha version available here (runs only on command line). Earlier beta versions of CoNet are available from its Cytoscape app store page.
An error occurred. What do I do now?
CoNet should have generated an error report for you, which you can send to:
If CoNet didn't generate a report or, even worse, Cytoscape crashed, please check whether the log file "output.log" was generated in your Cytoscape application folder. If you did not save your CoNet settings and Cytoscape crashed, please re-configure CoNet with the settings that caused the problem. Then either save the settings into a file or push the "Generate command line call" button and copy the command line call. Please send the settings or the command line call along with the log file (if possible).
Thanks for reporting the error.
How can I cite CoNet?
You can cite CoNet using reference: F1000 5:1519, Cytoscape apps Channel 2016

CoNet - Co-occurrence Network inference

News

Documentation

CoNet - Frequently Asked Questions (FAQ)

col_names	Sample1_Group1	Sample2_Group1	Sample3_Group1	Sample1_Group2	Sample2_Group2	Sample3_Group2
group1	1	1	1	0	0	0
group2	0	0	0	1	1	1