Select measures and specify their thresholds
A set of methods to infer relationships between items can be selected.
In order to select a measure of interest, activate its corresponding check box.
Correlations
Correlation measures all have a range of [-1,+1], with -1 for the strongest negative relationship
(anti-correlation), 0 for the neutral case and +1 for the strongest positive relationship.
Briefly stated, Pearson is assuming a linear relationship between the data, whereas
Spearman and Kendall are rank-based and thus do not assume linearity.
It should be pointed out that Pearson, Spearman and Kendall are not defined in case the standard deviation is zero.
Thus, if two taxa have constant abundances, they will not be linked by these correlation measures.
For large matrices, the computation of Kendall can take a prohibitive amount of time and
should better be carried out on command line.
For the formulas and in-depth discussions, we refer to the respective Wiki pages:
Pearson,
Spearman,
Kendall
Note that all three correlations are sensitive to the so-called double-zero problem
(Legendre & Legendre), that is
a vector pair with many matched zeros will receive a higher score than the same vector pair
without them. Since the interpretation of zeros is often ambiguous, this is a serious drawback
when dealing with sparse data.
In addition, correlation results may be strongly biased when applied to normalized data (see Aitchison 2003).
Renormalization (see the randomization menu) can reduce this bias.
Similarities
Similarities range from 0 to 1 or infinity. The higher their score, the more similar are two objects.
- The Steinhaus similarity is the inverse of the Bray Curtis dissimilarity and
is defined as
S(x,y) = 2W/(A+B), where A = sum of vector x, B = sum of vector y and W = sum of minimal values, i.e. min(xi,yi)
It is bounded between 0 and 1.
-
The mutual information has been defined and implemented in a variety of ways. However, different estimations of mutual information
give different results (Fernandes & Gloor).
The default in CoNet is a fast implementation of mutual information by Jean-Sebastien Lerat. A re-implementation of the ARACNE matlab script,
which depends on a parameter (the Gauss kernel width), is also available.
With Rserve enabled, one can also select the implementation that is part of the
minet package in R (see config menu).
When computing mutual information with minet or the default, abundance matrices or normalized matrices require a
discretization step, which can be selected in the config menu.
Note that the discretization step computes the bin number as the square root of the sample number, thus mutual information with discretization
requires a sufficiently large sample number.
Whereas extreme values in most measures are indicative of co-presence at one end and mutual exclusion at the other,
an extremely low mutual information (close to zero) signals independence, whereas strong co-presence and mutual exclusion patterns receive likewise
high mutual information values. Thus, when mutual information is combined with the "Top and Bottom" option explained below,
the double number of top edges is returned instead of top and bottom edges.
Note that for the same reason, mutual information cannot assign the interaction type.
-
The variance of log-ratios was introduced by Aitchison and is a measure of dissimilarity between
two components across compositions, i.e. two species across samples. It is defined as:
D(x,y)=var(log(xi/yi))
Aitchison suggested a transformation that scales the variance of log-ratios between 0 and 1:
S(x,y) = 1-exp(-sqrt(D(x,y)))
This transformation converts the measure into a similarity, with 0 corresponding to a lack of proportional relationship and 1 perfect proportional
relationship.
Note that a pseudocount is added to deal with zero values. However, the variance of log ratios depends on the selected pseudocount,
thus variance of log ratios computed on different zero-containing data are only comparable if the same pseudocount was used.
-
The distance of correlation (also known as Brownian correlation) is a similarity measure in the range of 0 to 1.
In contrast to correlation measures, 0 flags statistical independence. Like mutual information, an interaction type cannot be assigned
with this measure. For this reason, if this measure is used together with the "Top and Bottom" option, the double number of top edges is returned.
The distance of correlation has been implemented following the implementation in the R package energy (DCOR). For more information,
see the Wiki entry.
-
The Hilbert-Schmidt independence criterion (HSIC) is another general measure of dependency
(see this Wiki entry).
As in the case of mutual information and distance correlation, no interaction type
can be assigned and the option "Top and Bottom" returns simply the double number of top edges.
A HSIC of zero flags independence, and the higher the HSIC, the stronger the dependence.
Dissimilarities
Dissimilarities range from 0 to 1 or infinity. The higher their score, the more dissimilar are two objects.
A distance (or metric) has to fulfill the following criteria: 1) It should never be negative, 2) It should be zero only
if objects are identical, 3) It should be symmetric, e.g. d(x,y)=d(y,x) and 4) It should fulfill the triangle inequality
(d(x,z) <= d(x,y)+d(y,z)). Dissimilarities do not meet the fourth criterion.
- Euclidean distance is defined as
D(x,y) = sqrt(SUMi(xi - yi)^2) and goes from 0 to infinity.
Each row is divided by its sum, such that it adds up to one.
- Bray Curtis dissimilarity
This is the inverse of the Steinhaus similarity described above and is bounded between 0 and 1. The formula is:
D(x,y) = 1 - 2W/(A+B), where A = sum(x), B = sum(y) and W = sum(min(xi,yi))
Each row is divided by its sum, such that it adds up to one.
- Hellinger distance is computed as follows:
D(x,y) = sqrt(SUMi=1(sqrt(xi) - sqrt(yi))2), where x and y are supposed to sum to one
Each row is divided by its sum, such that it adds up to one.
The Hellinger distance is closely related to the Kullback-Leibler divergence.
- Kullback-Leibler dissimilarity
D(x,y) = SUMi=1(xi*log(xi/yi)+yi*log(yi/xi)), where x and y have been standardized to sum to one.
Note that a pseudocount is added to deal with zero values.
- Jensen-Shannon dissimilarity
D(x,y) = 1/2*SUMi=1(xi*log(xi/M) + yi*log(yi/M)), where M = (xi+yi)/2 and where x and y have been standardized to sum to one.
Note that a pseudocount is added to deal with zero values.
Incidence methods
In case the matrix is of type "incidence" (i.e. only contains presence/absence values)
or a conversion to the incidence type has been specified in the preprocessing menu,
a number of methods specific to incidence matrices are available.
- The hypergeometric distribution is often used to compute the p-value of the overlap between two sets, taking
into account the set sizes. The one-tailed version of the hypergeometric test is also known as
Fisher's exact test.
Here, the p-values of the hypergeometric distribution are multiple-test adjusted using the
E-value correction (that is multiplication of p-values with the test number) and then converted into
significances (sig=-log10(p-value)). The higher the significance, the smaller the
likelihood that the predicted edge is due to chance. Thus, the significance behaves like a similarity
measure. Strictly speaking, it is not a similarity, since significances can be negative, but the significance
threshold is usually set to zero or larger.
- The Jaccard distance also measures set overlap and is defined
for two sets A and B as:
D(A,B) = 1 - size(intersection(A,B))/size(union(A,B))
The Jaccard distance ranges from 0 to 1 (see Wiki entry).
- Association rule mining (not supported for Windows) enumerates
logical rules in presence/absence data sets. These rules can span any item number, but due to
multiple-testing and run-time issues, one should restrict association rule mining to low item numbers.
Numerous filters have been developed to retain only rules of interest.
CoNet wraps Christian Borgelt's implementation of the apriori algorithm (Agrawal et al.).
For the meaning of the filters, we refer users to the
documentation of apriori.
By default, if confidence is not selected as a filter, it is set to 50, whereas support (if not selected) is set to 10.
Using association rule mining requires apriori to be installed (see configuration).
Note that association rule mining is currently the only method that returns a directed network.
Network inference with Minet
Minet is an R package that implements three popular network inference
algorithms used in genetic regulatory network inference, namely
CLR (Faith et al.), ARACNE (Margolin et al.) and MRNET (Meyer et al. 2007).
These algorithms work on the basis of a similarity matrix, where the similarity
is usually mutual information. Note that using a correlation measure instead requires the data to be normally distributed.
Using minet requires Rserve to be enabled. Minet's strategies for mutual information estimation and discretization (required for abundance or normalized matrices)
can be set in the configuration menu.
Note that by default, mutual information is not computed in minet, but using ARACNE's algorithm. The default can be changed in the config menu.
Threshold setting
In order to set method-specific thresholds, two options are available:
- Manually: The user can set manually the threshold via the sliders and text fields next to the measures.
Text fields are up-dated upon slider movement and vice versa.
- Automatically: Click the "Automatic threshold setting" button
on the bottom of the Methods menu to open the Threshold setting menu.
The user can specify a certain edge number or quantile in the text field below the
"Edge selection" choice. CoNet will then set thresholds automatically such that the corresponding edge number
is included in the output network. The quantile is expected to be in the range of 0 to 1. For example, if the quantile
is set to 0.05, the threshold will include the 5% top-scoring edges.
If the edges with lowest scores should be included as well, activate the "Top and bottom" check box.
This is of interest to capture exclusion for measures other than correlation.
This option will be ignored for measures that cannot support it (such as mutual information). For these measures,
an equivalent number of top edges is selected instead.
Make sure to disable "Top and Bottom" and to clear the edge selection parameter
to disable automated threshold setting.
If "Force intersection" is enabled, thresholds are set such that resulting network has the selected number
of intersection edges. This option can be used in combination with network merge strategy "intersection",
to obtain an intersection network of given size (see Merge menu help).
Automatically set thresholds can be optionally saved to a file called "thresholds.txt" in a user-selected folder.
To select the folder, click the button "Select folder", click on a folder in the file tree and then click "Choose".
In the folder selection mode, files in the file tree are not clickable.
Note that it is always possible to store the thresholds as part of a CoNet settings file
(see Settings loading/saving) after completion of the network computation task.
With such a settings file, there is no need to recompute the thresholds.
However, the guessing parameter should be deleted (such that the field is empty) before the current
settings are saved or the threshold guessing line in the settings file should be removed,
to avoid recomputing the thresholds.
When loading lower and upper thresholds from a settings file, only one of them will be displayed for each measure in the
interface, however both will be applied during network inference.
Visualize score distributions
Score distributions of selected measures can be visualized by selecting an export folder
and activating the "Export distribution" check box in the Threshold setting menu.
To select an export folder, click the "Select" button and click on a folder, then click "Choose".
The score distributions will be exported into a single pdf file called
"score_distributions.pdf" in the selected folder.
When this option is activated, pushing the "GO" button in the main menu will not result in a network,
but instead the user is asked whether to open the pdf file with the distribution plots.
Pdf files are displayed using the Adobe Acrobat Viewer. In case the user confirms opening the pdf file,
Acrobat Viewer asks whether the user accepts the License Agreement and upon acceptance displays the pdf file.
When score visualization is enabled, previously set thresholds are lost.
Note that threshold guessing and score visualization are not supported for minet, hypergeometric distribution and association rule mining.