P-value assignment and multiple-testing correction

Background

For a matrix with n rows and a symmetric measure, n*(n-1)/2 tests are carried out during network inference. Thus, with increasing row number, it is more likely to find significant relationships by chance alone. There are various approaches to correct for this multiple testing issue, most of which rely on the adjustment of p-values. The p-value describes the probability that a relationship is inferred due to chance, thus the smaller the p-value, the higher the significance of a relationship. P-values can be computed from a random score distribution. In order to calculate such a "background" edge score distribution given a null hypothesis, random data are generated a selected number of times.

Random score distribution options

Randomization routine

First, the randomization routine needs to be chosen. The default "none" disables any randomization test. The "edgeScores" routine generates a set of edge-specific score distributions, whereas the routine "lallich" is an implementation of BS_FD (bootstrap-based false discovery) routine and is recommended for association rule mining. This routine takes a false discoveries number parameter, which is set to one. Note that the "lallich" routine ignores the resampling parameter (which is fixed to bootstrap for this routine) and requires the significance level as parameter (which can be provided via the p-value threshold, e.g. 0.05). In addition, it does not keep bottom edges (i.e. edges with extremely low similarity or high distance scores, see option "Top and bottom" in the methods menu). Note also that the "lallich" routine computes global score distributions for each method. In contrast to the edge-specific routine, the "lallich" routine does not assign p-values to the edges. Instead, it adjusts the original thresholds and discards edges with scores below the adjusted thresholds. The adjusted thresholds are stored in the network comment attribute. Since "lallich" does not assign p-values, the multiple-testing correction options are not applicable.
Note that only the "edgeScore" routine allows computing edge-specific p-values. The p-values are one-sided, that is a very high p-value is indicative of mutual exclusion (e.g. for a dissimilarity, the score was much higher than expected at random). P-values above 0.5 are converted by subtracting the p-value from one prior to multiple test correction. However, if the p-value does not agree with the initial interaction type (e.g. a high p-value for a co-presence edge), the edge is discarded. P-values are computed as follows:

Resampling

By default, the matrix is randomized by permuting the order of columns row-wise ("shuffle_rows"), which preserves the row sum. Alternative re-sampling procedures are the permutation of row order column-wise ("shuffle_cols"), to permute rows as well as columns ("shuffle_both"), or to bootstrap.

Iteration number

The iteration number specifies how often the randomization routine is carried out, i.e. how many values the random score distribution contains.

P-value merging

If the computation of a multi-edge network and edge-wise randomizations are enabled, method-specific p-values of multi-edges connecting the same node pair can be merged using the strategies listed below. Note that p-value filtering and multiple-test correction, if selected, are carried out after the merge, i.e. on the merged p-values.

Note that p-values of edges filtered out initially are missing for the p-value merge. In Brown's and Fisher's method, these missing p-values are implicitly set to one, but they are not considered in any other p-value merge method. The best is to enable the Force intersection option in the Threshold setting sub menu of the Methods menu when using a p-value merge option.

Renormalization

This option is recommended when the input matrix is normalized column-wise and correlation measures are selected. Renormalization (Sathirapongsasuti*, Faust* et al., PLoS Comp Bio 8(7): e1002606, 2012) is a strategy that has been empirically shown to counter the compositionality bias. The compositionality bias affects correlation measures and is introduced when normalizing across samples (see Aitchison). The principle is to repeat sample-wise normalization for each item pair in each randomization round, such that the effect of all other items on this pair are captured. Renormalization is only compatible with the "shuffle_rows" resampling routine. Attention: Renormalization is very time-consuming and best run on command line. It is not applied to measures known to be robust to compositional bias (Bray Curtis, Hellinger, Kullback-Leibler, Variance of log-ratios) or to incidence matrix measures. Note that association rule mining cannot be combined with renormalization. Renormalization is only possible for column-wise normalization by column sum division. Note that features are excluded from renormalization. If groups were specified, renormalization is carried out group-wise.
When combining renormalization with pre-processing, permutation and renormalization is carried out on the matrix resulting from all of these pre-processing steps. Without renormalization, resampling is carried out on the original input matrix and the preprocessing steps are repeated for each iteration.

Variance pooling

When permutation and bootstrap distributions are combined, their variances can be pooled to account for different iteration numbers using the following formula: pooled_sd = sqrt((var(permutation) + var(bootstrap))/2), which computes the mean variance of the bootstrap and permutation distribution. The standard deviation of the pooled variance instead of the bootstrap standard deviation is then used to compute the combined p-value. To enable variance pooling, activate the "Pool variances" check box.

Discarding unstable edges

Some edges might be only significant due to outliers and have much lower scores once the outliers are removed. Edge-specific bootstrapping results in a confidence interval for each edge. If the observed edge score is not within the 2.5 and 97.5 percentile of the bootstrap distribution, it can be considered as exceptionally high (due to outliers) and the edge is removed.

P-value options

Once p-values have been obtained from the randomization routine, they can be multiple-test adjusted using various approaches. For each approach, a lower and optionally an upper threshold for the edge p-value can be applied. All edges with p-values above the lower threshold (defaults to 0.05) are discarded.

Saving and re-using randomized scores

Since the repeated computation of random scores can be extremely time-consuming, it is possible to save the originally inferred network along with its random scores. To do so, select a folder by clicking the "Select folder" button on the bottom left side of the menu ("Save"). This will open the file tree, where you can click a folder and then click "Choose". Below the folder selection button, there is a text field "Specify file name", where you can input the name of the file into which scores are to be saved. Then click the "Save randomizations to file" checkbox.
The network can later be restored by selecting the file with the randomization results via the "Open file" button in the middle right of the menu ("Load" box). To clear a selected randomization file, click the "Open file" button again and push "Cancel".
Note that the CoNet settings should match the settings used to generate the randomization file (except for the p-value thresholds or metadata). For saving settings, click the "Settings loading/saving" button in the main menu.
Loading a network from a randomization file requires at least the following settings:
and the following options can be enabled:

Loading null distributions

Optionally, edge-specific p-values can be computed from two distributions: the null distribution generated by shuffling and the bootstrap distribution, from which a confidence interval can be derived. The p-value is then obtained from both distributions as described above. In order to compute p-values by combining two randomizations, the following steps are necessary:
  1. Configure and run CoNet to compute the null distribution (using one of the shuffling options and, optionally, renormalization). The null distribution should be saved to a file as explained above (using the "Save" box).
  2. Configure and run CoNet to compute the bootstrap distribution and to set combined p-values. The previously calculated null distribution can be set in the "Load null distributions" box.
  3. For future runs, both distributions can be reloaded; the null distribution via "Load null distribution" and the bootstrap distribution via "Load randomization file". Thus, there is no need for re-computing any of these distributions when changing the p-value treatment.