Preprocess and filter input matrix
Input filtering
In some situations, it is useful to filter the input matrix. For instance, if the matrix represents
bacterial abundances, noise may contribute much more than biological reasons to
variances in low-abundant bacteria abundances.
In addition, low-abundant species contain many zero values, which are ambiguous: they
mean either that the taxon does not occur at all in the sample or that it
was below detection limit.
Zeros require a special treatment: either by "smoothing" (e.g. Witten-Bell smoothing) which tries
to "guess" values from the remaining data or by pseudo-counts.
If a row is dominated by zero entries and missing values, it is best discarded.
Below, a number of filters are listed that allow discarding problematic rows.
These filters can also be combined.
- row_minsum The sum of the values of each row should be equal to or larger than the specified minimum. To specify the minimum, write the value
in the text field next to the check box and press enter.
- row_minocc Each row should have at least the specified number of non-zero, non-missing values.
- col_minsum The sum of the values of each column should be equal to or larger than the specified minimum.
- col_minocc Each column should have at least the specified number of non-zero, non-missing values.
- minzeropairs Each row pair should have at least the specified number of non-double-zero value pairs when computing its similarity score.
This filter avoids computing high scores based on double absences.
Keep sum of filtered rows
Instead of simply discarding filtered rows, sum them into a single row that is kept for further processing steps.
Standardization
If the columns of the matrix represent different samples, standardization over the samples is necessary.
A number of standardization strategies are available, which can be applied to columns and/or to rows.
Features are excluded from standardization. If a group attribute has been specified (see data menu help),
column-wise standardization strategies are carried out group-wise.
Note that rows or columns with zero sum will be removed from the input matrix.
If both filtering and standardization is specified, filtering is carried out first, followed by standardization.
- col_norm Each column is divided by its sum, converting abundances in column-wise proportions.
- col_downsample Each column is down-sampled such that its sum is equal to the sum of the column with the smallest sum. This option is only available for count matrices.
- row_stand Standardize rows such that each row x is transformed as follows: x_stand = (x-mean(x))/sd(x).
- row_stand_robust Standardize rows as in row_stand, but estimate the mean by the median and the standard deviation with the interquartile range normalized by (qnorm(0.75) - qnorm(0.25) ~ 1.35), where qnorm is the quantile function of the standard normal distribution (robust standardization follows an R-script by Jacques van Helden).
- row_norm Each row is divided by its sum, converting abundances in row-wise proportions.
- row_downsample Each row is down-sampled such that its sum is equal to the sum of the row with the smallest sum. This option is only available for count matrices.
- log2 Take the logarithm to basis 2 of each matrix entry.
To incidence conversion
When an abundance or count matrix is given, but incidence matrix methods (such as the hypergeometric distribution) shall be
applied as well to the input matrix, the matrix can be converted into an incidence matrix for these methods.
The following methods are available for conversion (features are excluded from the conversion and subsequent analysis steps):
- user Each matrix value equal to or above the threshold is considered as presence (and set to one) and each value below
is considered as absence (and set to zero).
- quantiles The row value equal to or above the given row-specific quantile is considered as presence (and set to one) and the row value below
is considered as absence (and set to zero).
Quantiles should be set between 0 and 1. For instance, if the quantile was set to 0.75, a row value is considered as present
only if it is equal to or above the third row quantile.
- eval Each row value is transformed into a z-score by subtracting the row mean from it and dividing it by the row standard deviation.
With the normal distribution, each z-score is converted into a p-value, which is multiplied by the column number to obtain a multiple-test corrected E-value.
An E-value below or equal to the given threshold is considered as presence, else as absence.
Output filtering
Edges can represent positive ("copresence") or negative ("mutual exclusion") relationships between the items.
The output filter allows keeping only copresence or only mutual exclusion edges.