Statistics¶
Alleles¶
- limix.stats.allele_expectation(p, nalleles, ploidy)[source]
Allele expectation.
Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.
- Parameters
- Returns
expectation – Last dimension will contain the expectation of each allele.
- Return type
ndarray
Examples
>>> from texttable import Texttable >>> from bgen_reader import read_bgen, allele_expectation, example_files >>> >>> sampleid = "sample_005" >>> rsid = "RSID_6" >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... ... locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index ... locus = locus.compute().values[0] ... sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid)) ... sample = sample.index ... sample = sample[0] ... ... nalleles = bgen["variants"].loc[locus]["nalleles"] ... ploidy = 2 ... ... p = bgen["genotype"][locus].compute()["probs"][sample] ... # For unphased genotypes only. ... e = allele_expectation(bgen, locus)[sample] ... ... alleles = bgen["variants"].loc[locus]["allele_ids"].compute() ... alleles = alleles.values[0].split(",") ... ... tab = Texttable() ... ... print(tab.add_rows( ... [ ... ["", "AA", "AG", "GG", "E[.]"], ... ["p"] + list(p) + [1.0], ... ["#" + alleles[0], 2, 1, 0, e[0]], ... ["#" + alleles[1], 0, 1, 2, e[1]], ... ] ... ).draw()) +----+-------+-------+-------+-------+ | | AA | AG | GG | E[.] | +====+=======+=======+=======+=======+ | p | 0.012 | 0.987 | 0.001 | 1 | +----+-------+-------+-------+-------+ | #A | 2 | 1 | 0 | 1.011 | +----+-------+-------+-------+-------+ | #G | 0 | 1 | 2 | 0.989 | +----+-------+-------+-------+-------+ >>> print("variant: {}".format(rsid)) variant: RSID_6 >>> print("sample : {}".format(sampleid)) sample : sample_005
Note
This function supports unphased genotypes only.
- limix.stats.allele_frequency(X)[source]
Compute allele frequency from its expectation.
Xis a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
- Returns
frequenct – Allele frequencies encoded as a variants-by-alleles matrix.
- Return type
ndarray
- limix.stats.compute_dosage(X, alt=None)[source]
Compute dosage from allele expectation.
- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided
Xis used as the reference if None. Defaults toNone.
- Returns
dosage – Dosage encoded as a variants-by-samples matrix.
- Return type
ndarray
Example
>>> from bgen_reader import read_bgen, allele_expectation, example_files >>> from bgen_reader import compute_dosage >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... variant_idx = 2 ... e = allele_expectation(bgen, variant_idx) ... dosage = compute_dosage(e) ... print(dosage[:5]) [0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]
Chi-square mixture¶
- limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]
Mixture of 𝜒² distributions.
Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2
\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.
Warning
This class is not production-ready. Keep in mind that this interface is likely to change.
- Parameters
scale_min (float) – Minimum value used for fitting the scale parameter.
scale_max (float) – Maximum value used for fitting the scale parameter.
dofmin (float) – Minimum value used for fitting the dof parameter.
dofmax (float) – Maximum value used for fitting the dof parameter.
qmax (float) – Only the top qmax quantile is used for the fit.
n_interval (int) – Number of intervals when performing gridsearch.
tol (float) – Tolerance of being zero.
Example
>>> from numpy.random import RandomState >>> import scipy as sp >>> from limix.stats import Chi2Mixture >>> >>> scale = 0.3 >>> dof = 2 >>> mixture = 0.2 >>> n = 100 >>> >>> random = RandomState(1) >>> x = random.chisquare(dof, n) >>> n0 = int( (1-mixture) * n) >>> idxs = random.choice(n, n0, replace=False) >>> x[idxs] = 0 >>> >>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0, ... dof_min=0.1, dof_max=5.0, ... qmax=0.1, tol=4e-3) >>> chi2mix.estimate_chi2mixture(x) >>> pv = chi2mix.sf(x) >>> print(pv[:4]) [0.2 0.2 0.2 0.2] >>> >>> print('%.2f' % chi2mix.scale) 1.98 >>> print('%.2f' % chi2mix.dof) 0.89 >>> print('%.2f' % chi2mix.mixture) 0.20
Ground truth evaluation¶
- limix.stats.confusion_matrix(df, wsize=50000)[source]
Provide a couple of scores based on the idea of windows around genetic markers.
- Parameters
causals – Indices defining the causal markers.
pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
Kinship¶
- limix.stats.linear_kinship(G, out=None, verbose=True)[source]
Estimate Kinship matrix via linear kernel.
Let 𝑑 be the number of columns of
G. The resulting matrix is given by:\[𝙺 = 𝚇𝚇ᵀ/𝑑\]where
\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]is the matrix
Gcolumn-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrixKhaving only real numbers.This functions is specially designed to also handle large matrices
Gthat would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.- Parameters
G (array_like) – Samples-by-variants matrix.
out (ndarray) – A location into which the result is stored.
verbose (bool, optional) –
Truefor showing progress;Falseotherwise. Defauts toTrue.
Examples
>>> from numpy.random import RandomState >>> from limix.stats import linear_kinship >>> >>> random = RandomState(1) >>> X = random.randn(4, 100) >>> K = linear_kinship(X, verbose=False) >>> print(K) [[ 0.91314823 -0.19283362 -0.34133897 -0.37897564] [-0.19283362 0.89885153 -0.2356003 -0.47041761] [-0.34133897 -0.2356003 0.95777313 -0.38083386] [-0.37897564 -0.47041761 -0.38083386 1.23022711]]
Likelihood-ratio test¶
- limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]
Compute p-values from likelihood ratios.
These are likelihood ratio test p-values.
Principal component analysis¶
- limix.stats.pca(X, ncomp)[source]
Principal component analysis.
- Parameters
X (array_like) – Samples-by-dimensions array.
ncomp (int) – Number of components.
- Returns
components (ndarray) – First components ordered by explained variance.
explained_variance (ndarray) – Explained variance.
explained_variance_ratio (ndarray) – Percentage of variance explained.
Examples
>>> from numpy import round >>> from numpy.random import RandomState >>> from limix.stats import pca >>> >>> X = RandomState(1).randn(4, 5) >>> r = pca(X, ncomp=2) >>> r['components'] array([[-0.75015369, 0.58346541, -0.07973564, 0.19565682, -0.22846925], [ 0.48842769, 0.72267548, 0.01968344, -0.46161623, -0.16031708]]) >>> r['explained_variance'] array([6.44655993, 0.51454938]) >>> r['explained_variance_ratio'] array([0.92049553, 0.07347181])
P-value correction¶
- limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]
Test results and p-value correction for multiple tests.
- Parameters
pvals (array_like) – Uncorrected p-values.
alpha (float) – FWER, family-wise error rate, e.g.
0.1.method (string) –
Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are
`bonferroni` : one-step correction `sidak` : one-step correction `holm-sidak` : step down method using Sidak adjustments `holm` : step-down method using Bonferroni adjustments `simes-hochberg` : step-up method (independent) `hommel` : closed method based on Simes tests (non-negative) `fdr_bh` : Benjamini/Hochberg (non-negative) `fdr_by` : Benjamini/Yekutieli (negative) `fdr_tsbh` : two stage fdr correction (non-negative) `fdr_tsbky` : two stage fdr correction (non-negative)
is_sorted (bool) – If
False(default), the p_values will be sorted, but the corrected pvalues are in the original order. IfTrue, then it assumed that the pvalues are already sorted in ascending order.
- Returns
reject (ndarray, boolean) –
Truefor hypothesis that can be rejected for given alpha.pvals_corrected (ndarray) – P-values corrected for multiple tests.
alphacSidak (float) – Corrected alpha for Sidak method.
alphacBonf (float) – Corrected alpha for Bonferroni method.
Notes
This is a wrapper around a function from the statsmodels package.
- limix.stats.empirical_pvalues(xt, x0)[source]
Function to compute empirical p-values.
Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).
- Parameters
xt (array_like) – Test statistcs observed on data.
x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is
1./len(x0).
- Returns
pvalues – Estimated empirical p-values.
- Return type
ndarray
Examples
>>> from numpy.random import RandomState >>> from limix.stats import empirical_pvalues >>> >>> random = RandomState(1) >>> x0 = random.chisquare(1, 5) >>> x1 = random.chisquare(1, 10000) >>> >>> empirical_pvalues(x0, x1) array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])