Statistics¶
Alleles¶
- limix.stats.allele_expectation(p, nalleles, ploidy)[source]
Allele expectation.
Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.
- Parameters
- Returns
expectation – Last dimension will contain the expectation of each allele.
- Return type
ndarray
Examples
>>> from texttable import Texttable >>> from bgen_reader import read_bgen, allele_expectation, example_files >>> >>> sampleid = "sample_005" >>> rsid = "RSID_6" >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... ... locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index ... locus = locus.compute().values[0] ... sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid)) ... sample = sample.index ... sample = sample[0] ... ... nalleles = bgen["variants"].loc[locus]["nalleles"] ... ploidy = 2 ... ... p = bgen["genotype"][locus].compute()["probs"][sample] ... # For unphased genotypes only. ... e = allele_expectation(bgen, locus)[sample] ... ... alleles = bgen["variants"].loc[locus]["allele_ids"].compute() ... alleles = alleles.values[0].split(",") ... ... tab = Texttable() ... ... print(tab.add_rows( ... [ ... ["", "AA", "AG", "GG", "E[.]"], ... ["p"] + list(p) + [1.0], ... ["#" + alleles[0], 2, 1, 0, e[0]], ... ["#" + alleles[1], 0, 1, 2, e[1]], ... ] ... ).draw()) +----+-------+-------+-------+-------+ | | AA | AG | GG | E[.] | +====+=======+=======+=======+=======+ | p | 0.012 | 0.987 | 0.001 | 1 | +----+-------+-------+-------+-------+ | #A | 2 | 1 | 0 | 1.011 | +----+-------+-------+-------+-------+ | #G | 0 | 1 | 2 | 0.989 | +----+-------+-------+-------+-------+ >>> print("variant: {}".format(rsid)) variant: RSID_6 >>> print("sample : {}".format(sampleid)) sample : sample_005
Note
This function supports unphased genotypes only.
- limix.stats.allele_frequency(X)[source]
Compute allele frequency from its expectation.
X
is a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
- Returns
frequenct – Allele frequencies encoded as a variants-by-alleles matrix.
- Return type
ndarray
- limix.stats.compute_dosage(X, alt=None)[source]
Compute dosage from allele expectation.
- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided
X
is used as the reference if None. Defaults toNone
.
- Returns
dosage – Dosage encoded as a variants-by-samples matrix.
- Return type
ndarray
Example
>>> from bgen_reader import read_bgen, allele_expectation, example_files >>> from bgen_reader import compute_dosage >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... variant_idx = 2 ... e = allele_expectation(bgen, variant_idx) ... dosage = compute_dosage(e) ... print(dosage[:5]) [0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]
Chi-square mixture¶
- limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]
Mixture of 𝜒² distributions.
Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2
\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.
Warning
This class is not production-ready. Keep in mind that this interface is likely to change.
- Parameters
scale_min (float) – Minimum value used for fitting the scale parameter.
scale_max (float) – Maximum value used for fitting the scale parameter.
dofmin (float) – Minimum value used for fitting the dof parameter.
dofmax (float) – Maximum value used for fitting the dof parameter.
qmax (float) – Only the top qmax quantile is used for the fit.
n_interval (int) – Number of intervals when performing gridsearch.
tol (float) – Tolerance of being zero.
Example
>>> from numpy.random import RandomState >>> import scipy as sp >>> from limix.stats import Chi2Mixture >>> >>> scale = 0.3 >>> dof = 2 >>> mixture = 0.2 >>> n = 100 >>> >>> random = RandomState(1) >>> x = random.chisquare(dof, n) >>> n0 = int( (1-mixture) * n) >>> idxs = random.choice(n, n0, replace=False) >>> x[idxs] = 0 >>> >>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0, ... dof_min=0.1, dof_max=5.0, ... qmax=0.1, tol=4e-3) >>> chi2mix.estimate_chi2mixture(x) >>> pv = chi2mix.sf(x) >>> print(pv[:4]) [0.2 0.2 0.2 0.2] >>> >>> print('%.2f' % chi2mix.scale) 1.98 >>> print('%.2f' % chi2mix.dof) 0.89 >>> print('%.2f' % chi2mix.mixture) 0.20
Ground truth evaluation¶
- limix.stats.confusion_matrix(df, wsize=50000)[source]
Provide a couple of scores based on the idea of windows around genetic markers.
- Parameters
causals – Indices defining the causal markers.
pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
Kinship¶
- limix.stats.linear_kinship(G, out=None, verbose=True)[source]
Estimate Kinship matrix via linear kernel.
Let 𝑑 be the number of columns of
G
. The resulting matrix is given by:\[𝙺 = 𝚇𝚇ᵀ/𝑑\]where
\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]is the matrix
G
column-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrixK
having only real numbers.This functions is specially designed to also handle large matrices
G
that would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.- Parameters
G (array_like) – Samples-by-variants matrix.
out (ndarray) – A location into which the result is stored.
verbose (bool, optional) –
True
for showing progress;False
otherwise. Defauts toTrue
.
Examples
>>> from numpy.random import RandomState >>> from limix.stats import linear_kinship >>> >>> random = RandomState(1) >>> X = random.randn(4, 100) >>> K = linear_kinship(X, verbose=False) >>> print(K) [[ 0.91314823 -0.19283362 -0.34133897 -0.37897564] [-0.19283362 0.89885153 -0.2356003 -0.47041761] [-0.34133897 -0.2356003 0.95777313 -0.38083386] [-0.37897564 -0.47041761 -0.38083386 1.23022711]]
Likelihood-ratio test¶
- limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]
Compute p-values from likelihood ratios.
These are likelihood ratio test p-values.
Principal component analysis¶
- limix.stats.pca(X, ncomp)[source]
Principal component analysis.
- Parameters
X (array_like) – Samples-by-dimensions array.
ncomp (int) – Number of components.
- Returns
components (ndarray) – First components ordered by explained variance.
explained_variance (ndarray) – Explained variance.
explained_variance_ratio (ndarray) – Percentage of variance explained.
Examples
>>> from numpy import round >>> from numpy.random import RandomState >>> from limix.stats import pca >>> >>> X = RandomState(1).randn(4, 5) >>> r = pca(X, ncomp=2) >>> r['components'] array([[-0.75015369, 0.58346541, -0.07973564, 0.19565682, -0.22846925], [ 0.48842769, 0.72267548, 0.01968344, -0.46161623, -0.16031708]]) >>> r['explained_variance'] array([6.44655993, 0.51454938]) >>> r['explained_variance_ratio'] array([0.92049553, 0.07347181])
P-value correction¶
- limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]
Test results and p-value correction for multiple tests.
- Parameters
pvals (array_like) – Uncorrected p-values.
alpha (float) – FWER, family-wise error rate, e.g.
0.1
.method (string) –
Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are
`bonferroni` : one-step correction `sidak` : one-step correction `holm-sidak` : step down method using Sidak adjustments `holm` : step-down method using Bonferroni adjustments `simes-hochberg` : step-up method (independent) `hommel` : closed method based on Simes tests (non-negative) `fdr_bh` : Benjamini/Hochberg (non-negative) `fdr_by` : Benjamini/Yekutieli (negative) `fdr_tsbh` : two stage fdr correction (non-negative) `fdr_tsbky` : two stage fdr correction (non-negative)
is_sorted (bool) – If
False
(default), the p_values will be sorted, but the corrected pvalues are in the original order. IfTrue
, then it assumed that the pvalues are already sorted in ascending order.
- Returns
reject (ndarray, boolean) –
True
for hypothesis that can be rejected for given alpha.pvals_corrected (ndarray) – P-values corrected for multiple tests.
alphacSidak (float) – Corrected alpha for Sidak method.
alphacBonf (float) – Corrected alpha for Bonferroni method.
Notes
This is a wrapper around a function from the statsmodels package.
- limix.stats.empirical_pvalues(xt, x0)[source]
Function to compute empirical p-values.
Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).
- Parameters
xt (array_like) – Test statistcs observed on data.
x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is
1./len(x0)
.
- Returns
pvalues – Estimated empirical p-values.
- Return type
ndarray
Examples
>>> from numpy.random import RandomState >>> from limix.stats import empirical_pvalues >>> >>> random = RandomState(1) >>> x0 = random.chisquare(1, 5) >>> x1 = random.chisquare(1, 10000) >>> >>> empirical_pvalues(x0, x1) array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])