Statistics¶

Alleles¶

limix.stats.allele_expectation(p, nalleles, ploidy)[source]

Allele expectation.

Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.

Parameters

p (array_like) – Allele probabilities.
nalleles (int) – Number of alleles.
ploidy (int) – Number of complete sets of chromosomes.

Returns

expectation – Last dimension will contain the expectation of each allele.

Return type

ndarray

Examples

>>> from texttable import Texttable
>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...
...     locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index
...     locus = locus.compute().values[0]
...     sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid))
...     sample = sample.index
...     sample = sample[0]
...
...     nalleles = bgen["variants"].loc[locus]["nalleles"]
...     ploidy = 2
...
...     p = bgen["genotype"][locus].compute()["probs"][sample]
...     # For unphased genotypes only.
...     e = allele_expectation(bgen, locus)[sample]
...
...     alleles = bgen["variants"].loc[locus]["allele_ids"].compute()
...     alleles = alleles.values[0].split(",")
...
...     tab = Texttable()
...
...     print(tab.add_rows(
...         [
...             ["", "AA", "AG", "GG", "E[.]"],
...             ["p"] + list(p) + [1.0],
...             ["#" + alleles[0], 2, 1, 0, e[0]],
...             ["#" + alleles[1], 0, 1, 2, e[1]],
...         ]
...     ).draw())
+----+-------+-------+-------+-------+
|    |  AA   |  AG   |  GG   | E[.]  |
+====+=======+=======+=======+=======+
| p  | 0.012 | 0.987 | 0.001 | 1     |
+----+-------+-------+-------+-------+
| #A | 2     | 1     | 0     | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0     | 1     | 2     | 0.989 |
+----+-------+-------+-------+-------+
>>> print("variant: {}".format(rsid))
variant: RSID_6
>>> print("sample : {}".format(sampleid))
sample : sample_005

Note

This function supports unphased genotypes only.

limix.stats.allele_frequency(X)[source]

Compute allele frequency from its expectation.

X is a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.

Parameters: X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
Returns: frequenct – Allele frequencies encoded as a variants-by-alleles matrix.
Return type: ndarray

limix.stats.compute_dosage(X, alt=None)[source]

Compute dosage from allele expectation.

Parameters

X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided X is used as the reference if None. Defaults to None.

Returns

dosage – Dosage encoded as a variants-by-samples matrix.

Return type

ndarray

Example

>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...     variant_idx = 2
...     e = allele_expectation(bgen, variant_idx)
...     dosage = compute_dosage(e)
...     print(dosage[:5])
[0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]

Chi-square mixture¶

limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]

Mixture of 𝜒² distributions.

Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2

\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]

Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.

Warning

This class is not production-ready. Keep in mind that this interface is likely to change.

Parameters

scale_min (float) – Minimum value used for fitting the scale parameter.
scale_max (float) – Maximum value used for fitting the scale parameter.
dofmin (float) – Minimum value used for fitting the dof parameter.
dofmax (float) – Maximum value used for fitting the dof parameter.
qmax (float) – Only the top qmax quantile is used for the fit.
n_interval (int) – Number of intervals when performing gridsearch.
tol (float) – Tolerance of being zero.

Example

>>> from numpy.random import RandomState
>>> import scipy as sp
>>> from limix.stats import Chi2Mixture
>>>
>>> scale = 0.3
>>> dof = 2
>>> mixture = 0.2
>>> n = 100
>>>
>>> random = RandomState(1)
>>> x =  random.chisquare(dof, n)
>>> n0 = int( (1-mixture) * n)
>>> idxs = random.choice(n, n0, replace=False)
>>> x[idxs] = 0
>>>
>>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0,
...                       dof_min=0.1, dof_max=5.0,
...                       qmax=0.1, tol=4e-3)
>>> chi2mix.estimate_chi2mixture(x)
>>> pv = chi2mix.sf(x)
>>> print(pv[:4]) 
[0.2 0.2 0.2 0.2]
>>>
>>> print('%.2f' % chi2mix.scale)
1.98
>>> print('%.2f' % chi2mix.dof)
0.89
>>> print('%.2f' % chi2mix.mixture)
0.20

Ground truth evaluation¶

limix.stats.confusion_matrix(df, wsize=50000)[source]

Provide a couple of scores based on the idea of windows around genetic markers.

Parameters

causals – Indices defining the causal markers.
pos – Within-chromossome base-pair position of each candidate marker, in crescent order.

Kinship¶

limix.stats.linear_kinship(G, out=None, verbose=True)[source]

Estimate Kinship matrix via linear kernel.

Let 𝑑 be the number of columns of G. The resulting matrix is given by:

\[𝙺 = 𝚇𝚇ᵀ/𝑑\]

where

\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]

is the matrix G column-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrix K having only real numbers.

This functions is specially designed to also handle large matrices G that would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.

Parameters

G (array_like) – Samples-by-variants matrix.
out (ndarray) – A location into which the result is stored.
verbose (bool, optional) – True for showing progress; False otherwise. Defauts to True.

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import linear_kinship
>>>
>>> random = RandomState(1)
>>> X = random.randn(4, 100)
>>> K = linear_kinship(X, verbose=False)
>>> print(K) 
[[ 0.91314823 -0.19283362 -0.34133897 -0.37897564]
 [-0.19283362  0.89885153 -0.2356003  -0.47041761]
 [-0.34133897 -0.2356003   0.95777313 -0.38083386]
 [-0.37897564 -0.47041761 -0.38083386  1.23022711]]

Likelihood-ratio test¶

limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]

Compute p-values from likelihood ratios.

These are likelihood ratio test p-values.

Parameters

null_lml (float) – Log of the marginal likelihood under the null hypothesis.
alt_lmls (array_like) – Log of the marginal likelihoods under the alternative hypotheses.
dof (int) – Degrees of freedom.

Returns

pvalues – P-values.

Return type

ndarray

Principal component analysis¶

limix.stats.pca(X, ncomp)[source]

Principal component analysis.

Parameters

X (array_like) – Samples-by-dimensions array.
ncomp (int) – Number of components.

Returns

components (ndarray) – First components ordered by explained variance.
explained_variance (ndarray) – Explained variance.
explained_variance_ratio (ndarray) – Percentage of variance explained.

Examples

>>> from numpy import round
>>> from numpy.random import RandomState
>>> from limix.stats import pca
>>>
>>> X = RandomState(1).randn(4, 5)
>>> r = pca(X, ncomp=2)
>>> r['components']
array([[-0.75015369,  0.58346541, -0.07973564,  0.19565682, -0.22846925],
       [ 0.48842769,  0.72267548,  0.01968344, -0.46161623, -0.16031708]])
>>> r['explained_variance'] 
array([6.44655993, 0.51454938])
>>> r['explained_variance_ratio'] 
array([0.92049553, 0.07347181])

P-value correction¶

limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]

Test results and p-value correction for multiple tests.

Parameters

pvals (array_like) – Uncorrected p-values.
alpha (float) – FWER, family-wise error rate, e.g. 0.1.

method (string) –

Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are

`bonferroni` : one-step correction
`sidak` : one-step correction
`holm-sidak` : step down method using Sidak adjustments
`holm` : step-down method using Bonferroni adjustments
`simes-hochberg` : step-up method  (independent)
`hommel` : closed method based on Simes tests (non-negative)
`fdr_bh` : Benjamini/Hochberg  (non-negative)
`fdr_by` : Benjamini/Yekutieli (negative)
`fdr_tsbh` : two stage fdr correction (non-negative)
`fdr_tsbky` : two stage fdr correction (non-negative)

is_sorted (bool) – If False (default), the p_values will be sorted, but the corrected pvalues are in the original order. If True, then it assumed that the pvalues are already sorted in ascending order.

Returns

reject (ndarray, boolean) – True for hypothesis that can be rejected for given alpha.
pvals_corrected (ndarray) – P-values corrected for multiple tests.
alphacSidak (float) – Corrected alpha for Sidak method.
alphacBonf (float) – Corrected alpha for Bonferroni method.

Notes

This is a wrapper around a function from the statsmodels package.

limix.stats.empirical_pvalues(xt, x0)[source]

Function to compute empirical p-values.

Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).

Parameters

xt (array_like) – Test statistcs observed on data.
x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is 1./len(x0).

Returns

pvalues – Estimated empirical p-values.

Return type

ndarray

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import empirical_pvalues
>>>
>>> random = RandomState(1)
>>> x0 = random.chisquare(1, 5)
>>> x1 = random.chisquare(1, 10000)
>>>
>>> empirical_pvalues(x0, x1) 
array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])