Statistics

Alleles

limix.stats.allele_expectation(p, nalleles, ploidy)[source]

Allele expectation.

Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.

Parameters
  • p (array_like) – Allele probabilities.

  • nalleles (int) – Number of alleles.

  • ploidy (int) – Number of complete sets of chromosomes.

Returns

expectation – Last dimension will contain the expectation of each allele.

Return type

ndarray

Examples

>>> from texttable import Texttable
>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...
...     locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index
...     locus = locus.compute().values[0]
...     sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid))
...     sample = sample.index
...     sample = sample[0]
...
...     nalleles = bgen["variants"].loc[locus]["nalleles"]
...     ploidy = 2
...
...     p = bgen["genotype"][locus].compute()["probs"][sample]
...     # For unphased genotypes only.
...     e = allele_expectation(bgen, locus)[sample]
...
...     alleles = bgen["variants"].loc[locus]["allele_ids"].compute()
...     alleles = alleles.values[0].split(",")
...
...     tab = Texttable()
...
...     print(tab.add_rows(
...         [
...             ["", "AA", "AG", "GG", "E[.]"],
...             ["p"] + list(p) + [1.0],
...             ["#" + alleles[0], 2, 1, 0, e[0]],
...             ["#" + alleles[1], 0, 1, 2, e[1]],
...         ]
...     ).draw())
+----+-------+-------+-------+-------+
|    |  AA   |  AG   |  GG   | E[.]  |
+====+=======+=======+=======+=======+
| p  | 0.012 | 0.987 | 0.001 | 1     |
+----+-------+-------+-------+-------+
| #A | 2     | 1     | 0     | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0     | 1     | 2     | 0.989 |
+----+-------+-------+-------+-------+
>>> print("variant: {}".format(rsid))
variant: RSID_6
>>> print("sample : {}".format(sampleid))
sample : sample_005

Note

This function supports unphased genotypes only.

limix.stats.allele_frequency(X)[source]

Compute allele frequency from its expectation.

X is a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.

Parameters

X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.

Returns

frequenct – Allele frequencies encoded as a variants-by-alleles matrix.

Return type

ndarray

limix.stats.compute_dosage(X, alt=None)[source]

Compute dosage from allele expectation.

Parameters
  • X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.

  • ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided X is used as the reference if None. Defaults to None.

Returns

dosage – Dosage encoded as a variants-by-samples matrix.

Return type

ndarray

Example

>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...     variant_idx = 2
...     e = allele_expectation(bgen, variant_idx)
...     dosage = compute_dosage(e)
...     print(dosage[:5])
[0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]

Chi-square mixture

limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]

Mixture of 𝜒² distributions.

Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2

\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]

Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.

Warning

This class is not production-ready. Keep in mind that this interface is likely to change.

Parameters
  • scale_min (float) – Minimum value used for fitting the scale parameter.

  • scale_max (float) – Maximum value used for fitting the scale parameter.

  • dofmin (float) – Minimum value used for fitting the dof parameter.

  • dofmax (float) – Maximum value used for fitting the dof parameter.

  • qmax (float) – Only the top qmax quantile is used for the fit.

  • n_interval (int) – Number of intervals when performing gridsearch.

  • tol (float) – Tolerance of being zero.

Example

>>> from numpy.random import RandomState
>>> import scipy as sp
>>> from limix.stats import Chi2Mixture
>>>
>>> scale = 0.3
>>> dof = 2
>>> mixture = 0.2
>>> n = 100
>>>
>>> random = RandomState(1)
>>> x =  random.chisquare(dof, n)
>>> n0 = int( (1-mixture) * n)
>>> idxs = random.choice(n, n0, replace=False)
>>> x[idxs] = 0
>>>
>>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0,
...                       dof_min=0.1, dof_max=5.0,
...                       qmax=0.1, tol=4e-3)
>>> chi2mix.estimate_chi2mixture(x)
>>> pv = chi2mix.sf(x)
>>> print(pv[:4]) 
[0.2 0.2 0.2 0.2]
>>>
>>> print('%.2f' % chi2mix.scale)
1.98
>>> print('%.2f' % chi2mix.dof)
0.89
>>> print('%.2f' % chi2mix.mixture)
0.20

Ground truth evaluation

limix.stats.confusion_matrix(df, wsize=50000)[source]

Provide a couple of scores based on the idea of windows around genetic markers.

Parameters
  • causals – Indices defining the causal markers.

  • pos – Within-chromossome base-pair position of each candidate marker, in crescent order.

Kinship

limix.stats.linear_kinship(G, out=None, verbose=True)[source]

Estimate Kinship matrix via linear kernel.

Let 𝑑 be the number of columns of G. The resulting matrix is given by:

\[𝙺 = 𝚇𝚇ᵀ/𝑑\]

where

\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]

is the matrix G column-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrix K having only real numbers.

This functions is specially designed to also handle large matrices G that would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.

Parameters
  • G (array_like) – Samples-by-variants matrix.

  • out (ndarray) – A location into which the result is stored.

  • verbose (bool, optional) – True for showing progress; False otherwise. Defauts to True.

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import linear_kinship
>>>
>>> random = RandomState(1)
>>> X = random.randn(4, 100)
>>> K = linear_kinship(X, verbose=False)
>>> print(K) 
[[ 0.91314823 -0.19283362 -0.34133897 -0.37897564]
 [-0.19283362  0.89885153 -0.2356003  -0.47041761]
 [-0.34133897 -0.2356003   0.95777313 -0.38083386]
 [-0.37897564 -0.47041761 -0.38083386  1.23022711]]

Likelihood-ratio test

limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]

Compute p-values from likelihood ratios.

These are likelihood ratio test p-values.

Parameters
  • null_lml (float) – Log of the marginal likelihood under the null hypothesis.

  • alt_lmls (array_like) – Log of the marginal likelihoods under the alternative hypotheses.

  • dof (int) – Degrees of freedom.

Returns

pvalues – P-values.

Return type

ndarray

Principal component analysis

limix.stats.pca(X, ncomp)[source]

Principal component analysis.

Parameters
  • X (array_like) – Samples-by-dimensions array.

  • ncomp (int) – Number of components.

Returns

  • components (ndarray) – First components ordered by explained variance.

  • explained_variance (ndarray) – Explained variance.

  • explained_variance_ratio (ndarray) – Percentage of variance explained.

Examples

>>> from numpy import round
>>> from numpy.random import RandomState
>>> from limix.stats import pca
>>>
>>> X = RandomState(1).randn(4, 5)
>>> r = pca(X, ncomp=2)
>>> r['components']
array([[-0.75015369,  0.58346541, -0.07973564,  0.19565682, -0.22846925],
       [ 0.48842769,  0.72267548,  0.01968344, -0.46161623, -0.16031708]])
>>> r['explained_variance'] 
array([6.44655993, 0.51454938])
>>> r['explained_variance_ratio'] 
array([0.92049553, 0.07347181])

P-value correction

limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]

Test results and p-value correction for multiple tests.

Parameters
  • pvals (array_like) – Uncorrected p-values.

  • alpha (float) – FWER, family-wise error rate, e.g. 0.1.

  • method (string) –

    Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are

    `bonferroni` : one-step correction
    `sidak` : one-step correction
    `holm-sidak` : step down method using Sidak adjustments
    `holm` : step-down method using Bonferroni adjustments
    `simes-hochberg` : step-up method  (independent)
    `hommel` : closed method based on Simes tests (non-negative)
    `fdr_bh` : Benjamini/Hochberg  (non-negative)
    `fdr_by` : Benjamini/Yekutieli (negative)
    `fdr_tsbh` : two stage fdr correction (non-negative)
    `fdr_tsbky` : two stage fdr correction (non-negative)
    

  • is_sorted (bool) – If False (default), the p_values will be sorted, but the corrected pvalues are in the original order. If True, then it assumed that the pvalues are already sorted in ascending order.

Returns

  • reject (ndarray, boolean) – True for hypothesis that can be rejected for given alpha.

  • pvals_corrected (ndarray) – P-values corrected for multiple tests.

  • alphacSidak (float) – Corrected alpha for Sidak method.

  • alphacBonf (float) – Corrected alpha for Bonferroni method.

Notes

This is a wrapper around a function from the statsmodels package.

limix.stats.empirical_pvalues(xt, x0)[source]

Function to compute empirical p-values.

Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).

Parameters
  • xt (array_like) – Test statistcs observed on data.

  • x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is 1./len(x0).

Returns

pvalues – Estimated empirical p-values.

Return type

ndarray

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import empirical_pvalues
>>>
>>> random = RandomState(1)
>>> x0 = random.chisquare(1, 5)
>>> x1 = random.chisquare(1, 10000)
>>>
>>> empirical_pvalues(x0, x1) 
array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])