Quality control

Box-Cox

limix.qc.boxcox(x)[source]

Box-Cox transformation for normality conformance.

It applies the power transformation

\[\begin{split}f(x) = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if } \lambda > 0; \\ \log(x), & \text{if } \lambda = 0. \end{cases}\end{split}\]

to the provided data, hopefully making it more normal distribution-like. The λ parameter is fit by maximum likelihood estimation.

Parameters

X (array_like) – Data to be transformed.

Returns

boxcox – Box-Cox transformed data.

Return type

ndarray

Examples

(Source code, png)

_images/qc-1.png

Dependent columns

limix.qc.remove_dependent_cols(X, tol=1.49e-08)[source]

Remove dependent columns.

Return a matrix with dependent columns removed.

Parameters
  • X (array_like) – Matrix to might have dependent columns.

  • tol (float, optional) – Threshold above which columns are considered dependents.

Returns

rank – Full column rank matrix.

Return type

ndarray

limix.qc.unique_variants(X)[source]

Filters out variants with the same genetic profile.

Parameters

X (array_like) – Samples-by-variants matrix of genotype values.

Returns

genotype – Genotype array with unique variants.

Return type

ndarray

Example

>>> from numpy.random import RandomState
>>> from numpy import kron, ones, sort
>>> from limix.qc import unique_variants
>>> random = RandomState(1)
>>>
>>> N = 4
>>> X = kron(random.randn(N, 3) < 0, ones((1, 2)))
>>>
>>> print(X)  
[[0. 0. 1. 1. 1. 1.]
 [1. 1. 0. 0. 1. 1.]
 [0. 0. 1. 1. 0. 0.]
 [1. 1. 0. 0. 1. 1.]]
>>>
>>> print(unique_variants(X))  
[[0. 1. 1.]
 [1. 1. 0.]
 [0. 0. 1.]
 [1. 1. 0.]]

Genotype

limix.qc.indep_pairwise(X, window_size, step_size, threshold, verbose=True)[source]

Determine pair-wise independent variants.

Independent variants are defined via squared Pearson correlations between pairs of variants inside a sliding window.

Parameters
  • X (array_like) – Sample by variants matrix.

  • window_size (int) – Number of variants inside each window.

  • step_size (int) – Number of variants the sliding window skips.

  • threshold (float) – Squared Pearson correlation threshold for independence.

  • verbose (bool) – True for progress information; False otherwise.

Returns

ok – Boolean array defining independent variants

Return type

ndarray

Example

>>> from numpy.random import RandomState
>>> from limix.qc import indep_pairwise
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 20)
>>>
>>> indep_pairwise(X, 4, 2, 0.5, verbose=False)
array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])
limix.qc.compute_maf(X)[source]

Compute minor allele frequencies.

It assumes that X encodes 0, 1, and 2 representing the number of alleles (or dosage), or NaN to represent missing values.

Parameters

X (array_like) – Genotype matrix.

Returns

Minor allele frequencies.

Return type

array_like

Examples

>>> from numpy.random import RandomState
>>> from limix.qc import compute_maf
>>>
>>> random = RandomState(0)
>>> X = random.randint(0, 3, size=(100, 10))
>>>
>>> print(compute_maf(X)) 
[0.49  0.49  0.445 0.495 0.5   0.45  0.48  0.48  0.47  0.435]

Impute

limix.qc.mean_impute(X, axis=- 1, inplace=False)[source]

Impute NaN values.

It defaults to column-wise imputation.

Parameters
  • X (array_like) – Matrix to have its missing values imputed.

  • axis (int, optional) – Axis value. Defaults to -1.

  • inplace (bool, optional) – Defaults to False.

Returns

Imputed array.

Return type

ndarray

Examples

>>> from numpy.random import RandomState
>>> from numpy import nan, array_str
>>> from limix.qc import mean_impute
>>>
>>> random = RandomState(0)
>>> X = random.randn(5, 2)
>>> X[0, 0] = nan
>>>
>>> print(array_str(mean_impute(X), precision=4))
[[ 0.9233  0.4002]
 [ 0.9787  2.2409]
 [ 1.8676 -0.9773]
 [ 0.9501 -0.1514]
 [-0.1032  0.4106]]
limix.qc.count_missingness(X)[source]

Count the number of missing values per column.

Parameters

X (array_like) – Matrix.

Returns

count – Number of missing values per column.

Return type

ndarray

Kinship

limix.qc.normalise_covariance(K, out=None)[source]

Variance rescaling of covariance matrix 𝙺.

Let n be the number of rows (or columns) of 𝙺 and let mᵢ be the average of the values in the i-th column. Gower rescaling is defined as

\[𝙺(n - 1)/(𝚝𝚛𝚊𝚌𝚎(𝙺) - ∑mᵢ).\]

Notes

The reasoning of the scaling is as follows. Let 𝐠 be a vector of n independent samples and let 𝙲 be the Gower’s centering matrix. The unbiased variance estimator is

\[v = ∑ (gᵢ-ḡ)²/(n-1) = 𝚝𝚛𝚊𝚌𝚎((𝐠-ḡ𝟏)ᵀ(𝐠-ḡ𝟏))/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐠𝐠ᵀ𝙲)/(n-1)\]

Let 𝙺 be the covariance matrix of 𝐠. The expectation of the unbiased variance estimator is

\[𝐄[v] = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐄[𝐠𝐠ᵀ]𝙲)/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝙺𝙲)/(n-1),\]

assuming that 𝐄[gᵢ]=0. We thus divide 𝙺 by 𝐄[v] to achieve an unbiased normalisation on the random variable gᵢ.

Parameters
  • K (array_like) – Covariance matrix to be normalised.

  • out (array_like, optional) – Result destination. Defaults to None.

Examples

>>> from numpy import dot, mean, zeros
>>> from numpy.random import RandomState
>>> from limix.qc import normalise_covariance
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 10)
>>> K = dot(X, X.T)
>>> Z = random.multivariate_normal(zeros(10), K, 500)
>>> print("%.3f" % mean(Z.var(1, ddof=1)))
9.824
>>> Kn = normalise_covariance(K)
>>> Zn = random.multivariate_normal(zeros(10), Kn, 500)
>>> print("%.3f" % mean(Zn.var(1, ddof=1)))
1.008

Normalisation

limix.qc.mean_standardize(X, axis=- 1, inplace=False)[source]

Zero-mean and one-deviation normalisation.

Normalise in such a way that the mean and variance are equal to zero and one. This transformation is taken over the flattened array by default, otherwise over the specified axis. Missing values represented by NaN are ignored.

Parameters
  • X (array_like) – Array of values.

  • axis (int, optional) – Axis value. Defaults to 1.

  • inplace (bool, optional) – Defaults to False.

Returns

X – Normalized array.

Return type

ndarray

Example

>>> import limix
>>> from numpy import arange
>>>
>>> X = arange(15).reshape((5, 3)).astype(float)
>>> print(X)  
[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]
 [ 9. 10. 11.]
 [12. 13. 14.]]
>>> X = arange(6).reshape((2, 3)).astype(float)
>>> X = limix.qc.mean_standardize(X, axis=0)
>>> print(X)  
[[-1.22474487  0.          1.22474487]
 [-1.22474487  0.          1.22474487]]
limix.qc.quantile_gaussianize(X, axis=1, inplace=False)[source]

Normalize a sequence of values via rank and Normal c.d.f.

It defaults to column-wise normalization.

Parameters
  • X (array_like) – Array of values.

  • axis (int, optional) – Axis value. Defaults to 1.

  • inplace (bool, optional) – Defaults to False.

Returns

Gaussian-normalized values.

Return type

array_like

Examples

>>> from limix.qc import quantile_gaussianize
>>> from numpy import array_str
>>>
>>> qg = quantile_gaussianize([-1, 0, 2])
>>> print(qg) 
[-0.67448975  0.          0.67448975]