Quantitative trait locusยถ
Introductionยถ
Every genetic model considered here is an instance of generalized linear mixed model (GLMM). It consists in four main components [St16]:
A linear predictor, ๐ณ = M๐ + ๐๐ฎ.
The distribution of the random effects, ๐ฎ โผ ๐(๐, ฮฃ).
The residual distribution, yแตข | ๐ฎ.
The link function, ๐แตข = g(zแตข).
The term ๐แตข represents the mean of yแตข conditioned on ๐ฎ:
The role of the link function is to scale the domain of zแตข, which ranges from -โ to +โ, to the residual distribution parameter ๐แตข. For example, the mean of a Bernoulli distribution is bounded within [0, 1], and therefore requires a link function to translate values of zแตข into values of ๐แตข.
The distribution of the outcome, conditioned on the random effects, has to be one from the exponential family [Ef18] having mean ๐แตข:
A notable instance of the above model is the linear mixed model (LMM). It consists of the identity link function, ๐แตข = g(๐แตข), and of normally distributed residuals, yแตข | ๐ฎ โผ ๐(๐แตข, ๐แตขยฒ) [Mc11]. It is more commonly described by the equation
for which ๐แตขโผ๐(0, ๐แตขยฒ). The random variables ๐ฎ and ๐ are independent from each other as well as ๐แตข and ๐โฑผ for iโ j. Defining ๐ฏ = ๐๐ฎ leads to:
There is another even simpler instance of GLMM that is also used in genetic analysis: a linear model (LM) is merely a LMM without the random effects:
The above models are used to establish a statistical tests to find significant association between genetic loci and phenotype. For that, their parameters have to be estimated.
As an example, let us define two parameters that will describe the overall variances of the random effects and of the residual effects:
If we assume a LMM, this example of model can be described by Eq. (1) for which
Equivalently, we have
for which
Therefore we have a model with three parameters: an array of effect sizes ๐ and variances ๐โ and ๐โ. If ๐ contains the normalized SNP genotypes of the samples, ๐๐แต is an estimation of the genetic relationship between the samples [Wa17].
Statistical testยถ
We use the likelihood ratio test (LRT) approach [LR18] to assess the significance of the association between genetic variants and the phenotype. It is based on the ratio between the marginal likelihood of the null ๐โ and alternative ๐โ models, for which the simpler model ๐โ is defined by placing a constraint on one or more parameters of the alternative model ๐โ.
The parameter inference is done via the maximum likelihood estimation (MLE) approach [ML18], for which the marginal likelihood p(๐ฒ | ๐ผ, ๐; ๐) is maximized over the parameters set ๐. Let ๐โ and ๐โ be the optimal parameters set under the null and alternative models. The likelihood ratio statistics is given by
which asymptotically follows a ฯยฒ distribution [Wh14]. We will make use of the LRT approach in the next sections to flag significant genetic associations.
Single-trait associationยถ
We first consider that the observed phenotype is described by additive effects from covariates and genetic components. Any deviation from that is assumed to be captured by the residual distribution. Let ๐ผ be a matrix of covariates and let ๐ถ be a matrix of genetic variants that we suspect might have some effect on the phenotype. Therefore, we have the linear model:
and we wish to compare the following hypotheses:
Note that the parameters of the above model are the covariate effect sizes, ๐, the effect sizes of a set of genetic variants, ๐, and the variance ๐โ of the noise variable. Under the null hypothesis, we set ๐=๐ and fit the rest of the parameters. Under the alternative hypothesis, we learn all the parameters. At the end, we compare the marginal likelihoods via the likelihood ratio test.
Let us first generate a random data set having a phenotype, covariates, and a set of genetic candidates.
>>> from numpy import ones, stack
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>>
>>> random = RandomState(2)
>>>
>>> # sample size
>>> n = 100
>>>
>>> # covariates
>>> offset = ones(n) * random.randn()
>>> age = random.randint(16, 75, n)
>>> M = stack((offset, age), axis=1)
>>> M = DataFrame(stack([offset, age], axis=1), columns=["offset", "age"])
>>> M["sample"] = [f"sample{i}" for i in range(n)]
>>> M = M.set_index("sample")
>>> print(M.head())
offset age
sample
sample0 -0.41676 38.00000
sample1 -0.41676 59.00000
sample2 -0.41676 34.00000
sample3 -0.41676 27.00000
sample4 -0.41676 56.00000
>>> # genetic variants
>>> G = random.randn(n, 4)
>>>
>>> # sampling the phenotype
>>> alpha = random.randn(2)
>>> beta = random.randn(4)
>>> eps = random.randn(n)
>>> y = M @ alpha + G @ beta + eps
We now apply the function limix.qtl.scan()
to our data set
>>> from limix.qtl import scan
>>>
>>> r = scan(G, y, "normal", M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
๐ฒ ~ ๐(๐ผ๐ถ, 3.462โ
๐ธ)
M = ['offset' 'age']
๐ถ = [2.10096551 0.19582931]
se(๐ถ) = [1.25826998 0.01068367]
lml = -203.98750767964498
Hypothesis 2
------------
๐ฒ ~ ๐(๐ผ๐ถ + G๐, s(3.462โ
๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.951e+02 9.915e-01 -6.198e-01
std 9.227e+00 9.342e-01 3.974e-01
min -2.031e+02 1.844e-01 -1.025e+00
25% -2.026e+02 1.959e-01 -9.275e-01
50% -1.967e+02 5.965e-01 -6.047e-01
75% -1.893e+02 1.831e+00 -2.970e-01
max -1.841e+02 2.312e+00 -2.448e-01
Likelihood-ratio test p-values
------------------------------
๐โ vs ๐โ
----------------
mean 6.514e-02
std 8.856e-02
min 2.804e-10
25% 2.606e-07
50% 3.651e-02
75% 1.016e-01
max 1.875e-01
Suppose we also have access to the whole genotype of our samples, ๐, and we want to use them to account for population structure and cryptic relatedness in our data [Ho13]. Since the number of genetic variants in ๐ is commonly larger than the number of samples, and because we are not actually interested in their effect sizes, we will include it in our model as a random component. We now have a linear mixed model:
It is important to note that ๐ฏ=๐๐ฎ can be equivalently described by a multivariate Normal distribution with a covariance proportional to ๐บ = ๐๐แต:
We make use of the function limix.stats.linear_kinship()
to define the covariance
matrix ๐บ, and call limix.qtl.scan()
to perform the analysis.
>>> from limix.stats import linear_kinship, multivariate_normal
>>> from numpy import zeros, eye
>>>
>>> # Whole genotype of each sample.
>>> X = random.randn(n, 50)
>>> # Estimate a kinship relationship between samples.
>>> K = linear_kinship(X, verbose=False) + 1e-9 * eye(n)
>>> # Update the phenotype
>>> y += multivariate_normal(random, zeros(n), K)
>>>
>>> r = scan(X, y, "normal", K, ๐ผ=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
๐ฒ ~ ๐(๐ผ๐ถ, 1.436โ
๐บ + 2.934โ
๐ธ)
M = ['offset' 'age']
๐ถ = [1.95338293 0.19448903]
se(๐ถ) = [1.25455536 0.01076470]
lml = -211.3819625136375
Hypothesis 2
------------
๐ฒ ~ ๐(๐ผ๐ถ + G๐, s(1.436โ
๐บ + 2.934โ
๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.109e+02 1.069e+00 5.922e-02
std 7.210e-01 8.819e-01 2.474e-01
min -2.114e+02 1.919e-01 -5.204e-01
25% -2.113e+02 1.944e-01 -1.102e-01
50% -2.111e+02 8.904e-01 4.920e-02
75% -2.109e+02 1.956e+00 2.366e-01
max -2.076e+02 2.215e+00 6.433e-01
Likelihood-ratio test p-values
------------------------------
๐โ vs ๐โ
----------------
mean 4.843e-01
std 2.705e-01
min 6.294e-03
25% 3.137e-01
50% 4.752e-01
75% 6.953e-01
max 9.929e-01
Non-normal trait associationยถ
If the residuals of the phenotype does not follow a Normal distribution, then we might consider performing the analysis using a generalized linear mixed model. Let us consider Poisson distributed residuals:
where the latent phenotype is described by
for
Note that the term ๐ in the above model is not the residual variable, as it were in the Eq. (1). The term ๐ is used to account for the so-called over-dispersion, i.e., when the residual distribution is not sufficient to explain the variability of yแตข.
>>> from numpy import exp
>>>
>>> z = (y - y.mean()) / y.std()
>>> y = random.poisson(exp(z))
>>>
>>> r = scan(G, y, "poisson", K, M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
๐ณ ~ ๐(๐ผ๐ถ, 0.154โ
๐บ + 0.000โ
๐ธ) for yแตข ~ Poisson(ฮปแตข=g(zแตข)) and g(x)=eหฃ
M = ['offset' 'age']
๐ถ = [5.17511934 0.04665214]
se(๐ถ) = [0.85159296 0.00604330]
lml = -145.33385788740767
Hypothesis 2
------------
๐ณ ~ ๐(๐ผ๐ถ + G๐, s(0.154โ
๐บ + 0.000โ
๐ธ)) for yแตข ~ Poisson(ฮปแตข=g(zแตข)) and g(x)=eหฃ
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.440e+02 2.553e+00 -1.306e-01
std 1.343e+00 2.682e+00 9.268e-02
min -1.453e+02 4.345e-02 -2.227e-01
25% -1.450e+02 4.635e-02 -2.018e-01
50% -1.439e+02 2.456e+00 -1.344e-01
75% -1.428e+02 5.054e+00 -6.321e-02
max -1.427e+02 5.202e+00 -3.085e-02
Likelihood-ratio test p-values
------------------------------
๐โ vs ๐โ
----------------
mean 2.830e-01
std 3.213e-01
min 2.274e-02
25% 2.519e-02
50% 2.113e-01
75% 4.692e-01
max 6.867e-01
Single-trait with interactionยถ
The following linear mixed model is considered:
The operator โ works as follows:
Therefore, the terms ๐ถโ๐ดโ and ๐ถโ๐ดโ can be understood as interaction terms between genetics, ๐ถ, and environments, ๐ดโ and ๐ดโ.
We define three hypotheses from the above linear mixed model:
The hypothesis ๐โ is for no-interaction, ๐โ is for interaction with environments encoded in ๐ดโ, and ๐โ is for interaction with environments encoded in ๐ดโ and ๐ดโ. We perform three statistical tests:
๐โ (null) vs ๐โ (alternative)
๐โ (null) vs ๐โ (alternative)
๐โ (null) vs ๐โ (alternative)
Here is an example.
>>> from numpy import concatenate, newaxis
>>> from limix.qtl import iscan
>>>
>>> # Generate interacting variables (environment)
>>> E0 = random.randn(y.shape[0], 1)
>>> E1 = random.randn(y.shape[0], 1)
>>>
>>> r = iscan(G, y, "normal", K, M, E0=E0, E1=E1, verbose=False)
>>> print(r)
Hypothesis 0
------------
๐ฒ ~ ๐(๐ผ๐ถ, 0.376โ
๐บ + 2.077โ
๐ธ)
M = ['offset' 'age']
๐ถ = [3.12608063 0.06042316]
se(๐ถ) = [1.01867609 0.00870181]
lml = -185.77488727691096
Hypothesis 1
------------
๐ฒ ~ ๐(๐ผ๐ถ + (๐ถโ๐ดโ)๐โ, s(0.376โ
๐บ + 2.077โ
๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.856e+02 1.611e+00 -2.976e-03
std 1.949e-01 1.658e+00 1.208e-01
min -1.858e+02 6.034e-02 -1.461e-01
25% -1.858e+02 6.058e-02 -4.769e-02
50% -1.856e+02 1.590e+00 -7.487e-03
75% -1.854e+02 3.137e+00 3.722e-02
max -1.854e+02 3.235e+00 1.492e-01
Hypothesis 2
------------
๐ฒ ~ ๐(๐ผ๐ถ + (๐ถโ๐ดโ)๐โ + (๐ถโ๐ดโ)๐โ, s(0.376โ
๐บ + 2.077โ
๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.852e+02 1.612e+00 7.001e-03
std 7.598e-01 1.659e+00 1.475e-01
min -1.857e+02 5.991e-02 -2.573e-01
25% -1.856e+02 6.096e-02 -4.135e-02
50% -1.855e+02 1.571e+00 3.611e-02
75% -1.851e+02 3.135e+00 7.660e-02
max -1.841e+02 3.241e+00 1.971e-01
Likelihood-ratio test p-values
------------------------------
๐โ vs ๐โ ๐โ vs ๐โ ๐โ vs ๐โ
----------------------------------------
mean 6.867e-01 6.501e-01 5.244e-01
std 3.199e-01 3.350e-01 3.168e-01
min 3.963e-01 1.795e-01 9.940e-02
25% 4.185e-01 5.578e-01 3.784e-01
50% 6.755e-01 7.277e-01 5.971e-01
75% 9.436e-01 8.200e-01 7.431e-01
max 9.995e-01 9.654e-01 8.042e-01
Multi-trait associationยถ
LMM can also be used to jointly model multiple traits. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable ๐ is a nรp matrix distributed according to
๐ฐ and ๐ผ are design matrices of dimensions pรp and nรc provided by the user, where ๐ผ is the usual matrix of covariates commonly used in single-trait models. ๐ is a cรp matrix of fixed-effect sizes per trait. ๐ is a nรr matrix provided by the user and I is a nรn identity matrices. ๐ฒโ and ๐ฒโ are both symmetric matrices of dimensions pรp, for which ๐ฒโ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices ๐, ๐ฒโ, and ๐ฒโ. ๐๐๐(โ ) is a function that stacks the columns of the provided matrix into a vector [Ve19].
Let ๐ฒ=๐๐๐(๐) and ๐=๐๐๐(๐). We can extend the model in Eq. (2) to represent three different hypotheses:
the hypotheses being
as before. Here is an example.
>>> from numpy import eye
>>>
>>> p = 2
>>> Y = random.randn(n, p)
>>> A = random.randn(p, p)
>>> A = A @ A.T
>>> A0 = ones((p, 1))
>>> A1 = eye(p)
>>>
>>> r = scan(G, Y, K=K, M=M, A=A, A0=A0, A1=A1, verbose=False)
>>> print(r)
Hypothesis 0
------------
๐ฒ ~ ๐((Aโ๐ผ)๐, Cโโ๐บ + Cโโ๐ธ)
traits = ['0' '1']
M = ['offset' 'age']
๐ถ = [-0.16350676 -0.00299814 -0.34521236 -0.00080406]
se(๐ถ) = [11.30571652 0.09640163 5.36090270 0.04573611]
diag(Cโ) = [0.01404947 0.29153072]
diag(Cโ) = [0.81175806 0.85780008]
lml = -277.3341913587698
Hypothesis 1
------------
๐ฒ ~ ๐((Aโ๐ผ)๐ + (AโโG)๐โ, s(Cโโ๐บ + Cโโ๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.763e+02 -1.243e-01 -2.842e-02
std 1.329e+00 1.435e-01 1.120e-01
min -2.773e+02 -3.712e-01 -1.666e-01
25% -2.772e+02 -2.032e-01 -7.187e-02
50% -2.767e+02 -6.896e-02 -2.672e-02
75% -2.758e+02 -1.371e-03 1.673e-02
max -2.744e+02 1.386e-04 1.063e-01
Hypothesis 2
------------
๐ฒ ~ ๐((Aโ๐ผ)๐ + (AโโG)๐โ + (AโโG)๐โ, s(Cโโ๐บ + Cโโ๐ธ))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.761e+02 -1.245e-01 -1.209e-02
std 1.404e+00 1.439e-01 5.823e-02
min -2.772e+02 -3.745e-01 -1.151e-01
25% -2.770e+02 -2.025e-01 -3.202e-02
50% -2.766e+02 -6.702e-02 -7.898e-03
75% -2.757e+02 -1.441e-03 2.127e-02
max -2.741e+02 -7.171e-04 7.372e-02
Likelihood-ratio test p-values
------------------------------
๐โ vs ๐โ ๐โ vs ๐โ ๐โ vs ๐โ
----------------------------------------
mean 3.973e-01 6.122e-01 8.438e-01
std 3.851e-01 3.942e-01 1.327e-01
min 1.597e-02 9.170e-02 7.251e-01
25% 1.133e-01 4.159e-01 7.319e-01
50% 3.626e-01 7.039e-01 8.370e-01
75% 6.466e-01 9.002e-01 9.488e-01
max 8.478e-01 9.493e-01 9.760e-01
References
- LR18
Wikipedia contributors. (2018, October 21). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 16:13, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=865020904
- ML18
Wikipedia contributors. (2018, November 8). Maximum likelihood estimation. In Wikipedia, The Free Encyclopedia. Retrieved 16:08, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Maximum_likelihood_estimation&oldid=867823508
- St16
Stroup, W. W. (2016). Generalized linear mixed models: modern concepts, methods and applications. CRC press.
- Ef18
Wikipedia contributors. (2018, October 18). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 18:45, November 25, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=864576150
- Mc11
McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004.
- Ve19
Wikipedia contributors. (2018, September 11). Vectorization (mathematics). In Wikipedia, The Free Encyclopedia. Retrieved 16:18, November 28, 2018, from https://en.wikipedia.org/w/index.php?title=Vectorization_(mathematics)&oldid=859035294
- Wa17
Wang, B., Sverdlov, S., & Thompson, E. (2017). Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics, 205(3), 1063-1078.
- Wh14
White, H. (2014). Asymptotic theory for econometricians. Academic press.
- Ho13
Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one, 8(10), e75707.