gfit {happy}R Documentation

Fit a Gaussian Mixture Model to an object returned by happy()

Description

gfit() fits a QTL model to a happy() object. The model is a mixture of Gaussians, each with a different mean, and corresponds loosely to the "full" model in hfit(). The difference is that hfit() fits the observed phenotype values to the expected phenotypes under a full model, whereas gfit() uses maximum likelihood to fit the observed phenotype values to a mixture of Gaussians, each with a different mean but common variance. The other functions in this suite are not usually called directly.

The statistical model fitted is as follows. Consider first the case of fitting a QTL at a locus, L. Let y be the vector of trait values. Let X(L) be the design matrix for fitting a QTL at the locus L. Let t(L) be the vector of parameters to be estimated at the locus. For an additive QTL, the paramters are the strain effect sizes; for a full interaction model there is a paramter for every possible strain combination. Then the one-QTL model is

E(y) = X(L).t

There are S(S-1)/2 parameters to be estimated in a full model allowing for interactions between the alleles within the locus, so the i,j'th element of the design matrix X is related to the strain probabilities thus:

X(Lij) = F(iLst)

, where

j(s,t) = min(s + S(t-1), t + S(s-1))

In the function hfit(), the observed phenotypes are regressed directly on the expected trait values. This is not an optimal procedure becuase the data are really a mixture:

y_i tilde sum_{st} F_{iLst} f( ( y_i - β_{Lst} )/2σ_L^2)

where f(x) is a standard Gaussian density. The β_L is a vector of mean trait values for the strain combinations. The parameters β_L , σ_L are estimated by maximum likelihood, and the test for the presence of a QTL at locus L is equivalent to the test that all the β_{st}=μ, when the model collapses to a single Gaussian distribution.

The model-fitting is implemented in the function gfit() by an iterative process, rather like a simplified version of EM. Is is slower than hfit(), and generally gives similar results as far as overall QTL detection is concered,m but gives more accurate parameter estimates. The log-likelihood for the data is

L = sum_{i} log ( sum_j p_{ij} frac{exp(-frac{(y_i-β_j)^2}{2σ^2})}{sqrt{2π σ^2}})

= sum_i log ( sum_j p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2})) - frac{N log(2πσ^2)}{2}

Differentiating wrt to the parameters gives

frac{partial L}{partial σ^2} = sum_i frac{sum_j p_{ij} (y_i-β_j)^2 exp(-frac{(y_i-β_j)^2}{2σ^2}) }{ 2σ^4 sum_j p_{ij} exp(-frac{(y_i-β_j)^2 }{2σ^2})} - frac{N}{2σ^2}

frac{partial L}{partial β_j } = - sum_i frac{ p_{ij} frac{(y_i-β_j) }{σ^2} exp( -frac{(y_i-β_j)^2}{2σ^2})}{ sum_j e_{ij}}

= frac{1}{σ^2} ( - sum_i frac{y_i e_{ij} }{sum_j e_{ij}} + β_j frac{sum_i e_{ij} }{sum_j e_{ij}} )

write

w_{ij} = frac{p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2}) }{ sum_j frac{p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2})}}

then the mle satisfies

hat{σ^2} = frac{1}{N} sum_i frac{sum_j p_{ij}(y_i-β_j)^2 exp(-frac{(y_i-β_j)^2}{2σ^2})} {sum_j p_{ij}exp(-frac{(y_i-β_j)^2}{2σ^2})}

hat{σ^2} = frac{1}{N} sum_i sum_j hat{w}_{ij}(y_i-hat{β}_j)^2

hat{β_j} = frac{sum_i hat{e}_{ij} y_i}{sum_i hat{e}_{ij}}

and the log-likelihood is

hat{L} = sum_i(log sum_j hat{e}_{ij}) - frac{N log(2πhat{σ}^2)}{2}

Usage

gfit( h,eps=1.0e-4, shuffle=FALSE, method="optim" )
gaussian.loop( d, maxit=100, eps=1.0e-3, df=NULL ) 
gaussian.null( n, sigma2 ) 
gaussian.init( d ) 
gaussian.iterate( d, params ) 
gaussian.fn( p, d=NULL )
gaussian.gr( p, d=NULL )

Arguments

h an object returned by a previous call to happy()
shuffle boolean indicating whether the shuffle the phenotypes to perform a permutation test
method The optimisation algorithm. Default is to use R's "optim" function, which uses derivative information. All other values of this argument will use an EM type iteration.
d a list comprising two elements d, probs
maxit the maximum number of iterations in the ML fitting
eps the terminatation accuracy in the model fitting : the log likelihood must change by less than eps in successive iterations
df the degress of freedom to use. If NULL then this is computed as the rank of the data
n the number of observations with non-missing phenotypes
sigma2 the variance of the errors in the data
params a list with two components, beta = the group means and sigma = the standard deviation
p vector of paramters. For internal use only

Value

gfit() returns a matrix with columns "marker", "LogL", "Null", "Chi", "df", "Pval", "LogPval". Each row of the column describes the fit of the model for thecorresponding marker interval.
gaussian.loop() fits the model to a single marker and returns a list with the same elements as in hfit()
gaussian.iterate() performs a single iteration of the fitting process and returns a list with the updated LogL, beta, sigma, dbeta and dsigma
gaussian.init() intialises the parameters under the Null model, ie to the case where the means are all identical and the variance is the overal variance.
gaussian.null() returns the log-likelihood under the Null model
gaussian.fn() and gaussian.gr() are the function and gradient required by the optim function.

Author(s)

Richard Mott

See Also

happy{}, hprob{}

Examples

## An example session:
# initialise happy
## Not run: h <- happy('HS.data','HS.alleles')
# fit all the markers
## Not run: f <- gfit(h)

[Package happy version 2.1 Index]