CCL Home Preclinical Pharmacokinetics Service
APREDICA -- Preclinical Service: ADME, Toxicity, Pharmacokinetics
Up Directory CCL May 27, 1998 [004]
Previous Message Month index Next day

From:  Henrick Alex Ninaber <anina01 %! at !% iona.cryst.bbk.ac.uk>
Date:  Wed, 27 May 1998 12:57:04 +0100 (BST)
Subject:  Summary: Correlating data with two or more (Gaussian) distributions



Dear CCL,

This is the summary of responses to my question:

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The distribution of certain variables I calculate sometimes show more than
one mean: in my case this means that the distribution is a sum of two or
more separate distributions. Calculating the correlation with (for
instance) a single Gaussian distribution does not make much sense. Is
anyone familiar with this branch of statistics? I am looking for a theory
dealing with this problem.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Four responses in total:

First response from Peter Shenkin 

>
> How can a single distribution have more than one mean?  There
> are distributions which do not have a mean, but any distribution
> that has a mean has only one.
>
> Recall that in statistics, the "mean" is a synonym for the "average".
>
> Are you really trying to say something else?
>
>        -P.

Maybe I was trying to say something else.

After mailing him personally his second response was:

> If you have a reason to expect the underlying distributions to
> be Gaussian, or if you just want to use Gaussians to fit your
> data, you still have to decide how many to use.  Even for data
> exhibiting a single mode, if there is some experimental error,
> the more Gaussians you use, the better will be your fit.  (Just
> like fitting points to a polynomial:  the more terms you use,
> the better will be your fit.)  But of course, "better" does
> not mean "significantly better", in the statistical sense.
>
> To fit to N gaussians, you need 3N-1 parameters.  Each Gaussian
> has a mean and a standard deviation, giving 2N parameters;  then
> each has an amplitude (fraction it contributes), which would give
> another N, except they have to sum to 1; so you have an additional
> N-1 instead of N.
>
> Suppose you try to fit to two Gaussians, using 5 parameters.
> I'd start with an initial guess for the parameters from
> visualizing the data, then do a minimization in the parameter
> space of the RMS deviation of the data points to the sum of the two
> Gaussians, starting with your guess.
>
> If you want to then try with three Gaussians, you can do a
> Fisher F test to determine whether the fit is significantly
> better as a result of adding the third Gaussian.  You can
> continue this process until Fisher says that getting fancier
> no longer makes the fit significantly better.
>
>        -P.


:::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Second response was from David van der Spoel 

> Hi Alex,
>
> If you are in fact looking at dihedral angles you may want to have a
> look
> at my paper:
> ()at()Article{Spoel97a,
>   author  =      {D. van der Spoel and H. J. C. Berendsen},
>   title   =      {Molecular Dynamics Simulations of {Leu}-Enkephalin in
>                  water and {DMSO}},
>   year    = 1997,
>   journal = "Biophysical Journal",
>   volume  = 72,
>   pages   = {2032-2041}
> }
>
> It defines correlation functions for dihedral angles, and does a
> comparison to
> other forms of such correlation functions in the literature.
>
> Groeten, David.
> ________________________________________________________________________
> Dr. David van der Spoel         Biomedical center, Dept. of Biochemistry
> s-mail: Husargatan 3, Box 576,  75123 Uppsala, Sweden
> e-mail: spoel at.at xray.bmc.uu.se    www: http://zorn.bmc.uu.se/~spoel
                                  
The dihedral angle is one of the properties I need to look at, dank je
David, I will get your paper.


::::::::::::::::::::::::::::::::::::::::::::::::::

Third response was from Rick Venable 

> The trick here is to be able to separate the populations, either by
> gaussian deconvolution, if the peaks for each mean are well resolved, or
> by using additional properties of each data point to resolve the groups
> in more than one dimension.  Once separated into 2 or more groups, each
> group can be analyzed independently.  You should also think about
> whether having multiple means makes sense for the property you're
> calculating, or if perhaps you've just undersampled the underlying
> distribution.
>
> Deconvolution is essentially what goes on in chromatographic data
> systems-- peaks are separated into independent populations before
> calculating width at half height, integrated area, etc.  There's a fair
> amount of older literature on deconvolution of Gaussian, Lorentzian, and
> other distributions, and probably some good book chapters; I don't have
> refs handy (I'm at home), but could provide some on request.
>
> Deconvolution is essentially what goes on in chromatographic data
> systems-- peaks are separated into independent populations before
> calculating width at half height, integrated area, etc.  There's a fair
> amount of older literature on deconvolution of Gaussian, Lorentzian, and
> other distributions, and probably some good book chapters; I don't have
> refs handy (I'm at home), but could provide some on request.
>
> Using additional properties to resolve the groups in a higher
> dimensional space is another possibility; discriminant analysis is often
> used to treat this problem by statisticians.  For each data point in
> your distribution, you need several additional properties.  A least
> squares fit in N-dimensional space tells which of the N properties offer
> the most discrimination; 2D scatter plots of the more significant
> properties may show clear groupings.  Commercial stat packages have
> modules for this, and I wouldn't be surprised to discover freely
> available Fortran or C code.

Gaussian deconvolution is the theory I was looking for, I will get
articles on it myself.


::::::::::::::::::::::::::::::::::::::::::::::

Last person to answer the question was Jack Smith: 

> You might look into the way polymer scientist characterize molecular
> weight distributions (MWD) using polydispersity indices (PDI), which
> often display bimodality (two peaks).  The polydispersity index is a
> ratio of two types of averages (weight average and number average).  The
> MWD can be described by even higher "moments".  When MWD's are truly
> composite (and now just broad), there are various ways to deconvolute
> composite MWD's into separate "normal" MWD's using such moment
> expansions.
>
> - Jack


The package Origin is able to do some 'multipeak splitting' but is not
flexible enough: I have to give initial values for the Gaussians, which
doesn't speed up the process. I want to code it myself anyway (using
someone's theory how to do it properly).

Thank you all very much for responding, I think I have a fairly good idea
now how to solve the problem.


Alex Ninaber





Similar Messages
08/01/1996:  Re: CCL:M:Heat of formation calculation using MOPAC.
05/25/1998:  Correlating data with two or more (Gaussian) distributions: possible?
04/23/1992:   Huckel MO Theory software
02/21/1993:  G90/92. Scaled Freq. Correct. for H and S.
03/11/1996:  Law of conservation of difficulty: violations.
08/01/1995:  Spin contamination, effect on energy and structure.
10/13/1998:  Parameterization - summary
06/28/1995:  Re:POSTED RESPONSES: Quantitative assessment of novel ligands
10/01/1993:  torsion of conjugated systems -- summary
11/16/1993:  MM Parameters/Macrocycles


Raw Message Text