From anina01 - at - iona.cryst.bbk.ac.uk Wed May 27 07:53:04 1998 Received: from pandora.cryst.bbk.ac.uk (pandora.cryst.bbk.ac.uk [192.84.212.49]) by www.ccl.net (8.8.3/8.8.6/OSC/CCL 1.0) with ESMTP id HAA23459 Wed, 27 May 1998 07:53:02 -0400 (EDT) Received: from localhost (anina01 -x- at -x- localhost) by pandora.cryst.bbk.ac.uk (8.8.7/8.8.7) with SMTP id MAA10567; Wed, 27 May 1998 12:57:05 +0100 X-Authentication-Warning: pandora.cryst.bbk.ac.uk: anina01 owned process doing -bs Date: Wed, 27 May 1998 12:57:04 +0100 (BST) From: Henrick Alex Ninaber X-Sender: anina01 "-at-" pandora.cryst.bbk.ac.uk Reply-To: Henrick Alex Ninaber To: chemistry <-at-> www.ccl.net Subject: Summary: Correlating data with two or more (Gaussian) distributions Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Dear CCL, This is the summary of responses to my question: ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: The distribution of certain variables I calculate sometimes show more than one mean: in my case this means that the distribution is a sum of two or more separate distributions. Calculating the correlation with (for instance) a single Gaussian distribution does not make much sense. Is anyone familiar with this branch of statistics? I am looking for a theory dealing with this problem. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Four responses in total: First response from Peter Shenkin > > How can a single distribution have more than one mean? There > are distributions which do not have a mean, but any distribution > that has a mean has only one. > > Recall that in statistics, the "mean" is a synonym for the "average". > > Are you really trying to say something else? > > -P. Maybe I was trying to say something else. After mailing him personally his second response was: > If you have a reason to expect the underlying distributions to > be Gaussian, or if you just want to use Gaussians to fit your > data, you still have to decide how many to use. Even for data > exhibiting a single mode, if there is some experimental error, > the more Gaussians you use, the better will be your fit. (Just > like fitting points to a polynomial: the more terms you use, > the better will be your fit.) But of course, "better" does > not mean "significantly better", in the statistical sense. > > To fit to N gaussians, you need 3N-1 parameters. Each Gaussian > has a mean and a standard deviation, giving 2N parameters; then > each has an amplitude (fraction it contributes), which would give > another N, except they have to sum to 1; so you have an additional > N-1 instead of N. > > Suppose you try to fit to two Gaussians, using 5 parameters. > I'd start with an initial guess for the parameters from > visualizing the data, then do a minimization in the parameter > space of the RMS deviation of the data points to the sum of the two > Gaussians, starting with your guess. > > If you want to then try with three Gaussians, you can do a > Fisher F test to determine whether the fit is significantly > better as a result of adding the third Gaussian. You can > continue this process until Fisher says that getting fancier > no longer makes the fit significantly better. > > -P. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Second response was from David van der Spoel > Hi Alex, > > If you are in fact looking at dihedral angles you may want to have a > look > at my paper: > -: at :-Article{Spoel97a, > author = {D. van der Spoel and H. J. C. Berendsen}, > title = {Molecular Dynamics Simulations of {Leu}-Enkephalin in > water and {DMSO}}, > year = 1997, > journal = "Biophysical Journal", > volume = 72, > pages = {2032-2041} > } > > It defines correlation functions for dihedral angles, and does a > comparison to > other forms of such correlation functions in the literature. > > Groeten, David. > ________________________________________________________________________ > Dr. David van der Spoel Biomedical center, Dept. of Biochemistry > s-mail: Husargatan 3, Box 576, 75123 Uppsala, Sweden > e-mail: spoel' at \`xray.bmc.uu.se www: http://zorn.bmc.uu.se/~spoel The dihedral angle is one of the properties I need to look at, dank je David, I will get your paper. :::::::::::::::::::::::::::::::::::::::::::::::::: Third response was from Rick Venable > The trick here is to be able to separate the populations, either by > gaussian deconvolution, if the peaks for each mean are well resolved, or > by using additional properties of each data point to resolve the groups > in more than one dimension. Once separated into 2 or more groups, each > group can be analyzed independently. You should also think about > whether having multiple means makes sense for the property you're > calculating, or if perhaps you've just undersampled the underlying > distribution. > > Deconvolution is essentially what goes on in chromatographic data > systems-- peaks are separated into independent populations before > calculating width at half height, integrated area, etc. There's a fair > amount of older literature on deconvolution of Gaussian, Lorentzian, and > other distributions, and probably some good book chapters; I don't have > refs handy (I'm at home), but could provide some on request. > > Deconvolution is essentially what goes on in chromatographic data > systems-- peaks are separated into independent populations before > calculating width at half height, integrated area, etc. There's a fair > amount of older literature on deconvolution of Gaussian, Lorentzian, and > other distributions, and probably some good book chapters; I don't have > refs handy (I'm at home), but could provide some on request. > > Using additional properties to resolve the groups in a higher > dimensional space is another possibility; discriminant analysis is often > used to treat this problem by statisticians. For each data point in > your distribution, you need several additional properties. A least > squares fit in N-dimensional space tells which of the N properties offer > the most discrimination; 2D scatter plots of the more significant > properties may show clear groupings. Commercial stat packages have > modules for this, and I wouldn't be surprised to discover freely > available Fortran or C code. Gaussian deconvolution is the theory I was looking for, I will get articles on it myself. :::::::::::::::::::::::::::::::::::::::::::::: Last person to answer the question was Jack Smith: > You might look into the way polymer scientist characterize molecular > weight distributions (MWD) using polydispersity indices (PDI), which > often display bimodality (two peaks). The polydispersity index is a > ratio of two types of averages (weight average and number average). The > MWD can be described by even higher "moments". When MWD's are truly > composite (and now just broad), there are various ways to deconvolute > composite MWD's into separate "normal" MWD's using such moment > expansions. > > - Jack The package Origin is able to do some 'multipeak splitting' but is not flexible enough: I have to give initial values for the Gaussians, which doesn't speed up the process. I want to code it myself anyway (using someone's theory how to do it properly). Thank you all very much for responding, I think I have a fairly good idea now how to solve the problem. Alex Ninaber