From dok707@cvx12.inet.dkfz-heidelberg.de  Wed Nov 17 10:27:36 1993
Received: from cvx12.inet.dkfz-heidelberg.de  for dok707@cvx12.inet.dkfz-heidelberg.de
	by www.ccl.net (8.6.1/930601.1506) id JAA15253; Wed, 17 Nov 1993 09:36:34 -0500
Received: by cvx12.inet.dkfz-heidelberg.de id AA04274
  (5.65c/IDA-1.4.4 for chemistry@ccl.net); Tue, 16 Nov 1993 20:16:36 +0100
Date: Tue, 16 Nov 1993 20:16:36 +0100
From: Frank Herrmann <F.Herrmann@dkfz-heidelberg.de>
Message-Id: <199311161916.AA04274@cvx12.inet.dkfz-heidelberg.de>
To: chemistry@ccl.net
Subject: quasiparticle force fields


QUASIPARTICLE FORCE FIELDS
--------------------------

Here are two (non review) articles:

@article{mean force,
author={Manfred J. Sippl},
title={Calculation of Conformational Ensembles from Potentials of Mean Force},
journal={Journal of Molecular Biology 213, 1990},
pages={859-883},
year=1990}

@article{entire residues,
author={Paul R. Gerber},
title={Peptide Mechanics: A Force Field for Peptides and Proteins Working with Entire Residues as Smallest Units},
journal={Biopolymers, Vol. 32},
pages={1003-1017},
year=1992}

--------------------------------------------------
Frank Herrmann, Dept. of Molecular Biophysics 0810
German Cancer Research Center
Im Neuenheimer Feld 280, D-69120 Heidelberg
Tel: (49) 6221-422336, FAX: (49) 6221-422333
email: F.Herrmann@dkfz-heidelberg.de

From mikes@bioch.ox.ac.uk  Wed Nov 17 12:27:27 1993
Received: from oxmail.ox.ac.uk  for mikes@bioch.ox.ac.uk
	by www.ccl.net (8.6.1/930601.1506) id MAA17229; Wed, 17 Nov 1993 12:12:44 -0500
From: <mikes@bioch.ox.ac.uk>
Received: from bioch.ox.ac.uk by oxmail.ox.ac.uk with SMTP (PP) 
          id <19801-0@oxmail.ox.ac.uk>; Wed, 17 Nov 1993 16:20:11 +0000
Received: from bioch.ox.ac.uk (nmrpcd.ocms) by biochemistry.oxford.ac.uk;
          Wed, 17 Nov 93 16:19:06 GMT
Received: by bioch.ox.ac.uk (920330.SGI/bioch3.0) id AA10871;
          Wed, 17 Nov 93 16:19:15 GMT
Date: Wed, 17 Nov 93 16:19:15 GMT
Message-Id: <9311171619.AA10871@nmrpcd.ocms.bioch.ox.ac.uk>
To: chemistry@ccl.net
Subject: Cluster Analysis


A while ago I sent in a query about Cluster Analysis, and got lots of helpful
replies - thank you for everyone who sent me something.  Here are most of them:

Mike

-----------------------------------------------------------------------------

Arun Malhotra:

Cluster analysis is a useful way to classify structures - we have been using
it to sort thru several (10-20) models of the 16S RNA. You may want to look
up "Modeling the 3-D structure of RNA using discrete nucleotide conformational
sets" Daniel Gautheret, Francois Major and Robert Cedergren, J. Mol. Biol.
(1993) 229, 1049-1064. Francois and other have been using cluster analysis 
for classifying nucleotide structures and discuss some of this in the methods
section. Another recent application of cluster analysis appeared in "Protein
structure comparison by alignment of distance matrices" Liisa Holm and Chris
Sander, J. Mol. Biol. (1993) 233, 123-138. Usually any measure (single-valued)
of difference between two structures can be used depending on what property
you are trying to cluster around - rmsd or a difference distance matrix is 
good for structural comparisons.

There is a Mac package for multivariate analysis/cluster analysis on sumex
(in sci/mac-dendro, mac-mul, graph-mu etc.) that works fairly well. Commercial
statistical analysis packages such as Splus also have cluster analysis.

------------------------------------------------------------------------------

Alain St-Amant:

A very nice piece of work has recently come out of Charlie Brooks' group.
It should be just what you want.  The reference is:

 M. E. Karpen, D. J. Tobias, and C. L. Brooks, "Statistical Clustering
 Techniques for the Analysis of Long Molecular Dynamics Trajectories --
 Analysis of 2.2 ns Trajectories of YPGDV," Biochemistry, Volume 32,
 pages 412-420.

------------------------------------------------------------------------------

Peter S. Shenkin:


I'm glad you asked that question.  :-)

Quentin McDonald and I have written a program that does exactly
what you describe: cluster analysis of molecular conformations.  An
article that covers the approach we used, as well as the implementation, 
has been submitted to J. Comput. Chem.

The program is called XCluster.  It begins by constructing
the matrix of inter-conformational "distances" between all
pairs of conformations read in.  There are several choices
of "distance" available, including RMS of interatomic distances
following rigid-body superposition (which is what I assume you mean
by "rmsd").  Molecular symmetry can be taken into account, and 
clustering can, if desired, be performed on only a subset of the 
atoms -- for example, the ring atoms of a cyclic system.

----------------------------------------------------------------------------

David States:

There are several issues here.  If you do not know how many clusters
there should be, then you don't want to use an algorithm that imposes
a particular answer.  This can happen either explicitly (for example
a binary classification halted at 3 divisions by definition will give 8
classes) or implicitly in a leader-mean classification of the sort you
describe (the number of classes is inversely dependent on the cutoff
radius).

To avoid biasing the results of the classification by class number, you
need a method that allows you to compare classifications with different
numbers of classes.  This leads to the general field of Bayesian
classification where a classification is viewed as a model for the
observed distribution and the optimal classification is that
classification which optimally describes the data.  Larry Hunter and I
used this to derive a classification of protein secondary structure
several years ago (Hunter and States (1991), "Bayesian classificaiotn
of protein structural motifs.", in Proceedings of HICSS-24, IEEE Press,
Los Alamitos CA, 595-604).

Another issue is flat vs. hierarchical classification.  Are these
proteins derived by an evolutionary process from a common ancestor in
which case a hierarchical model might be more appropriate, or are they
random samples of conformation space in which case a flat
classification would be better. Defining the optimal tree structure 
for a set can itself be a demanding problem.

The computational complexity of the problem depends on your class
definition.  If you seek connected classes (two structures are in the
same class if there is a path of similarity relationships connecting
them, transitive closure) then this is algorithmically a minimal
spanning tree problem for which linear time solutions exist.  On the
other hand, if you demand that all members of a class to fall within
some cutoff of every other member (cliques), then you have a graph
partition problem that is NP-complete.

The sensitivity of the classification to errors or ambiguities in the
data is also dependent on class definition.  Transitive closure is
robust to less than perfect sensitivity in defining similarity
relationships but false positive similarity judgements lead directly
to classification errors.  On the otherhand, clique definitions are very 
sensitive to false negative similarity judgements and robust to false
positives.

-----------------------------------------------------------------------------

Frank Kolakowski:

I am not sure exactly what it is that you are asking, but
look at recent efforts by C. Sander and colleagues from EMBL.

AU Ouzounis-C.  Sander-C.  Scharf-M.  Schneider-R.
TI Prediction of protein structure by evaluation of sequence-structure
   fitness.  Aligning sequences to contact profiles derived from
   three-dimensional structures.
SO J-Mol-Biol.  1993 Aug 5.  232(3).  P 805-25.

AU Sander-C.  Schneider-R.
TI The HSSP data base of protein structure-sequence alignments.
SO Nucleic-Acids-Res.  1993 Jul 1.  21(13).  P 3105-9.

----------------------------------------------------------------------------

John Kapenga:

There are a number of ways to cluster things (and some texts on cluster
analysis as well as codes). If you have a pairwise distance function rmsd say,
as d(p1,p2),  then one method is to to define the distance between to groups
G1 and G2 as the average of the distances between all pairs (p1,p2) with
p1 in G1 and p2 in G2 (there are other posible group distances)

Start with 100 groups Gi , each with a single point pi
repeat k times
	merge the closest two groups

Then you are left with 100-k groups.

Another method requires yo to be able to find the mean of a group,
which must be able to be input to d(,).

Start with a random choice of q1, q2, ... qk from  the pi's
repeat until "converged"
	for each pi put pi in Gi if pi is closest to qi amoung all the qis
	let qi = mean(Gi)

Which results in k clusters - you do need to watch empty Gis

These (and their variations) are perhaps the two most common methods.

-----------------------------------------------------------------------------

Eric Martin:

    One way to do this is to make a similarity matrix based on RMSD of chosen
corresponding atoms.  You could do cluster analysis directly, which would be 
treating distance from each member as a variable in a 100 dimensional property
space.  In my opinion, a better way is to 1st perform multidimensional scaling
on the similarity matrix (proc MDS in SAS).  MDS finds cartesian coordinates
in a low dimensional euclidian space such that the distances between the points
best reproduce the similarities.  You can then either perform cluster analysis
on these latent variables, or else superimpose a grid on the space and pick
points near the grid points, or any other grouping or experimental design scheme.

-----------------------------------------------------------------------------

Ganesan Ravishanker:

That is the simplest way to cluster them. And this procedure, called 2-D
RMS Map by us, turns out to be a very useful tool to cluster and monitor
structural grouping in an MD. We construct a symmetrix matrix RMS(i,j)
where i and j are indices of the average structures over portions of
trajectory. This matrix is laid out on a 2-D grid using a continuous
spectrum of color to code the value of RMS at the grid (i,j). As is
obvious, the diagonals are zero and 1A blocks are usually around the
diagonal and various squares or rectangles develop around the diagonal
showing the persistence of a given structural group. We can also block
them using the same color for say rmsd of 0-1, another color for 1-2 etc.
Off diagonal blocks indicate revisitation of structural groups along the
trajectory. 

We have succesfully extended this to even compare trajectories of similar
systems to capture the common structural groups visited in various
trajectories. There is also facility to calculate the rmsd on only a
subset of atoms (say only the sugar atoms in a DNA) which might give the
"substate" picture relevent to only those that are selected.

This and another 50 or so applications are collectively known as "MD
Toolchest" developed here at Wesleyan. A small subset of the program set
is already being distributed and the next considerably expanded version of
it is getting ready for distribution in November. If any of you wish to
know more about it, please let me know and I will personally correspond
with you. Thank you.

--------------------------------------------------------------------------

Heather Gordon

I have been using fuzzy c-means clustering (a partitional rather than hierarchical method)
to locate similar structures explored in molecular dynamics or monte carlo simulations.
We published a paper last year:

"Fuzzy cluster analysis of molecular dynamics trajectories", H. Gordon and R.L. Somorjai,
Proteins: Structure, Function, and Genetics 14:249-264 (1992).

I have since applied fuzzy c-means clustering to some MD trajectories of a simple
carbohydrate with great success.  The approach is, as you suggested, to find an RMS distance
matrix by optimally superimposing structures and then clustering the distance matrix.
The superposition algorithm is one developed by Somorjai and Zuker (a quaternion method)
and the fuzzy clustering algorithm is due to Bezdek.  The code is definitely not elegant(!),
but if you are interested, I can send a copy to you.

You might also be interested in a clustering paper:

"Statistical clustering techniques for the analysis for long molecular dynamics
trajectories:  analysis of a 2.2ns trajectory of YPGDV", ME Karpen, DJ Tobias, CL Brooks III,
Biochemistry 32, 412-420 (1993).

------------------------------------------------------------------------------

Jeff Blaney:

Many methods have been described for clustering conformations into families
during the last decade or so.  The RMS matrix is an effective 
measure of the pairwise similarities between all conformers (see Cohen, F. E., 
Sternberg, M. J. E. "On the Use of Chemically Derived Distance Constraints 
in the Prediction of Protein Structure with Myoglobin as an Example", 
J. Mol. Biol. 1980, 137, 9-22 and Seno, Y., Go, N. "Deoxymyoglobin Studied 
by the Conformational Normal Mode Analysis I.  Dynamics of Globin and 
the Heme-Globin Interaction", J. Mol. Biol. 1990, 216, 95-109.)

Cluster analysis performed on the NxN matrix containing the RMS least 
squares rotation/translation fit deviations for N conformers works well
and has been used since the mid-80's (see Perkins, T. D. J., Barlow, 
D. J. "RAMBLE:  A conformational search program", J. Mol. Graphics 
1990, 8, 156-162).  This method is also available in the program COMPARE, 
which was distributed with DGEOM in 1990 (QCPE Program #590).  COMPARE 
reads a series of PDB format files and generates the RMS matrix in a 
format appropriate for input into ARTHUR (an old chemometrics software 
package) or DATADESK (Mac stats software package), which both perform 
hierarchical clustering. COMPARE's output format can be modified 
easily for other stats packages (e.g. SAS).  Single or complete-linkage 
hierarchical clustering can be used on small problems with less than about 
500 conformers; beyond this the hierarchical algorithms consume very large 
amounts of computer time and become impractical.

Jarvis-Patrick clustering (Jarvis, R. A., Patrick, E. A. "Clustering Using 
a Similarity Measure Based on Shared Near Neighbors", IEEE Trans. Comp. 
1973, C22, 1025-1034) is much faster than hierarchical clustering and can 
be applied to huge datasets (several 100,000 members).
Jarvis-Patrick clustering is routinely used for clustering large chemical 
databases into structurally related families based on two-dimensional 
similarity (see Willett, P. Similarity and Clustering in Chemical 
Information Systems, Research Studies Press: Letchworth, 1987).  Clustering 
1000 conformers takes only a few seconds and requires an insignificant amount 
of time compared to calculating the RMS matrix.  Jarvis-Patrick performs well 
for conformational clustering, and gives results comparable or superior to 
hierarchical clustering on small datasets and can be run easily on datasets 
that are too large for hierarchical clustering.  The new release of DGEOM
to QCPE (imminent...) will have a new version of COMPARE which is self-
contained and includes Jarvis-Patrick clustering.

Conformational clustering has also been performed directly on cartesian
coordinates (Murray-Rust, P., Raftery, J. "Computer analysis of molecular 
geometry, Part VI:  Classification of differences in conformation", J. 
Mol. Graphics 1985, 3, 50-59).  Torsion angles are a poor choice due to their 
'leverage' effect:  a small torsion angle change at the beginning of a chain 
leads to a difference of many angstroms at the end of the chain.

Yvonne Martin recently described a fast, simple approach that compares distance
matrices and eliminates duplicate conformers based on any single distance 
difference being greater than a user-specified threshhold  (Y. Martin, 
Meeting on Binding Sites: Characterising and Satisfying Steric and Chemical 
Restraints, Molecular Graphics Society, University of York, England, 
March 28-30, 1993).

------------------------------------------------------------------------------

Konrad Koehler:

   Cluster analysis based on conformation can easily be done by creating a
matrix all pairwise least squares RMS deviations.  The cluster analysis could 
also be based on torsional angles, but the least squares method is simipler 
and generally works better.  Many statistical packages such as SAS can then
perform a hierarchial cluster analysis based on this matrix.  Hierarchial
cluster analysis is only pratical for ~1,000 conformations.  If you have more,
you will have to resort to some non-hierarchial method such as Jarvis-Patrick.

   There are several packages already available for doing cluster analysis
based on conformation.  The XCluster program which is included with MacroModel
(Clark Still, Columbia University) version 4.0 contains a fairly sophisticated
conformational clustering package.  The XCluster manual contains an in depth 
discussion of the theory.  Other packages such as Sybyl also contain 
facilities for cluster analysis of conformations based on either RMS 
deviations or torsional angles.

-----------------------------------------------------------------------------


From mam@xenon.chem.ucla.edu  Wed Nov 17 13:27:28 1993
Received: from argon.chem.ucla.edu  for mam@xenon.chem.ucla.edu
	by www.ccl.net (8.6.1/930601.1506) id MAA18360; Wed, 17 Nov 1993 12:53:17 -0500
Received: from xenon.chem.ucla.edu by argon.chem.ucla.edu 
	(Sendmail 4.1/1.08) id AA12635; Wed, 17 Nov 93 09:50:32 PST
Received: by xenon.chem.ucla.edu (4.1/SMI-4.1)
	id AA15282; Wed, 17 Nov 93 09:54:55 PST
Date: Wed, 17 Nov 93 09:54:55 PST
From: mam@xenon.chem.ucla.edu (McAllister Michael)
Message-Id: <9311171754.AA15282@xenon.chem.ucla.edu>
To: chemistry@ccl.net
Subject: molecular mechanics and Z-matrices


Does anyone know of a standard molecular mechanics program/platform
that will accept Z-matrices at inputs ???? From my experience all MM
programs want you to create the molecule within the application (ie draw it
manually or call it up from a previously created MM file) however we have 
hundreds of Z-matrices lying around which we would like to run mechanics on,
but don't particularly wish to recreate their geomtries by drawing them!
So, are there any MM programs out there that will read a standard GAUSSIAN
or MOPAC type Z-matrix???
Thanks in advance,  Mike McAllister
You can reply directly to me or to the net, whichever you prefer.
'mam@xenon.chem.ucla.edu'

From jle@world.std.com  Wed Nov 17 22:27:33 1993
Received: from world.std.com  for jle@world.std.com
	by www.ccl.net (8.6.1/930601.1506) id VAA23058; Wed, 17 Nov 1993 21:35:00 -0500
Received: by world.std.com (5.65c/Spike-2.0)
	id AA04796; Wed, 17 Nov 1993 21:34:47 -0500
Date: Wed, 17 Nov 1993 21:34:47 -0500
From: jle@world.std.com (Joe M Leonard)
Message-Id: <199311180234.AA04796@world.std.com>
To: chemistry@ccl.net
Subject: DGEOM questions for conf anal...


Folks,
	I'm trying to use DGEOM ('89-'90 vintage) for conformational
analysis, but am interested in having phenyl/aromatic groups remain
unaltered during the calculations.  I've tried making them their own
residues (while the rest of the molecule's the "other" residue") and
using the RIGID global keyword for the aromatic residues, but the
various rings distort during the calculation.  Am I missing something
here?  Should I also insert  PLANE local constraints for things to work
(I'm uniquely naming atoms in each aromatic residue, but the names are shared
between residues).

	I realize that the rings will flatten out when some form of energy
function's applied to the resulting structures, but I'm interested in 
trying to keep the number of opt cycles to a minimum (and since the
initial phenyl group geoms are ok...).  Have there been changes to newer
versions that make this an easier process (or have changes been made that I
should be aware of)?

	I'd appreciate learning from others who have solved this problem...
If any of the authors have comments, I'd REALLY appreciate hearing from them.

Joe Leonard
jle@world.std.com