CCL: Descriptors
- From: "N. Sukumar" <nagams\a/rpi.edu>
- Subject: CCL: Descriptors
- Date: Wed, 27 Oct 2010 16:48:07 -0400
Sent to CCL by: "N. Sukumar" [nagams|,|rpi.edu]
> Blind QSAR based on large numbers of descriptors just selected by
> sophisticated statistical methods will lead to QSAR equations, which
> look significant, but most often include no physics. They will fail as
> soon as you apply them to a novel situation.
Of course, the fault here lies in the inadequate or improper use of
robust validation techniques. While I agree in principle with Andreas
that a descriptor should have a reasonable relation to the target
property, many descriptors available today may not have an intuitively
OBVIOUS correlation with the property (or target biological activity) of
interest. While sophisticated statistical methods are not required to
construct the most obvious correlations, excessive reliance on picking
so-called "interpretable" descriptors "by hand" merely
serves to
reinforce one's existing prejudices ("chemical intuition") and rarely
leads to the discovery of new science or new materials. Anyone wishing to
seriously embark upon a program of PREDICTIVE cheminformatics should, at
the very least, read the following articles by Alex Tropsha:
A. Golbraikh, A. Tropsha, â??Beware of q2 !â??, J. Mol. Graph. Model.
20,
269â??276 (2002).
Alexander Tropsha, â??Best Practices for QSAR Model Development,
Validation, and Exploitationâ??, Molecular Informatics 29 (6-7),
476â??488 (2010).
Dr. N. Sukumar
Rensselaer Exploratory Center for Cheminformatics Research
http://reccr.chem.rpi.edu/ --------------------------
"It is nice to know that the computer understands the problem. But I
would like to understand it too." -- Eugene P. Wigner
==============Original message text===============
On Tue, 26 Oct 2010 16:26:51 EDT "Andreas Klamt klamt~~cosmologic.de"
wrote:
Sent to CCL by: Andreas Klamt [klamt~~cosmologic.de]
Dear George,
I like to send a kind of warning: The large number of molecular
descriptors which nowadays are easily made available by some programs
also provide a kind of danger. If you have thousands of descriptors
available for a property for which you may have lets say 50 exp. data,
then the chance that some of them correlate just accidentally is quite
high. If they correlate accidentally, no statistical method will detect
that the correlation is accidental. Therefore I strongly recommend that
you first decide rationally whether a descriptor may have any reasonable
relation to the target property. There are few criteria wich can be
used: If you want to describe a local property of a molecule, maybe a
certain reactivity of a functional group, do not use global molecular
descriptors, because they cannot be the right descriptors. Vice versa,
do not use local descriptors for global properties (e.g. a logP). Do not
use orbital descriptors when you want to describe molecular
mobility/viscosity, diffusion coefficients, ..) Best use a small set of
descriptors which is known to include the relevant information, e.g. for
any kind of log-partition coefficient you may either use the 5 Abraham
descriptors or the 5 COSMO-RS sigma-moments. ...
Blind QSAR based on large numbers of descriptors just selected by
sophisticated statistical methods will lead to QSAR equations, which
look significant, but most often include no physics. They will fail as
soon as you apply them to a novel situation.
Best regards
Andreas
Am 26.10.2010 20:23, schrieb Erik-Jan Ras Erik-Jan.Ras..avantium.com:
> Sent to CCL by: Erik-Jan Ras [Erik-Jan.Ras~~avantium.com]
> Dear George,
>
> As already indicated by others, there is no uniform selection method
for choosing which descriptors to use. Some guidelines, depending on the
modeling method you use may still be helpfull.
>
> If you're using PLS models, a good starting point is the variable
importance (VIP) for each of the variables in your model. A variable with
a high VIP will have a high impact on your model performance. Typically
you start your modeling exercise with all available variables. After
that, in small iterative steps, you reduce your model. At each stage you
have to carefully evaluate predictive power of your model. Ideally you
would use asubstantially large external validation set to assess
predictive power.
>
> Also keep in mind the fact that per response (Y) in theory only one
latent variable should be required in your model. If (many) more latent
variables are required you're dealing with variations in your descriptor
space (X) that are orthogonal (uncorrelated) to your response (Y). In
this case you may want to consider using OPLS in stead of PLS.
>
> Generally speaking, these methods are implemented in commercial
packages like Simca-P and work quite well (also pretty well documented
and referenced). With a bit more effort in environments like Matlab,
Scilab or R many open source libraries are available as well.
>
> Regards,
> Erik-Jan
>
>
> ________________________________________
>> From: owner-chemistry+erikjan.ras==avantium.com_-_ccl.net
[owner-chemistry+erikjan.ras==avantium.com_-_ccl.net] On Behalf Of George
Lawrence geoe2##hotmail.com [owner-chemistry_-_ccl.net]
> Sent: Tuesday, October 26, 2010 12:51 PM
> To: Erik-Jan Ras
> Subject: CCL: Descriptors
>
> Sent to CCL by: "George Lawrence" [geoe2%hotmail.com]
> While building a model for a set of compounds, how does one make the
choice of molecular descriptors, I am using MOE which has about 333
different descriptors. I noticed that some have the same suffix or prefix.
> For example: GCUT (could be SlogP, SMR or PEOE) and then there is
SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to
them. What does this mean?
>
> Do they describe the same thing? How does the numbers relate to each
descriptor?
> What are the best methods to use to decide the right choice of descriptors?
>
> George Lawrence
> Geoe2[a]hotmail.com
> Kent U.K.http://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp-:-//www.ccl.net/chemistry/sub_unsub.shtmlhttp-:-//www.ccl.net/spammers.txtThis
email (including its attached files and other content) is confidential and
intended only for the use by named addressee. Unauthorized use, dissemination,
disclosure and/or copying are prohibited. This email, attachments and (any part
of) its content are (1) intended for the named addressee(s) only, and (2)
strictly confidential and proprietary. All rights are reserved byAvantium
Holding B.V. and its subsidiaries ('Avantium'). Any unauthorized use,
dissemination, disclosure and/or copying is strictly prohibited, except after
prior and express written permission by Avantium. Avantium isnot responsible for
the correct transmission and timely receipt of this email and its content.
Should you have received this email, attachments and its content by mistake,
please bring this to our attention and destroythis email in full. Thank you. http://www!
> .avantium.com/about/legal-disclaimer/>
>
>
--
PD. Dr. Andreas Klamt
CEO / Geschäftsführer
COSMOlogic GmbH& Co. KG
Burscheider Strasse 515
D-51381 Leverkusen, Germany
phone +49-2171-731681
fax +49-2171-731689
e-mail klamt---cosmologic.de
web www.cosmologic.de
HRA 20653 Amtsgericht Koeln, GF: Dr. Andreas Klamt
Komplementaer: COSMOlogic Verwaltungs GmbH
HRB 49501 Amtsgericht Koeln, GF: Dr. Andreas Klamthttp://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp-:-//www.ccl.net/chemistry/sub_unsub.shtmlhttp-:-//www.ccl.net/spammers.txt===========End
of original message text===========