CCL: Descriptors

From: "N. Sukumar" <nagams\a/rpi.edu>
Subject: CCL: Descriptors
Date: Wed, 27 Oct 2010 16:48:07 -0400
 Sent to CCL by: "N. Sukumar" [nagams|,|rpi.edu]
 > Blind QSAR based on large numbers of descriptors just selected by
 > sophisticated statistical methods will lead to QSAR equations, which
 > look significant, but most often include no physics. They will fail as
 > soon as you apply them to a novel situation.
 Of course, the fault here lies in the inadequate or improper use of
 robust validation techniques. While I agree in principle with Andreas
 that a descriptor should have a reasonable relation to the target
 property, many descriptors available today may not have an intuitively
 OBVIOUS correlation with the property (or target biological activity) of
 interest. While sophisticated statistical methods are not required to
 construct the most obvious correlations, excessive reliance on picking
 so-called "interpretable" descriptors "by hand" merely
 serves to
 reinforce one's existing prejudices ("chemical intuition") and rarely
 leads to the discovery of new science or new materials. Anyone wishing to
 seriously embark upon a program of PREDICTIVE cheminformatics should, at
 the very least, read the following articles by Alex Tropsha:
 A. Golbraikh, A. Tropsha, â??Beware of q2 !â??, J. Mol. Graph. Model.
 20,
 269â??276 (2002).
 Alexander Tropsha, â??Best Practices for QSAR Model Development,
 Validation, and Exploitationâ??, Molecular Informatics 29 (6-7),
 476â??488 (2010).
 Dr. N. Sukumar
 Rensselaer Exploratory Center for Cheminformatics Research
 http://reccr.chem.rpi.edu/ --------------------------
 "It is nice to know that the computer understands the problem. But I
 would like to understand it too." -- Eugene P. Wigner
 ==============Original message text===============
 On Tue, 26 Oct 2010 16:26:51 EDT "Andreas Klamt klamt~~cosmologic.de"
 wrote:
 Sent to CCL by: Andreas Klamt [klamt~~cosmologic.de]
 Dear George,
 I like to send a kind of warning: The large number of molecular
 descriptors which nowadays are easily made available by some programs
 also provide a kind of danger. If you have thousands of descriptors
 available for a property for which you may have lets say 50 exp. data,
 then the chance that some of them correlate just accidentally is quite
 high. If they correlate accidentally, no statistical method will detect
 that the correlation is accidental. Therefore I strongly recommend that
 you first decide rationally whether a descriptor may have any reasonable
 relation to the target property. There are few criteria wich can be
 used: If you want to describe a local property of a molecule, maybe a
 certain reactivity of a functional group, do not use global molecular
 descriptors, because they cannot be the right descriptors. Vice versa,
 do not use local descriptors for global properties (e.g. a logP). Do not
 use orbital descriptors when you want to describe molecular
 mobility/viscosity, diffusion coefficients, ..) Best use a small set of
 descriptors which is known to include the relevant information, e.g. for
 any kind of log-partition coefficient you may either use the 5 Abraham
 descriptors or the 5 COSMO-RS sigma-moments. ...
 Blind QSAR based on large numbers of descriptors just selected by
 sophisticated statistical methods will lead to QSAR equations, which
 look significant, but most often include no physics. They will fail as
 soon as you apply them to a novel situation.
 Best regards
 Andreas
 Am 26.10.2010 20:23, schrieb Erik-Jan Ras Erik-Jan.Ras..avantium.com:
 > Sent to CCL by: Erik-Jan Ras [Erik-Jan.Ras~~avantium.com]
 > Dear George,
 >
 > As already indicated by others, there is no uniform selection method
 for choosing which descriptors to use. Some guidelines, depending on the
 modeling method you use may still be helpfull.
 >
 > If you're using PLS models, a good starting point is the variable
 importance (VIP) for each of the variables in your model. A variable with
 a high VIP will have a high impact on your model performance. Typically
 you start your modeling exercise with all available variables. After
 that, in small iterative steps, you reduce your model. At each stage you
 have to carefully evaluate predictive power of your model. Ideally you
 would use asubstantially large external validation set to assess
 predictive power.
 >
 > Also keep in mind the fact that per response (Y) in theory only one
 latent variable should be required in your model. If (many) more latent
 variables are required you're dealing with variations in your descriptor
 space (X) that are orthogonal (uncorrelated) to your response (Y). In
 this case you may want to consider using OPLS in stead of PLS.
 >
 > Generally speaking, these methods are implemented in commercial
 packages like Simca-P and work quite well (also pretty well documented
 and referenced). With a bit more effort in environments like Matlab,
 Scilab or R many open source libraries are available as well.
 >
 > Regards,
 > Erik-Jan
 >
 >
 > ________________________________________
 >> From: owner-chemistry+erikjan.ras==avantium.com_-_ccl.net
 [owner-chemistry+erikjan.ras==avantium.com_-_ccl.net] On Behalf Of George
 Lawrence geoe2##hotmail.com [owner-chemistry_-_ccl.net]
 > Sent: Tuesday, October 26, 2010 12:51 PM
 > To: Erik-Jan Ras
 > Subject: CCL: Descriptors
 >
 > Sent to CCL by: "George  Lawrence" [geoe2%hotmail.com]
 > While building a model for a set of compounds, how does one make the
 choice of molecular descriptors, I am using MOE which has about 333
 different descriptors. I noticed that some have the same suffix or prefix.
 > For example: GCUT (could be SlogP, SMR or PEOE) and then there is
 SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to
 them. What does this mean?
 >
 >   Do they describe the same thing? How does the numbers relate to each
 descriptor?
 > What are the best methods to use to decide the right choice of descriptors?
 >
 > George Lawrence
 > Geoe2[a]hotmail.com
 > Kent U.K.http://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp-:-//www.ccl.net/chemistry/sub_unsub.shtmlhttp-:-//www.ccl.net/spammers.txtThis
 email (including its attached files and other content) is confidential and
 intended only for the use by named addressee. Unauthorized use, dissemination,
 disclosure and/or copying are prohibited. This email, attachments and (any part
 of) its content are (1) intended for the named addressee(s) only, and (2)
 strictly confidential and proprietary. All rights are reserved byAvantium
 Holding B.V. and its subsidiaries ('Avantium'). Any unauthorized use,
 dissemination, disclosure and/or copying is strictly prohibited, except after
 prior and express written permission by Avantium. Avantium isnot responsible for
 the correct transmission and timely receipt of this email and its content.
 Should you have received this email, attachments and its content by mistake,
 please bring this to our attention and destroythis email in full. Thank you. http://www!
 >   .avantium.com/about/legal-disclaimer/>
 >
 >
 --
 PD. Dr. Andreas Klamt
 CEO / GeschÃ¤ftsfÃ¼hrer
 COSMOlogic GmbH&  Co. KG
 Burscheider Strasse 515
 D-51381 Leverkusen, Germany
 phone  	+49-2171-731681
 fax    	+49-2171-731689
 e-mail 	klamt---cosmologic.de
 web    	www.cosmologic.de
 HRA 20653 Amtsgericht Koeln, GF: Dr. Andreas Klamt
 Komplementaer: COSMOlogic Verwaltungs GmbH
 HRB 49501 Amtsgericht Koeln, GF: Dr. Andreas Klamthttp://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp-:-//www.ccl.net/chemistry/sub_unsub.shtmlhttp-:-//www.ccl.net/spammers.txt===========End
 of original message text===========