From owner-chemistry@ccl.net Thu Mar 27 11:15:00 2014 From: "Wolf Ihlenfeldt wdi^-^xemistry.com" To: CCL Subject: CCL: Similarity exclusion filters Message-Id: <-49874-140327094053-8719-LEO38jZ5uv6jHEHDal/DCw\a/server.ccl.net> X-Original-From: Wolf Ihlenfeldt Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Date: Thu, 27 Mar 2014 14:40:44 +0100 MIME-Version: 1.0 Sent to CCL by: Wolf Ihlenfeldt [wdi[A]xemistry.com] On Wed, Mar 26, 2014 at 11:14 PM, Andrew Voronkov drugdesign-,-yandex.ru wrote: > > Sent to CCL by: Andrew Voronkov [drugdesign[a]yandex.ru] > Dear CCL users, are you aware of any scripts or programs, which can be used for filtering out similar compounds. > Let s say there is a set of known compounds for the biotarget. Then I make a screening of certain dataset through this biotarget. After screening, I would like to filter out all compounds, which are similar to the already known dataset, which I provide let s say as smiles file. > It would be nice to have some script, where I can setup that for example compounds with 80-95% similarity to already known compounds are filtered out. > > Are you aware of such software, scripts? Maybe there is someone, who can consult me about writing such script? Maybe something like OpenBabel etc. can be used for that? > > Best regards, > Andrey > > This is a typical problem for a chemistry-aware scripting toolkit, such as our Cactvs Cheminformatics Toolkit (see www.xemistry.com/academic for free academic releases). Here are simple sample solutions in the Tcl and Python interface languages: ---snip--- set th [table create] table addcol $th E_SCREEN table addfile $th [molfile open [lindex $argv 0]] set fh [molfile open [lindex $argv 1]] molfile loop $fh eh { set too_similar 0 table loop $th row { if {[prop compare E_SCREEN [ens get $eh E_SCREEN] [lindex $row 0] tanimoto]>80} { set too_similar 1 break } } if {!$too_similar} { molfile copy $fh stdout 1 -1 } } ---snip--- run as "csts -f myscript.tcl myfilterset.smi mydb.smi" ---snip--- th=Table() th.addcol('E_SCREEN') th.addfile(Molfile(sys.argv[0])) fh=Molfile(sys.argv[1]) for eh in fh: too_similar=False for row in th: if (Prop.Compare('E_SCREEN',eh.E_SCREEN,row[0],'tanimoto')>80): too_similar=True break th.rewind() if not too_similar: fh.copy(sys.stdout,1,-1) ---snip.. run as "cspy -f myscript.py myfilterset.smi mydb.smi" Using a scripting environment instead of a turnkey application gives you much more control on what the software actually does - it is straightforward to use, for example, different similarity source data (path fingerprints, sphere fingerprints instead for the fragment-based fingerprint used in the example code), comparison algorithms (Cactvs supports about 10 different similarity comparison algorithms, such as Cosine, Dice, Russel-Rao,Kulczynski. Simpson, Yule, Hamman, Forbes in addition to Tanimoto), or add additional filters (element composition filter, reactive groups, druglikeness, etc.) The development of the Python interface now part of the free academic packages was kindly supported by Vertex Pharmaceuticals. -- Wolf-D. Ihlenfeldt - Xemistry GmbH - wdi{}xemistry.com Phone: +49 6174 201455 - Fax +49 6174 209665 --- xemistry gmbh – Geschäftsführer/Managing Director: Dr. W. D. Ihlenfeldt Address: Hainholzweg 11, D-61462 Königstein, Germany HR Königstein B7522 : Ust/VAT ID DE215316329 : DUNS 34-400-1719