From owner-chemistry@ccl.net Thu Mar 27 11:15:00 2014
From: "Wolf Ihlenfeldt wdi^-^xemistry.com" <owner-chemistry\a/server.ccl.net>
To: CCL
Subject: CCL: Similarity exclusion filters
Message-Id: <-49874-140327094053-8719-LEO38jZ5uv6jHEHDal/DCw\a/server.ccl.net>
X-Original-From: Wolf Ihlenfeldt <wdi(-)xemistry.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8
Date: Thu, 27 Mar 2014 14:40:44 +0100
MIME-Version: 1.0


Sent to CCL by: Wolf Ihlenfeldt [wdi[A]xemistry.com]
On Wed, Mar 26, 2014 at 11:14 PM, Andrew Voronkov
drugdesign-,-yandex.ru <owner-chemistry{}ccl.net> wrote:
>
> Sent to CCL by: Andrew Voronkov [drugdesign[a]yandex.ru]
> Dear CCL users, are you aware of any scripts or programs, which can be used for filtering out similar compounds.
>  Let s say there is a set of known compounds for the biotarget. Then I make a screening of certain dataset through this biotarget. After screening, I would like to filter out all  compounds, which are similar to the already known dataset, which I provide let s say as smiles file.
> It would be nice to have some script, where I can setup that for example compounds with 80-95% similarity to already known compounds are filtered out.
>
> Are you aware of such software, scripts? Maybe there is someone, who can consult me about writing such script? Maybe something like OpenBabel etc. can be used for that?
>
> Best regards,
> Andrey
>
>

This is a typical problem for a chemistry-aware scripting toolkit,
such as our Cactvs Cheminformatics Toolkit (see
www.xemistry.com/academic for free academic releases).

Here are simple sample solutions in the Tcl and Python interface languages:

---snip---
set th [table create]
table addcol $th E_SCREEN
table addfile $th [molfile open [lindex $argv 0]]
set fh [molfile open [lindex $argv 1]]
molfile loop $fh eh {
    set too_similar 0
    table loop $th row {
        if {[prop compare E_SCREEN [ens get $eh E_SCREEN] [lindex $row
0] tanimoto]>80} {
            set too_similar 1
            break
        }
    }
    if {!$too_similar} {
        molfile copy $fh stdout 1 -1
    }
}
---snip---

run as "csts -f myscript.tcl myfilterset.smi mydb.smi"

---snip---
th=Table()
th.addcol('E_SCREEN')
th.addfile(Molfile(sys.argv[0]))
fh=Molfile(sys.argv[1])
for eh in fh:
    too_similar=False
    for row in th:
        if (Prop.Compare('E_SCREEN',eh.E_SCREEN,row[0],'tanimoto')>80):
            too_similar=True
            break
    th.rewind()
    if not too_similar:
        fh.copy(sys.stdout,1,-1)
---snip..

run as "cspy -f myscript.py myfilterset.smi mydb.smi"

Using a scripting environment instead of a turnkey application gives
you much more control on what the software actually does - it is
straightforward to use,  for example,  different similarity source
data (path fingerprints, sphere fingerprints instead for the
fragment-based fingerprint used in the example code), comparison
algorithms (Cactvs supports about 10 different similarity comparison
algorithms, such as Cosine, Dice, Russel-Rao,Kulczynski. Simpson,
Yule, Hamman, Forbes in addition to Tanimoto), or add additional
filters (element composition filter, reactive groups, druglikeness,
etc.)

The development of the Python interface now part of the free academic
packages was kindly supported by Vertex Pharmaceuticals.


-- 
Wolf-D. Ihlenfeldt -  Xemistry GmbH - wdi{}xemistry.com
Phone: +49 6174 201455 - Fax +49 6174 209665
---
xemistry gmbh – Geschäftsführer/Managing Director: Dr. W. D. Ihlenfeldt
Address: Hainholzweg 11, D-61462 Königstein, Germany
HR Königstein B7522 : Ust/VAT ID DE215316329 : DUNS 34-400-1719