From owner-chemistry@ccl.net Fri Jul 11 08:52:01 2008 From: "Zhao Yuan ccl+/-mail.sioc.ac.cn" To: CCL Subject: CCL: questions about ECFPs Message-Id: <-37326-080711085003-24201-P/1t6ITFX5s2wxuCNkbUoQ^server.ccl.net> X-Original-From: "Zhao Yuan" Date: Fri, 11 Jul 2008 08:50:00 -0400 Sent to CCL by: "Zhao Yuan" [ccl]~[mail.sioc.ac.cn] Hi everyone, Recently, I've read the paper about how to generate Extended-Connectivity Fingerprints. ///////////////////////////////////////////// High-Throughput Data Analysis. 1. Extended-Connectivity Fingerprints: A High-Dimensional Descriptor for Molecular Data Analysis David Rogers* and Mathew Hahn SciTegic, Inc. ///////////////////////////////////////////// It mentioned that ECFPs can be rapidly calculated and can represent a very large number of different features. So I want to use it to compare two molecules or calculate similarity between them. However, I met some detailed and technical problem when following its method. The first problem is the hash function. I used lots of hash function to encode the initial atom identifiers but none of them is identical to the result in the reference. Does anyone know what hash function it used? Second, after the first iteration, the code of root atom's neighbors are attached to the code of root atom. Then it got a array like this: [1, 3194967052, 1, 1559650422, 1, 1572579716, 2, 3220825640] I wondered whether it needed to sort again. (in my program, the array was like this: 13194967052, 11559650422, 11572579716, 23220825640 then I converted them to a sorted or unsorted string which will be used for hash function. sorted string: 11559650422115725797161319496705223220825640 unsorted string: 13194967052115596504221157257971623220825640 but whatever string I used, the new features I got was different > from the reference result. Third, for the second iteration, some atoms may connect to the same neighboring atoms. Such as in a four membered ring, B and C are the neighbor of atom A, while D connected to B and C. In the second iteration, which atom should the D's code attach (B or C or Both)? A----B | | C----D At last, can anybody give me a detailed example of the ECFPs_4. The initial identifier of each atom and the identifiers in each iterations ( identifier before hash and after hash ). I've tried to correspond with the author Dr. Rogers, however his e-mail is not valid now. I sincerely appreciate if anyone can give me some help in resolving the problem. Best Regards, Zhao Yuan ------------------------------------------------------------ State Key Lab of Bio-organic and Natural Products Chemistry Shanghai Institute of Organic Chemistry (SIOC), Chinese Academy of Sciences. Addr. 354, Fenglin Road, Shanghai, China. Tel.: +86-21-54925275 Email: yzhao.^.mail.sioc.ac.cn