From owner-chemistry@ccl.net Mon Mar 19 23:42:01 2007 From: "Wolf-D. Ihlenfeldt wdi=xemistry.com" To: CCL Subject: CCL: program to split sdf file Message-Id: <-33842-070319234024-13232-soHYqi9usu+Ci4/M6g4sHA/./server.ccl.net> X-Original-From: "Wolf-D. Ihlenfeldt" Content-Type: multipart/alternative; boundary="----=_NextPart_000_0AF8_01C76A7F.E91421B0" Date: Mon, 19 Mar 2007 23:39:58 -0400 MIME-Version: 1.0 Sent to CCL by: "Wolf-D. Ihlenfeldt" [wdi|a|xemistry.com] This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------=_NextPart_000_0AF8_01C76A7F.E91421B0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit There have now been quite a number of helpful suggestions on how to perform the split - but none of them are really robust, which is essential if files become too big for visual inspection of the results. Remember: Murphy lives! Take the following completely legal SD file record: ---- -ISIS- 03190722292D $$$$ 1 0 0 0 0 0 0 0 0 0999 V2000 -2.3167 -0.2167 0.0000 Au 0 0 0 1 0 0 1 0 0 0 0 0 M END > $$$$ $$$$ ------ This is a single record, with multiple "$$$$" in places where they are *not* record terminators. All simple string search methods with awk or similar tools which simply look for the $$$$ line will fail on this. Never assume that such records do not exist. I *have* seen $$$$ in SD data lines before. A record splitter thus needs more chemical intelligence to process such files. OpenBabel has been suggested for problem. There are several problems with that proposal: a) It does not scale to really big files, because Babel has no method to output multiple files. Every batch is a separate command and needs to skip to the first copy position. Not a problem with a few thousand cpds, but ultimately this approaches n**2 performance law and if you need to split your full PubChem 10 mil cpds download, you have a problem. b) While Babel is smart enough to read and output the first records from an SD input file with repeated records as above (almost) correctly, its skip function seems to have less brainpower and gets confused. It simply silently quits without any message. It is not possible to output records starting after first record from above repeated multi-record test file or after encountering such as record anywhere in the skipped part. A bad thing if something like this happens in the middle of your 500Mb file where you cannot edit. c) While on superficial inspection the Babel output looks correct when run on the first records, a closer look shows that critical information has been lost. Babel needs to read records into its internal datastructure before output via conversion.However, its Molfile parser is rather simple and supports few of the more advanced Molfile encoding conventions. In this case, Babel silently drops the critical H0 designator flag (plus a second flag) which lets a Molfile reader distinguish between metal Au and AuH3 with implicit H. So after the pass through Babel, the compound has changed, without any notification, from metal Au to AuH3. That can be a problem. OK, enough criticism, here is constructive help: -----snip---store as script.tcl--- set fname [lindex $argv 0] set fhin [molfile open $fname] set setsize [lindex $argv 1] set startrec 1 while 1 { set fhout [open [file rootname $fname]_${startrec}_[expr $startrec+$setsize-1][file extension $fname] w] if {[catch {molfile copy $fhin $fhout $setsize}]} { close $fhout exit } close $fhout incr startrec $setsize } ------- Above is a really simple (and not user-proofed, no parameter checking) script for the CACTVS toolkit (www.xemistry.com/academic). Run it with the generic script interpreter from the packages as csts -f script.tcl filename.sdf setsize The script will output a set of files like "myfile_1_99.sdf", "myfile_100_199.sdf", etc. in the same directory as the source file. This script: a) Processes above sample file (or any other input file) without a single change of bytes in the split records, similar to line-copying awk scripts etc.. The record copy function does not decode and re-encode the data; it just keeps an eye on the passing data to detect proper record boundaries. b) Does not need to know anything about the input file format. It will autodetect the format (independent of the suffix) and work with any supported multi-record format. W. D. Ihlenfeldt Xemistry GmbH wdi-x-xemistry.com _____ > From: owner-chemistry-x-ccl.net [mailto:owner-chemistry-x-ccl.net] Sent: Monday, March 19, 2007 10:36 AM To: Ihlenfeldt, W.d. Subject: CCL: program to split sdf file HI, Unix has loads of ways of doing this, some of which have already been suggested. But, if you want a more graphical way of doing things then you might want to look a ChemAxon's Instant JChem (http://www.chemaxon.com/product/ijc.html) which lets you easily view and query chemical data imported from an SD file. The data can be queried (in your case, just for IDs 1-3,000 etc, but much more complex queries are possible) and then the results exported to a SD file. And yes, it is free! Alex Fan,Huajun hjfan^^^pvamu.edu wrote: Hi, Does anyone know any programs (preferably free) that can split a big sdf file into smaller files? I got a sdf file containing 30,000 molecules and want to do a DOCK5. It is too big even to read it through. I want to split it into 10 samller files that contains 3,000 each. Is it possible? The newest version of Babel seems not available of this split function for SDF format. Thanks in advance. Hua-Jun ------=_NextPart_000_0AF8_01C76A7F.E91421B0 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

There=20 have now been quite a number of helpful suggestions on how to perform = the split=20 - but none of them are really robust, which is essential if files = become=20 too big for visual inspection of the results. Remember: Murphy=20 lives!

Take=20 the following completely legal SD file record:

----

-ISIS- = 03190722292D
$$$$
1 =20 0 0 0 0 0 0 0 0 0999=20 V2000
-2.3167 -0.2167 = 0.0000=20 Au 0 0 0 1 0 0 1 0 = 0 =20 0 0 0
M END
> = <PRICE>
$$$$

$$$$

------

This=20 is a single record, with multiple "$$$$" in places where they are = *not*=20 record terminators. All simple string search methods with awk or similar = tools=20 which simply look for the $$$$ line will fail on this. Never = assume=20 that such records do not exist. I *have* seen $$$$ in SD data lines = before.

A=20 record splitter thus needs more chemical intelligence to process such = files.=20

OpenBabel has been suggested for problem. There are several = problems with=20 that proposal:

a) It=20 does not scale to really big files, because Babel has no method to = output=20 multiple files. Every batch is a separate command and needs to skip to = the first=20 copy position. Not a problem with a few thousand cpds, but ultimately = this=20 approaches n**2 performance law and if you need to split your full = PubChem 10=20 mil cpds download, you have a problem.

b)=20 While Babel is smart enough to read and output the first records from an = SD=20 input file with repeated records as above (almost) correctly, its = skip=20 function seems to have less brainpower and gets confused. It simply = silently=20 quits without any message. It is not possible to output records starting = after=20 first record from above repeated multi-record test file or after=20 encountering such as record anywhere in the skipped part. A bad thing if = something like this happens in the middle of your 500Mb file where you = cannot=20 edit.

c) While on superficial inspection the Babel = output=20 looks correct when run on the first records, a closer look shows that = critical=20 information has been lost. Babel needs to read records into its internal = datastructure before output via conversion.However, its Molfile parser = is rather=20 simple and supports few of the more advanced Molfile encoding = conventions. In=20 this case, Babel silently drops the critical H0 designator flag (plus a = second=20 flag) which lets a Molfile reader distinguish between metal Au and AuH3 = with=20 implicit H. So after the = pass through=20 Babel, the compound has changed, without any notification, from = metal Au to=20 AuH3. That can be a problem.

OK, enough criticism, here is constructive=20 help:

-----snip---store as=20 script.tcl---

set fname [lindex $argv 0]
set fhin = [molfile open=20 $fname]
set setsize [lindex $argv 1]
set startrec 1
while 1=20 {
        set fhout [open [file = rootname=20 $fname]_${startrec}_[expr $startrec+$setsize-1][file extension $fname]=20 w]
        if {[catch {molfile = copy $fhin=20 $fhout $setsize}]}=20 {
           &n= bsp;   =20 close=20 $fhout
          &nb= sp;    =20 exit
       =20 }
        close=20 $fhout
        incr startrec=20 $setsize
}

-------

Above is a really simple (and not = user-proofed, no=20 parameter checking) script for the CACTVS toolkit (www.xemistry.com/academic).= Run=20 it with the generic script interpreter from the packages=20 as

csts -f script.tcl filename.sdf=20 setsize

The script will output a set of files = like=20 "myfile_1_99.sdf", "myfile_100_199.sdf", etc. in the same directory as = the=20 source file.

This = script:

a) Processes above sample file (or any other = input=20 file) without a single change of bytes in the split records, similar to=20 line-copying awk scripts etc.. The record copy function does not decode = and=20 re-encode the data; it just keeps an eye on the passing data to detect = proper=20 record boundaries.

b) Does not need to know anything about = the input=20 file format. It will autodetect the format (independent of the suffix) = and work=20 with any supported multi-record = format.

W. D. Ihlenfeldt
Xemistry=20 GmbH
wdi-x-xemistry.com

From: owner-chemistry-x-ccl.net=20 [mailto:owner-chemistry-x-ccl.net]
Sent: Monday, March 19, = 2007 10:36=20 AM
To: Ihlenfeldt, W.d.
Subject: CCL: = program to=20 split sdf file

HI,

Unix has loads of ways of doing this, some of = which have=20 already been suggested.
But, if you want a more graphical way of = doing=20 things then you might want to look a ChemAxon's Instant JChem (http://www.chemaxon.com= /product/ijc.html)=20 which lets you easily view and query chemical data imported from an SD = file.=20

The data can be queried (in your case, just for IDs 1-3,000 = etc, but=20 much more complex queries are possible) and then the results exported = to a SD=20 file.

And yes, it is free!
Alex

Fan,Huajun=20 hjfan^^^pvamu.edu wrote:=20

Hi, Does anyone know = any=20 programs (preferably free) that can split a big sdf file into = smaller files?=20 I got a sdf file containing 30,000 molecules and want to do a DOCK5. = It is=20 too big even to read it through. I want to split it into 10 samller = files=20 that contains 3,000 each. Is it possible? The newest version of Babel seems not available of this = split=20 function for SDF format.

Thanks in=20 advance.

Hua-Jun=20

------=_NextPart_000_0AF8_01C76A7F.E91421B0--