CCL: program to split sdf file



 
 
There have now been quite a number of helpful suggestions on how to perform the split - but none of them are really robust, which is essential if files become too big for visual inspection of the results. Remember: Murphy lives!
 
Take the following completely legal SD file record:
 
----

  -ISIS-  03190722292D
$$$$
  1  0  0  0  0  0  0  0  0  0999 V2000
   -2.3167   -0.2167    0.0000 Au  0  0  0  1  0  0  1  0  0  0  0  0
M  END
> <PRICE>
$$$$
 
$$$$
------
 
This is a single record,  with multiple "$$$$" in places where they are *not* record terminators. All simple string search methods with awk or similar tools which simply look for the $$$$ line will fail on this. Never assume that such records do not exist. I *have* seen $$$$ in SD data lines before.
 
A record splitter thus needs more chemical intelligence to process such files.
 
OpenBabel has been suggested for problem. There are several problems with that proposal:
a) It does not scale to really big files, because Babel has no method to output multiple files. Every batch is a separate command and needs to skip to the first copy position. Not a problem with a few thousand cpds, but ultimately this approaches n**2 performance law and if you need to split your full PubChem 10 mil cpds download, you have a problem.
b) While Babel is smart enough to read and output the first records from an SD input file with repeated records as above (almost) correctly, its skip function seems to have less brainpower and gets confused. It simply silently quits without any message. It is not possible to output records starting after first record from above repeated multi-record test file or after encountering such as record anywhere in the skipped part. A bad thing if something like this happens in the middle of your 500Mb file where you cannot edit.
c) While on superficial inspection the Babel output looks correct when run on the first records, a closer look shows that critical information has been lost. Babel needs to read records into its internal datastructure before output via conversion.However, its Molfile parser is rather simple and supports few of the more advanced Molfile encoding conventions. In this case, Babel silently drops the critical H0 designator flag (plus a second flag) which lets a Molfile reader distinguish between metal Au and AuH3 with implicit H. So after the pass through Babel, the compound has changed, without any notification, from metal Au to AuH3. That can be a problem.
 
OK, enough criticism, here is constructive help:
 
-----snip---store as script.tcl---
set fname [lindex $argv 0]
set fhin [molfile open $fname]
set setsize [lindex $argv 1]
set startrec 1
while 1 {
        set fhout [open [file rootname $fname]_${startrec}_[expr $startrec+$setsize-1][file extension $fname] w]
        if {[catch {molfile copy $fhin $fhout $setsize}]} {
                close $fhout
                exit
        }
        close $fhout
        incr startrec $setsize
}
-------
 
Above is a really simple (and not user-proofed, no parameter checking) script for the CACTVS toolkit (www.xemistry.com/academic). Run it with the generic script interpreter from the packages as
 
csts -f script.tcl filename.sdf setsize
 
The script will output a set of files like "myfile_1_99.sdf", "myfile_100_199.sdf", etc. in the same directory as the source file.
 
This script:
a) Processes above sample file (or any other input file) without a single change of bytes in the split records, similar to line-copying awk scripts etc.. The record copy function does not decode and re-encode the data; it just keeps an eye on the passing data to detect proper record boundaries.
b) Does not need to know anything about the input file format. It will autodetect the format (independent of the suffix) and work with any supported multi-record format.
 
 

W. D. Ihlenfeldt
Xemistry GmbH
wdi-x-xemistry.com
 

 


From: owner-chemistry-x-ccl.net [mailto:owner-chemistry-x-ccl.net]
Sent: Monday, March 19, 2007 10:36 AM
To: Ihlenfeldt, W.d.
Subject: CCL: program to split sdf file

HI,

Unix has loads of ways of doing this, some of which have already been suggested.
But, if you want a more graphical way of doing things then you might want to look a ChemAxon's Instant JChem (http://www.chemaxon.com/product/ijc.html) which lets you easily view and query chemical data imported from an SD file.

The data can be queried (in your case, just for IDs 1-3,000 etc, but much more complex queries are possible) and then the results exported to a SD file.

And yes, it is free!
Alex

Fan,Huajun hjfan^^^pvamu.edu wrote:

Hi, Does anyone know any programs (preferably free) that can split a big sdf file into smaller files? I got a sdf file containing 30,000 molecules and want to do a DOCK5. It is too big even to read it through. I want to split it into 10 samller files that contains 3,000 each. Is it possible? The newest version of Babel seems not available of this split function for SDF format.

Thanks in advance.

Hua-Jun