The SDF Toolkit in Perl 5

Introduction

SDF or structures data file is a common file format developped by Molecular Design Limited to handle a list of molecular structures associated with properties. The file format has been published (Dalby et al., 1992).
The purpose of this SDF toolkit is to provide functions to read and parse SDFs, filter, and add/remove properties. It can also read comma separated value (CSV) tables which contain new fields to be added to the SDF file. A typical application is to add calculated Log P values or biological data exported from a spreadsheet. The new SDF can thereafter be displayed with the new data fields with e.g. ChemFinder, the CACTVS system browser csbr, and probably many other programs.
The SDF toolkit is written in Perl 5, a free, widely available, scripting language.
One useful application (at least for me) has been written with this toolkit: "add_prop_sdf". This script reads an SDF, adds properties from a comma separated values (CSV) file and prints out the new SDF file. No GUI here, it's a batch mode program.
Also of interest is the script select_sdf which can be use to extract specific records of an SDF.
The SDF_toolkit is freely available under the GNU public license (see file LICENSE.txt).

Downloading

The kit can be downloaded here. (~ 820 KB).

Installation

You'll need a recent version of Perl (5 or above) installed on your system. Unfortunately, I have tested the toolkit only on Unix systems (Linux, IRIX and OpenStep). Thanks to wide availability of Perl (www.perl.com), one can expect that it would be easy to run the toolkit on other platforms such as Win32 and Mac. The SDF toolkit does not contain any features specific to a particular platform.

The toolkit is distributed as a tar archive compressed with compress. To extract the archive, use the following standard command:

uncompress < sdf_toolkit.tar.Z | tar xfv -

Unix installation

A)

Using your shell, change the working directory to the installation directory and type the command:

perl test_sdf_fields < sdf_fields.txt

If the output is:

fields => ARRAY(0x205c0)
> <Formula> (11)
C14H22O2

> <BOILING.POINT> (MD-08974) FROM ARCHIVES
-53.4

> <Formula> (11)
C14H22O2

> <MolWeight> (11)
222.33

...

go to point C) , if not, there was a problem with the installation, and we go to point B).

B)

Check if perl is correctly installed on your system.
In your shell, type
perl -v

If you get a message like:

perl: Command not found.

you are not lucky and you 'll need to install perl 5 or to change your shell's $PATH variable. See your system administrator.
If you get a message like:

This is perl, version 5.001
you are OK. Pay attention to the version number: it must be a number larger than 5. On some unix system (e.g. IRIX), one has to type "perl5" to get the right version. If your version is < 5, you'll need to install a newer version.

C)

Your perl installation seems to be OK. The toolkit comes with a small test suite. To run it, type
make

The toolkit is OK if the make command does not stop with an error message.

D)

To get a better idea of the toolkit 's capabilities, type

perl add_prop_sdf -help

The output of this command describes a complete example with detailed explanations.

E)

The script add_prop_sdf can be installed on a Unix system in such a way that it can be run from any directory.
1) type
which perl
to know where your perl executable resides. A typical output is:
/usr/bin/perl
2) The SDF toolkit contains a set of packages (*.pm) that need sto be installed in a standard directory. Which one to use will depend on your perl installation.
To find out, type:
perl -e 'print join("\n", @INC), "\n"'
This command prints a list of directories where the Perl packages can be installed. On my system, the output is one directory /usr/lib/perl5 . If you wish, the packages can be installed in a non default directory (see point 4). Copy all the *.pm files into the directory you've selected.
3) Edit the file add_prop_sdf
The first line shows which perl executable is going to be used to interpret the script. The default is :
#!/usr/bin/perl
This first line must match the full path obtained in 1).
4) If the package files (*.pm)have been installed in a non-default directory, edit the first line of the script and add a n-I option for the package files installation directory like e,g.:
#!/usr/bin/perl -I/home/brunob/lib/perl
(assuming here that the *.pm files have been copied to e.g. /home/brunob/lib/perl)
5) Copy the edited file add_prop_sdf into a directory which is read by your shell to find programs (type "echo $PATH" to find the list of directories)

For Perl 5 programmers

The SDF toolkit is set of packages and classes providing a high level of abstraction. In other words, it is relatively easy to write a small script which manipulates/filters/combines SDF and CSV files. The file makefile, which includes the commands of the tests suite, runs a series of small scripts which test a subset of the SDF toolkit functionalities. Thus, these scripts can be used as examples to learn how to use the toolkit. Sorry, there is no documentation yet.

The script test_sdf_fields shows the basics to create objects from input files. The script extract_prop_sdf shows a simple loop to do some processing on each SDF entry read from the standard input (STDIN).

All the classes of the SDF toolkit derive from one root class (HashObject) a la Smalltalk.

One of the most useful class is MDL_sdf. Alternatively, the class MDL_sdf_non_parsed_molecule can be used in place of MDL_sdf (MDL_sdf_non_parsed_molecule runs about 5 times faster because the molecular data of the SDF is not completely parsed).

The CSV_table class can use a fast access method if one knows in advance the key to be searched in the table. Look in the test script test_quick_csv for "readFromInputFileUsingQuickKey" .

Array and hash tables are usually passed by reference and not by value.

Problems and limitations

The SDF toolkit is quite strict about the syntactical correctness of the input files. Some programs export SDF files that are not totally compliant with the published standard (Dalby et al., 1992). In some cases, the SDF toolkit might generate relatively cryptic error messages.
Using big CSV tables can consume large amount of memory.
Only three persons including myself used the toolkit so far. So expect some rough edges.

Reference

Dalby A, Nourse JG, Hounshell WD, Gushurst Aki, Grier DL, Leland BA, Laufer J
"Description of several chemical-structure file formats used by computer-programs developed at Molecular Design Limited"
Journal of Chemical Information and Computer Sciences 32: (3) 244-255 may-jun 1992

Bruno Bienfait, Ph. D.            Laboratory of Medicinal Chemistry
                                  National Cancer Institute
Email : brunob@helix.nih.gov      National Institutes of Health
Phone : (301) 402-3111            Building 37, Room 5B20
Fax   : (301) 496-5839            Bethesda Maryland 20892 , USA
WWW   : http://www.brunob.org

[ CCL Home Page ]
[ SDF ]
[ Raw Version of this page ]

Modified: Fri Jul 2 18:07:11 1999 GMT

Page accessed 5657 times since Sun Mar 2 15:47:04 2003 GMT