SDF Toolkit
The SDF Toolkit in Perl 5
(This is copy of http://cactus.cit.nih.gov/SDF_toolkit/)
Introduction
SDF or Structures Data File is a common file format developed
by Molecular Design Limited
to handle a list of molecular structures with associated properties. The
file format has been published (Dalby et al., J. Chem. Inf. Comput. Sci.
1992, 32, 244-255).
The purpose of this SDF toolkit is to provide functions to read and
parse SDFs, filter, and add/remove properties. It can also read comma separated
value (CSV) tables which contain new fields to be added to the SD file.
A typical application is to add calculated Log P values or biological data
exported from a spreadsheet. The new SDF can thereafter be displayed with
the new data fields with e.g. ChemFinder,
the CACTVS
system browser csbr, and probably many other programs.
The SDF toolkit is written in Perl 5,
a free, widely available, scripting language.
One useful application (at least for me) has been written with this
toolkit: "add_prop_sdf". This script reads an SDF, adds
properties from a comma separated values (CSV) file and prints out the
new SDF file. No GUI here, it's a batch mode program.
Also of interest is the script select_sdf which can be use
to extract specific records of an SDF. Random selection of records
from an SDF can be made with the help of the gen_rnd script.
The SDF_toolkit is freely available under the GNU public license.
Comments, critics, suggestions and bug reports are welcome.
New release 1.11
Click here to read a description of
the changes.
Downloading
The latest release of the kit can be downloaded here or from
http://cactus.nci.nih.gov/SDF_toolkit/SDF_toolkit.tar.Z.
.
(~ 1.3 MB).
The current release is
1.11.
NEW:
A Win32 version is available as a .zip archive (~ 1.1 MB)
here or at:
http://cactus.cit.nih.gov/SDF_toolkit/SDF_toolkit.zip.
Installation
You'll need a recent version of Perl (5 or above) installed
on your system. Unfortunately, I have tested the toolkit only on Unix systems
(Linux, IRIX and OpenStep). Thanks to wide availability of
Perl,
one can expect that it would be easy to run the toolkit on other platforms
such as Win32 and Mac. The SDF toolkit does not contain any features specific
to a particular platform.
The toolkit is distributed as a tar archive compressed with
compress. To extract the archive, use the following standard command:
uncompress < sdf_toolkit.tar.Z | tar xfv -
Unix installation
A)
Using your shell, change the working directory to the installation
directory and type the command:
perl test_sdf_fields < sdf_fields.txt
If the output is:
fields => ARRAY(0x205c0)
> <Formula> (11)
C14H22O2
> <BOILING.POINT> (MD-08974) FROM ARCHIVES
-53.4
> <Formula> (11)
C14H22O2
> <MolWeight> (11)
222.33
...
go to point C), if not, there was a problem with the installation,
and we go to point B).
B)
Check if Perl is correctly installed on your system.
In your shell, type
perl -v
If you get a message like:
perl: Command not found.
you are not lucky and you'll need to install Perl 5 or to
change your shell's $PATH variable. See your system administrator.
If you get a message like:
This is perl, version 5.001
you are OK. Pay attention to the version number: it must be a number
larger than 5. On some Unix system (e.g. IRIX), one has to type "perl5"
to get the right version. If your version is < 5, you'll need to install
a newer version. To find downloadable distributions of Perl, in both binary
executable and source code format, you can, e.g., go to
http://www.perl.com.
C)
Your Perl installation seems to be OK. The toolkit comes with
a small test suite. To run it, type
make
The toolkit is OK if the make command does not stop with an
error message.
D)
To get a better idea of the toolkit 's capabilities, type
perl add_prop_sdf -help
The output of this command describes a complete example with
detailed explanations.
E)
The script add_prop_sdf can be installed on a Unix
system in such a way that it can be run from any directory.
1) type
which perl
to know where your Perl executable resides. A typical output is:
/usr/bin/perl
2) The SDF toolkit contains a set of packages (*.pm)
that needs to be installed in a standard directory. Which one to use will
depend on your Perl installation.
To find out, type:
perl -e 'print join("\n", @INC), "\n"'
This command prints a list of directories where the Perl
packages can be installed. On my system, the output is one directory /usr/lib/perl5.
If you wish, the packages can be installed in a non-default directory
(see point 4). Copy all the *.pm files into the directory you've
selected.
3) Edit the file add_prop_sdf
The first line shows which Perl executable is going to be used
to interpret the script. The default is :
#!/usr/bin/perl
This first line must match the full path obtained in 1).
4) If the package files (*.pm) have
been installed in a non-default directory, edit the first line of the script
and add a n-I option for the package files installation directory, such as:
#!/usr/bin/perl -I/home/brunob/lib/perl
(assuming here that the *.pm files have been copied to,
e.g., /home/brunob/lib/perl)
5) Copy the edited file add_prop_sdf into a directory
which your shell reads to find programs (type "echo $PATH"
to find the list of directories)
6) Type rehash
Note: Alternatively, one can set the environment variable PERL5LIB
to point to the directories where the SDF_toolkit is installed. The Unix
command to set an environment variable is setenv.
For Perl 5 programmers
The SDF toolkit is set of packages and classes providing a high level of
abstraction. In other words, it is relatively easy to write a small
script which manipulates/filters/combines SDF and CSV files. The file makefile,
which includes the commands of the tests suite, runs a series of small
scripts which test a subset of the SDF toolkit functionalities. Thus, these
scripts can be used as examples to learn how to use the toolkit. Sorry,
there is not much documentation yet.
The script test_sdf_fields shows the basics to create objects
from input files. The script extract_prop_sdf shows a simple loop
to do some processing on each SDF entry read from standard input (STDIN).
All the classes of the SDF toolkit derive from one root class (HashObject)
a la Smalltalk.
One of the most useful classes is MDL_sdf. Alternatively, the class
MDL_sdf_non_parsed_molecule
can
be used in place of MDL_sdf (MDL_sdf_non_parsed_molecule
runs about 5 times faster because the molecular data of the SDF are not
completely parsed).
The CSV_table class can use a fast access method if one knows in
advance the key to be searched in the table. Look in the test script test_quick_csv
for "readFromInputFileUsingQuickKey" .
Array and hash tables are usually passed by reference and not by value.
History
Version 1.11: Dec 31, 2002: Minor bug fixes,
fixes of inconsistencies between documentation and scripts.
Win32 version: Feb 5 2001.
Thanks to Rick Sarvas for providing us with the Win32 archive.
Version 1.10: Jan 27 2000:
more documentation.
A file describing working examples was added.
It shows how we generated downloadable bulk files from the NCI Open Database
structures and biological test data (cancer and AIDS, see http://cactus.nci.nih.gov/ncidb2/download.html)
.
Working examples of custom filters for select_sdf were added (see the directory
Examples/)
commands to delete properties were added (see test_remove_sdf_fields2)
Version 1.05: Nov 18 1999:
New tools to manage NCI data were created and/or updated .
select_sdf has a new option (-perlfile) to dynamically load
a custom filter. Try select_sdf -help.
add_prop_sdf has three new options: -perlclass, -noskip, -silent
sort_sdf : new tool to sort all entries
append_sdf : new tool to concatenate SD files while avoiding duplicates
Version 1.04: Sep 8 1999:
The number of lines in the properties block is not checked if a version
stamp is present (v2000).
Version 1.03: Aug 20 1999:
New tool: shuffle_ct to randomize the atom ordering in the connection
table . This is useful to check if calculated properties are independent
of atom ordering. Note that shuffle_ct does not correctly handle
the M CHG and M RAD lines. select_sdf has a new option (-not)
to inverse the selection. A new class for reading tables containing
space separated values has been added.
-
Version 1.02: July 13
1999: Fixed the problem with the "Assertion failed at Table.pm line 251"
message. The problem is due to a bug in perl 5.003. Use a more recent version
of perl or download this new version of the toolkit.
-
Version 1.01 : July 12 1999: bug fix
to support the presence of an "Atom list block" and a "Stext block" in
the connection table (MDL_Molecule class). Refactoring of
the MDL_sdf and MDL_sdf_non_parsed_molecule
classes. These
changes affect only the libraries and not the scripts. Addition of gen_rnd,
a script to generate random numbers, of use to randomly select entries
from an SD file.
-
Version 1.0 : July 2 1999
Problems and limitations
The SDF toolkit is quite strict about the syntactical correctness
of the input files. Some programs export SDF files that are not totally
compliant with the published standard (Dalby et al., 1992). In some cases,
the SDF toolkit might generate relatively cryptic error messages.
Using big CSV tables can consume large amounts of memory.
On some Linux systems (e.g. Red Hat 6.0), the test suite fails because
0.0000 values are written as -0.0000 .
Reference
Dalby A, Nourse JG, Hounshell WD, Gushurst Aki, Grier DL, Leland
BA, Laufer J
"Description of several chemical-structure file formats used by computer-programs
developed at Molecular Design Limited"
Journal of Chemical Information and Computer Sciences 32:(3) 244-255,
May-Jun 1992
This toolkit was written by Bruno Bienfait, Ph.D., while he was a postdoc in the
Laboratory of Medicinal Chemistry
at the
National Cancer Institute,
Bethesda, MD, USA.
You can contact him (e.g. for bug reports) at
brunob@helix.nih.gov
or
bruno@brunob.org.
Additional programs written by Bruno Bienfait can be found at
http://www.brunob.org.
This page was last changed 31-Dec-2002.
|