This document describes some of the operations performed to generate the downloadable bulk files from the NCI Open Database
structures and biological test data (cancer and AIDS, see http://cactus.cit.nih.gov/ncidb/download.html for more information). The aim of this document is to show how to combine tools of the SDF_Toolkit and to provide tricks and recipes by showing real examples. All these examples shoud be run on a Unix system.
All input files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP, http://dtp.nci.nih.gov).
We collected the structures and biological data from DTP (cancer data as of August 1999, AIDS data as of October 1999), combined them where applicable, and generated MDL SD files from this information.
The SDF_Toolkit can be downloaded at http://cactus.cit.nih.gov/SDF_toolkit/index.html . You'll need version 1.06 or later.
Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier (a non chemical data entry in the SD files). The identifier (here: NSC number) must be present in both input files.
nciopen_LMCH_aug99_0D.sdf : August 1999 SD file without 3D/2D and stereo information aids_o99_chemical_structs.sdf : chemical structures from the DTP site for which AIDS data is available
Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier (a non chemical data entry in the SD files). The identifier must be present in both input files. Make a list of the new entries.
nciopen_LMCH_aug99_0D.sdf : august 99 SD file without 3D/2D and stereo information aids_o99_chemical_structs.sdf : chemical structures from the DTP site for which AIDS data is available
tail -2212 temp.list > 2212_oct99.list
Select a subset of an SD file using the NSC number as the identifier.
aids_o99_chemical_structs.sdf : chemical structures from the DTP site for which AIDS data is available 2212_oct99.list file created here.
See title.
- 2212_oct99_3D.sdf : chemical structures file created here
- 689_aug99.list : a list of new entries for the August 1999 release.
- cancer_screened_a99_chemical_structs.sdf : file downloaded from the DTP WWW site. This file contains structures for which cancer cell data is available.
cactus_2d_nci 2212_oct99_0D.sdf | remove_stereo_sdf > 2212_oct99_2D.sdf
#Redo the same thing for the 689 file:
select_sdf -labelfile 689_aug99.list -property_name NSC < cancer_screened_a99_chemical_structs.sdf > 689_aug99_3D.sdf
remove_h_sdf < 689_aug99_3D.sdf | remove_charge_sdf | tee 689_aug99_3D_no_H.sdf | zero_sdf > 689_aug99_0D.sdf
cactus_2d_nci 689_aug99_0D.sdf | remove_stereo_sdf > 689_aug99_2D.sdf
tee is a standard Unix command which reads from standard input, writes to standard output and saves to a file. cactus_2d_nci is a TCL script (not part of the SDF_Toolkit) which calculate 2D coordinates. This script makes use of the CACTVS system.
Remove entries from the NCI files that have an NSC number greater or equal than 900,000 (these are combinatorial library entries)
- open_397.mol : NCI data file released in March 1997
- 689_aug99_0D.sdf : supplemental structures from the August 99 release
- 2212_oct99_0D.sdf : supplemental structures from the October 99 release
- remove_900000.pm : a perl module for the tool select_sdf
defined $sdf_entry ||
die "Assertion failed" ;
my $value = $sdf_entry->data_for_field_name("NSC");
defined $value || die
"Assertion failed: undefined property" ;
# print STDERR $value, "\n"
;
return $value < 900000
; #Keep NSC's < 9000000
}
1;
##################################
The special filter is loaded and compiled at run time.
Sort NCI files by NSC number
- open_397.sdf : NCI data file released in March 1997 (includes 3D )
- 689_aug99_3D.sdf : supplemental structures from the August 99 release
- 2212_oct99_3D.sdf : supplemental structures from the October 99 release
sort_sdf might require a lot of memory (the whole input file is stored in memory). For example, sorting the entire NCI database (about 250,000 entries with biological data added, a ~800 MB SD file) by NSC number required 1.5GB of memory and about 20 min. of computer time (this was done on galaxy.nih.gov, an SGI computer with 32 x 250 MHz R10000 processors (only one CPU was used) and 8GB RAM)
- nciopen_LMCH_oct99_2D.sdf : NCI data file (includes 2D information)
- cancer_screened_gi50_a99 :cancer screen data from the DTP WWW site (August 1999 release)
- cancer_screened_lc50_a99
- cancer_screened_tgi_a99
Add AIDS and cancer cell data to an SD file in one operation.
- nciopen_LMCH_oct99_2D.sdf : NCI data file (includes 2D information)
- cancer_screened_gi50_a99.csv : comma -separated value file with a special format which matches the NCI_screen format. Each line contains all the data for one NSC entry.
- cancer_screened_lc50_a99.csv
- cancer_screened_tgi_a99.csv
- aids_ec50_oct99.csv
- aids_ic50_oct99.csv
- aids_conc_oct99.csv
-noskip : is an option that instructs to keep all entries even if biological data is not available
Add AIDS and cancer cell data to a SD file in one operation. Same as before, but now only the structures for which all biological data (AIDS and cancer cells) is available
- nciopen_LMCH_oct99_2D.sdf : NCI data file (includes 2D information)
- cancer_screened_gi50_a99.csv : comma separated value file with a special format which matches the NCI_screen format. Each line contains all the data for one NSC entry.
- cancer_screened_lc50_a99.csv
- cancer_screened_tgi_a99.csv
- aids_ec50_oct99.csv
- aids_ic50_oct99.csv
- aids_conc_oct99.csv
The -noskip option is not used
Bruno Bienfait 1-11-2000