This module implements a set of very simple assignment tools, which may however proof to be useful. It is almost completely written in the Gifa macro language (the command find_dist is based on a Perl script), and as such can be fully adapted to your needs. Right now, it is principally aimed toward protein and peptide assignment. Extending it to oligonucleotides and sugars is probably a simple matter of extending the basic residues data-bases. However, I have no idea whether it can be used to help in the assignment process of other kind of organic molecules or not.
You will not find here any fancy tools nor automatic assignment, the only help provided here consists in a set of tools permitting to visualise several spectra at the same time, to add notes to peaks, to draw lines to help for visual align search, and to store the information in several data-bases, one for assigned peaks, one for spins and one for spin systems (consisting simply in a set of spins). For the moment, this module works only for 2D data-sets, and is aimed mostly to homonuclear spectroscopy (however, I'm sure it can be used for 2D heteronuclear spectroscopy).
However, due to the Gifa versability (calling UNIX from Gifa, creating/reading files, etc..) it is quite easy to adapt this canvas to your proper need, for instance calling from within this set-up your favorite automatic assignment tools.
A complete assignment is kept in a special directory, called a project. The project, which may reside anywhere on the disk, holds several files and directories used for storing informations, it may also contain any file that the user may wish to keep.
In the project directory, you will find typically two files : The file parameters is a macro which is executed when selecting the project. It contains all the definitions and some basic environment variables. The file zoom_window contains the zoom window coordinates for the 'multi-zoom' tool (see the ZOOM command in the documentation).
Five obliged directories reside also in the project.
The db directory holds the data-bases in dbm format as described below, the primary sequence is also found in this directory in a file called primary. The format is as follows : one residue per line, coded in one-letter code.
The spectra and PDB directories hold respectively all the spectra and PDB files associated with the project. Typically, for space optimisation, links to actual experiment files will be stored here rather than the complete file.
The processing and constraint directories hold respectively the intensity curves and the constraint files generated by the Integration menu (see below).
Assignment information is stored as sets of peaks, spins and spin systems, with the following structure :
Except for the primary structure file presented above, there are 3 kind of data-base files in the db directory, there are all in dbm format and each of them is thus composed of two files *.pag and *.dir which should not be modified directly. The dbm format is a generic UNIX format for flat data bases. For instance, these files can very easily be accessed with the perl language. Nevertheless, the dbm format is not fully compatible among all UNIX platforms, and you should be carefull with that (specially Linux users).
The name_of_experiment.pag and name_of_experiment.dir files hold the peak data base for a given experiment. An entry in the peak data-base stores all the pertinent information for a given peak. Peak entries can be created by copying them from the peak-picker, or during the assignment process. With a peak entry is stored two pointers to the spin databases, pointing to the parent spins of this peak.
Spin.pag and Spin.dir is the spin base, a spin is stored as a chemical shift, a name, and the spin-system to which it belongs.
Finally Spin_sys.pag and Spin_sys.dir is the spin-system base, for each entry, the spin-system type, the index in the primary sequence, as well as the list of the spins is stored.
In all the data-bases, entries are referred to by a numerical id that ranges from 1 to the highest value. The value of the highest id used is stored in a special entry indexed as "LARGEST". However, due to the very nature of the dbm format used for the file, (and of the associative arrays used internally in Gifa) there is no need for the id to be contiguous. So, if an entry is deleted, the numbering of the other entries is unaffected.
The build list is the main tool for progressing in the assignment work. The idea is to make a list of all the peaks within a spin-system (in the TOCSY sense). Once this list is complete, it is possible to promote the list to a new spin-system which is then entered in the data-base.
The build list is managed with a set of tools found in the graph tool menu. The marker tool permits to detected peak alignments, and to create a spin for each alignment. The list can be listed or displayed directly on screen. And of course, the list can be promoted to a spin-system.
When entering the assignment mode, Gifa will set up an assignment environment with the basic menus, the Peak menus and 4 additional menus that give access to all the commands needed to performed spectral assignment. The macro env_att.g actually sets-up every thing for assignment..
This menu permits to create or select a project, and more generally to realize all the operations global to the project.
Produces a short help of the assignement module as a 'recipe' to use it. The complete assignment help contained in the HTML documentation can be called from this short help.
To create a new project. It will create a directory with all the empty data-bases. You will be also prompted for the primary sequence of the studied protein. The primary sequence can be given, either literally in 1 letter code, from a file (1 or 3 letter code), or from a PDB file.
Permit to select any previously created project. Only one project can be used at a time.
When selecting a project, you have the choice, in the dialog box, to either create a backup of the current state of the project (simply a tar file), to recover from a previous backup (thus deleting the current state) or not to do any action.
After selection, the number of assigned systems is displayed on screen, then the multi zoom tool and the File Selector tool are opened.
A set of parameters is stored with the project, these parameters can be changed from here. You can define : distance alignment for spin assignment, distance tolerance for mouse clicking, and several display parameters.
The assignment data-bases are permanently kept on file, however, in case of a program crash, the last modifications may be lost. Clicking here secures the last entries.
Copy the current backup file backup.tar to backup.tar.old, and store the current dbm files of the project in the archive file backup.tar, produced by the command tar.
All the spectra which are used in the project can be accessed from here. A spectrum can be loaded in memory, in the same time the associated peak data-base is loaded. A spectrum can also be only showed (see the SHOWC command in the documentation), displaying it on screen, but not loading it in memory.
When loading a file in-memory, the related peak data-base is opened, and the peak data base of the previously displayed spectrum is closed. if the file loaded in-memory has no associated peak data base, a new one is created.
Note that you can also use the super2d tool (in the display menu) to display several spectra superimposed.
From here, you can add a spectrum to the list of the currently used spectra. "Adding" a spectrum consists in either copying or linking it into the dedicated directory(see above, File set-up). Linking sets a UNIX soft-link which stores the address of the file only, thus permitting an important gain in disk space.
Permit to add a PDB file to the list of the currently used PDB files, in a way analogous to the way used for spectra (see the previous command 'Add spectra'). This PDB file list is used in the 'Find distance' command (see the utilities menu below).
Produce statistics on the amino acids contained in the primary sequence.
This one permits to copy the content of the peak table (obtained with the Peak picking tool) to the peak data-base. Peak will be there but without assignment of course. This permits to load a first set of peaks, for instance the finger print region, from which the assignment work can proceed.
This command can be issued several time and at any moment during the assignment work, thus adding peaks into the assignment data-base.
Remove the unassigned peaks from the peak data-base. The unassigned peaks are those for which no spin has been assigned in F1 or F2.
Copy the peak data-base of a data-set of the spectra list to the peak data-base of another data-set of the list. The command asks for the permission to erase an already existing data-base.
Merge the peak table with the peak database of the assignment projet.
For each peak of the peak table, the command looks at the closest element of the database within the click tolerance and change the spectral coordinates and intensity of this element to the peak coordinates and intensity. The assignment information of the dbm element is unchanged. If no database element in found within the peak tolerance, the peak is added to the data-base as an unassigned element.
Use this entry if you want to quit the assignment module and restore a normal set-up
This menu permits to graphically display and to modify the assignment data-bases (peak, spin or spin system data-bases)..
This command displays on the current spectral zoom all the peaks in the current peak data base.
This command displays on the current spectral zoom all the unassigned peaks in the current peak database. The unassigned peaks are those for which the F1 or F2 spin has not been assigned.
This command displays on the current spectral zoom all the peaks verifying different criteria given by the user. Peaks can be displayed depending on the value of the peak note, the spins notes, the residue numbers or types, the spin types and the peak intensity (maximum value or threshold). The criteria for peak and spin notes and spin types are tested as substrings of the peak.corresponding parameters.
The criteria can be applied according to a logical parameter: 'and' means that all the given criteria must be verified to display the peak, 'or' means that one verified criterion is sufficient to display the peak.
After selecting this command, the program will wait until you click on a peak on the spectrum, and will high-light the selected peak as well as print its id in the terminal screen.
The previously high-lighted peak can then be edited with this command. You directly see the content of the assignment data-base, and can actually modify it. If the peak is already assigned, you will be able to see/edit the corresponding spins. If the peak is not located in the current zoom window, you can center the zoom window to it by using the button 'center' of the peak formbox.
This command wait until you click on the spectral window, and creates a new entry in the peak data-base. If a peak already exists within the distance tolerance for the mouse clicking, the program asks to user a confirmation for creating the peak. The new peak is then edited.
After selecting this command, the program will wait until you click on the spectrum, and will propose spins close to the click points, indicated along which axis (F1/F2) they are found.
The previously high-lighted spin can then be edited with this command. You directly see the content of the assignment data-base, and can actually modify it. Related peaks and spin system can also be edited.
After selecting this command, the program will ask you along which axis (F1/F2) you want to create a new spin. Then, it will wait until you click on the spectral window, and creates a new entry in the spin data-base. The new spin is then edited.
This command produces a clickable list of all the spins in the data-base.
After selecting this command, the program will wait until you click on a peak on the spectrum, and will high-light the related spin-system, if it exists.
The previously high-lighted spin-system can then be edited with this command. You directly see the content of the assignment data-base, and can actually modify it. Related peaks and spins can also be edited.
When the spin system type has been defined by the user, the list of possible spin names is restricted to the list defined in the topology data-base. The button refresh close the opened system formbox, and open a new one containing the last modifications. When you close the formbox, the program check that the topology of the spin system is correct.
This command produces a clickable list of all the spin-systems in the data-base.
This command produces statistics of all the spin-systems in the data-base.
Permits to click on the data-set, and search the spins located within the align distance tolerance, in F1 and F2 axes. The spins names and information are shown in a formbox, which allows to display and edit them individually.
This command draws the homonuclear NOESY walk in the HN-HN and the HN-HA regions.
This command builds a form box, with one line per residue in the primary sequence of the molecule under study. Each assigned residue is associated to a button showing the corresponding spin-system on screen, and to another button allowing to edit it.
This menu contains all the basic graphics tool which are used to detect peak alignment and build new spin systems.
Most display command use a contrasting color (see SCOLOR in documentation). This utility permit to define which color will be used.
This is equivalent to the standard point macro : you can click on the current spectrum, and the coordinates of the clicked point are printed, and a cross is drawn at the click point location. You exit the point command by clicking on the third button of the mouse.
This is the main tool for detecting peak alignment and for building the build list. When activating the command, you are prompted to click on each spectral location that you want to put in the marker. When finished, click on the third button of the mouse. A box is then created, which will remain on screen as long as you do not close it purposely.
From the marker box you can choose to redraw all the horizontal and vertical lines connecting the selected points. You can choose to have diagonal-symmetric locations considered or not, and choose the color. You can show all the peaks in the data-base lying at the intersection of an horizontal and vertical lines, as well as create missing peaks. Finally, you can add all the peaks detected at the intersections to the build list.
This command empties the build-list, which contains peaks.
This command simply prints the content build-list, by showing the id of the peaks in the build-list..
This will display on the spectral window, the peaks in the actual build-list.
This is the command that will use the actual build list, and create a new spin-system. First, alignment are detected within the peaks in the build list, and spins are detected for each spectral coordinates. If needed, new spins are created. Then a new spin system is created.
Finally, the tool permitting to edit the spin-system is opened from which you can modify the spin-system parameters and edit each individual spins.
This menu contains various utility commands, which allow to chck the integrity of the current assignment state, its consistency with structural information obtained from a PDB file. These conmands also permit to output assignment information.
Checks all current assignment data-bases for integrity. Because of a bug in the foreach code, this command may have to be applied more than once in case of wrong entries. In particular, this command checks the consistency between the chemical shifts of the spins assigned to a peak and th chemical shifts of this peak. It also checks for internal coherence.
Check that all spins of each spin system are defined in the current topology database. The topology data base file is located in the /usr/local/gifa/macro/att directory. For each spin system type, it contains the allowed list of spin names, including the individual hydrogen names and generic names for superposed geminal hydrogens. This file can be modified by the user to fit other topologies (see macro programming).
Creates a formbox which permits direct editing to each spin, syst and peak, according to their index in the data-base.
This command permits to click on the spectrum, and high-light the closest peak in the database. Then, it creates a dialog box to look for the distances between hydrogens involved in the selected correlation. The PDB files you can scan are given by the PDB file list.
You can select atoms by their exact names or by a substring of their names (as an example, looking for the 'HB' substring will permit to find all the hydrogens HB1, HB2, HB3,...). If you want to search among all the residues or all the atoms, put the sign '.' into the corresponding field of the dialog box.
This command is based on a perl script calcdst.pl located in /usr/local/gifa/com/att directory.
Lists all the peak database entries to a file. Here is un example of listing file:
# Project : /d1bis/people/terez/ranab # Experiment : spectra/proc.05 # F1 F2 Spin1 Spin2 Amplitude Peak# (Note) 0.282 4.855 unk unk 476142 151 0.900 1.266 HD-20 QG-20 2597383 443 0.900 4.223 HD-20 HA-20 682930 442 0.923 0.912 HG3-7 HG3-7 356122080 430 0.923 1.267 HG3-7 HG2-7 3268115 369 0.923 1.531 HG3-7 HG1-7 2074078 336 0.923 1.886 HG3-7 HB-7 1735731 306 0.923 4.213 HD-7 HA-7 919385 178
Lists all the spin database entries to a file. Here is un example of listing file:
# Project : /d1bis/people/terez/ranab #PPM Name System Spin# (Note) 1.014 HG 15 63 1.083 HG 8 37 1.255 HG1 7 30 super with ILE 13 1.266 HG1 20 79 1.473 HB 11 48 1.664 HB 21 83 super with HG LEU 5 1.728 HB2 22 88 super with HD1 Lys 19 and 18 1.771 HG 16 67
Lists all the spin system database entries to a file. Here is un example of listing file:
# Project : /d1bis/people/terez/ranab 2 LEU ------ 8.784 HN 10 ------ 4.456 HA 10 ------ 1.654 QB 10 super with HG Leu 2 ------ .946 QD 10 3 GLY ------ 8.441 HN 17 ------ 4.017 QA 17 4 GLY ------ 8.399 HN 18 ------ 4.051 HA1 18 ------ 3.979 HA2 18 5 LEU ------ 8.324 HN 21 ------ 4.418 HA 21 ------ 1.664 QB 21 super with HG LEU 5 ------ .969 QD 21
Lists all the assignment entries to a file. Here is an example of listing file:
# Project : /d1bis/people/terez/ranab 1 F assigned to : 1 Arom-Phe 103 104 ------ 7.458 3H 26 ------ 7.348 2H 26 1 F assigned to : 1 PHE 101 119 ------ 4.344 HA 25 ------ 3.261 QB 25 2 L assigned to : 2 LEU 42 43 44 45 ------ 8.784 HN 10 ------ 4.456 HA 10 ------ 1.654 QB 10 super with HG Leu 2 ------ .946 QD 10 3 G assigned to : 3 GLY 68 69 ------ 8.441 HN 17 ------ 4.017 QA 17
Plots on file the labels of the elements of peak current database, which are located in the current zoom window.
This menu allows the calulation of peak intensites, and the output of distance constraint files or peak intensity curve.
Produces a short help of the Integration manu as a 'recipe' to use this menu. The complete assignment help contained in the HTML documentation can be called from this short help.
Performs integration of all the peaks of the opened dbm assignment table (it is not possible to perform this integration only on the current zoom window, because of a bug in dbm described in CAVEAT). The calulated volumes are put into the dbm dbm amplitude Different integration methods can be used: 'Max intensity' gives the data-set value at the peak maximum, 'sumrec' and 'amoeba' make use of the commands SUMREC and INTEG respectively.
From the calibration peak set, the program determines by a least-square method a proportionality coefficient between intensity values and the inverse sixth power of the distance. this coefficient is then used to associate a distance to each element of the peak data-base. Then, using the uncertainty on the distance provided the user, it write a constraint file with distance upper and lower bounds. The file format can be XPLOR or DYANA format.
From the calibration peak set, the program determines by a least-square method a proportionality coefficient between intensity values and the inverse sixth power of the distance. this coefficient is then used to associate a distance to each element of the peak data-base. Then, using distance interval and upper and lower bounds given for each interval by the user, it write a constraint file with distance upper and lower bounds. The file format can be XPLOR or DYANA format.
Copy the current assignment data-base to a peak file that can be read with PKREAD, and create a lookup database giving the relation between the peak index in the assignment data-base and in the peak file. The peak file name is basename.atr and the lookup file names are: basename.hash.dir and basename.hash.pag.
Reads a peak file with the command PKREAD.
Lists the peak table (see command PKLIST).
Displays the peaks located in the current zoom window (see command SHOWPEAKS).
Evaluates the noise on the data-sets using a zoom window given by the user.
Performs the peak integration according to the noise level and the peak table, by calculating for each peak an amoeba (see command INTEG).
Show on the disp2d window the amoeba located in the current zoom window (see command SHOW)..
After selecting this command, the program will ask you to click on a peak, and will create a formbox, which permits to interactively and graphically modify the peak amoeba by selecting one-by-one pixels. Two possibilities are available: "add" adds the selected pixel to the peak amoeba, "erase" removes the selected pixel from the peak amoeba.
If the flag is set-up to 'add' and the pixel is in another peak amoeba, the program ask to user to take a decision.
Reads a peak file and an amoeba file according to a basename given by the user. The peak file name is: basename.pek, and the amoeba file name is: basename.amb
Saves the current peak table and the current amoeba to a peak file and to an amoeba file according to a basename given by the user. The peak file name is: basename.pek, and the amoeba file name is: basename.amb. If the peaks/amoeba files already exist with the same basename, the user is asked to remove them.
Permits the integration of peaks along a series of experiments according to an assignment database, an amoeba file and the project list of spectra. For each peak, the integral values are written in an independent file, located in the processing directory. Two formats are possible: free ascii format or Tela macro format. The user can select in the formbox the spectra he want to be used in the integration.
To perform the multiple integration of peaks, the amoeba basename file should be the same than the basename of the lookup data-base between the indexes of peaks in assignment data-base and peak file (see the command Copy db to a peak file above).
Permits the display of the curve obtained with the multiple integration utility, by clicking on the corresponding peak.
After selecting this command, the program ask to you to click on a peak, and then creates a formbox, which allows to display successively the giben peak on the list of project spectra. The user can select in the list the spectra he want to be displayed.
With this set-up, a typical assignment work on a small protein, done in the Wüthrich way, consists in:
You will notice that the program slows down when the data-base get bigger, this is why it is not recommended to start with a big peak-picking, and then to handle a big data-base throughout the whole process.
The intensity analysis (Integration) menu implements a set of simple tools dedicated to the analysis of peak intensity according to the assignment database. Here is a simple 'recipe' on how to use this menu:
If you want to produce a constraint file used for structure generation:
The complete assignment module is written in nothing but macros. So you could have written it yourself ! At least you can modify it to fit your needs. Here is some help to do so.
All the macros are in /usr/local/gifa/macro/att , this address is added to the GifaPath when entering the module. Static information (list of possible atom names, residue names, etc...) are defined in the basic_db.g and build_static_db.g macros. You can very simply adapt this one for some new residues. Some static information is stored in dbm data-bases: the "topology" data-base stores the structures of the different residues defined and the names of valid spins. The data-base 3let_1let and 1let_3let stores lookup tables permitting to convert rapidly 1 lettre residue names (C Y K..) to 3 letters names (Cys Typ Lys...). These 3 dat-bases are also stored in /usr/local/gifa/macro/att and can be recreated with the build_static_db.g macro. This can be done if you want to add new types of residues, it may also be needed in the case of a new installation of Gifa, since the dbm format is not fully ninary compatible among all UNIX platforms (be carefull, you, Limux users).
The static data-bases are opened when entering the assignment module. Other data-bases are associated to the project associative arrays att[] spin[] and sys[] for the peak data-base, the spin data-base, and the spin-system data-base respectively. Entries are thus of the kind : $att[$peak_id] for instance. The different pieces of information are stored as blank separated fields in the variable. Coding is the following :
$att[#att] = f1 f2 amp #spin1 #spin2 type note
$spin[#spin] = delta name #sys note
$sys[#sys] = index type list_of_spin note
where
#att, #spin #sys are used here to note the indexes.
f1 and f2 are coordinates in ppm;
amp is the peak amplitude in arbitrary unit;
type codes for the kind of experiment;
delta is chemical shift in ppm;
index is the number in the primary sequence
type is the name of the residue
note is a free field, that you can use for whatever function.
Each dbm associative array att[] spin[] and sys[] contain the special entry LARGEST, which contains the index of the LARGEST id (#att, #spin, or #sys) yet assigned. So, when creating a new entry (spin in the example), you are supposed to do something like :
set new_id = ($spin["LARGEST"] + 1)
set $spin[$new_id ] = "New entry ..."
set $spin["LARGEST"] = $new_id ; updated only if no error occured
When programming some function that scan the whole data-base, you will probably end-up writing something like :
foreach i in att ; let's scan all the peak as an example if ($i s! "LARGEST") then ; don't forget this one ! set peak = $att[$i] ; this is the complete entry ; then parse the entry, this is one way : set f1 = (head($peak)) set peak = (tail($peak)) set f2 = (head($peak)) set peak = (tail($peak)) set amp = (head($peak)) set peak = (tail($peak)) ; etc... endif endforWhen loaded, the calibration distances defined by the command Choose the calibration intensities, are stored in an associative array called calib_dst[].
If you manage to make something useful, you can transmit it to me so that I will make it available to other users.
Note that this is a preliminary work, people have been using this tool here in our lab, however, I'm sure there is still a lot of bugs.
foreach i in att if ($i s! "LARGEST") then set att[i] = "what you want" endif endforRather use the following:
for i = 1 to $att["LARGEST"] if (exist('att['//$i//']')) then set att[i] = "what you want" endif endforIf you wish to help, please contact me !