To show the molecule on the computer screen, the computer must be told about molecular structure. Molecular modeling software requires that this information be provided in the form of a sketch on the screen which is usually done with a mouse or some other pointing device, or prompts the user for a name of the disk file where the information is stored. This section discusses examples of different ways in which the chemist's drawing of a molecule may be converted into a series of numbers which computer understands.
Each atom of the molecule is usually assigned an ordinal number, from 1 to N, where N is the total number of atoms in a molecule. Most modeling systems do not impose a particular atom numbering, hence the atom numbers usually do not adhere to strict rules of IUPAC convention. However most modeling systems require that atoms belonging to a given residue (e.g. amino acid in a protein) be numbered consecutively.
Usually, the atoms comprising the molecule are assigned a type which describes their chemical identity. This type is usually an integral number or a mnemonic symbol, e.g.: ``21'' or ``Csp3''. The type reflects not only an element but also a particular arrangement of bonds formed by the atom, and its formal charge. The atom type may also depend on atom neighbors in the molecule. For example, the amine, ammonium, imine, amidic, pyridinic, pyrrolic, etc., nitrogens are assigned different types in most molecular modeling systems. There is still no universal table of atom types because different approaches may require different levels of detail in atom type specification. Similarly, bonds are assigned types: single, double, aromatic, hydrogen, etc. Besides real chemical atoms and bonds, most systems introduce the notion of a dummy atom and dummy bond (the terms virtual or pseudo are also frequently used). These objects can be used in marking important features of the molecule, for orienting two molecules relative to each another, or enforcing some geometrical constraints on real atoms and bonds of the molecule.
The relation between atoms must be given to completely specify molecular objects to the computer. This information consists of two parts: the specification of bonds and specification of geometry (the spatial relation between atoms). With millions of known molecules, it is surprising to discover that there is still no accepted standard for representing molecular structure.
Figure 6.1: Adjacency and connectivity matrices for acetic acid. Rows
and columns correspond to the atom numbering given on the left.
Off-diagonal elements of both matrices represent order of the bonds
between atoms pointed by row and column. The diagonal of the connectivity
matrix represents the atomic number.
Bonds can be specified as a list of bonded atom pairs accompanied by bond types or as an adjacency matrix or attachment list. The example of an adjacency matrix is shown in Fig. 6.1. It is a square matrix with the number of columns and rows equal to the number of atoms in the molecule. Positions corresponding to bonded atoms are marked with 1 (or sometimes by bond type), and all other entries are zero. A variant of the adjacency matrix, called connectivity matrix, differs in that it includes the atomic number on the diagonal and for each bond the actual bond order (i.e., 1 for single bond, 2, for double, 1.5 for conjugated, etc.) is used. The attachment list is constructed by listing all atoms bonded to a given atom. An example of an attachment list for acetic acid is given in Fig. 6.2.
Figure 6.2: Attachment list and bonded atom list for acetic acid as depicted on
the left. Single bonds are denoted as (-) and double as (=).
Many computer operations on molecules are based on principles of graph theory (topology). Software manuals and research papers frequently use terms and concepts of the graph theory. Some basic definitions will be given below, however do not be mislead by the simplicity of concepts explained in the next few paragraphs. The graph theory is an active, important and difficult branch of mathematics. It also has many important practical applications in fields as diverse as electronics and cartography. For a recent review of the application of graph theory to molecular modeling consult Marsili, (1990).
Figure 6.3: Two isomorphic graphs.
A graph, such as in Fig. 6.3, is a collection of points, V (most frequently called vertices or nodes), and lines, E (commonly called edges but the term bonds is also used). Graphs are frequently specified as a list of edges (see Fig. 6.3). The valence (or degree) of a vertex is the number of edges originating from this vertex. The sum of degrees of all vertices is even and equal to twice the number of edges. This fundamental property of graphs is called a handshaking lemma. Two vertices are called adjacent if they are joined by an edge. Similarly, two edges are adjacent when they originate at the same vertex. Fig. 6.3a represents a graph which some may consider similar to a molecule. Note, however, that ordinary graphs do not assign any spatial arrangement to the vertices. Edges can be any length, and the graphs which you see depicted in Fig. 6.3a and 6.3b are identical. Graphs representing molecules belong to a class of undirected graphs, i.e., graphs with edges not directed from one vertex to another (edge (i,j) is equivalent to (j,i)). Two graphs are isomorphic if one can convert the first graph to the second one by renumbering vertices. The efficient algorithms for detecting isomorphism are actively researched due to their importance for fast computer searching of chemical databases. The problem is far from trivial, since the number of ways in which the graph vertices can be numbered, , is astronomical even for small molecules.
Figure 6.4: Example of disconnected graph with circuits. This graph contains
three components (1--8, 9--12 and 13--19) and seven independent
circuits.
Some sequences of edges have special names. The ``walk'' is such a sequence of edges where consecutive edges are adjacent. The ``trail'' is a walk in which each edge is traversed only once. The ``path'' is a trail in which all vertices are visited only once, with the exception that the first and last vertex may be identical, in which case the path is closed. Closed paths are also called ``circuits''. Using as an example a graph in Fig. 6.4, the sequence of edges: a b c d j c d is a walk. The walk: j e f i d c is a trail, but is not a path since vertex 5 was crossed twice, however, the walk: r s u v is a path. A graph is connected when there is a path between any pair of vertices. Please note that the two strands of DNA in a double helix, are usually considered separate components since hydrogen bonds are not routinely included as edges in the molecular description. The number of components of the graphs is the number of disjointed sets of vertices between which there is no path. The graph in Fig. 6.4 has 3 components, but the removal of edge u would produce a graph with 4 components. All edges whose removal disconnects the graph are called a disconnecting set.
Figure 6.5: Example of a graph with circuits (right) and its spanning
tree (left).
A connected graph without a circuit is called a tree. The tree has several important properties:
It is easy to see that if a molecule is a tree, all its bond lengths and angles are independent of each another. One can change any of them without affecting the other. If a molecule contains circuits (which chemists call rings or cycles) only the lengths and angles associated with bonds belonging to the disconnecting set can be changed without affecting other bonds and angles. For molecules containing rings, it is desirable to find a spanning tree (see Fig. 6.5) of their graph by removing one edge for every circuit (note that the number of vertices is not affected by this operation). Such a spanning tree must be found if one needs to change a length or an angle involving a bond which is a part of a ring system. This change will always affect angles and lengths associated with bonds which were removed to obtain a spanning tree. There is an important relation which is often used in molecular modeling systems to check the consistency of the molecular object. If n, m, r, and k denote the number of vertices, edges, circuits, and components, respectively, then: r = m - n + k. Check this relation by using Fig. 6.4 as an example. Please note however, that r refers here to the number of smallest circuits. To find the number of smallest circuits, one starts by marking all circuits of size 3, then 4, 5, etc., but counts only those circuits for which there is at least one edge, which was not previously incorporated into a smaller circuit. The process is continued until all edges belonging to rings were used up. For the graph in Fig. 6.4, the closed paths: l m o and k n o are counted as distincts circuits, while the path: k l m n is not considered a distinct circuit, since all its edges were already included in smaller circuits.
The algorithms derived from graph theory are used extensively in molecular modeling software. Even for simple operations, like modifying bond lengths or angles, the topological methods must be used in order to see if a bond is part of a ring system, since only bonds belonging to a disconnecting set may be independently modified. Assigning atom types, detecting aromaticity and checking if the valence of atoms is satisfied requires that the topology of the molecule be analyzed. Topological concepts are used in the calculation of internal coordinates, performing conformational searches and converting standard chemical notation to representations of bonding suitable for computer analysis. We will see some examples in the following sections.