The Computational Chemistry Archives can be searched by sending a search query to the address chemistry-search@server.ccl.net. A list of files which satisfy your query will be sent back to you via e-mail. This is an experimental service; please send your comments and ideas to the author, Jan K. Labanowski at jkl@ccl.net.
The following document describes the format of the search query. The format, while not simple, allows for precise and elaborate search queries. Please take the time to read these instructions before using the e-mail searcher.
A search query consists of a number of lines of text. The first line consists of a single number. The number corresponds to the number of text patterns you wish to search for. Note that if you only wish to search for one text pattern, you may omit this line provided that you also omit the logical relation (see below.)
Following the first line is a list of regular expressions. There should be exactly one regular expression per line; the number of regular expressions (and thus, the number of lines) should be equal to the number in the first line. The format of regular expressions is described below. (Those familiar with Perl already know the format for regular expressions.)
Following the list of regular expressions is a single line containing a logical relation. The logical relation determines which combination of regular expressions constitute a match. For example, you may wish to find all files which mention carbon and oxygen, or MOPAC or MindTool, or hydrogen and helium but not argon. Logical expressions allow this sort of "pick and choose" behavior. Note that if there is only one regular expression, you may omit the logical relation. The format of logical relations is also described below.
A search query with only one regular expression would look like this:
regular expressionA search query with n regular expressions would look like this:
n regular expression 1 regular expression 2 . . . regular expression n logical relationThe following sections explain the format of regular and logical relations in more detail.
<Return to top of page>
A regular expression is a string of characters which tells the searcher which
string (or strings) you are looking for. The following explains the format of
regular expressions in detail. If you
are familiar with Perl, you already know the syntax. If you are familiar with
Unix, you should know that there are subtle
differences between Perl's regular expressions and Unix' regular expressions.
Delimiters
A regular expression always starts and ends with a character, called the delimiter, which does not appear anywhere else in the expression. The delimiter is included because the searcher removes trailing spaces at the end of a line. Without the delimiters, there would be no way to distinguish between significant and insignificant spaces. Only the characters inside the delimiters are considered part of the regular expression.
Here are some examples of legal regular expressions:
/legal/
/
" is the delimiter).legal.
.
" is the delimiter)blegal b
b
" is the delimiter; note that the space after the "l"
is considered part of the regular expression)Here are some illegal regular expressions:
/illegal
/ill/egal/
<Return to top of page>
Simple Regular Expressions
In its simplest form, a regular expression is just a word or phrase to search for. For example,
/gauss/would match any file whose name had the string "gauss" in it, or which mentioned the word "gauss" in the file contents. Thus, files named "gauss", "gaussian" or "degauss" would all be matched, as would a file containing the phrases "de-gauss the monitor" or "gaussian elimination." Here are some more examples:
/carbon/
.hydro.
boxyb
>top ten>
<Return to top of page>
Metacharacters
Some characters have a special meaning to the searcher. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the searcher.
The period (.) is a commonly used metacharacter. It matches exactly one character, regardless of what the character is. For example, the regular expression:
/2,.-Dimethylbutane/will match "2,2-Dimethylbutane" and "2,3-Dimethylbutane". Note that the period matches exactly one character-- it will not match a string of characters, nor will it match the null string. Thus, "2,200-Dimethylbutane" and "2,-Dimenthylbutane" will not be matched by the above regular expression.
But what if you wanted to search for a string containing a period? For example, suppose we wished to search for references to pi. The following regular expression would not work:
/3.14/ (THIS IS WRONG!)This would indeed match "3.14", but it would also match "3514", "3f14", or even "3+14". In short, any string of the form "3x14", where x is any character, would be matched by the regular expression above.
To get around this, we introduce a second metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.14", we would use:
/3\.14/ (This will work.)This is called "quoting". We would say that the period in the regular expression above has been quoted. In general, whenever the backslash is placed before a metacharacter, the searcher treats the metacharacter literally rather than invoking its special meaning.
(Unfortunately, the backslash is used for other things besides quoting metacharacters. Many "normal" characters take on special meanings when preceded by a backslash. The rule of thumb is, quoting a metacharacter turns it into a normal character, and quoting a normal character may turn it into a metacharacter.)
Let's look at some more common metacharacters. We consider first the question mark (?). The question mark indicates that the character immediately preceding it should be matched either zero times or one time. Thus
/m?ethane/would match either "ethane" or "methane". Similarly,
/comm?a/would match either "coma" or "comma".
Another metacharacter is the star (*). This indicates that the character immediately to its left may be repeated any number of times, including zero. Thus
/ab*c/would match "ac", "abc", "abbc", "abbbc", "abbbbbbbbc", and any string that starts with an "a", is followed by a sequence of "b"'s, and ends with a "c".
The plus (+) metacharacter indicates that the character immediately preceding it may be repeated one or more times. It is just like the star metacharacter, except it doesn't match the null string. Thus
/ab+c/would not match "ac", but it would match "abc", "abbc", "abbbc", "abbbbbbbbc" and so on.
Metacharacters may be combined. A common combination includes the period and star metacharacters, with the star immediately following the period. This is used to match an arbitrary string of any length, including the null string. For example:
/cyclo.*ane/would match "cyclodecane", "cyclohexane" and even "cyclones drive me insane." Any string that starts with "cyclo", is followed by an arbitrary string, and ends with "ane" will be matched. Note that the null string will be matched by the period-star pair; thus, "cycloane" would be matche by the above expression.
If you wanted to search for articles on cyclodecane and cyclohexane, but didn't want to match articles about how cyclones drive one insane, you could string together three periods, as follows:
/cyclo...ane/This would match "cyclodecane" and "cyclohexane", but would not match "cyclones drive me insane." Only strings eleven characters long which start with "cyclo" and end with "ane" will be matched. (Note that "cyclopentane" would not be matched, however, since cyclopentane has twelve characters, not eleven.)
Here are some more examples. These involve the backslash. Note that the placement of the backslash is important.
/a\.*z/
/a.\*z/
Matches any string starting with an "a", followed by one arbitrary character, and terminated with "*z". Thus, "ag*z", "a5*z" and "a@*z" are all matched. Only strings of length four, where the first character is "a", the third "*", and the fourth "z", are matched.
/a\++z/
/a\+\+z/
/a+\+z/
/a.?e/
/a\.?e/
/a.\?e/
/a\.\?e/
/2,\d-Dimethylbutane/would match "2,2-Dimethylbutane", "2,3-Dimethylbutane" and so forth. Similarly,
/1\.\d\d\d\d\d/would match any six-digit floating-point number from 1.00000 to 1.99999 inclusive. We could combine the digit metacharacter with other metacharacters; for instance,
/a\d+z/matches any string starting with "a", followed by a string of numbers, followed by a "z". (Note that the plus is used, and thus "az" is not matched.)
The letter "d" in the string "\d" must be lower-case. This is because there is another metacharacter, the non-digit metacharacter, which uses the uppercase "D". The non-digit metacharacter looks like "\D" and matches any character except a digit. Thus,
/a\Dz/would match "abz", "aTz" or "a%z", but would not match "a2z", "a5z" or "a9z". Similarly,
/\D+/Matches any non-null string which contains no numeric characters.
Notice that in changing the "d" from lower-case to upper-case, we have reversed the meaning of the digit metacharacter. This holds true for most other metacharacters of the format backslash-letter.
There are three other metacharacters in the backslash-letter format. The first
is the word metacharacter, which matches exactly one letter, one
number, or the underscore character (_
). It is written as
"\w". It's opposite, "\W", matches any one character
except a letter, a number or the underscore. Thus,
/a\wz/would match "abz", "aTz", "a5z", "a_z", or any three-character string starting with "a", ending with "z", and whose second character was either a letter (upper- or lower-case), a number, or the underscore. Similarly,
/a\Wz/would not match "abz", "aTz", "a5z", or "a_z". It would match "a%z", "a{z", "a?z" or any three-character string starting with "a" and ending with "z" and whose second character was not a letter, number, or underscore. (This means the second character must either be a symbol or a whitespace character.)
The whitespace metacharacter matches exactly one character of whitespace. (Whitespace is defined as spaces, tabs, newlines, or any character which would not use ink if printed on a printer.) The whitespace metacharacter looks like this: "\s". It's opposite, which matches any character that is not whitespace, looks like this: "\S". Thus,
/a\sz/would match any three-character string starting with "a" and ending with "z" and whose second character was a space, tab, or newline. Likewise,
/a\Sz/would match any three-character string starting with "a" and ending with "z" whose second character was not a space, tab or newline. (Thus, the second character could be a letter, number or symbol.) The word boundary metacharacter matches the boundaries of words; that is, it matches whitespace, punctuation and the very beginning and end of the text. It looks like "\b". It's opposite searches for a character that is not a word boundary. Thus:
/\bcomput/will match "computer" or "computing", but not "supercomputer" since there is no spaces or punctuation between "super" and "computer". Similarly,
/\Bcomput/will not match "computer" or "computing", unless it is part of a bigger word such as "supercomputer" or "recomputing".
Note that the underscore (_
) is considered a "word" character.
Thus,
/super\bcomputer/will not match "super_computer".
There is one other metacharacter starting with a backslash, the octal metacharacter. The octal metacharacter looks like this: "\nnn", where "n" is a number from zero to seven. This is used for specifying control characters that have no typed equivalent. For example,
/\007/would find all textfiles with an embedded ASCII "bell" character. (The bell is specified by an ASCII value of 7.) You will rarely need to use the octal metacharacter.
There are three other metacharacters that may be of use. The first is the braces metacharacter. This metacharacter follows a normal character and contains two numbers separated by a comma (,) and surrounded by braces ({}). It is like the star metacharacter, except the length of the string it matches must be within the minimum and maximum length specified by the two numbers in braces. Thus,
/ab{3,5}c/will match "abbbc", "abbbbc" or "abbbbbc". No other string is matched. Likewise,
/.{3,5}pentane/will match "cyclopentane", "isopentane" or "neopentane", but not "n-pentane", since "n-" is only two characters long.
The alternative metacharacter is represented by a vertical bar (|). It indicates an either/or behavior by separating two or more possible choices. For example:
/isopentane|cyclopentane/will match any file containing the strings "isopentane" or "cyclopentane" or both. However, It will not match "pentane" or "n-pentane" or "neopentane." The last metacharacter is the brackets metacharacter. The bracket metacharacter matches one occurence of any character inside the brackets ([]). For example,
/\s[cmt]an\s/will match "can", "man" and "tan", but not "ban", "fan" or "pan". Similarly,
/2,[23]-dimethylbutane/will match "2,2-dimethylbutane" or "2,3-dimethylbutane", but not "2,4-dimethylbutane", "2,23-dimethylbutane" or "2,-dimethybutane". Ranges of characters can be used by using the dash (-) within the brackets. For example,
/a[a-d]z/will match "aaz", "abz", "acz" or "adz", and nothing else. Likewise,
/textfile0[3-5]/will match "textfile03", "textfile04", or "textfile05" and nothing else.
If you wish to include a dash within brackets as one of the characters to match, instead of to denote a range, put the dash immediately before the right bracket. Thus:
/a[1234-]z/and
/a[1-4-]z/both do the same thing. They both match "a1z", "a2z", "a3z", "a4z" or "a-z", and nothing else.
The bracket metacharacter can also be inverted by placing a caret (^) immediately after the left bracket. Thus,
/textfile0[^02468]/matches any ten-character string starting with "textfile0" and ending with anything except an even number. Inversion and ranges can be combined, so that
/\W[^f-h]ood\W/matches any four letter wording ending in "ood" except for "food", "good" or "hood". (Thus "mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do not apply and other
metacharacters are not available. The only characters that can be quoted
in brackets are "[
", "]
", and "\
".
Thus,
/[\[\\\]]abc/matches any four letter string ending with "abc" and starting with "
[
", "]
", or "\
".
<Return to top of page>
Forbidden Characters
Because of the way the searcher works, the following metacharacters should not be used, even though they are valid Perl metacharacters. They are:
Here are some other things you should know about regular expressions.
/mopac/and
/Mopac/and
/MOPAC/all search for the same set of strings. Each will match "mopac", "MOPAC", "Mopac", "mopaC", "MoPaC", "mOpAc" and so forth. Thus you need not worry about capitalization. (Note, however, that metacharacter must still have the proper case. This is especially important for metacharacters whose case determines whether their meaning is reversed or not.)
\
), the tilde
(~
) or the backtick (`
), nor can you use them as
delimiters.
x:/regular expression/where x is either "F", "T" or "B". "F" instructs the searcher to only search filenames for matches to this regular expression. "T" instructs the searcher to only search a file's contents for matches to this regular expression. "B" indicates that the searcher should search both.
For instance,
F:/text/will match any file with the string "text" in its name. Likewise,
T:/mopac/will match any file mentioning "mopac" in the contents. It will not match a file named "mopac". Finally,
B:/carbon/will match any file mentioning "carbon" in its name or contents. Note that if no search scope identifier is included, "
B:
" is
assumed.
Once the regular expressions have been specified, you need to specify a logical relation between them. This relation determines how matches with the regular expression determine whether or not the file itself matches. For example, you might want to see all files referring to hydrogen and carbon, or those which mention methane or ethane, or those which refer to cyclopentane or neopentane but not both, or that do not refer to uranium. There are three logical operators:
&
), or by a pair of ampersands (&&
).
It returns a match only if the expression to its left is true, and
the expression to the right is true.
|
), or by a pair of veritcal bars (||
).
It returns a match only if the expression to its left is true, or
the expression to the right is true.
!
). It negates the value of the expression
to its right.
Here are some simple examples of search queries with logical expressions. Note that the numbers in the logical relation show which regular expression is meant. "1" refers to the first regular expression, "2" to the second, and so on.
2 /\bcarbon\b/ /\bhydrogen\b/ 1 OR 2This will match any file referring to carbon, or hydrogen, or both.
2 /\bpentane\b/ /\bpropane\b/ 1 && 2This will match any file referring to both pentane and propane.
1 /\bmethane\b/ !1This matches any file that does not refer to methane.
Here are some more complicated examples.
3 /\bcarbon\b/ /\bhydrogen\b/ /\boxygen\b/ 1 AND 2 OR NOT 3This matches any file that either refers to both carbon and hydrogen, or does not refer to oxygen. Note that this is the same as
3 /\bcarbon\b/ /\bhydrogen\b/ /\boxygen\b/ (1 AND 2) OR (NOT 3)Here is a slightly different example. Note how the different position of the parentheses alters its meaning.
3 /\bcarbon\b/ /\bhydrogen\b/ /\boxygen\b/ 1 AND (2 OR NOT 3)This matches any file that refers to carbon, and that either refers to hydrogen or does not refer to oxygen. Note that with this relation, a file that did not mention carbon could not be matched, whereas with the previous relation, a file that didn't mention carbon could be matched so long as it did not mention oxygen. The position of the parentheses can drastically alter the meaning of a search.
<Return to top of page>
Here are some examples of typical search queries. Please study them to gain a better understanding of how the searcher works.
Suppose you wish to find files which mention "AMBER". You might use a query that looks like this:
1 /amber/ 1Since you have only one regular expression, you could also use:
/amber/The letter case is unimportant; you could also use:
/AMBER/or
/AmbeR/or
/Amber/Note that the above queries would also find strings such as "camber", "chamber", "chamberlain", "clamber" or "lambert". This is probably not what you want to do. You need to request the word "amber", and not just any string that contains "amber". One possible way to do this is by putting spaces around the word "amber", like this:
/ amber /Unforunately, this would not match a string such as "For this calculations, I used AMBER." This is because, in this string, "AMBER" is followed by a period, not a space. It is therefore best to use the word boundary metacharacter:
/\bamber\b/Even this is not without problems, however. Consider the string, "Amber3.0 is slower than Amber3.1." This would not be matched, since digits and underscores are considered to be part of the word by the searcher. In this case, the best solution seems to be:
/\bamber[^a-z]/This searches for any word starting with "amber" which is then followed by a non-alphabetic character.
As this examples shows, it is important to analyze all possible combinations which will match your regular expression. You do not want to get too many unrelated files, but you want to be sure that you get all the files which relate to the topic of your search.
<Return to top of page>
Example 2
Suppose you want to search for the information on MM2, MM3, MM2P or MMP2. You can search the archives by giving the following query:
4 /MM2/ /MM3/ /MM2P/ /MMP2/ 1 | 2 | 3 | 4Or you could use:
1 /MM2|MM3|MM2P|MMP2/ 1You could also use:
/MM2|MM3|MM2P|MMP2/But as we saw in example one, it is a good idea to be specific as possible. So you might use this instead:
1 /\bMM[23]\b|\bMMP2\b|\bMM2P\b/ 1This is equivalent to:
/\bMM[23]\b|\bMMP2\b|\bMM2P\b/or
B: /\bMM[23]\b|\bMMP2\b|\bMM2P\b/This searches for all files which refer to MM2, MM3 or MM2P, MMP2 or file names which contain MM2, MM3, MMP2, MM2P.
<Return to top of page>
Example 3
Consider the following query:
/basis\sset/The \s stands for any white space character. Since all tabs and new lines are converted to single spaces, and multiple spaces are contracted to single ones, the query above is equivalent to:
/basis set/Besides the term "basis set", it will also find "basis sets", "basis set," and "basis set."
Note that this example is not equivalent to the query:
2 /basis/ /set/ 1 & 2This is because in the first case, the words "basis" and "set" must be side by side. In the second case, they may be separated by many words; in fact, the "set" may be found before the "basis" is. Note also that the latter case will find all the file names having "basis" and "set" in them, while the first will not match any file names since there are no files in the archive with a space embedded in the name.
Actually, none of the above queries are good if you really want to find all the references about basis sets. People frequently say "basis" or "set"; sometimes they say "basis functions" or "contracted gaussians", and so on. You would need a more elaborate expression to be confident that you had found most of the references to this topic. It might look like:
/basis|set|gaussians|contracted|6-31G\*|631G\*|gaussian exponent|/Or you could make your query look like this:
2 /basis|set|gaussians| contracted |6-31G\*|631G\*|gaussian exponent|/ /\bDZP\b|\bTZP\b|\bDZ\b|\bTZ\b|gaussian function|STO-?\dG|\+G\*/ 1 | 2Note that I did not use "gaussian" but "gaussians". Had I not done so, I would have received all of the files refering to GAUSSIAN program (and there are plenty of them, many of which have nothing to do with basis sets.) Note also that the star was quoted with a backspace character; otherwise it would match almost anything (you might want this side effect by the way, if you wanted sets like "6-31G(3d,2f)" or "6-31G without polarization"). The "
-?
" means 0 or 1 minus signs
(some people say "STO-3G", and some incorrectly say "STO3G"). The
"\d
\+G\*" means G preceded with
a plus sign and followed with a star. The backslashes are necessary;
without them, the plus sign would be interpreted as "1 or more occurences of
a digit", and the star as "0 or more occurences of G". This is not what we
want.
<Return to top of page>
Example 4
Suppose you want to search for files which talk about MNDO and d-orbitals. Here is an example of the query which could be used for this purpose:
2 /\bMNDO[^a-z]/ /\bd[^a-z]|\bd[_\s-]*orbital|\bd[_\s-]function/ 1 & 2This will look for MNDO and strings words: "d", "d-orbital", "d orbital", "d_orbital", "d-function", "d_ function", "d -orbital", etc. Note the use of [] brackets. Note that the dash is at the end of the brackets, since it would be interpreted as a range if used elsewhere. Note also that we have used a metacharacter within the brackets; the brackets will match the underscore, dash, or any whitespace character. Finally, note that zero or repetitions of the brackets is allowed, to account for the possibility of somebody (mis)typing in phrases like "dorbital", "d- orbital", "d _function", and so on. The logical relation requires that only files where MNDO and "d" were simultaneously mentioned will be collected. This could also be written as:
3 /\bMNDO\b/ /\bd[_\s-]/ /\borbital|\bfunction/ 1 & 2 & 3This is not equivalent to the previous one, but it is close. In both examples, I did not put \b at the end of "orbital" or "function", so "orbitals", "orbital,", "orbital.", and so on will also be matched.
<Return to top of page>
Example 5
Consider the following search query:
3 T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/ T: /\bCHARGE/ T: /\bHYDROGEN[\s_-]?BOND[si\s.,;]|\bH[\s_-]?BOND[si\s.,;]/ 1 AND (2 OR 3)The first regular expression looks for "mopac", "ampac", "am1", "pm3", "mndo", or "mindo". Note that we want to match references to both "mopac" and "mopac6". Thus, it is safer to require a non-letter after "mopac" rather than a space or word boundary. The second expression will match "charge ", "charges", "charge,", "charge.", "charge=", "charge-", and so on. The third expression matches "hydrogen bond", "hydrogen-bond", "hydrogen_bond", "hydrogenbond", "h-bond", "h bond", "h_bond", "hbond", etc. It also matches cases where "bonds" is used instead of "bond", and cases where there is punctuation following the string "bond" (or "bonds").
Note that all the regular expressions given above request searching for the text of the file only, not its name.
The logical relation simply says: "Match all files which mention MOPAC or AMPAC or AM1 or PM3 or MNDO or MINDO and which also say something about charges or hydrogen bonds".
<Return to top of page>
5. Using Search From Within Mailserv
When using the Mailserv program, you can perform searches of the current directory and all subdirectories by first CD-ing to the appropriate directory and then issuing the following command:
SEARCH n /regular expression 1/ /regular expression 2/ . . . /regular expression n/ logical relationor
SEARCH /regular expression/The format of the Mailserv search query is the same as before, except that the word SEARCH preceding it. Note that the search only takes place in the current directory and its subdirectories. This can be used to reduce search time if you have a reasonably good idea of where your target file or files will be, and if they aren't spread out all over the place. Unlike the search program above, this can be used to search not only the Computational Chemistry archives but the Russian archives as well.
<Return to top of page>
This helpfile was originally written by Jan K. Labanowski. It was revamped and converted to HTML by Alan Chalker. Plaintext copies of the original helpfile and the revised helpfile are both available.