The Computational Chemistry Archives can be searched by sending a search query to the address chemistry-search@ccl.net. A list of files which satisfy your query will be sent back to you via e-mail. This is an experimental service; please send your comments and ideas to the author, Jan K. Labanowski at jkl@ccl.net.
The following document describes the format of the search query. The format, while not simple, allows for precise and elaborate search queries. Please take the time to read these instructions before using the e-mail searcher.
A search query consists of a number of lines of text. The first line consists of a single number. The number corresponds to the number of text patterns you wish to search for. Note that if you only wish to search for one text pattern, you may omit this line provided that you also omit the logical relation (see below.)
Following the first line is a list of regular expressions. There should be exactly one regular expression per line; the number of regular expressions (and thus, the number of lines) should be equal to the number in the first line. The format of regular expressions is described below. (Those familiar with Perl already know the format for regular expressions.)
Following the list of regular expressions is a single line containing a logical relation. The logical relation determines which combination of regular expressions constitute a match. For example, you may wish to find all files which mention carbon and oxygen, or MOPAC or MindTool, or hydrogen and helium but not argon. Logical expressions allow this sort of "pick and choose" behavior. Note that if there is only one regular expression, you may omit the logical relation. The format of logical relations is also described below.
A search query with only one regular expression would look like this:
regular expressionA search query with n regular expressions would look like this:
n regular expression 1 regular expression 2 . . . regular expression n logical relationThe following sections explain the format of regular and logical relations in more detail.
A regular expression is a string of characters which tells the searcher which
string (or strings) you are looking for. The following explains the format of
regular expressions in detail. If you
are familiar with Perl, you already know the syntax. If you are familiar with
Unix, you should know that there are subtle
differences between Perl's regular expressions and Unix' regular expressions.
Delimiters
A regular expression always starts and ends with a character, called the delimiter, which does not appear anywhere else in the expression. The delimiter is included because the searcher removes trailing spaces at the end of a line. Without the delimiters, there would be no way to distinguish between significant and insignificant spaces. Only the characters inside the delimiters are considered part of the regular expression.
Here are some examples of legal regular expressions:
/legal/
/
" is the delimiter).legal.
.
" is the delimiter)blegal b
b
" is the delimiter; note that the space after the "l"
is considered part of the regular expression)Here are some illegal regular expressions:
/illegal
/ill/egal/
*Return to top of page*
Simple Regular Expressions
In its simplest form, a regular expression is just a word or phrase to search for. For example,
/gauss/would match any file whose name had the string "gauss" in it, or which mentioned the word "gauss" in the file contents. Thus, files named "gauss", "gaussian" or "degauss" would all be matched, as would a file containing the phrases "de-gauss the monitor" or "gaussian elimination." Here are some more examples:
/carbon/
.hydro.
boxyb
>top ten>
*Return to top of page*
Metacharacters
Some characters have a special meaning to the searcher. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the searcher.
The period (.) is a commonly used metacharacter. It matches exactly one character, regardless of what the character is. For example, the regular expression:
/2,.-Dimethylbutane/will match "2,2-Dimethylbutane" and "2,3-Dimethylbutane". Note that the period matches exactly one character-- it will not match a string of characters, nor will it match the null string. Thus, "2,200-Dimethylbutane" and "2,-Dimenthylbutane" will not be matched by the above regular expression.
But what if you wanted to search for a string containing a period? For example, suppose we wished to search for references to pi. The following regular expression would not work:
/3.14/ (THIS IS WRONG!)This would indeed match "3.14", but it would also match "3514", "3f14", or even "3+14". In short, any string of the form "3x14", where x is any character, would be matched by the regular expression above.
To get around this, we introduce a second metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.14", we would use:
/3\.14/ (This will work.)This is called "quoting". We would say that the period in the regular expression above has been quoted. In general, whenever the backslash is placed before a metacharacter, the searcher treats the metacharacter literally rather than invoking its special meaning.
(Unfortunately, the backslash is used for other things besides quoting metacharacters. Many "normal" characters take on special meanings when preceded by a backslash. The rule of thumb is, quoting a metacharacter turns it into a normal character, and quoting a normal character may turn it into a metacharacter.)
Let's look at some more common metacharacters. We consider first the question mark (?). The question mark indicates that the character immediately preceding it should be matched either zero times or one time. Thus
/m?ethane/would match either "ethane" or "methane". Similarly,
/comm?a/would match either "coma" or "comma".
Another metacharacter is the star (*). This indicates that the character immediately to its left may be repeated any number of times, including zero. Thus
/ab*c/would match "ac", "abc", "abbc", "abbbc", "abbbbbbbbc", and any string that starts with an "a", is followed by a sequence of "b"'s, and ends with a "c".
The plus (+) metacharacter indicates that the character immediately preceding it may be repeated one or more times. It is just like the star metacharacter, except it doesn't match the null string. Thus
/ab+c/would not match "ac", but it would match "abc", "abbc", "abbbc", "abbbbbbbbc" and so on.
Metacharacters may be combined. A common combination includes the period and star metacharacters, with the star immediately following the period. This is used to match an arbitrary string of any length, including the null string. For example:
/cyclo.*ane/would match "cyclodecane", "cyclohexane" and even "cyclones drive me insane." Any string that starts with "cyclo", is followed by an arbitrary string, and ends with "ane" will be matched. Note that the null string will be matched by the period-star pair; thus, "cycloane" would be matche by the above expression.
If you wanted to search for articles on cyclodecane and cyclohexane, but didn't want to match articles about how cyclones drive one insane, you could string together three periods, as follows:
/cyclo...ane/This would match "cyclodecane" and "cyclohexane", but would not match "cyclones drive me insane." Only strings eleven characters long which start with "cyclo" and end with "ane" will be matched. (Note that "cyclopentane" would not be matched, however, since cyclopentane has twelve characters, not eleven.)
Here are some more examples. These involve the backslash. Note that the placement of the backslash is important.
/a\.*z/
/a.\*z/
Matches any string starting with an "a", followed by one arbitrary character, and terminated with "*z". Thus, "ag*z", "a5*z" and "a@*z" are all matched. Only strings of length four, where the first character is "a", the third "*", and the fourth "z", are matched.
/a\++z/
/a\+\+z/
/a+\+z/
/a.?e/
/a\.?e/
/a.\?e/
/a\.\?e/
/2,\d-Dimethylbutane/would match "2,2-Dimethylbutane", "2,3-Dimethylbutane" and so forth. Similarly,
/1\.\d\d\d\d\d/would match any six-digit floating-point number from 1.00000 to 1.99999 inclusive. We could combine the digit metacharacter with other metacharacters; for instance,
/a\d+z/matches any string starting with "a", followed by a string of numbers, followed by a "z". (Note that the plus is used, and thus "az" is not matched.)
The letter "d" in the string "\d" must be lower-case. This is because there is another metacharacter, the non-digit metacharacter, which uses the uppercase "D". The non-digit metacharacter looks like "\D" and matches any character except a digit. Thus,
/a\Dz/would match "abz", "aTz" or "a%z", but would not match "a2z", "a5z" or "a9z". Similarly,
/\D+/Matches any non-null string which contains no numeric characters.
Notice that in changing the "d" from lower-case to upper-case, we have reversed the meaning of the digit metacharacter. This holds true for most other metacharacters of the format backslash-letter.
There are three other metacharacters in the backslash-letter format. The first
is the word metacharacter, which matches exactly one letter, one
number, or the underscore character (_
). It is written as
"\w". It's opposite, "\W", matches any one character
except a letter, a number or the underscore. Thus,
/a\wz/would match "abz", "aTz", "a5z", "a_z", or any three-character string starting with "a", ending with "z", and whose second character was either a letter (upper- or lower-case), a number, or the underscore. Similarly,
/a\Wz/would not match "abz", "aTz", "a5z", or "a_z". It would match "a%z", "a{z", "a?z" or any three-character string starting with "a" and ending with "z" and whose second character was not a letter, number, or underscore. (This means the second character must either be a symbol or a whitespace character.)
The whitespace metacharacter matches exactly one character of whitespace. (Whitespace is defined as spaces, tabs, newlines, or any character which would not use ink if printed on a printer.) The whitespace metacharacter looks like this: "\s". It's opposite, which matches any character that is not whitespace, looks like this: "\S". Thus,
/a\sz/would match any three-character string starting with "a" and ending with "z" and whose second character was a space, tab, or newline. Likewise,
/a\Sz/would match any three-character string starting with "a" and ending with "z" whose second character was not a space, tab or newline. (Thus, the second character could be a letter, number or symbol.) The word boundary metacharacter matches the boundaries of words; that is, it matches whitespace, punctuation and the very beginning and end of the text. It looks like "\b". It's opposite searches for a character that is not a word boundary. Thus:
/\bcomput/will match "computer" or "computing", but not "supercomputer" since there is no spaces or punctuation between "super" and "computer". Similarly,
/\Bcomput/will not match "computer" or "computing", unless it is part of a bigger word such as "supercomputer" or "recomputing".
Note that the underscore (_
) is considered a "word" character.
Thus,
/super\bcomputer/will not match "super_computer".
There is one other metacharacter starting with a backslash, the octal metacharacter. The octal metacharacter looks like this: "\nnn", where "n" is a number from zero to seven. This is used for specifying control characters that have no typed equivalent. For example,
/\007/would find all textfiles with an embedded ASCII "bell" character. (The bell is specified by an ASCII value of 7.) You will rarely need to use the octal metacharacter.
There are three other metacharacters that may be of use. The first is the braces metacharacter. This metacharacter follows a normal character and contains two numbers separated by a comma (,) and surrounded by braces ({}). It is like the star metacharacter, except the length of the string it matches must be within the minimum and maximum length specified by the two numbers in braces. Thus,
/ab{3,5}c/will match "abbbc", "abbbbc" or "abbbbbc". No other string is matched. Likewise,
/.{3,5}pentane/will match "cyclopentane", "isopentane" or "neopentane", but not "n-pentane", since "n-" is only two characters long.
The alternative metacharacter is represented by a vertical bar (|). It indicates an either/or behavior by separating two or more possible choices. For example:
/isopentane|cyclopentane/will match any file containing the strings "isopentane" or "cyclopentane" or both. However, It will not match "pentane" or "n-pentane" or "neopentane." The last metacharacter is the brackets metacharacter. The bracket metacharacter matches one occurence of any character inside the brackets ([]). For example,
/\s[cmt]an\s/will match "can", "man" and "tan", but not "ban", "fan" or "pan". Similarly,
/2,[23]-dimethylbutane/will match "2,2-dimethylbutane" or "2,3-dimethylbutane", but not "2,4-dimethylbutane", "2,23-dimethylbutane" or "2,-dimethybutane". Ranges of characters can be used by using the dash (-) within the brackets. For example,
/a[a-d]z/will match "aaz", "abz", "acz" or "adz", and nothing else. Likewise,
/textfile0[3-5]/will match "textfile03", "textfile04", or "textfile05" and nothing else.
If you wish to include a dash within brackets as one of the characters to match, instead of to denote a range, put the dash immediately before the right bracket. Thus:
/a[1234-]z/and
/a[1-4-]z/both do the same thing. They both match "a1z", "a2z", "a3z", "a4z" or "a-z", and nothing else.
The bracket metacharacter can also be inverted by placing a caret (^) immediately after the left bracket. Thus,
/textfile0[^02468]/matches any ten-character string starting with "textfile0" and ending with anything except an even number. Inversion and ranges can be combined, so that
/\W[^f-h]ood\W/matches any four letter wording ending in "ood" except for "food", "good" or "hood". (Thus "mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do not apply and other
metacharacters are not available. The only characters that can be quoted
in brackets are "[
", "]
", and "\
".
Thus,
/[\[\\\]]abc/matches any four letter string ending with "abc" and starting with "
[
", "]
", or "\
".
<Return to top of page>
Because of the way the searcher works, the following metacharacters should
not be used, even though they are valid Perl metacharacters. They
are:
Here are some other things you should know about regular expressions.
For instance,
Once the regular expressions have been specified, you need to specify a
logical relation between them. This relation determines how matches with
the regular expression determine whether or not the file itself matches.
For example, you might want to see all files referring to hydrogen and
carbon, or those which mention methane or ethane, or those which refer to
cyclopentane or neopentane but not both, or that do not refer to uranium.
There are three logical operators:
Here are some simple examples of search queries with logical expressions.
Note that the numbers in the logical relation show which regular expression
is meant. "1" refers to the first regular expression, "2" to the second,
and so on.
Here are some more complicated examples.
Here are some examples of typical search queries. Please study them to gain
a better understanding of how the searcher works.
Suppose you wish to find files which mention "AMBER". You might use a query
that looks like this:
As this examples shows, it is important to
analyze all possible combinations which will
match your regular expression. You do not want to get too many unrelated files,
but you want to be sure that you get all the files which relate to the
topic of your search.
*Return to top of page*
Suppose you want to search for the information on MM2, MM3, MM2P or MMP2.
You can search the archives by giving the following query:
*Return to top of page*
Note that this example is not equivalent to the query:
Actually, none of the above queries are good if you really want to find
all the references about basis sets. People frequently say "basis" or
"set"; sometimes they say "basis functions" or "contracted gaussians",
and so on. You would need a more elaborate expression to be confident
that you had found most of the references to this topic. It might look like:
*Return to top of page*
Suppose you want to search for files which talk about MNDO and d-orbitals.
Here is an example of the query which could be used for this purpose:
*Return to top of page*
Note that all the regular expressions
given above request searching for the text of the file only, not its name.
The logical relation simply says: "Match all files which mention MOPAC or
AMPAC or AM1 or PM3 or MNDO or MINDO and which also say something about
charges or hydrogen bonds".
When using the Mailserv program, you can perform searches of the current
directory and all subdirectories by first CD-ing to the appropriate directory
and then issuing the following command:
This helpfile was originally written by
Jan K. Labanowski. It was revamped and converted to HTML by
Alan Chalker. Plaintext copies of
the
original helpfile and the
revised helpfile are both available.
Forbidden Characters
*Return to top of page*
Things To Remember
*Return to top of page*
/mopac/
and
/Mopac/
and
/MOPAC/
all search for the same set of strings. Each will match "mopac", "MOPAC",
"Mopac", "mopaC", "MoPaC", "mOpAc" and so forth. Thus you need not worry
about capitalization. (Note, however, that metacharacter must still have
the proper case. This is especially important for metacharacters whose
case determines whether their meaning is reversed or not.)\
), the tilde
(~
) or the backtick (`
), nor can you use them as
delimiters.x:/regular expression/
where x is either "F", "T" or "B". "F" instructs the searcher to only
search filenames for matches to this regular expression. "T" instructs
the searcher to only search a file's contents for matches to this
regular expression. "B" indicates that the searcher should search both. F:/text/
will match any file with the string "text" in its name. Likewise,
T:/mopac/
will match any file mentioning "mopac" in the contents. It will
not match a file named "mopac". Finally,
B:/carbon/
will match any file mentioning "carbon" in its name or contents. Note
that if no search scope identifier is included, "B:
" is
assumed.
Parentheses can also be used to clarify which subexpressions should be
evaluated first. In the absence of parentheses, NOT has highest priority,
followed by AND, followed by OR.&
), or by a pair of ampersands (&&
).
It returns a match only if the expression to its left is true, and
the expression to the right is true.|
), or by a pair of veritcal bars (||
).
It returns a match only if the expression to its left is true, or
the expression to the right is true.!
). It negates the value of the expression
to its right.
2
/\bcarbon\b/
/\bhydrogen\b/
1 OR 2
This will match any file referring to carbon, or hydrogen, or both.
2
/\bpentane\b/
/\bpropane\b/
1 && 2
This will match any file referring to both pentane and propane.
1
/\bmethane\b/
!1
This matches any file that does not refer to methane.
3
/\bcarbon\b/
/\bhydrogen\b/
/\boxygen\b/
1 AND 2 OR NOT 3
This matches any file that either refers to both carbon and hydrogen, or does
not refer to oxygen. Note that this is the same as
3
/\bcarbon\b/
/\bhydrogen\b/
/\boxygen\b/
(1 AND 2) OR (NOT 3)
Here is a slightly different example. Note how the different position of the
parentheses alters its meaning.
3
/\bcarbon\b/
/\bhydrogen\b/
/\boxygen\b/
1 AND (2 OR NOT 3)
This matches any file that refers to carbon, and that either refers to
hydrogen or does not refer to oxygen. Note that with this relation, a
file that did not mention carbon could not be matched, whereas with the
previous relation, a file that didn't mention carbon could be matched so
long as it did not mention oxygen. The position of the parentheses can
drastically alter the meaning of a search.
1
/amber/
1
Since you have only one regular expression, you could also use:
/amber/
The letter case is unimportant; you could also use:
/AMBER/
or
/AmbeR/
or
/Amber/
Note that the above queries would also find strings such as
"camber", "chamber", "chamberlain", "clamber" or "lambert". This is
probably not what you
want to do. You need to request the
word "amber", and not just any string that contains "amber". One possible
way to do this is by putting spaces around the word "amber", like this:
/ amber /
Unforunately, this would not match a string such as "For this calculations, I
used AMBER." This is because, in this string, "AMBER" is followed by a period,
not a space. It is therefore best to use the word boundary metacharacter:
/\bamber\b/
Even this is not without problems, however. Consider the string,
"Amber3.0 is slower than Amber3.1." This would not be matched, since digits
and underscores are considered to be part of the word by the searcher. In
this case, the best solution seems to be:
/\bamber[^a-z]/
This searches for any word starting with "amber" which is then followed by a
non-alphabetic character.Example 2
4
/MM2/
/MM3/
/MM2P/
/MMP2/
1 | 2 | 3 | 4
Or you could use:
1
/MM2|MM3|MM2P|MMP2/
1
You could also use:
/MM2|MM3|MM2P|MMP2/
But as we saw in example one, it is a good idea to be specific as possible.
So you might use this instead:
1
/\bMM[23]\b|\bMMP2\b|\bMM2P\b/
1
This is equivalent to:
/\bMM[23]\b|\bMMP2\b|\bMM2P\b/
or
B: /\bMM[23]\b|\bMMP2\b|\bMM2P\b/
This searches for all files
which refer to MM2, MM3 or MM2P, MMP2 or file names which contain MM2, MM3,
MMP2, MM2P.Example 3
Consider the following query:
/basis\sset/
The \s stands for any white space character. Since all tabs and new
lines are converted to single spaces, and multiple spaces are contracted
to single ones, the query above is equivalent to:
/basis set/
Besides the term "basis set", it will also find "basis sets", "basis set,"
and "basis set."
2
/basis/
/set/
1 & 2
This is because in the first case, the words "basis" and "set" must be side
by side.
In the second case, they may be separated by many words; in fact,
the "set" may be found before the "basis" is. Note also that the latter case
will
find all the file names having "basis" and "set" in them, while the first will
not match any file names since there are no
files in the archive with a space embedded in the name.
/basis|set|gaussians|contracted|6-31G\*|631G\*|gaussian exponent|/
Or you could make your query look like this:
2
/basis|set|gaussians| contracted |6-31G\*|631G\*|gaussian exponent|/
/\bDZP\b|\bTZP\b|\bDZ\b|\bTZ\b|gaussian function|STO-?\dG|\+G\*/
1 | 2
Note that I did not use "gaussian" but "gaussians". Had I not done so,
I would have received all of the files refering to GAUSSIAN program (and
there are plenty of them, many of which have nothing to do with basis sets.)
Note also that
the star was quoted with a backspace character; otherwise it would match
almost
anything (you might want this side effect by the way, if you wanted sets like
"6-31G(3d,2f)" or "6-31G without polarization").
The "-?
" means 0 or 1 minus signs
(some people say "STO-3G", and some incorrectly say "STO3G"). The
"\d
\+G\*" means G preceded with
a plus sign and followed with a star. The backslashes are necessary;
without them, the plus sign would be interpreted as "1 or more occurences of
a digit", and the star as "0 or more occurences of G". This is not what we
want.Example 4
2
/\bMNDO[^a-z]/
/\bd[^a-z]|\bd[_\s-]*orbital|\bd[_\s-]function/
1 & 2
This will look for MNDO and strings words: "d", "d-orbital", "d orbital",
"d_orbital", "d-function", "d_ function", "d -orbital", etc. Note the use
of [] brackets. Note that the
dash is at the end of the brackets, since it would be interpreted as a range
if used elsewhere. Note also that we have used a metacharacter within the
brackets; the brackets will match the underscore, dash, or any
whitespace character. Finally, note that zero or repetitions of the brackets
is allowed, to account for the possibility of somebody (mis)typing in phrases
like "dorbital", "d- orbital", "d _function", and so on.
The logical relation
requires that only files where MNDO and "d" were simultaneously mentioned will
be collected. This could also be written as:
3
/\bMNDO\b/
/\bd[_\s-]/
/\borbital|\bfunction/
1 & 2 & 3
This is not equivalent to the previous one, but it is close.
In both examples, I did
not put \b at the end of "orbital" or "function", so "orbitals", "orbital,",
"orbital.", and so on will also be matched.Example 5
Consider the following search query:
3
T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/
T: /\bCHARGE/
T: /\bHYDROGEN[\s_-]?BOND[si\s.,;]|\bH[\s_-]?BOND[si\s.,;]/
1 AND (2 OR 3)
The first regular
expression looks for "mopac",
"ampac", "am1", "pm3", "mndo", or "mindo".
Note that we want to match references to both "mopac" and "mopac6". Thus,
it is safer to require a non-letter after "mopac" rather than a space
or word boundary. The second expression will match "charge ", "charges",
"charge,", "charge.", "charge=", "charge-", and so on. The third expression
matches "hydrogen bond", "hydrogen-bond", "hydrogen_bond", "hydrogenbond",
"h-bond", "h bond", "h_bond", "hbond", etc. It also matches cases where
"bonds" is used instead of "bond", and cases where there is punctuation
following the string "bond" (or "bonds").
Summary
Remember:
*Return to top of page*
5. Using Search From Within Mailserv
SEARCH
n
/regular expression 1/
/regular expression 2/
.
.
.
/regular expression n/
logical relation
or
SEARCH
/regular expression/
The format of the Mailserv search query is the same as before, except that
the word SEARCH preceding it. Note that the search only takes place in the
current directory and its subdirectories. This can be used to reduce search
time if you have a reasonably good idea of where your target file or files will
be, and if they aren't spread out all over the place. Unlike the search
program above, this can be used to search not only the Computational Chemistry
archives but the Russian archives as well.
6. About This Helpfile
Return to the CCL homepage | Information on this page