http://server.ccl.net/cca/html_pages/help.search.text.html.shtml |
CCL help.search.text.html | |
E-Mail Search EngineContents
1. A Brief OverviewThe Computational Chemistry Archives can be searched by sending a search query to the address chemistry-search@ccl.net. A list of files which satisfy your query will be sent back to you via e-mail. This is an experimental service; please send your comments and ideas to the author, Jan K. Labanowski at jkl@ccl.net. The following document describes the format of the search query. The format, while not simple, allows for precise and elaborate search queries. Please take the time to read these instructions before using the e-mail searcher. A search query consists of a number of lines of text. The first line consists of a single number. The number corresponds to the number of text patterns you wish to search for. Note that if you only wish to search for one text pattern, you may omit this line provided that you also omit the logical relation (see below.) Following the first line is a list of regular expressions. There should be exactly one regular expression per line; the number of regular expressions (and thus, the number of lines) should be equal to the number in the first line. The format of regular expressions is described below. (Those familiar with Perl already know the format for regular expressions.) Following the list of regular expressions is a single line containing a logical relation. The logical relation determines which combination of regular expressions constitute a match. For example, you may wish to find all files which mention carbon and oxygen, or MOPAC or MindTool, or hydrogen and helium but not argon. Logical expressions allow this sort of "pick and choose" behavior. Note that if there is only one regular expression, you may omit the logical relation. The format of logical relations is also described below. A search query with only one regular expression would look like this:
regular expressionA search query with n regular expressions would look like this:
n regular expression 1 regular expression 2 . . . regular expression n logical relationThe following sections explain the format of regular and logical relations in more detail.
A regular expression is a string of characters which tells the searcher which
string (or strings) you are looking for. The following explains the format of
regular expressions in detail. If you
are familiar with Perl, you already know the syntax. If you are familiar with
Unix, you should know that there are subtle
differences between Perl's regular expressions and Unix' regular expressions.
A regular expression always starts and ends with a character, called the delimiter, which does not appear anywhere else in the expression. The delimiter is included because the searcher removes trailing spaces at the end of a line. Without the delimiters, there would be no way to distinguish between significant and insignificant spaces. Only the characters inside the delimiters are considered part of the regular expression. Here are some examples of legal regular expressions:
/legal/
.legal.
blegal b
Here are some illegal regular expressions:
/illegal /ill/egal/
*Return to top of page*
In its simplest form, a regular expression is just a word or phrase to search for. For example, /gauss/would match any file whose name had the string "gauss" in it, or which mentioned the word "gauss" in the file contents. Thus, files named "gauss", "gaussian" or "degauss" would all be matched, as would a file containing the phrases "de-gauss the monitor" or "gaussian elimination." Here are some more examples: /carbon/
.hydro.
boxyb
>top ten>
*Return to top of page*
Some characters have a special meaning to the searcher. These characters are called metacharacters. Although they may seem confusing at first, they add a great deal of flexibility and convenience to the searcher. The period (.) is a commonly used metacharacter. It matches exactly one character, regardless of what the character is. For example, the regular expression: /2,.-Dimethylbutane/will match "2,2-Dimethylbutane" and "2,3-Dimethylbutane". Note that the period matches exactly one character-- it will not match a string of characters, nor will it match the null string. Thus, "2,200-Dimethylbutane" and "2,-Dimenthylbutane" will not be matched by the above regular expression. But what if you wanted to search for a string containing a period? For example, suppose we wished to search for references to pi. The following regular expression would not work: /3.14/ (THIS IS WRONG!)This would indeed match "3.14", but it would also match "3514", "3f14", or even "3+14". In short, any string of the form "3x14", where x is any character, would be matched by the regular expression above. To get around this, we introduce a second metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.14", we would use: /3\.14/ (This will work.)This is called "quoting". We would say that the period in the regular expression above has been quoted. In general, whenever the backslash is placed before a metacharacter, the searcher treats the metacharacter literally rather than invoking its special meaning. (Unfortunately, the backslash is used for other things besides quoting metacharacters. Many "normal" characters take on special meanings when preceded by a backslash. The rule of thumb is, quoting a metacharacter turns it into a normal character, and quoting a normal character may turn it into a metacharacter.) Let's look at some more common metacharacters. We consider first the question mark (?). The question mark indicates that the character immediately preceding it should be matched either zero times or one time. Thus /m?ethane/would match either "ethane" or "methane". Similarly, /comm?a/would match either "coma" or "comma". Another metacharacter is the star (*). This indicates that the character immediately to its left may be repeated any number of times, including zero. Thus /ab*c/would match "ac", "abc", "abbc", "abbbc", "abbbbbbbbc", and any string that starts with an "a", is followed by a sequence of "b"'s, and ends with a "c". The plus (+) metacharacter indicates that the character immediately preceding it may be repeated one or more times. It is just like the star metacharacter, except it doesn't match the null string. Thus /ab+c/would not match "ac", but it would match "abc", "abbc", "abbbc", "abbbbbbbbc" and so on. Metacharacters may be combined. A common combination includes the period and star metacharacters, with the star immediately following the period. This is used to match an arbitrary string of any length, including the null string. For example: /cyclo.*ane/would match "cyclodecane", "cyclohexane" and even "cyclones drive me insane." Any string that starts with "cyclo", is followed by an arbitrary string, and ends with "ane" will be matched. Note that the null string will be matched by the period-star pair; thus, "cycloane" would be matche by the above expression. If you wanted to search for articles on cyclodecane and cyclohexane, but didn't want to match articles about how cyclones drive one insane, you could string together three periods, as follows: /cyclo...ane/This would match "cyclodecane" and "cyclohexane", but would not match "cyclones drive me insane." Only strings eleven characters long which start with "cyclo" and end with "ane" will be matched. (Note that "cyclopentane" would not be matched, however, since cyclopentane has twelve characters, not eleven.) Here are some more examples. These involve the backslash. Note that the placement of the backslash is important. /a\.*z/
/a.\*z/
/a\++z/
/a\+\+z/
/a+\+z/
/a.?e/
/a\.?e/
/a.\?e/
/a\.\?e/
/2,\d-Dimethylbutane/would match "2,2-Dimethylbutane", "2,3-Dimethylbutane" and so forth. Similarly, /1\.\d\d\d\d\d/would match any six-digit floating-point number from 1.00000 to 1.99999 inclusive. We could combine the digit metacharacter with other metacharacters; for instance, /a\d+z/matches any string starting with "a", followed by a string of numbers, followed by a "z". (Note that the plus is used, and thus "az" is not matched.) The letter "d" in the string "\d" must be lower-case. This is because there is another metacharacter, the non-digit metacharacter, which uses the uppercase "D". The non-digit metacharacter looks like "\D" and matches any character except a digit. Thus, /a\Dz/would match "abz", "aTz" or "a%z", but would not match "a2z", "a5z" or "a9z". Similarly, /\D+/Matches any non-null string which contains no numeric characters. Notice that in changing the "d" from lower-case to upper-case, we have reversed the meaning of the digit metacharacter. This holds true for most other metacharacters of the format backslash-letter.
There are three other metacharacters in the backslash-letter format. The first
is the word metacharacter, which matches exactly one letter, one
number, or the underscore character ( /a\wz/would match "abz", "aTz", "a5z", "a_z", or any three-character string starting with "a", ending with "z", and whose second character was either a letter (upper- or lower-case), a number, or the underscore. Similarly, /a\Wz/would not match "abz", "aTz", "a5z", or "a_z". It would match "a%z", "a{z", "a?z" or any three-character string starting with "a" and ending with "z" and whose second character was not a letter, number, or underscore. (This means the second character must either be a symbol or a whitespace character.) The whitespace metacharacter matches exactly one character of whitespace. (Whitespace is defined as spaces, tabs, newlines, or any character which would not use ink if printed on a printer.) The whitespace metacharacter looks like this: "\s". It's opposite, which matches any character that is not whitespace, looks like this: "\S". Thus, /a\sz/would match any three-character string starting with "a" and ending with "z" and whose second character was a space, tab, or newline. Likewise, /a\Sz/would match any three-character string starting with "a" and ending with "z" whose second character was not a space, tab or newline. (Thus, the second character could be a letter, number or symbol.) The word boundary metacharacter matches the boundaries of words; that is, it matches whitespace, punctuation and the very beginning and end of the text. It looks like "\b". It's opposite searches for a character that is not a word boundary. Thus: /\bcomput/will match "computer" or "computing", but not "supercomputer" since there is no spaces or punctuation between "super" and "computer". Similarly, /\Bcomput/will not match "computer" or "computing", unless it is part of a bigger word such as "supercomputer" or "recomputing".
Note that the underscore ( /super\bcomputer/will not match "super_computer". There is one other metacharacter starting with a backslash, the octal metacharacter. The octal metacharacter looks like this: "\nnn", where "n" is a number from zero to seven. This is used for specifying control characters that have no typed equivalent. For example, /\007/would find all textfiles with an embedded ASCII "bell" character. (The bell is specified by an ASCII value of 7.) You will rarely need to use the octal metacharacter. There are three other metacharacters that may be of use. The first is the braces metacharacter. This metacharacter follows a normal character and contains two numbers separated by a comma (,) and surrounded by braces ({}). It is like the star metacharacter, except the length of the string it matches must be within the minimum and maximum length specified by the two numbers in braces. Thus, /ab{3,5}c/will match "abbbc", "abbbbc" or "abbbbbc". No other string is matched. Likewise, /.{3,5}pentane/will match "cyclopentane", "isopentane" or "neopentane", but not "n-pentane", since "n-" is only two characters long. The alternative metacharacter is represented by a vertical bar (|). It indicates an either/or behavior by separating two or more possible choices. For example: /isopentane|cyclopentane/will match any file containing the strings "isopentane" or "cyclopentane" or both. However, It will not match "pentane" or "n-pentane" or "neopentane." The last metacharacter is the brackets metacharacter. The bracket metacharacter matches one occurence of any character inside the brackets ([]). For example, /\s[cmt]an\s/will match "can", "man" and "tan", but not "ban", "fan" or "pan". Similarly, /2,[23]-dimethylbutane/will match "2,2-dimethylbutane" or "2,3-dimethylbutane", but not "2,4-dimethylbutane", "2,23-dimethylbutane" or "2,-dimethybutane". Ranges of characters can be used by using the dash (-) within the brackets. For example, /a[a-d]z/will match "aaz", "abz", "acz" or "adz", and nothing else. Likewise, /textfile0[3-5]/will match "textfile03", "textfile04", or "textfile05" and nothing else. If you wish to include a dash within brackets as one of the characters to match, instead of to denote a range, put the dash immediately before the right bracket. Thus: /a[1234-]z/and /a[1-4-]z/both do the same thing. They both match "a1z", "a2z", "a3z", "a4z" or "a-z", and nothing else. The bracket metacharacter can also be inverted by placing a caret (^) immediately after the left bracket. Thus, /textfile0[^02468]/matches any ten-character string starting with "textfile0" and ending with anything except an even number. Inversion and ranges can be combined, so that /\W[^f-h]ood\W/matches any four letter wording ending in "ood" except for "food", "good" or "hood". (Thus "mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do not apply and other
metacharacters are not available. The only characters that can be quoted
in brackets are " /[\[\\\]]abc/matches any four letter string ending with "abc" and starting with " [ ", "] ", or "\ ".
<Return to top of page>
Because of the way the searcher works, the following metacharacters should
not be used, even though they are valid Perl metacharacters. They
are:
Here are some other things you should know about regular expressions.
For instance,
Once the regular expressions have been specified, you need to specify a
logical relation between them. This relation determines how matches with
the regular expression determine whether or not the file itself matches.
For example, you might want to see all files referring to hydrogen and
carbon, or those which mention methane or ethane, or those which refer to
cyclopentane or neopentane but not both, or that do not refer to uranium.
There are three logical operators:
Here are some simple examples of search queries with logical expressions.
Note that the numbers in the logical relation show which regular expression
is meant. "1" refers to the first regular expression, "2" to the second,
and so on.
Here are some more complicated examples.
Here are some examples of typical search queries. Please study them to gain
a better understanding of how the searcher works.
Suppose you wish to find files which mention "AMBER". You might use a query
that looks like this:
As this examples shows, it is important to
analyze all possible combinations which will
match your regular expression. You do not want to get too many unrelated files,
but you want to be sure that you get all the files which relate to the
topic of your search.
*Return to top of page*
Suppose you want to search for the information on MM2, MM3, MM2P or MMP2.
You can search the archives by giving the following query:
*Return to top of page*
Note that this example is not equivalent to the query:
Actually, none of the above queries are good if you really want to find
all the references about basis sets. People frequently say "basis" or
"set"; sometimes they say "basis functions" or "contracted gaussians",
and so on. You would need a more elaborate expression to be confident
that you had found most of the references to this topic. It might look like:
*Return to top of page*
Suppose you want to search for files which talk about MNDO and d-orbitals.
Here is an example of the query which could be used for this purpose:
*Return to top of page*
Note that all the regular expressions
given above request searching for the text of the file only, not its name.
The logical relation simply says: "Match all files which mention MOPAC or
AMPAC or AM1 or PM3 or MNDO or MINDO and which also say something about
charges or hydrogen bonds".
When using the Mailserv program, you can perform searches of the current
directory and all subdirectories by first CD-ing to the appropriate directory
and then issuing the following command:
This helpfile was originally written by
Jan K. Labanowski. It was revamped and converted to HTML by
Alan Chalker. Plaintext copies of
the
original helpfile and the
revised helpfile are both available.
|
[ CCL Home Page ]
[ About CCL ]
[ Resources ]
[ Search CCL ]
[ Announcements ]
[ Links ]
[ E-mail us ]
[ Raw Version of this page ]
Modified: Thu Nov 6 17:00:00 1997 GMT |
Page accessed 5054 times since Sat Apr 17 21:40:40 1999 GMT |