At last... Searching CCL archives.
- From: jkl ( ( at )
) ccl.net (Jan Labanowski)
- Subject: At last... Searching CCL archives.
- Date: Tue, 1 Dec 1992 21:14:49 -0500
Dear Subscribers,
Here is something new for the list.
Jan
jkl ( ( at ) ) ccl.net
------------- HOW TO SEARCH COMPUTATIONAL CHEMISTRY ARCHIVES -------------
This file can be obtained from anonymous ftp at www.ccl.net as
pub/chemistry/help.search
or via e-mail by sending a message:
send help.search from chemistry
to OSCPOST ( ( at ) ) ccl.net or OSCPOST ( ( at ) ) OHSTPY.BITNET
Computational chemistry archives can be searched by sending a search
query to the address: chemistry-search ( ( at ) ) ccl.net. As a result, the list
of files which satisfy the query, will be sent back via e-mail to the
originator of the search request. This is an experimental service and
may be improved (or discontinued) in the future, so please send your
comments and ideas to the author: jkl ( ( at ) ) ccl.net.
The following document describes the format of the search query. The format
is not simple, however, it allows for precise and elaborate search queries.
If you have suggestions how to make it simpler without loosing its generality,
please tell me. I will appreciate your comments.
The search query specifies:
1. How many text patterns to look for,
2. Text patterns to look for --- so called "regular expressions",
3. Logical relation between patterns which needs be satisfied for
including the file in the search result --- logical expression.
1. REGULAR EXPRESSIONS
======================
In its simplest form it is just a word which you want to find in a file name
and/or a text of the file itself. This world has to be bracketed with a unique
character (i.e., the character which does not appear inside the regular
expression) which will be called a "delimiter". Since during the
processing, the trailing spaces are removed at the beginning and at the end
of the line, the delimiter is the only way to distinguish between significant
and unsignificant spaces.
Some characters (so called "metacharacters") and groups of characters
have
a special meaning within the regular expression and if their original meaning
is needed, they have to be "quoted" by preceding them with a backslash
character "\". On the other hand, quoting some ordinary letters, may
attach some special meaning to them, so use the backslash judiciously.
Note that a list below corresponds to a Perl convention for regular expressions
and is different from the one used in UNIX regular expressions. Remember also
that:
a) the file appears to the searching program as a one, long line of
text where words are separated by single spaces (new lines are replaced
with spaces, multiple spaces, tabs and other white space is contracted
into a single space, hyphenated words at the end of the line are joined).
There are two exceptions to this rule: 1) file names, if searched, are
treated as separate pieces of text, 2) when searching files containing
archived messages posted to the list at a given date, each message is
search separately, as if it was a separate file. If the logical
expression is satisfied for the message, then the name of the file
is reported. Please note that only text files are scanned for text
(contents of binary files is not). However, all file names (for binary
as well as text) are scanned for, if requested.
b) Search is lettercase insensitive, i.e., searching for words: charge,
Charge, cHaRgE, etc., will produce the same result.
c) The description below includes full metacharacter definitions, however,
some will not be found in the text, since there are no new_lines and tabs,
A-Z is the same as a-z, etc.
Constructs in regular expressions:
. --- Period matches any character except a new line (remember, no
new lines in here).
[ ] --- Any character within the square brackets matches, e.g. [a,b]
matches a, comma, b. Ranges are also allowed: [0-9a-z] will
match any digit or letter. Note that if you are searching for
"-", you must put it just before the right ], or it
will be
treated as a range. Negation within square brackets is achieved
with a caret character immediately following left bracket
"[^".
[^a-z] means: match everything but letters, [^_] means
everything but underscore. Note that within brackets, characters
.?*+|()$^{} should not be quoted, while [, \, ] should be
entered as \[, \\, \].
\d --- Matches any digit (i.e., is a shorthand for [0-9]).
\D --- Matches everything but digit (i.e., is a shorthand for [^0-9]).
\w --- Matches "word" characters, i.e., letters, digits and
underscore
(same as [a-zA-Z0-9_]).
\W --- Matches "nonword" character (same as [^a-zA-Z0-9_]).
\s --- Matches a white-space (i.e., space, new_line, tab, etc.).
\S --- Matches a non-white-space character (i.e., characters which
use pigment in your printer).
\n --- Matches new line (do not use here).
\r --- Matches carriage return (do not use here).
\t --- Matches a tab (do not use here, tabs are converted to spaces).
\f --- Matches a formfeed (do not use)
\b --- Matches a backspace (do not use)
\xxx --- Matches an ASCII octal code of a character (xxx are digits).
() --- () are special characters to quote substrings for substitution.
We do not do substitutions here. If you search for parentheses,
use \( and \) outside square brackets.
\1, \2 ... \9 --- used only in substitution strings. Do not use here.
x? --- matches 0 or 1 occurrences of character x (or any other, i.e.,
[a-z]? matches 0 or 1 letter).
x* --- matches 0 or more occurrences of character x.
x+ --- matches 1 or more occurrences of character x.
x{m,n} --- matches at least m, but no more than n occurrences of x, e.g.,
[0-9.+-]{2,3} will match: 1.2, .2, +11, -1, 123, +-.
| --- alternative: \son\s|\sin\s|\sup\s|\sat\s will match words:
on, in, up, and at.
\b --- matches word boundary (outside [] only). It corresponds to
white space, punctation marks and the very beginning and
end of the line.
\B --- matches non word boundary (i.e., any printable character).
^ --- outside [] marks the beginning of the string. Do not use here.
$ --- outside [] marks the end of the string. Do not use here.
Note that in UNIX, parentheses and braces are quoted to get their special
meaning, while here, they need to be quoted to get their ordinary meaning.
Regular expressions, bracketed with a delimiter character can be optionally
preceded with a label and a search scope identifier followed by a colon
":".
The label is an integer number and the scope is one of the letters:
T - text only (file names will not be matched to a regular expression),
F - file name only (text inside the file will not be scanned),
B - both file name and text will be scanned for matching (default).
Both the label and the scope can be omitted, but if either exists, the colon
must be present. For the purpose of the search, all the expressions below
are identical:
1B : /[MA][MO]PAC/
2: #ampac|mopac#
: +MOPAC|AMPAC+
B4: *Ampac|Mopac*
-[am][mo]pac-
Note that the numerical value of integer label is disregarded by the program,
and the program assigns the numbers to regular expressions based on the
order in which they were specified. It is here only for your convenience.
2. LOGICAL RELATION
===================
Once you specified your regular expressions, you need to specify a logical
relation between them which qualifies the file for reporting. The logical
relation can contain only operators:
& --- AND (you can also use && if you are a UNIX or C fan)
| --- OR (you can also use || if you are a UNIX or C fan)
! --- NOT
parentheses, ordinal numbers of regular expressions and spaces. The number of
the regular expression corresponds to its status (FOUND or NOT_FOUND) after
searching its scope (i.e, file text and/or file name). For example:
1 | !2 means that the file should be reported if regular expression number 1
matched and expression number 2 did not. You can make your relation as
complicated as you wish and use nested parentheses, but remember that:
!(1 & 2) is equivalent to !1 | !2, and the !(1 | 2) is equivalent to !1
& !2.
Remember also that in logic the & takes precedence before |, and the ! is
a unary operator and takes precedence before both of them, but use the
parentheses for better readability.
3. QUERY FORMAT AND EXAMPLES
============================
The complete query has the following format:
Number of regular expressions (N)
regular expression 1
regular expression 2
....
regular expression N
logical relation
The query should be send to chemistry-search ( ( at ) ) ccl.net, and when the
search
is finished, the resulting list of files satisfying your query will be
sent to you automatically. Do not be impatient and do not send your
next query before you get the results of the first one --- your request
will be denied. Please remember that this is a flat file search and it is
very demanding as I/O and CPU is concerned. Therefore only one query will be
running at the given moment. Before you send a query, try to analyze it,
and make it specific. You do not want to get a listing of the whole archive.
The queries which search only file names are much faster than the queries
which scan the whole file (i.e., you get your results faster, but you might
not like what you get).
If you look for a single term, you can use the abbreviated one-line format
which consists of a regular expression only.
Now, a few examples.
Example 1.
----------
1
1B: /\bMM[23]\b|MMP2\b/
1
which is equivalent to saying:
B: /\bMM[23]\b|MMP2\b/
or
/\bMM[23]\b|MMP2\b/
since the default is both text and file names. You search for all text files
which refer to MM2, MM3 or MMP2, or file names which contain MM2, MM3 or MMP2.
Example 2.
----------
/basis\sset/
Note that this example is not equivalent to the query:
2
/basis/
/set/
1 & 2
since in the first case, the words "basis" and "set" must be
side by side,
while in the latter case they may be separated by many words and in fact
the "set" may be found before the "basis" is. Also, the
latter case will
find all the file names having "basis" or "set" in them,
while there is no
file names in the archive which have a space embedded in them.
Example 3.
----------
3
1T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/
2T: /\bCHARGE/
3T: /\bHYDROGEN[\s-_]?BOND[si\s.,;]|\bH[\s-_]?BOND[si\s.,;]/
1 & (2 | 3)
3 regular expressions were specified. Expression 1 looks for a word MOPAC,
AMPAC, AM1, MNDO, MINDO. Note that it may be either MOPAC or MOPAC6 so
it is safer to require a non-letter after MOPAC rather than a space
or word boundary. The 2nd expression looks for "CHARGE ",
"CHARGES",
"CHARGE,", "CHARGE.", "CHARGE=",
"CHARGE-", etc. The 3rd one is
a challenge for you. Note that people may say: HYDROGEN BOND, HYDROGEN-BOND,
HYDROGEN_BOND, HYDROGENBOND, H-BOND, H BOND, H_BOND, HBOND, and may say BONDS,
BONDING, and may put .,; after BOND. Note that all the regular expressions
given above request searching for the text of the file only, not its name.
The logical relation simply says: "find me the files which mention MOPAC or
AMPAC or AM1 or PM3 or MNDO or MINDO and say also something about
CHARGes or HYDROGEN BONDs".
In short:
1. prepare your search query as described above
2. send it to chemistry-search ( ( at ) ) ccl.net
3. wait (for a long, long time) for an answer
-----------------------------
I will welcome suggestions for improving this description and corrections
to my spelling and grammar (as you know, I am not a native English speaker).
Jan Labanowski
jkl ( ( at ) ) ccl.net