At last... Searching CCL archives.

From: jkl ( ( at ) ) ccl.net (Jan Labanowski)
Subject: At last... Searching CCL archives.
Date: Tue, 1 Dec 1992 21:14:49 -0500
 Dear Subscribers,
 Here is something new for the list.
 Jan
 jkl ( ( at ) ) ccl.net
  ------------- HOW TO SEARCH COMPUTATIONAL CHEMISTRY ARCHIVES -------------
 This file can be obtained from anonymous ftp at www.ccl.net as
  pub/chemistry/help.search
 or via e-mail by sending a message:
  send help.search from chemistry
 to OSCPOST ( ( at ) ) ccl.net or OSCPOST ( ( at ) ) OHSTPY.BITNET
 Computational chemistry archives can be searched by sending a search
 query to the address: chemistry-search ( ( at ) ) ccl.net. As a result, the list
 of files which satisfy the query, will be sent back via e-mail to the
 originator of the search request. This is an experimental service and
 may be improved (or discontinued) in the future, so please send your
 comments and ideas to the author: jkl ( ( at ) ) ccl.net.
 The following document describes the format of the search query. The format
 is not simple, however, it allows for precise and elaborate search queries.
 If you have suggestions how to make it simpler without loosing its generality,
 please tell me. I will appreciate your comments.
 The search query specifies:
  1. How many text patterns to look for,
  2. Text patterns to look for --- so called "regular expressions",
  3. Logical relation between patterns which needs be satisfied for
     including the file in the search result --- logical expression.
 1. REGULAR EXPRESSIONS
 ======================
 In its simplest form it is just a word which you want to find in a file name
 and/or a text of the file itself. This world has to be bracketed with a unique
 character (i.e., the character which does not appear inside the regular
 expression) which will be called a "delimiter". Since during the
 processing, the trailing spaces are removed at the beginning and at the end
 of the line, the delimiter is the only way to distinguish between significant
 and unsignificant spaces.
 Some characters (so called "metacharacters") and groups of characters
 have
 a special meaning within the regular expression and if their original meaning
 is needed, they have to be "quoted" by preceding them with a backslash
 character "\". On the other hand, quoting some ordinary letters, may
 attach some special meaning to them, so use the backslash judiciously.
 Note that a list below corresponds to a Perl convention for regular expressions
 and is different from the one used in UNIX regular expressions. Remember also
 that:
   a) the file appears to the searching program as a one, long line of
      text where words are separated by single spaces (new lines are replaced
      with spaces, multiple spaces, tabs and other white space is contracted
      into a single space, hyphenated words at the end of the line are joined).
      There are two exceptions to this rule: 1) file names, if searched, are
      treated as separate pieces of text, 2) when searching files containing
      archived messages posted to the list at a given date, each message is
      search separately, as if it was a separate file. If the logical
      expression is satisfied for the message, then the name of the file
      is reported. Please note that only text files are scanned for text
      (contents of binary files is not). However, all file names (for binary
      as well as text) are scanned for, if requested.
   b) Search is lettercase insensitive, i.e., searching for words: charge,
      Charge, cHaRgE, etc., will produce the same result.
   c) The description below includes full metacharacter definitions, however,
      some will not be found in the text, since there are no new_lines and tabs,
      A-Z is the same as a-z, etc.
 Constructs in regular expressions:
      .     --- Period matches any character except a new line (remember, no
                new lines in here).
     [ ]    --- Any character within the square brackets matches, e.g. [a,b]
                matches a, comma, b. Ranges are also allowed: [0-9a-z] will
                match any digit or letter. Note that if you are searching for
                "-", you must put it just before the right ], or it
 will be
                treated as a range. Negation within square brackets is achieved
                with a caret character immediately following left bracket
 "[^".
                [^a-z] means: match everything but letters, [^_] means
                everything but underscore. Note that within brackets, characters
                .?*+|()$^{} should not be quoted, while [, \, ] should be
                entered as \[, \\, \].
     \d     --- Matches any digit (i.e., is a shorthand for [0-9]).
     \D     --- Matches everything but digit (i.e., is a shorthand for [^0-9]).
     \w     --- Matches "word" characters, i.e., letters, digits and
 underscore
                (same as [a-zA-Z0-9_]).
     \W     --- Matches "nonword" character (same as [^a-zA-Z0-9_]).
     \s     --- Matches a white-space (i.e., space, new_line, tab, etc.).
     \S     --- Matches a non-white-space character (i.e., characters which
                use pigment in your printer).
     \n     --- Matches new line (do not use here).
     \r     --- Matches carriage return (do not use here).
     \t     --- Matches a tab (do not use here, tabs are converted to spaces).
     \f     --- Matches a formfeed (do not use)
     \b     --- Matches a backspace (do not use)
     \xxx   --- Matches an ASCII octal code of a character (xxx are digits).
     ()     --- () are special characters to quote substrings for substitution.
                We do not do substitutions here. If you search for parentheses,
                use \( and \) outside square brackets.
     \1, \2 ... \9  --- used only in substitution strings. Do not use here.
     x?     --- matches 0 or 1 occurrences of character x (or any other, i.e.,
                [a-z]? matches 0 or 1 letter).
     x*     --- matches 0 or more occurrences of character x.
     x+     --- matches 1 or more occurrences of character x.
     x{m,n} --- matches at least m, but no more than n occurrences of x, e.g.,
                [0-9.+-]{2,3} will match: 1.2, .2, +11, -1, 123, +-.
     |      --- alternative: \son\s|\sin\s|\sup\s|\sat\s will match words:
                on, in, up, and at.
     \b     --- matches word boundary (outside [] only). It corresponds to
                white space, punctation marks and the very beginning and
                end of the line.
     \B     --- matches non word boundary (i.e., any printable character).
     ^      --- outside [] marks the beginning of the string. Do not use here.
     $      --- outside [] marks the end of the string. Do not use here.
 Note that in UNIX, parentheses and braces are quoted to get their special
 meaning, while here, they need to be quoted to get their ordinary meaning.
 Regular expressions, bracketed with a delimiter character can be optionally
 preceded with a label and a search scope identifier followed by a colon
 ":".
 The label is an integer number and the scope is one of the letters:
    T - text only (file names will not be matched to a regular expression),
    F - file name only (text inside the file will not be scanned),
    B - both file name and text will be scanned for matching (default).
 Both the label and the scope can be omitted, but if either exists, the colon
 must be present. For the purpose of the search, all the expressions below
 are identical:
    1B : /[MA][MO]PAC/
    2:   #ampac|mopac#
    :    +MOPAC|AMPAC+
    B4:  *Ampac|Mopac*
         -[am][mo]pac-
 Note that the numerical value of integer label is disregarded by the program,
 and the program assigns the numbers to regular expressions based on the
 order in which they were specified. It is here only for your convenience.
 2. LOGICAL RELATION
 ===================
 Once you specified your regular expressions, you need to specify a logical
 relation between them which qualifies the file for reporting. The logical
 relation can contain only operators:
   &  --- AND   (you can also use && if you are a UNIX or C fan)
   |  --- OR    (you can also use || if you are a UNIX or C fan)
   !  --- NOT
 parentheses, ordinal numbers of regular expressions and spaces. The number of
 the regular expression corresponds to its status (FOUND or NOT_FOUND) after
 searching its scope (i.e, file text and/or file name). For example:
 1 | !2  means that the file should be reported if regular expression number 1
 matched and expression number 2 did not. You can make your relation as
 complicated as you wish and use nested parentheses, but remember that:
 !(1 & 2) is equivalent to !1 | !2, and the !(1 | 2) is equivalent to !1
 & !2.
 Remember also that in logic the & takes precedence before |, and the ! is
 a unary operator and takes precedence before both of them, but use the
 parentheses for better readability.
 3. QUERY FORMAT AND EXAMPLES
 ============================
 The complete query has the following format:
     Number of regular expressions (N)
     regular expression 1
     regular expression 2
        ....
     regular expression N
     logical relation
 The query should be send to chemistry-search ( ( at ) ) ccl.net, and when the
 search
 is finished, the resulting list of files satisfying your query will be
 sent to you automatically. Do not be impatient and do not send your
 next query before you get the results of the first one ---  your request
 will be denied. Please remember that this is a flat file search and it is
 very demanding as I/O and CPU is concerned. Therefore only one query will be
 running at the given moment. Before you send a query, try to analyze it,
 and make it specific. You do not want to get a listing of the whole archive.
 The queries which search only file names are much faster than the queries
 which scan the whole file (i.e., you get your results faster, but you might
 not like what you get).
 If you look for a single term, you can use the abbreviated one-line format
 which consists of a regular expression only.
 Now, a few examples.
 Example 1.
 ----------
 1
 1B: /\bMM[23]\b|MMP2\b/
 1
 which is equivalent to saying:
 B: /\bMM[23]\b|MMP2\b/
 or
 /\bMM[23]\b|MMP2\b/
 since the default is both text and file names. You search for all text files
 which refer to MM2, MM3 or MMP2, or file names which contain MM2, MM3 or MMP2.
 Example 2.
 ----------
 /basis\sset/
 Note that this example is not equivalent to the query:
 2
 /basis/
 /set/
 1 & 2
 since in the first case, the words "basis" and "set" must be
 side by side,
 while in the latter case they may be separated by many words and in fact
 the "set" may be found before the "basis" is. Also, the
 latter case will
 find all the file names having "basis" or "set" in them,
 while there is no
 file names in the archive which have a space embedded in them.
 Example 3.
 ----------
 3
 1T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/
 2T: /\bCHARGE/
 3T: /\bHYDROGEN[\s-_]?BOND[si\s.,;]|\bH[\s-_]?BOND[si\s.,;]/
 1 & (2 | 3)
 3 regular expressions were specified. Expression 1 looks for a word MOPAC,
 AMPAC, AM1, MNDO, MINDO. Note that it may be either MOPAC or MOPAC6 so
 it is safer to require a non-letter after MOPAC rather than a space
 or word boundary. The 2nd expression looks for "CHARGE ",
 "CHARGES",
 "CHARGE,", "CHARGE.", "CHARGE=",
 "CHARGE-", etc. The 3rd one is
 a challenge for you. Note that people may say: HYDROGEN BOND, HYDROGEN-BOND,
 HYDROGEN_BOND, HYDROGENBOND, H-BOND, H BOND, H_BOND, HBOND, and may say BONDS,
 BONDING, and may put .,; after BOND. Note that all the regular expressions
 given above request searching for the text of the file only, not its name.
 The logical relation simply says: "find me the files which mention MOPAC or
 AMPAC or AM1 or PM3 or MNDO or MINDO and say also something about
 CHARGes or HYDROGEN BONDs".
 In short:
   1. prepare your search query as described above
   2. send it to chemistry-search ( ( at ) ) ccl.net
   3. wait (for a long, long time) for an answer
 -----------------------------
 I will welcome suggestions for improving this description and corrections
 to my spelling and grammar (as you know, I am not a native English speaker).
 Jan Labanowski
 jkl ( ( at ) ) ccl.net