Searching The CCL Archives


Contents

  1. A Brief Overview
  2. Regular Expressions
  3. Logical Relations
  4. Searching via E-mail

1. A Brief Overview

Sometimes searching for a given word in a text is not enough. You may want to find different forms of the same word. Or sometimes the same word can be spelled in a number of ways. Sometimes people will join two words with a hyphen, or may write them together, yet some will write them separately. Hence, flexible searches are needed. Regular expressions will do that.

Also, you may want to look for a file which contains several pieces of information, e.g., you may want to find files which contain information about MOON and JUPITER. In other words, you want to introduce a logical relation between regular expressions.

Searching a collection of files saves time, since it helps you decide which files may be of interest. In this particular case, the archives can be searched via Web-Form, or through an e-mail interface that is also available, for those of you who cannot access the World-Wide-Web for some reason. The functions of the e-mail searcher are identical to the Web searcher, but obviously, e-mail is more awkward to use. Also, the Web searcher has the advantage of being able to retrieve or view interesting files with the click of the button. On the other hand, if a search is time-consuming (e.g., if you search through the entire large archive), you will appreciate performing the search off line, and receiving the results via e-mail, rather than waiting at your terminal for half an hour or more. In fact, the Web-Form software will refuse to run the search interactively if the estimate of time needed is greater than 10 minutes. For longer searches you will need to enter your valid e-mail address and select E-mail mode on the entry form.

You are also presented with 2 choices of output format (both for the E-mail and the Interactive mode), namely: Plain Text, and HTML (HyperText Markup Language). If you do not have Web access, choose Plain Text. However, if you are connected to the Web, use HTML by all means. Even if you are getting the search results via e-mail, you will be able to display your results as a Web page and view selected files with a click of a mouse button.

This description is written specifically for searches of the archives available at OSC. Searching scripts are written in the Perl programming language and therefore Perl syntax for regular expressions is used. It is essentially identical to the UNIX egrep syntax. Also, only a subset of the full syntax of regular expressions is described here, because, in our case, the text being searched is initially preprocessed (only for the purpose of the search) in the following ways:

  1. Most punctuation marks (namely: ( ) [ ] { } , ; . : ? ! = < > ' ` " & ^ @ | \ / ~ ) are converted to spaces. Do not search for these characters since you will not find them (though they are most likely present in the original text). Note that: + # - _ * are not converted to spaces. All white space characters (i.e, TABs, NEW-LINEs, FORM-FEEDs, and CARRET_RETURNs) are converted to a space.

  2. Multiple spaces are converted to a single space character.

  3. Hyphenated words which are split between two lines are joined.

For the search software the text looks like a single looooong line of text, without punctuation marks, and only single spaces between words. Moreover, this long line always starts with a space and ends with a space.

To help you decide if the file selected by the search is of interest, a portion of the text (context) surrounding the match is also displayed. You can make this up to 300 characters long. Again, what is displayed is not the actual text, but the one deprived of punctuation and formatting.

<Return to top of page>


2. Regular Expressions

A regular expression is a way to specify flexible matches, as opposed to rigid keywords. While letters and digits have their verbatim meaning in regular expressions, most punctuation marks have a special meaning and are therefore called metacharacters. Since in our case the search is not sensitive to letter case, it does not matter if you use capital or lower case letters within the regular expression. Regular expressions are enclosed within a pair of identical characters (delimiters) which must be different from the ones used inside the regular expression itself. This is done to see spaces that may be if a part of the regular expression. For example: ? Mozart ? and ! MoZarT ! will find the same text. Some possible constructs within regular expressions are:

You can build very powerful searches using these elements. For example, if you are looking for dates from 1820 until 1899, / 18[2-9][0-9] / will do (note the spaces around it). If you want Sonaten, Sonatas, Sonate, Sonata, Sonatine, Sonatinen, Sonatensatz, etc., / SONAT[AEI]/ may help (note, no space at the end). Different spellings of the same word can also be searched for, e.g., Schroedinger or Schrodinger or Schrödinger? To catch all variants use: /Schr(o|oe|\366)dinger/ . Looking for a million dollars or more? This one is easy: /\d( ?\d{3}){2,}/ i.e, find at least 2 consecutive groups of 3 digits (the group may, but does not have to, start from a space) following a digit. Be aware that the numbers could be written as: 53,123,456 , 53123456 ,or 53 123 456 and the commas, if present, were changed to spaces before searching.

Here are some examples of valid regular expressions:

/m?ethane/
would match either ethane or methane.
/ab*c/
would match ac, abc, abbc, abbbc, etc., that is any string that starts with an a, is followed by 0 or more b's, and ends with a c.
/ab+c/
would not match ac, but it would match abc, abbc, abbbc, etc.
/cyclo.*ane/
would match cyclodecane, cyclohexane and even cyclones drive me insane. Any string that starts with cyclo, is followed by an arbitrary string, and ends with ane will be matched. Note that the null string will be matched by the period-star pair; thus, cycloane would be matched by the above expression. If you wanted to search for articles on cyclodecane and cyclohexane, but didn't want to match articles about how cyclones drive one insane, you could string together three periods, as follows: /cyclo...ane/
/ c\++ /
would match c+ , c++ , c+++ , etc., while / c\+\+ / would only match c++ .
/\W[^f-h]ood\W/
matches any four letter wording ending in ood except for food, good or hood. (Thus mood and wood would both be matched.)

While the regular expressions seem somewhat cryptic and complicated, they are compact. After only a little practice you can become a guru. One important bit of advice for beginners though is to train on some small directory which does not have large files before you submit your final exhaustive search.

<Return to top of page>


3. Logical Relations

Sometimes you may want to find out if the text contains a number of specific words, not only a single keyword. For example you may be interested in a text file which specifically mentions:
1:/Beethoven/ and 2:/Sonatas/.
In short, you want: 1 AND 2.
Or, if you want to learn who wrote:
1:/Sonatas/ beside 2:/Beethoven/ and 3:/Mozart/
you may want to have a logical relation like:
1 AND NOT (2 OR 3)
which is equivalent to:
1 AND (NOT 2 AND NOT 3).
The numbers refer to regular expressions which are to be matched with the proprocessed text of the file. Of course, you may want to create more complicated relations between matched pieces of text. This is needed when you look for a specific topic. But beware... You can miss useful information, if you search mechanically. In the above example, the most valuable file may have started with the phrase: Below is a list of all Sonatas, except those composed by Mozart and Beethoven and the search would have missed it. On the other hand, if you are looking specifically for
1:/Sonatas/ of 2:/Beethoven/ or 3:/Mozart/, that is:
1 AND (2 OR 3)
you would find this file, even if you do not need it. You have to be aware that even the most specific searches will turn out some useless material. For this reason, when a match to the regular expressions is found, a fragment of surrounding text is displayed to help you decide if the information is useful. But, only the first match is shown, and you never know what is written later in the file.

<Return to top of page>


4. Searching Archives via E-mail

The archives can be also searched via e-mail by sending a message to MAILSERV@server.ccl.net. To get an overview of all commands send a single word:
  help
in the body of your message to MAILSERV@server.ccl.net. More information is available on each command, and the details of e-mail archive searching can be obtained by sending the line:
  help search
to MAILSERV@server.ccl.net. There is also a more detailed help file available regarding email searching. An example of a typical search query follows:
  select chemistry
  cd archived-messages/95
  search HTML 250
  7
  1T:  /charges/
  2T:  / MP[2-4] /
  3T:  / MBPT /
  4T:  / DFT /
  5T:  / LSD /
  6T:  / LDF /
  7T:  / Density[ -]Functional /
  1 AND (2 or 3 or 4 or 5 or 6 or 7)
  dir 
  quit
In this example: You may send multiple search queries and other commands within a single request to MAILSERV@server.ccl.net. The search will be executed in the order received. one at a time.

<Return to top of page>


Return to the CCL homepage WWW Search Engine Information on this page