Also, you may want to look for a file which contains several
pieces of information, e.g., you may want to find files
which contain information about MOON
and JUPITER
. In other words, you want to introduce
a logical relation between regular expressions.
Searching a collection of files saves time, since it helps you decide which files may be of interest. In this particular case, the archives can be searched via Web-Form, or through an e-mail interface that is also available, for those of you who cannot access the World-Wide-Web for some reason. The functions of the e-mail searcher are identical to the Web searcher, but obviously, e-mail is more awkward to use. Also, the Web searcher has the advantage of being able to retrieve or view interesting files with the click of the button. On the other hand, if a search is time-consuming (e.g., if you search through the entire large archive), you will appreciate performing the search off line, and receiving the results via e-mail, rather than waiting at your terminal for half an hour or more. In fact, the Web-Form software will refuse to run the search interactively if the estimate of time needed is greater than 10 minutes. For longer searches you will need to enter your valid e-mail address and select E-mail mode on the entry form.
You are also presented with 2 choices of output format (both for the E-mail and the Interactive mode), namely: Plain Text, and HTML (HyperText Markup Language). If you do not have Web access, choose Plain Text. However, if you are connected to the Web, use HTML by all means. Even if you are getting the search results via e-mail, you will be able to display your results as a Web page and view selected files with a click of a mouse button.
This description is written specifically for
searches of the archives available at
OSC. Searching
scripts are written in the Perl
programming language
and therefore Perl
syntax for
regular expressions is used. It is essentially identical to
the UNIX egrep
syntax.
Also, only a subset of the full syntax
of regular expressions is described here, because, in our case,
the text being searched
is initially preprocessed (only for the purpose of the search)
in the following ways:
To help you decide if the file selected by the search is of interest, a portion of the text (context) surrounding the match is also displayed. You can make this up to 300 characters long. Again, what is displayed is not the actual text, but the one deprived of punctuation and formatting.
<Return to top of page>
2. Regular Expressions
A regular expression is a way to specify flexible matches, as opposed
to rigid keywords.
While letters and digits have their verbatim meaning in
regular expressions, most punctuation marks have a special meaning
and are therefore called metacharacters. Since in our case the search is not
sensitive to letter case, it does not matter if you use capital or lower
case letters within the regular expression. Regular expressions are enclosed within
a pair of identical characters (delimiters)
which must be different from the ones
used inside the regular expression itself. This is done to see spaces
that may be if a part of the
regular expression. For example: ? Mozart ?
and
! MoZarT !
will find the same text. Some
possible constructs within regular expressions are:
/Chopin|moZart|Kuhlau/
will match Chopin, or Mozart, or Kuhlau.
You can build very powerful searches using these
elements. For example, if you are looking for dates
from 1820 until 1899, / 18[2-9][0-9] /
will do (note the spaces around it). If you want
Sonaten, Sonatas, Sonate, Sonata, Sonatine, Sonatinen, Sonatensatz, etc.,
/ SONAT[AEI]/
may help (note, no space at the end). Different spellings of the
same word can also be searched for, e.g., Schroedinger
or Schrodinger or Schrödinger? To catch all variants use:
/Schr(o|oe|\366)dinger/
.
Looking for a million dollars or more? This one is easy:
/\d( ?\d{3}){2,}/
i.e, find at least 2 consecutive groups of 3 digits (the group may,
but does not have to, start from a space) following a digit.
Be aware that the numbers could be written as:
53,123,456 , 53123456 ,or 53 123 456
and the commas, if present, were changed to spaces before searching.
Here are some examples of valid regular expressions:
/m?ethane/
ethane
or methane
./ab*c/
ac
, abc
, abbc
,
abbbc
, etc., that is any string that
starts with an a
, is
followed by 0 or more b
's, and ends with a c
.
/ab+c/
ac
,
but it would match abc
, abbc
,
abbbc
, etc./cyclo.*ane/
cyclodecane
, cyclohexane
and
even cyclones drive me insane.
Any string that starts
with cyclo
, is followed by an arbitrary string, and
ends with ane
will be matched. Note that the null
string will be matched by the period-star pair; thus, cycloane
would be matched by the above expression. If you wanted to search for
articles on cyclodecane and cyclohexane, but didn't want to match articles
about how cyclones drive one insane
, you could
string together three periods, as follows: /cyclo...ane/
/ c\++ /
c+
, c++
,
c+++
, etc., while
/ c\+\+ /
would only match c++
./\W[^f-h]ood\W/
ood
except for food
,
good
or hood
.
(Thus mood
and wood
would both be matched.)While the regular expressions seem somewhat cryptic and complicated, they are compact. After only a little practice you can become a guru. One important bit of advice for beginners though is to train on some small directory which does not have large files before you submit your final exhaustive search.
<Return to top of page>
3. Logical Relations
Sometimes you may want to find out if the text contains a number of
specific words, not only a single keyword. For example you may
be interested in a text file which specifically mentions:
1:/Beethoven/
and 2:/Sonatas/
.
In short, you want: 1 AND 2
.
Or, if you want to learn who wrote:
1:/Sonatas/
beside
2:/Beethoven/
and 3:/Mozart/
you may want to have a logical relation like:
1 AND NOT (2 OR 3)
which is equivalent to:
1 AND (NOT 2 AND NOT 3)
.
The numbers refer to regular expressions
which are to be matched with the proprocessed text of the file.
Of course, you may want to create more complicated relations between
matched pieces of text. This is needed when you look for a specific
topic. But beware... You can miss useful information, if you
search mechanically. In the above example, the most valuable file may
have started with the phrase: Below is a list of all Sonatas,
except those composed by Mozart and Beethoven and the search would have
missed it. On the other hand, if you are looking specifically for
1:/Sonatas/
of 2:/Beethoven/
or
3:/Mozart/
, that is:
1 AND (2 OR 3)
you would find this file, even if you do not need it.
You have to be aware that even
the most specific searches will turn out some useless material.
For this reason, when a match
to the regular expressions is found, a fragment of surrounding
text is displayed to help you
decide if the information is useful. But, only the first match
is shown, and you never know
what is written later in the file.
<Return to top of page>
4. Searching Archives via E-mail
The archives can be also searched via e-mail by sending a message to
MAILSERV@server.ccl.net
. To get an overview of all commands
send a single word:
helpin the body of your message to
MAILSERV@server.ccl.net
.
More information is available on each command, and the details of
e-mail archive searching can be obtained by sending the line:help searchto
MAILSERV@server.ccl.net
.
There is also a more detailed help file available regarding
email searching.
An example of a typical search
query follows:select chemistry cd archived-messages/95 search HTML 250 7 1T: /charges/ 2T: / MP[2-4] / 3T: / MBPT / 4T: / DFT / 5T: / LSD / 6T: / LDF / 7T: / Density[ -]Functional / 1 AND (2 or 3 or 4 or 5 or 6 or 7) dir quitIn this example:
T
specifies that only the text of the file should be
searched for a match (other searching scopes are: N
for matching file names only, and B
for matching regular
expressions against both the text and the name of
the file -- this is the default).
<Return to top of page>