1. A Brief Overview
Sometimes searching for a given word in a text is not enough.
You may want to find different forms of the same word. Or
sometimes the same word can be spelled in a number of ways.
Sometimes people
will join two words with a hyphen, or may write them together, yet some
will write them separately.
Hence, flexible searches are needed. Regular expressions
will do that.
Also, you may want to look for a file which contains several
pieces of information, e.g., you may want to find files
which contain information about MOON
and JUPITER
. In other words, you want to introduce
a logical relation between regular expressions.
Searching a collection of files saves time, since it helps you decide which
files may be of interest. In this particular case, the archives
can be searched via Web-Form, or through an e-mail interface that is also
available,
for those of you who cannot access the World-Wide-Web for some reason.
The functions of the e-mail searcher
are identical to the Web searcher, but obviously, e-mail is more
awkward to use. Also, the Web searcher has the advantage of being able
to retrieve or view interesting files with the click of the button.
On the other hand, if a search is time-consuming (e.g., if you search through
the entire large archive), you will appreciate
performing the search off line, and receiving the results via e-mail,
rather than waiting at your terminal for half an hour or more.
In fact, the Web-Form software will refuse to run the search interactively if
the estimate of time needed is greater than 10 minutes.
For longer searches you will need to enter
your valid e-mail address and select E-mail mode on the entry form.
You are also presented with 2 choices of output format (both for the
E-mail
and the Interactive mode), namely: Plain Text, and HTML
(HyperText Markup Language). If you do not have Web access,
choose Plain Text.
However, if you are connected to the Web, use HTML
by all means. Even if you
are getting the search results via e-mail, you will be able to display
your results as a Web page and view selected files with a click of
a mouse button.
This description is written specifically for
searches of the archives available at
OSC. Searching
scripts are written in the Perl
programming language
and therefore Perl
syntax for
regular expressions is used. It is essentially identical to
the UNIX egrep
syntax.
Also, only a subset of the full syntax
of regular expressions is described here, because, in our case,
the text being searched
is initially preprocessed (only for the purpose of the search)
in the following ways:
- Most punctuation marks (namely:
( ) [ ] { } , ; . : ? ! = < > ' ` " & ^ @ | \ / ~ )
are converted to spaces. Do not search for these characters
since you will not find them (though they are most likely present in the
original text). Note that: + # - _ *
are not converted to spaces.
All white space characters (i.e, TABs, NEW-LINEs, FORM-FEEDs, and CARRET_RETURNs)
are converted to a space.
- Multiple spaces are converted to a single space character.
- Hyphenated words which are split between two lines are joined.
For the search software the
text looks like a single looooong line of text, without punctuation
marks, and only single spaces between words. Moreover, this long line always
starts with a space and ends with a space.
To help you decide if the file selected by the search is of interest,
a portion of the text (context) surrounding the match is
also displayed. You can make this up to 300 characters long.
Again, what is displayed is not the actual text, but the one deprived of
punctuation and formatting.
<Return to top of page>
2. Regular Expressions
A regular expression is a way to specify flexible matches, as opposed
to rigid keywords.
While letters and digits have their verbatim meaning in
regular expressions, most punctuation marks have a special meaning
and are therefore called metacharacters. Since in our case the search is not
sensitive to letter case, it does not matter if you use capital or lower
case letters within the regular expression. Regular expressions are enclosed within
a pair of identical characters (delimiters)
which must be different from the ones
used inside the regular expression itself. This is done to see spaces
that may be if a part of the
regular expression. For example: ? Mozart ?
and
! MoZarT !
will find the same text. Some
possible constructs within regular expressions are:
- Alternative -- a | character separates
alternative pieces of text, e.g.
/Chopin|moZart|Kuhlau/
will match Chopin, or Mozart, or Kuhlau.
- Atoms, i.e., the basic elements of a regular expression.
They may represent a single character or a group
of characters, or even a complete regular expression enclosed within
parentheses:
- A letter, digit, - , or # will match itself
(except that letters may be written as capital or small, i.e.,
K is the same as k ). To match metacharacters
you need to precede them with a backslash, (e.g.:
\+, \*, \$, etc.).
When it doubt, use the \ to ensure that the original
meaning is preserved. Remember, however, that most of these characters
were temporarily changed to spaces for searching and you will not find them.
Do not use a backslash before ordinary letters, since some sequences have
a special meaning:
- \d matches any digit,
- \D matches a non-digit,
- \s matches a space,
- \S matches a non-space character (i.e., letter, digit,
punctuation, etc.),
- \w corresponds to a word character, i.e.,
any letter, digit, or _ (underscore).
- \W matches characters which are not
matched by \w .
The above sequences are useful. There are also other
sequences of this type but they would not be useful for searching
here as they deal with characters which are explicitly
removed from the text before searching.
- A . (period) matches any single character.
- A [list] , i.e., a list or a range of characters surrounded
by square brackets, matches any character on the list. E.g.,
[abc012] will match any of the first three letters or digits.
The list may include ranges (e.g., [a-z] represents any letter).
The list may also be negated with a ^ (carat) character,
e.g., [^0-9+-] signifies: all characters but
digits, plus or minus signs. Note that the minus sign
specifies a range only when surrounded by two ordinary characters.
Some ranges do not make sense, e.g., if the first character is later in the
table of character codes than the second one, e.g., the range:
[f-a] is a nonsense.
More than one list may be enclosed in backets, e.g., [a-z0-9] will match
any alphanumeric character.
- A \ backslash followed by the octal code, will match the
ASCII (or extended ASCII) code of the character. Only eagles should dare.
For example, /G\351za/ , and /G\363recki/ will match
Géza, and Górecki, respectively, if the ISO Latin-1 character
set is used in the file.
- Any regular expressions enclosed within parentheses is an atom,
e.g., (\d\d\d) is an atom which matches 3 consecutive digits.
- A Quantifier follows an atom and denotes how many times
the atom needs to occur:
- ? -- 0 or 1 time,
- * -- 0 or more times,
- + -- 1 or more times,
- {2}, {3}, ... -- 2, 3, ... times. In general: {n} means n times,
- {2,}, {3,}, ... -- at least 2 times, at least 3 times, ... In general: {n,}
means at least n times,
- {n,m} -- at least n times but no more than m times.
You can build very powerful searches using these
elements. For example, if you are looking for dates
from 1820 until 1899, / 18[2-9][0-9] /
will do (note the spaces around it). If you want
Sonaten, Sonatas, Sonate, Sonata, Sonatine, Sonatinen, Sonatensatz, etc.,
/ SONAT[AEI]/
may help (note, no space at the end). Different spellings of the
same word can also be searched for, e.g., Schroedinger
or Schrodinger or Schrödinger? To catch all variants use:
/Schr(o|oe|\366)dinger/
.
Looking for a million dollars or more? This one is easy:
/\d( ?\d{3}){2,}/
i.e, find at least 2 consecutive groups of 3 digits (the group may,
but does not have to, start from a space) following a digit.
Be aware that the numbers could be written as:
53,123,456 , 53123456 ,or 53 123 456
and the commas, if present, were changed to spaces before searching.
Here are some examples of valid regular expressions:
/m?ethane/
- would match either
ethane
or methane
.
/ab*c/
- would match
ac
, abc
, abbc
,
abbbc
, etc., that is any string that
starts with an a
, is
followed by 0 or more b
's, and ends with a c
.
/ab+c/
- would not match
ac
,
but it would match abc
, abbc
,
abbbc
, etc.
/cyclo.*ane/
- would match
cyclodecane
, cyclohexane
and
even cyclones drive me insane.
Any string that starts
with cyclo
, is followed by an arbitrary string, and
ends with ane
will be matched. Note that the null
string will be matched by the period-star pair; thus, cycloane
would be matched by the above expression. If you wanted to search for
articles on cyclodecane and cyclohexane, but didn't want to match articles
about how cyclones drive one insane
, you could
string together three periods, as follows: /cyclo...ane/
/ c\++ /
- would match
c+
, c++
,
c+++
, etc., while
/ c\+\+ /
would only match c++
.
/\W[^f-h]ood\W/
- matches any four letter wording ending in
ood
except for food
,
good
or hood
.
(Thus mood
and wood
would both be matched.)
While the regular expressions seem somewhat cryptic and complicated, they
are compact. After only a little practice you can become a guru. One important
bit of advice for beginners though is to train on some
small directory which does not
have large files before you submit your final exhaustive search.
<Return to top of page>
3. Logical Relations
Sometimes you may want to find out if the text contains a number of
specific words, not only a single keyword. For example you may
be interested in a text file which specifically mentions:
1:/Beethoven/
and 2:/Sonatas/
.
In short, you want: 1 AND 2
.
Or, if you want to learn who wrote:
1:/Sonatas/
beside
2:/Beethoven/
and 3:/Mozart/
you may want to have a logical relation like:
1 AND NOT (2 OR 3)
which is equivalent to:
1 AND (NOT 2 AND NOT 3)
.
The numbers refer to regular expressions
which are to be matched with the proprocessed text of the file.
Of course, you may want to create more complicated relations between
matched pieces of text. This is needed when you look for a specific
topic. But beware... You can miss useful information, if you
search mechanically. In the above example, the most valuable file may
have started with the phrase: Below is a list of all Sonatas,
except those composed by Mozart and Beethoven and the search would have
missed it. On the other hand, if you are looking specifically for
1:/Sonatas/
of 2:/Beethoven/
or
3:/Mozart/
, that is:
1 AND (2 OR 3)
you would find this file, even if you do not need it.
You have to be aware that even
the most specific searches will turn out some useless material.
For this reason, when a match
to the regular expressions is found, a fragment of surrounding
text is displayed to help you
decide if the information is useful. But, only the first match
is shown, and you never know
what is written later in the file.
<Return to top of page>
4. Searching Archives via E-mail
The archives can be also searched via e-mail by sending a message to
MAILSERV@server.ccl.net
. To get an overview of all commands
send a single word:
help
in the body of your message to MAILSERV@server.ccl.net
.
More information is available on each command, and the details of
e-mail archive searching can be obtained by sending the line:
help search
to MAILSERV@server.ccl.net
.
There is also a more detailed help file available regarding