|
Matching text with Perl Regular Expressions
Uncopyrighted by Jan Labanowski in Oct. 2005.
You can do whatever you want with it, and even
put your name on it, if you think that it will make you look good
or bring you money.
If you read this document, please read it twice,
since some elements are introduced earlier than they are explained
(otherwise, it would be much longer).
This text is primarily intended to be used with
the Perl regular expression exercise page at:
http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl.
As you read, you should use this form and enter the examples given below
to see your matches highlighted. Yes, it will take time, but you will
learn the power of regular expressions that have a lot of useful
uses. Moreover, while regular expressions are similar in spirit
to UNIX shell globs, the similarity is superficial.
Regular expressions are much more complicated and have
different syntax. E.g., the asterisk * in the shell glob
like: ls -l *.doc would
have to be represented in regular expression
as .*
that means: "zero or more repeats (*) of any character
(.)".
The whole shell glob
*.doc would
correspond to
/.*\.doc/
regular expression.
Perl is a popular scripting language to process
text. For this reason, it is often used for writing Web
applications. The processing of entries from Web forms is frequently
accomplished with the Perl interpreter due to its very powerful
regular expression support. Perl is also a very convenient tool
for converting input files from one format to another, for
extracting needed data from large output files, and is an
essential tool in bioinformatics/genomics. If you still do not know
Perl, and do computing, it is time to learn it.
In this short presentation only the basic syntax of the
regular expressions will be covered, and the substitution
of matched text (search and replace) will not be even
mentioned. Likewise, I will not deal with Unicode.
But hopefully, the background presented here can be
a good starting point and encouragement to study the
Camel Book, as it is
called by Perl aficionados, a primary reference for Perl language:
Programming Perl by Larry Wall, Tom Christiansen, and
Jon Orwant, published by O'Reilly (currently in its 3rd edition).
After you read this tutorial, please complete practical exercises
at the following link:
http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl
Regular expressions are the UNIX way of
specifying flexible text matches. I will
simplify some things here, since general form and possible
variants would take a book to describe (and indeed there
are books on this topic alone). The popular form of using the flexible
match in Perl is:
$some_string =~ /some_regular_expression/some_modifiers
where some_regular_expression
is matched against $some_string.
The matching can be additionally modified by
some_modifiers. If the regular
expression matches the string, the =~ relation is TRUE,
otherwise it is FALSE.
There is also a negation form of the relation where = is
replaced by !):
$some_string !~ /some_regular_expression/some_modifiers
In this case the relation is TRUE if the
regular expression does not match the string, and FALSE, if it matches.
For example, if $some_string contains the text:
The spring is late this year
The relation:
$some_string =~ /the/i
will be TRUE (the modifier i makes the match case
insensitive).
However, the relation:
$some_string !~ /spring/i
will be FALSE due to the negation match
!~ relation. In this simple
case, you can match
the words, and this is what is done most often.
But it is sometimes not enough. Let say, that the $some_string
contains text justified with spaces (multiple) and new lines,
as in my famous poem The Spring and the Deeper
Meaning of Life:
The spring came late
and this is our fate.
The mood is not good
and there is no food.
If we wanted to check if some specific
word (i.e., series of alphanumeric characters surrounded by white space
or punctuation marks) is present in the text, we have to use
more sophistication. For example, if we wanted to check if
the string contains the word the, the regular expression:
$some_string =~ /the/i
would also match the word there. We would find
the word the in the string:
Spring came late
and this is our fate.
Mood is not good
and there is no food.
where it is not present. You could, of course,
try the expression:
$some_string =~ / the /i
i.e., put spaces around it. But this approach would
fail for the word fate (it would not be found,
since the period, rather than a space, is following it), or the late (not a space, but a NEWLINE
character is following it). Here we enter the magic world of special
symbols (called metacharacters) in regular expressions.
There are a dozen of them of them:
\,
|,
(,
),
[,
{,
^,
$,
*,
+,
? and
. .
I sometimes use metacharacter in a loose way
meaning: "a character or a sequence of characters
that mean something else than what is written".
Check the links at the bottom of this page to find more.
You can find words in a number of ways:
$some_string =~ /\Wthe\W/i
Here, the \W is a symbol saying: Match any
"nonword" character (i.e., character other than a letter,
digit or underscore). Equally good (or even better, since the previous
example does not account for the beginning or the end of the
string) we could do with:
$some_string =~ /\bthe\b/i
Here the \b represents Match word
boundary, i.e., a location rather than a character
(it is also overloaded with the meaning Match the backspace character,
i.e., the CTRL-H). We could also use the untidy:
$some_string =~ /\sthe\s/i
i.e., match the word between two whitespace (\s)
characters, but then we would not catch the words followed by punctuation
marks, like period or comma (whitespace is ANY space
character, i.e., new lines, TABs, form feeds, caret returns, etc).
We could also try to list all popular
characters that are found before and after the word, like:
$some_string =~ /[\s.;:?!]the[\s.;:?!]/i
but obviously the list is much longer (Note:
the notation [abc] means: match a character
that is either a or
b or c, while the notation [^abc] means: match a
character that is neither
a nor b nor c). We could finally do:
$some_string =~ /[^a-z]the[^a-z]/i
i.e., the word the surrounded by nonletters
(the hyphen means a range of characters, e.g., [a-zA-Z0-9] means
all lower and upper case letters and digits). Latin letters
and digits inside a regular expression
mean usually what they stand for (do not quote them with
the backslash \,
since doing so changes them often
to some special symbols as you saw above!). Many punctuation marks and
other nonalphanumeric characters (but not all!!!), however,
have a special meaning in regular expressions. The are called
metacharacters.
Moreover, some characters have special meaning in the Perl itself
and you have to quote them, even when they do not have a special meaning
in the regular expressions (it is a small oversimplification, but
it will have to do here). For example, if you look for an
e-mail address that contains @ character, you need
to quote it as \@ since Perl uses this character to
specify lists (arrays) of values [caveat: in some cases, e.g.,
when regular expressions is given as a Perl variable at execution time, the
@ is not expanded into a
list and keeps its literal meaning].
The regular expression:
$some_string =~ /jlabanow@ccl/
in the Perl script would match the string "jlabanow@aol.com"
or "jlabanowANYTHING" if the list
@ccl was not
defined or empty in the Perl script (or even worse: if the list
had some entries, your matching would be quite surprising and
quite unpredictable, since this varies with Perl releases).
Conclusion: Test it before you use it!.
I will list a few important examples of special characters or
character sequences but there are scores of them.
You can learn about all of them by checking the appropriate man pages under
UNIX system on which Perl is installed, namely:
man perlrequick for a short tutorial
man perlretut for a longer tutorial
man perlre for a reference manual for Perl regexp
Repeated pattern matching and greedy vs.
non-greedy match
There are often situations when you need to match
a series of repetitions of characters or character sets. For example,
if you want to match valid decimal numbers (without exponent) like:
-12.123 you could use
the expression: /[0-9.+-]+/,
(match digits, period, plus and minus). The
+ means one or more times.
But obviously, such expression would also match strings like:
1-234.2.4 that are
not valid numbers. Expression like:
/([+-]?[0-9]+\.?[0-9]*)|([+-]?[0-9]*\.?[0-9]+)/ or
equivalent versions like:
/([+-]?\d+\.?\d*)|([+-]?\d*\.?\d+)/  or
/((\+|-)?\d+\.?\d*)|((\+|-)?\d*\.?\d+)/
would only match valid numbers like:
+123,
+123.,
.123,
1.23,
+1.23, etc.
Here, the ? means:
zero or one time,
* means:
zero or more times, the
| denotes the
alternative, parentheses enclose grouping of characters
and the period
., being a metacharacter,
needs to be protected with a backslash to retain its literal meaning.
Note that metacharacters do not have to be backslashed (usually) within
the square brackets (i.e., character classes). Note also that
to retain the original meaning of the minus sign
- it needs to be
specified as the last character within square brackets, or it would be
interpreted as a range.
We often have situations when it is important
that the matched string is the shortest or the longest possible.
The expression: /Bo+/
In the string Booting can
match either Bo or
Boo. By default,
the repetition operators (
+,
*,
? and
{n,m} )
are greedy, i.e., they will try to match the longest possible string.
In the example above, the expression:
/Bo+/ will match
Boo. To make them
non-greedy (i.e., to make them match the shortest possible
string), follow them by a ?.
The expression: /Bo+?/
will match the Bo.
While the greediness of the regular expression is mostly
important when patterns are used for substitutions, it can also
be important in searching. For example, if you search the valid
HTML document, that starts from
the <html> element
and ends with the
</html> element,
the expression
/<.+>/s will match the
whole HTML document, while the expression
/<.+?>/s will match
only the opening
<html> tag.
Popular metacharacters and special character sequences
Find examples of popular metacharacters below.
They are essential for
flexible pattern matching, and hopefully, after analyzing my
poetry presented earlier, you will cherish their usefulness.
- NEWLINE
- This is a mess... The vendors of operating systems worked hard to
make text files from one system to look like junk on the other so you
are stuck. Of course, you are not stuck, you are only punished by their
greed (or egos, as the case may be). The Internet standards and the
Microsoft use two ASCII characters
to mark the end of the line: CTRL-M followed by CTRL-J. The UNIX uses
CTRL-J, and the Mac uses CTRL-M. The CTRL-M (caret return,
\r) is octal 15,
decimal 13, and hex 0D. The CTRL-J (line feed,
\n)
is octal 12, decimal 10, and hex 0A. UNIX often automatically converts
the text from files with the UNIX newline
convention (\n)
to Internet/MS-DOS convention
(\r\n) before feeding it to
electronic mail or serving web pages. If you look for the new lines
in regular expressions, use $
and the m modifier, or
be a guru and search for (\r?\n|\r). In fact, if you
made your data file on a PC under Windows or on the Mac and you
want to feed it to a program on UNIX check if NEWLINEs adhere to UNIX
convention or your input can be rejected. Use a dos2unix command or write a Perl
script with the
$my_data =~ s/(\r?\n|\r)/\n/g,
but check with od -c myfile
command first to see if you really have a problem.
- /
- the forward slash is not really a special character, but it is
used as a default character for delimiting regular expression (i.e.,
marking the start and the end of the regular expression). For
this reason it has to be quoted with a backslash within most regular
expressions, so the Perl is not confused were is the start and the end
of the regular expression. You can use a special syntax and
specify some other character as a delimiter for regular expressions,
but usually, we just quote the slash with a backslash to recover
its natural meaning within the regular expressions. So to find the local subdirectory
in the string, say: "/usr/local/bin" you would use:
$a_string =~ /\/local\//;
- \
- the backslash is used for quoting. Many characters have special
meaning within regular expressions. When you want to match the
original character, you quote it with a backslash, like in UNIX shell.
E.g., to find a period (which is a metacharacter)
you would look for \.
rather than a bare period. To match the backslash in the text give
it as \\. But do not
quote the letters and digits, since it would usually assign special
meaning to them. For example:
\A and
\z match the beginning, and
the end of the string, respectively (always, irrespectively on m and
s modifiers), while
\1,
\2, etc. match the
pieces of regular expression that were enclosed in parentheses
(e.g., the expression /\b(\w+)\s+\1/ will match
the repeated word in the text, while the /(\d+)\1/
will match the repeated digit or a series of digits in a
number, and if you change it to /(\d+)\1+/ it will match the whole
sequence of repetitions. If you are a genomics person, and you want to
find the initiation codon followed by methionine(s) you would search
for /(AUG)\1+/i, but it
does not make sense, does it?). The backslash is also used to specify
character codes in the regular expression. The \nnn specifies an octal code
for a character, while the \xNN a hexadecimal code.
For example: \141 and
\x61 represent code
for lowercase a. Some
popular character codes have escape sequences assigned for convenience:
\a -- bell char, BEL, CTRL-G
(\007)
\b -- backspace, BS, CTRL-H
(\008)
\t -- horizontal tab,
HT, CTRL-I
(\011)
\n -- line feed, LF, CTRL-J
(\012)
\f -- form feed, FF, CTRL-L
(\014)
\r -- caret return, CR, CTRL-J
(\012)
\e -- ASCII escape, ESC, CTRL-[
(\033).
Other control characters can be entered as
\cX, e.g., the line feed,
CTRL-J, can be entered also as \cJ, while the ESC code as
\c[.
- .
- period matches any single character. This is not that simple
however, when string contains new line characters.
There are two modifiers (the stuff that follows the closing slash
/ of the regular expression),
namely: m (default)
and s that affect pattern
matching properties. They tell Perl the following:
- m
-- assume that the text/string contains multiple
lines (it is the default behavior);
- s
-- assume that string contains a single line of text
(and if it does contain new lines, treat them like if they were
ordinary characters)
If no m or s modifier is given, m
is a default behavior.
Therefore, period matches a NEWLINE character when the s
modifier is used (since we lied to Perl, that there are no NEWLINEs).
If we told Perl that the string has multiple lines of text by
using m modifier (or
accepting default), the NEWLINEs become special and denote end of
lines -- special spots in the text. The period will not match them,
and the metacharacters ^ (beginning of string) and
$ (end of
string) will refer not only to the real beginning and end of
the string, but also to spots just after the NEWLINE, and just before
the NEWLINE, respectively. When you want to match a period verbatim,
you have to quote it as \. with a backlash.
For example, the /B.T/si
will match, "BLT",
"100 MBits", "Ubot sunk", "The b.tch is a dog too.",
and even "Rob\nTom"
(where \n
denotes a new line), while the expression /B.T/m will
not match the the "RoB\nTom". The expressions
/B\.T/s and the /B\.T/m will only match a string
like "whateverB.Twhatever".
- [ and ]
- square brackets are used to specify character lists
(as in examples above) for matching. You need to
quote them with a backslash (i.e,. as \[ or \] )
to have them match themselves in the regexp. Many special characters
(metacharacters) loose their meaning within the brackets. The period
., alternative
|, parentheses
( or
), brackets
{ or
}, asterisk
*, question mark
? and plus
+ stand for
their literal meaning. The escaped letters that denote
single characters (e.g., whitespace
\s, nonwhitespace
\S, digit
\d, nondigit
\D, word char
\w, nonword char
\W, line feed
\n, caret return
\r, control char
\cX, hex code
\xNN, or octal code
\nnn ) retain
their special meaning. Two characters get special
meaning: ^ after
opening [ means
negation while -
used between two characters denotes character range.
Usually, you do not need to quote special characters within the brackets,
e.g., the period . is just
a real period (but it is usually safe to quote
special characters -- when in doubt, always quote everything
beside letters and digits. You need to quote some characters,
however. Obviously, you need to quote square brackets themselves,
if you want them to stand for themselves, or the Perl would get confused
where the character list starts and ends.
The ^
(the caret) as a first character after the opening bracket [ means do not match those
that follow me, e.g., /[^0-9]+/ means: match
one or more nondigit characters. The - (hyphen) between two
characters means a range of character codes, but you really would have
to look at the ASCII table, to know what are the codes (under UNIX, just
type: man ascii
to list character codes). It is safe though
to use with letters or digits (e.g., /[a-g]/ will match
all notes in the C major scale in English (but not in German
where b is h), while [0-7]
represents all digits in the octal notation). When you put - before the closing bracket,
or quote it with a backslash, it has its natural meaning
- |
- denotes alternative. The expression /a|b|c/ means: match a
or b or c. For example, it will match "Matt",
"D'Ambrosia", and "Jean-Christophe"
but will not match "Anthony" (but /a|b|c/i would).
This notation is slightly confusing when you match alternatives
that are longer than one character. It is probably best to enclose them
in parentheses that make groupings atomic. For example, expressions:
/apples|oranges/,
/(apples|oranges)/,
/(apples)|(oranges)/ or
/((apples)|(oranges))/ all
match the "oranges" in the
string "Apples and oranges" while
the expression
/apple(s)|(r)/i would match
"Apples" in the same string.
- ( and )
- parentheses mark groupings. For example:
/(Frod|Drog|Bilb)o/ would match
"Frodo",
"Drogo", or "Bilbo".
If you want to look for verbatim parentheses you need to quote them with
a backslash (i.e., write them as \( or \) in the
expression). The expression /(CO)/ would match strings:
"CO", "ACORN", "Fe(CO)6" but not
"C0"(with a zero instead of O).
Expression /\(CO\)/
would only match the "Fe(CO)6".
The parentheses make a group of characters atomic, i.e., behave like
a whole. For example: the /bo+/ will
match the string "booo", while
/(bo)+/ will match
"bobobobo".
Parentheses also mark the backreferences to which you can refer
in the regular expression or in the replacement string (that
we do not discuss here). For references, you count the
opening parentheses from the left as 1, 2, 3, ... and refer
to the content that matched them as
\1,
\2,
\3..., for example:
/([a-z]+)([0-9]+)\1\2/
would match "_abc123abc123_"
and "#a1a1#", but not
"#a1a2#". You can also
nest them: /(([a-z]+)([0-9]+))\2\1\3/ will
match "#ab123abab123123#".
- ?
- means match zero or one occurrences of a character (or a group
in parentheses), e.g., /Bo?t/
will match Bt
and Bot, but not
Boot, or
But. The
/Many (thanks )?/
will match
"Many ",
"Many x",
"Many thanks ",
"Many thanks thanks " and
"Many thanks x"
but not "Many".
If you want to look for a question mark, quote it as \? in your regexp. For example, the
/Ab\?/ will match
"Ab?" but will not
match the "Ab".
- *
- means zero or more times. E.g., /Bo*t/ will match
"Bt", "Bot", and even "Booooooooot".
You need a backslash quote \* to look for a plain
asterisk. The /2\**5/ will match "1256",
"12*52", "2**573" and even "---2********5----"
while the /2*5/ will match "5", and
"25", "12256", etc. In the string that
contains new line characters, the expressions /.*/
and /.*/m
will match the the first line (without the
ending new line character), while the expression /.*/s
will always match the whole damned string, even when it is empty!!!
- +
- means: at least once. For example, expression /Bo+t/ will match
"Bot", "Boot", and even "Boooooooot").
Quote the \+ with
backslash to look for plain + in
the text. The /.+/ and /.+/m will not
match a string containing only the new line characters,
while /.+/s will match a string that
contains only a single new line character and nothing else. Neither /.+/ nor /.+/m nor
/.+/s will match the empty
string, though.
- {n} or {n,m}
- specifies how many repetitions to match. For example:
/Bo{2}t/
will match only Boot, while /Bo{1,2}/ will
match "UBot" or "My Boots". There is also
a /Bo{2,}/ which means
2 or more times, e.g., "The Boot failed",
"Uh... Booot...", "Boooot", etc.
- ^
- the meaning depends on the modifiers s or m.
If s is used, the caret matches
beginning of the string. If m is used, the caret matches
beginning of the string and the spot after the new line character
(if it is present in the string). For example, the /^The/is will match
"the dog" and
"the dog\n" (\n denotes
a new line character) but will not match
"my cat likes\nthe dog"
while /^The/im will match all four:
"the dog",
"\nthe dog",
"the dog\n" and the
"my cat likes\nthe dog".
- $
- Dollar sign is used only at the end of the regular expression (if
you look for $ itself, quote it with the backslash). Since the $ also marks the
beginning of variable in Perl, you cannot use it in the middle of
the regular expression as Perl would try to replace it with the
variable (scalar, as they call it). Therefore, in the expression
/Many $s/ Perl will
try to put the value of $s
into the regular expression, and if it does not exist, it will
put nothing there, and the above expression can match strings like:
"Many " or
"Many things". Like a caret,
it is interpreted differently for the m and s modifiers.
$ matches only the end of the string if s modifier is
used. It matches the spot before the new line and the end of the
string if m is used. For example: /time$/s
will match "It's time" and
"It's time\n" but will
not match "It's time\nto go home" while
/time$/m will match all three strings.
The letters quoted with a backslash often have
special meaning. For this reason, avoid quoting letters, unless
you know what you are doing. I will give here only a few examples.
Check the URLs below, and the Perl man pages given earlier
to learn all of them.
- \d
- Match a digit. It is a shortcut for [0-9].
- \D
- Match a nondigit character. It is a shortcut for [^0-9].
- \s
- Match any whitespace character (space, tab, newline, form feed, caret
return, etc).
- \S
- Match a nonwhite space character (i.e., a character that uses
ink on your printer).
- \w
- Match any word character (i.e., a letter, a digit, or an underscore --
now you know that the Perl is for programmers since these entities
represent valid characters in the variable name). The \w
can be replaced with [a-zA-Z0-9_] if you want to
make your regular expression look fancy.
- \W
Match any nonword character (i.e., anything that is not a letter
a digit, or an underscore).
- \b
- Match a virtual boundary between the word and the nonword character
that precedes or follows it. Why this is needed when we have
both \w and \W? Since matches are used for
substitutions and then it is important that matched piece of
text does not include surrounding space or punctuation. It is
also convenient in cases when we want to match the whole word
rather then a piece of it, as in the example given earlier.
Incidentally, it also matches the backspace character.
- \B
- By now, you should suspect that it is the opposite of \b, i.e.,
it does not match the positions around the word. So the
/\B\./ will find a period that does not follow the word,
like in "You do not put period after space . like this".
- \n
- Matches a newline character. Problem with this is that on the UNIX
machine the newline is a New Line Character (NL, CTRL-J), on the
Mac it is a Caret Return Character (CR, CTRL-M), and under
DOS/Windows, Web, Email, etc, it is a two-character
sequence of CR followed by NL.
We are impatiently waiting for the discovery of the NL CR sequence
for the new line character to make text files even more incompatible.
Of course, we could always match the new line as /\r?\n?\r?/
but than we would match also multiple empty lines coming
from DOS.
- \r
- Match the Caret Return character (usually CR, i.e., CTRL-M,
but the Macs are special, and there it matches NL, that is CTRL-J).
- \A
- Matches only the beginning of the string/text, irrespectively of
m and s modifiers.
- \z
- Matches only the end of the string/text, irrespectively of
m and s modifiers.
- \Z
- Matches before the NEWLINE character or at the end the string/text,
irrespectively of m and s modifiers.
There are also other quoted letters, and special
quantifiers, etc. Describing them would
take a lot of space, and basically, if someone needs to use them, he/she
has to go through the boot camp of the Camel Book first.
After you read this tutorial, please complete practical exercises
at the following link:
http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl
Check the links:
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
http://virtual.park.uga.edu/humcomp/perl/regex2a.html
http://www.comp.leeds.ac.uk/Perl/matching.html
and study.
Jan K. Labanowski
|