Guide to Regular Expressions

An on-line guide to understand and make yours the wonderful world of regular expressions

Regular expressions introduction

You probably have read or listened often about the use of regular expressions and their power but if you are here probably don't have a very clear idea of what regular expressions are. First of all there are various names for the regular expressions: regex, regexp, reg ex or reg-ex, all meaning the same thing.

History may sometimes result boring but living without history, is like living without memory. Stephen Kleene defined in 1950 the regular sets, the first concept from which originated the regular expressions. The first real application of regular expressions was implemented by Ken Thompson in 1966 in the QED editor. Today all the principal programming languages support the regular expressions. And there are a lot of text editors and probably all the IDEs that supports the regular expressions.

So what you have to remember about the regular expressions is that they are used to search and replace some text inside some text.

Regular Expressions = SEARCH && REPLACE

What are the regular expressions?

The regular expressions are objects that help us search and replace something inside a string. A string in informatics is a sequence of characters.

Often we need to alter a text or search something inside some text. The first thing that a programmer can think is to write some code. But often write the code to accomplish a certain job may result very expensive. Here comes in handy the regular expressions.

Special characters in regular expressions

With the regular expression \beve\b we can find all the occurrences of the word eve inside a text.
There are 7 characters in the regular expression:
\ b e v e \ b
5 regular characters and 2 back slashes. The backslash gives the normal character b a special meaning. The special character duo \b indicates the border of a word.

Do a global search, and test for "Eve":





There are a lot of normal characters that if preceded by a backslash are considered special characters.

List of special duo characters with preceding backslash

\a	Acoustic sound (Bell Code).
\A	Start of text.
\b      Word border.
\B 	Everything except \b.
\c	Control code (Ctrl).
\d      Numeric value.
\D	Everything except \d.
\e	Escape code.
\f	End of page code.
\n	End of row code.
\r	Carriage return code.
\s	Space.
\S	Everything except \s.
\t	Tabulation code.
\u	Unicode code.
\v	Vertical tabulation code.
\w 	Word character.
\W	Everything except \w.
\x	ASCII code.
\z	End of text.

List of single special characters

^	Start of a row.
$	End of a row.
*	Zero or more.
+ 	One or more.
? 	Zero or one.
.	Any character except \n.
( )	Group of characters.
{ }	Number of repetitions.
[ ]	Group of characters.
|	Alternative strings.
/	Group modifiers.
\	Special character that gives or removes meaning to other characters.

All of the preceding special single characters, if preceded by a backslash becomes normal characters. So the backslash has two utilities. Gives meaning to normal characters if put before some normal characters and removes meaning to special characters.
If we are searching for the backslash character we can simply put a two backslashes in a regular expression
\\

Simple searches

The most simple searches that you can do is to search for a string of normal characters.
For example consider searching for the extension of a jpeg file.

Do a global search, and test for ".jpg":





What you obtained it is not what you thought you would get. In fact the regular expression matched even the ljpg from the alljpg word. This happended because the . character is a special character that matches any normal character.
To match exactly the .jpg try to replace the .jpg regular expression with the \.jpg regular expression to see the what is matched.

Numerical digits and spaces

You probably often need to search for numbers inside some text at a certain position. Or you may need to search for some text except the numerical values. We can obtain these results with the regular expression \d that selects numerical digits or the characters between 0 – 9. And the complementary regular expression \D that selects everything but numerical digits.

There are two similar regular expressions for the search of space character and everything but the space character \s and \S.

The numerical digits \d and \D

Think about the situation in which you want to find the birth date of a person that is written inside a very large text with other numbers and a lot of text. What you know is that the date is in the format dd/mm/yyyy. So how are we going to proceed? Easy, we use the \d regular expression.

Do a global search, and find the birth date in the format dd/mm/yyyy:





Now what happens if we use the \D regular expression? Try to replace the \d\d/\d\d/\d\d\d\d with the \D regular expression to see the result.

The spaces \s and \S

With the use of regular expressions we can distinguish between spaces and other characters. We do that with the \s and \S regular expressions.

Do a global search, and find all the spaces:





Now try to replace the \s with the \S and the result will change

Character sets [ ]

We often need to find some group of characters inside a text. To do that there is a special regular expression that searches for a set that can be written in the form of [abcd] where the set of characters searched is a,b,c,d.
The most simple character set that we can search for is a single character, that is equal to search for the character itself so regular expression [a] equals to regex a.

Group of characters [abc]

If we want to search for a group of characters we write the regex [abc] were the characters that we are searching are a,b,c.

Do a global search, and find all colors with character values:





Sequence of characters [-]

If we want to search for a sequence of characters we write the regex [0-9] were the characters that we are searching are from 0 to 9.

Do a global search, and find all colors with numeric values:





All characters but [^]

Sometimes we need to find everything but some characters, to do that we use the regex character sets with the adding of ^ character. For example if we want all the numeric html color but those from 0 to 4 we have to write a regex like that [^0-4]

Do a global search, and find all colors with numeric values different from 0,1,2,3,4:





If we want to use the ^ character in the set of characters the only thing to do is to not put it at first position.

The Words: \b, \B, \w, \W

We can now search for characters, numbers and sets of them. We have seen the special character \b (word border), we are going to see the \B opposite regex and the word identifiers \w and \W.

The borders of a word: \b

We saw the use of the \b regex to find words. Now we want to find some words of a certain length, we can use the \b regex even for this job.

Do a global search, and find all the words of three letters:





Now we have a problem here because we have marked even the do and is words because the . regex matches even the space character.

To solve this problem we could use the character sets regex like this.

Do a global search, and find all the words of three letters:





Something very important to notice here is that the \b has a 0 length, meaning that it behaves like an anchor.

The characters of a word: \w

Instead of using the character sequence [a-zA-Z] we can use the \w regex

Do a global search, and find all the words of three letters:





Now we can use this simple regex instead of writing each time the sequence of characters.

The inverse of borders and characters: \B and \W

The \W regex represents everything but a word. The definition of a word for the regex is a sequence of characters, numbers and the _ character, yes even the numbers and the _ character can be placed inside a word. An example can be very useful here.

Do a global search, to see the use of \W reg-ex:





To see the inverse try replacing the \W regex with \w

The use of \B is a little more complicated. First of all the \B regex is like \b an anchor with 0 length. And \B is the inverse of \b.

The \b regex is an anchor and it matches the border of a word. The \B regex is the inverse of \b, so it matches everything that it is not the border of a word, meaning that it matches every border of a character that is not the border of a word. To clarify follow the example below.

Do a global search, to see the use of \B reg-ex:





As you can see the regex matches the ev in Never and believe words but it doesn't matches the ev in ever word, this because the ever word starts with a word border, something that can be found only with \bev\B regex.

Try to replace \Bev\B with \bev\B to see what happens.

The start and end of a row and entire text ^, $, \A, \Z

We often need to define the start, end point of a row or the entire text that we are analysing. To accomplish this job we use the ^, $ regex for the row and the \A, \Z for the entire text.

The start and end of a row ^, $

Think of the case were you have a text file and want to find all the words that have been divided at the end of a line with a hyphen. To accomplish this job we can use the regular expressions.

Do a global search, find all the words divided at the end of a line with a hyphen:





As you can see the regex matches the someth- word but not the reg- word.

The start and end of a text \A, \Z

Think of the case where you have a file with a start, end tag for each line and you want to find only the start tag for the file. In javascript you can't use these but in other languages yes.

The quantifiers: +, *, ? and { }

We have the necessity to search for variable length of characters in our regular expressions, to do that we use the quantifiers. They can be placed after the characters of a regex and give a new specific meaning to the new regex. For example with the \d we can search a number, with \d+ we can search for one or more numbers.

One or more: +

Think about the case where we want to find all the dates inside a text. We know that all the dates are written with the slashes but we do not know if we have a format of dd/mm/yyyy or a format of yyyy/mm/dd. So the solution comes from the + regex.

Do a global search, find all the dates inside a text:





As you can see the \d+ matches one or more numerical values, in this case 2 or 4 numbers. If you want to use the [0-9] regex you have to write the + regex outside, like that: [0-9]+.

Zero or more: *

Now we have a more complicated case. A date were the month is expressed in roman characters like 01/XII/2016 or normal date 2016/12/01.

Do a global search, find all the dates inside a text:





As you can see the two dates are matched. The * regex means zero or more repetitions of the characters.

Zero or one: ?

Now we are in the case where the dates are in the form of dd/mm/yyyy or ddmm/yyyy.

Do a global search, find all the dates inside a text:





As you can see, placing the ? regex after the / character recognizes the date string even if in some cases the / is not present like in the case of the 0112/2016 date.

Number of repetitions: { }

There is another case that can’t be resolved with the preceding regexes. The case of dates in different formats like ddmmyyyy or dd/mm/yyyy. If we use the precedent quantifiers we will come up to something like that.

Do a global search, find all the dates inside a text:





In this case the regex matched even the phone number, something that we do not want. To solve this problem we use the { } regex.

There are a number of possible uses of the { } regex:

  1. {number} – in this case we use only one number inside the brackets, this means an exact number of repetitions. Example: \d{3} matches all the numbers with exact 3 numbers.
  2. {number1, number2} – in this case we use two numbers inside the brackets separated by a comma, this means from number1 to number2 repetitions. Example: \d{2,5} matches all the numbers that have from 2 to 5 digits. It will match 12, 123, 1234, 12345 but not 1 or 123456.
  3. {number, } - in this case we use on number followed by a comma, this means from number1 to undefined number of elements. Example: \d{3,} will match the numbers 123, 1234, 12345 and growing, but not 1, 12
  4. {number, number} is equivalent to {number}
  5. {1,} is equivalent to +
  6. {0,} is equivalent to *
  7. {0,1} is equivalent to ?

To solve the preceding problem.

Do a global search, find all the dates inside a text:





Finally the correct matching.

Laziness: ?

All the regular expressions are greedy. What does that mean? To explain this we are going to use an example.

Do a global search, find all the mathematical operations in brackets:





What we got here it's not exactly what we expected. In fact we expected only the sum operations to be matched, but the regex matched the whole expression. Why? This happened because the regex is greedy. Let's analyse the regex. We have \(.+\), the first \( matches the first ( in the mathematical expression. Then the .+ instead of matching only the 3+2 part, matches 3+2)*(4+5)*(1+6) part. Because the . regex matches all the characters except the new line, it matches even the brackets (, ). Because it is greedy it continues till the last element on which makes matching. To remedy to this behaviour we use the lazy regex ? that if putt after the .+ produces the lazy .+? regex, this way it will stop matching at the 3+2 mathematical expression. To clarify all let's see an example.

Do a global search, find all the mathematical operations in brackets:





Groups

Like in other languages the regexes have the grouping functionality so that particular cases can be matched.

Grouping: ( )

We learned about the + regex but think about the case where you want to apply a repetition not to one character but to some consecutive characters.

Do a global search, find all the repetitions of 01 consecutive characters:





Backreference

If we use the round brackets we create a group, the matching of the which can be used to be substituted or used for another search. To use this matching called backreference we use the "\n-th" match regex.

Do a global search, find all palindrome words of four characters:





In this example the first (\w) matches the A of Anna, the second (\w) matches the first n of Anna then the \2 backreference that is associated with the second (\w) matches the second n of Anna and at last the \1 matches the first A of Anna.

Passive group: (?:)

When we use the grouping we create the backreference, that has a big cost in terms of processor and memory used. To create a group but that doesn’t create the backreference we can use the passive group that matches the exact characters but on which we can not use the "\n-th" regex.

Alternative: |

We often need to match alternative groups of characters, to accomplish this task we use the alternative | regex or pipe char.

Do a global search, find all the mobile numbers of 10 digits:





In this way we created a match for both the possible alternatives of international prefix number.

Lookaround: (?!), (?=)

We will see now the most advanced capability of the regular expressions, the capability to look ahead or look behind and search for the match regard the current match.

Lookahead if different: (?!)

The use of this regex is, make a match on characters not followed by the value in the regex, for example the mails not containing the .org extension.

Do a global search, find all the emails that do not contain .org:





Lookahead if equal: (?=)

The use of this regex is, make a match on characters followed by the value in the regex, for example the mails containing the .org extension.

Do a global search, find all the emails that contain .org:





Lookbehind

The use of these regexes is symmetrical to lookahead but they are not supported in javascript.

Flags: i, g, m

A regular expression is written in the form of a string the ones that we talked till here, enclosed by two slashes and followed by some flags like /\beve\b/mgi, where \beve\b is the regex enclosed by the two slashes and followed by the mgi flags. There are three types of flags in javascript i, g and m.

Case insensitive: i

The use of this flag in a regular expression makes the search of the text independent from lower or upper case characters. If we use the first example and modify the flags from mgi to mg we will see that the regex doesn't match nothing because all the Eve words have the first letter in upper case.

Do a global search, and test for "Eve":





Try to add the i flag to see how the match changes.

Multiline: m

If we use the multiline flag, the anchors ^ and $ will match the beginning and the end of a row instead of beginning and end of a the whole text, in the case of not using this flag. If we do not use this flag the ^ regex is equal to the \A and the $ regex is equal to the \Z regex.

The last flag is the global search flag. It permits us to do subsequent searches and there for to start the next match from the end of the previous match. If this flag it is not enabled subsequent searches will give the same match.