Regular Expressions - If only they were not so practical

Sascha included in

2019-09-26 2019-09-26 1411 words 7 minutes

Contents

For a long time, I’ve been avoiding regular expressions. Every time I got into the subject, the instinctive reaction was flight. I have tried again and again to approach this topic. But all the people who use regular expressions can not be completely wrong. I took all courage together and fought my way through. To be honest, I’m still not the biggest fan of regular expression. Still, they save a lot of time.

Let’s start with a very simple examples. Imagine you have a long table of article numbers in Google Sheets. The article numbers have a uniform structure (e.g., article-color-12587). What are you doing when you want to change the order of the segments? With a few simple formulas here and there and some extra column you can do this 😁. With a regular expression you can solve this problem much more elegantly. Google Sheets has a REGEXREPLACE () function.

This can solve this problem with one simple formula. But before we come to the solution, we need to know a few basics.

The basics for regular expressions

Regular expressions help identify patterns within strings. The smallest element of a string is a single character. That’s where we start. There are different so-called character classes when using regular expressions:

Character classes

RegEx	Meaning
.	Any character
\d	Digits (0-9)
\D	All characters except numbers
\s	Space (space, tab, CR, LF
\S	All characters that are not space
\w	Alphanumeric characters including “_”
\W	Any character that does not include an alphanumeric character or “_”

This already helps us to look for a date pattern:

\d\d.\d\d.\d\d\d\d

Two digits followed by a dot followed by 2 digits, a dot and 4 digits.

The period gets prefixed with a . This makes it clear that we do not mean every single sign, but actually the period. This is exactly how it works with the numbers. If the \ were not in front of the d, the letter d would be searched for, not the character class.

Special characters

Besides the common characters there are special characters.

Regex	Meaning
c	e.g. the “c” character
^	Beginning of lines / negation of [^ ..] character classes
$	End of line or string
\	highlights the special meaning of the next character
\n	LF, feed to the next line / line break
\r	CR or WR - return movement of the write movement to pos.1 of the same line
\r\n	Line break DOS / Windows
\t	Tab
\f	FF or page break - moving to the first line of the next page
\a	Beep
\e	Escape
\b	Empty string at the beginning or end of the word
\B	Empty string not at the beginning or end of the word
<	Empty string at the beginning of the word
>	Empty string at the end of the word

Custom character classen

You can also define your own character classes:

Regex	Meaning
[abc]	a, b, or c - a so-called simple class
[^abc]	any character that is not a, b, or c
[a-h]	Character range from a to h
[a-h]’[r-u]	characters in the range between a to h or r to u

In a character class, either single characters are defined [aeiou] or an area [a-h0-9]. With the ^ one can negate the class, i. [^ abc] means any character that is not an a, b, or c. In addition, you can connect several character classes with the ‘operator (or).

Quantities

You can specify in regular expressions how often a character is allowed.

Regex	Meaning
a?	once or not at all
a	not at all up to any number of times
a+	once up to any number of times
a{3}	exactly three times
a{3,5}	three times or more, but not more than five times

This can be used e.g. Find number blocks: \d{3}-\d{4}-\d{5} finds numbers formatted like this: 123-4567-89101.

Greedy and lazy quantifiers

When specifying sets in regular expressions, there are so-called lazy and gluttonous quantifiers. Greedy quantifiers try to process as much as possible per result. Your lazy colleagues want to process as little as possible for each outcome.

Lazy quantifiers:

Regex	Meaning
a?	not at all up to as rare as possible
a+?	once up to to as rare as possible
a{3,}?	three times or more, but as little as possible

Greedy quantifiers:

Regex	Meaning
a+	not at all up to as often as possible
a++	once to up to as often as possible
a{3,}+	three times or more, but as often as possible

Suppose we have the sentence Hello -Bob-, how re -You-? Und now we’re looking with a greedy and a lazy quantifier:

(greedy quantifier): -.- finds: -Bob-, how re -You-

(lazy quantifier): -.?- finds: -Bob- -You-

You can see that the greedy quantifier finds one long match in the sentence. The lazy quantifier finds two short matches instead.

Groups

You can separate parts of a regular expression from each other. In this case, groups in a regular expression are defined by parentheses. Groups within a regular expression get a number. The numbering so that the complete found expression gets the 0. Then the respective groups will be counted up.

Example SKU:

artikel-rot-2545

regular expression: (\w+)-(\w+)-(\d+)

The regular expression finds the complete article number. But there are also assigned 4 numbers. For the call you use $ + the respective number. In an example, it looks like this:

$0 = artikel-rot-2545

$1 = artikel

$2 = rot

$3 = 2545

Now we have everything we need for our origin problem. The formula in Google Sheets looks like this:

=REGEXREPLACE(A1;"(\w+)-(\w+)-(\d+)";"$3-$2-$1")

Example: http://Google Sheets

The same principle can be applied in various programming languages. There are slight differences depending on the language:

Python:
re.sub(r’(\w+) (\w+)’,r’\2 \1’,‘Word1 Word2’)

Go:
regex := regexp.MustCompile(`(\w+) (\w+)`)
fmt.Printf(regex.ReplaceAllString(“Word1 Word2”, “$2 $1”))

Javascript:
let regex = /(\w+) (\w+)/;
“Word1 Word2”.replace(regex, “$2 $1”);

different way in Javascript:

let regex2 = new RegExp("(\w+) (\w+)");
“Word1 Word2”.replace(regex2, “$2 $1”);

Lookaround

Now it gets a bit more complicated 😇. One can also specify the context of a regular expression. That we can specifically search for something that is in front of or behind a specific string.

Look behind

Example SKU list:
artikel-rot-2538
artikel-gelb-2539
artikel-blau-2542
artikel-lila-2543
artikel-rot-2545
artikel-gelb-2546

Regular expression + Look behind:

(?<=artikel-)\w+

The first term article- must precede the second expression \w+

(?<!-)\d+
The 1st expression - may not precede the 2nd term \d+

Examples:
Look behind 1: https://regexr.com/4irf2
Look behind 2: https://regexr.com/4irf8

Look ahead

Example SKU list:
artikel-rot-2538
artikel-gelb-2539
artikel-blau-2542
artikel-lila-2543
artikel-rot-2545
artikel-gelb-2546

Look ahead:

\w+(?=-)

The 2nd expression - must follow the 1st expression \w+ folgen

\w+(?!-)

The 2nd expression - may not precede the 1st expression \w+

Example:
Look ahead 1: https://regexr.com/4irg3
Look ahead 2: https://regexr.com/4irg9

The difference between the two types of lookarounds is the direction. Basically, you have to imagine it like this:

First, we look for the regular expression. A look ahead examines whether the second specified search pattern is in front of the result. Look behind looks for it accordingly. If the additional condition is met, the result is returned.

Example for a ‘real world’ regular expression

Finally, a more common example. The following regular expression is one of many ways to recognize an e-mail address. Full regular expression:

Full regular expression:

\w+[.-]?\w+@\w+([.-]?\w+)?.[a-zA-Z]{2,4}

Let’s break it down into its components:

\w+
at least one to any number of alphanumerics. Sign incl. “_”

[.-]?
“.” or “-” not at all up to max. 1x occurring (optional)

\w+
at least one to any number of alphanumerics. Sign incl. “_”

@
the @ sign

\w+ at least one to any number of alphanumerics. Sign incl. “_”

([.-]\w+)?
a group that does not occur until 1x (optional): [.-], one “.” or “_”, followed by \w + at least one to any number of alphanumerics. Sign incl. “_”

\.
the . before the domain ending

[a-zA-Z]{2,4}
at least 2 maximum 4 letters in upper or lower case

Here is the link to the example: https://regexr.com/4itu0

My conclusion

In the meantime, I find regular expressions pretty handy. They are a very powerful tool that can help you a lot. I do not want to say that I love regular expressions by now. But I have learned to deal with them. My tip for all who want to deal with the topic: examples are everything! To understand regular expressions, it’s not enough just to read through this or another text. You have to try as much as possible to really understand how regular expressions work.