Stepping Through Text with Regular Expressions

By Andy Oram, 04 November, 2020

Anyone who deals with text has reason to learn a bit about regular expressions—which represent the "re" in the name of the popular grep command—in order to search through text. However, it's another whole level of mastery to parse lines with regular expressions, extracting and classifying the relevant parts of a larger text. This article shows you a few tools that help you use regular expressions at this higher level.

We'll explore ways to break a line into separate strings and extract multiple strings that match a single regular expression. We'll loop over these strings and see how to store them in convenient data structures. And at the end, we'll peek at lookahead, which is a way to match without really matching.

Although I’ll explain each regular expression and function I use, you should already have a basic understanding of regular expressions before reading the article. Introductions are easy to find both online and in many published books.

Because Python is extremely popular and has rich, powerful implementations of regular expressions, I'll write examples in version 3.7 of that language. But all modern languages support regular expressions, so the techniques I show here can be used in just about any language you're likely to come across. This article shows how to extract multiple strings that match the criteria you specify through the Python findall function.

Setting Up

At the beginning of your Python program, import its standard regular expression library through the following line:

import re

Another library, called regex, claims to offer advanced features, but you don't need it for the techniques shown in this article. I'm trying to stick to techniques available to everyone, regardless of what language and library you use.

Don't Resort Immediately to Regular Expressions

In highly structured documents, finding the text you want is easy—in fact, you often don't need regular expressions. Other techniques are both simpler and less resource-intensive. As an example of a structured document, many tools come with configuration files in YAML format, which look like this:

name: Generic
version: 2.7
distribution_type: module

To collect the keys and their values, just split each line on the colon. We’ll assume you read each line into a variable named yaml_text:

yaml_tmp = yaml_text.split(":")

Any language you use is likely to have a function like split in Python. It looks for whatever divider you specify (here, a colon) and moves the text between the dividers into a list of strings. No regular expression is needed, although split accepts a regular expression as a divider. Having possession now of the list of strings, you can save them in some convenient data structure such as a Python dictionary (which is called a “hash” or “associative array” in some languages). The following statement uses another Python function, strip, to take off any white space before or after each string:

yaml_data[ yaml_tmp[0].strip() ] = yaml_tmp[1].strip()

It was worth showing some basic housekeeping here so that you can use the strings we'll extract from text in the rest of this article. Another lesson of this section is that regular expressions are heavy equipment, and there's a lot you can do with simpler tools in your chosen programming language.

Bulk Extractions

Often you're trying to match several pieces of text in a single line. And, although regular expressions began as ways to scrutinize a single line of text, many languages provide ways to run a single search over multiple lines or a whole file. So you may have the job of looking for something like the sizes shown in the following text:

You need a 3x6x24 board, a 2x4x36 board,
a 2x4 steel plate, and a 1x4 piece of molding.

Our goal here is to find each size (3x6x24) and the material it applies to (board). Let's start with the digits. The following matches any string of digits, such as 6 or 24:

\d+

Strings such as 6x24 can be matched by:

\d+x\d+

Sometimes we have two dimensions (2x4), whereas other times we have three (3x6x24). Although we don't expect our construction to extend into the ten dimensions postulated by modern physics, we might as well create a regular expression that allows any number of strings of digits, always separated by x:

(\d+x)+\d+

After the parentheses, the plus sign applies to everything in the parentheses. So it will match 3x6x, while the \d+ at the end picks up the final 24.

We want to capture the size, so we'll wrap the regular expression we've created so far in parentheses:

((\d+x)+\d+)

But wait—we don't want to capture each 2x or 3x. We want just to capture the whole thing. We can't get rid of the inner parentheses, but we can render them non-capturing. Instead of a single open parenthesis, we start the sequence with:

(?:

The question mark in this odd clump of characters has nothing to do with the question mark we used earlier to represent multiple instances of characters. We're using these three characters to tell the regular expression parser that we're using parentheses just for grouping and don't want the enclosed characters to be remembered.

So far, our regular expression is:

((?:\d+x)+\d+)

which matches strings like 3x6x24. These strings are always followed by spaces, which we throw away with:

\s+

Finally, we want another set of capturing parentheses to pick up the materials (board, etc.). We could capture one word with:

(\w+)

But some materials consist of multiple words (steel plate, piece of molding). So I'll exploit another aspect of the sentence: every piece of material is followed by a comma or period. I will capture everything following the size, up to but not including the comma or period that follows:

(.*)[,\.]

In regular expressions, a period represents any character, and that's how I use the first period in the regular expression just shown. Within the square brackets, I want to match a real period, so I precede it with a backslash to say, "This does not have a special meaning; it stands just for itself."

We're not quite finished. Our "grab everything" .* code is greedy. The first one will take the entire text up to the final period. We need to make it non-greedy, telling the asterisk to look for the first upcoming comma or period and then stop. Adding a question mark to the asterisk accomplishes that:

(.*?)[,\.]

Now we're done! Take a deep breath before reading out the complete regular expression:

((?:\d+x)+\d+)\s+(.*?)[,\.]

And here's a complete Python program that uses the regular expression. Because we break our string across multiple lines, Python requires us to start and end it with three quotation marks:

import re

construction = """You need a 3x6x24 board, a 2x4x36 board,
    a 2x4 steel plate, and a 1x4 piece of molding."""

for m in re.findall("((?:\d+x)+\d+)\s+(.*?)[,\.]", construction) :
    print (m[1].strip() + " of size "  + m[0].strip())

Here's the output:

board of size 3x6x24
board of size 2x4x36
steel plate of size 2x4
piece of molding of size 1x4

Formatting and Compiling

The sequences of characters that you embed in a regular expression are a headache to read. Usually, the programming language lets you break them into lines for easier perusal. For instance, here's our regular expression laid out with comments:

(        # Open capture group 1, the size
(?:      # Open a non-capturing group
\d+x)+ # Strings like 24x
\d+      # Final string of digits
)        # Close capture group 1
\s+      # Discard spaces
(.*?)  # Capture group 2, the material
[,\.]  # Comma or period terminates

If you're going to apply your regular expression repeatedly, you should compile it to reduce overhead. It takes time for the parser just to read each character and put the regular expression into a more efficient internal format. By compiling the expression, you reduce that overhead to a one-time operation, and the program runs faster each time you use the regular expression.

Here we'll compile our expression using the re.compile function:

materials_regex = re.compile("""
                 (       # Open capture group 1, the size
                 (?:     # Open a non-capturing group
                 \d+x)+ # Strings like 24x
                 \d+     # Final string of digits
                 )       # Close capture group 1
                 \s+     # Discard spaces
                 (.*?)  # Capture group 2, the material
                 [,\.]  # Comma or period terminates
                 """, re.VERBOSE)

The re.VERBOSE argument at the end enables the line breaks and comments.

Now we'll run the expression on our construction string. Because the materials_regex variable holds a regular expression, we can run findall on it directly:

materials = materials_regex.findall(construction)

Lookahead

One of the most abstract features of regular expressions is lookahead, which lets you see characters that follow the match. Many regular expression libraries (including Python's) also allow lookbehind. Lookbehind is restricted to a fixed number of characters because of limitations in implementation—no asterisks or plus signs.

To illustrate the possible value of lookahead, let’s suppose that our text is more compressed and has few cues such as commas:

board 3x6x24 board 2x4x36 steel plate 2x4 piece of molding 1x4

The text lists each material followed by its size. Our cue that we're switching from material to size is a digit. When capturing the material, we must stop when we encounter a digit, but we don't want the regular expression to consume the first digit it finds. That first digit must be left in place to become part of the size. So here is our new regular expression.

materials_regex = re.compile("""
                 (.*?)      # Capture group 1, the material
                 (?=\d)     # Lookahead: digit
                 (          # Open capture group 2, the size
                 (?:        # Open a non-capturing group
                 \d+x)+     # Strings like 24x
                 \d+        # Final string of digits
                 )          # Close the non-capturing group
                 """, re.VERBOSE)

The first capture group is the familiar non-greedy group capturing everything. We tell it to stop when it sees its first digit:

(?=\d)

The (?= sequence triggers lookahead. After that, the next lines pick up with the digit, which we didn't lose because we used lookahead. The rest of the regular expression is the sequence we used before to capture the size. We capture some leading spaces we don't want in this regular expression, but we can use strip later to remove them.

Most situations where lookahead seems like a good solution can actually be solved without it. In this case, we can eliminate lookahead simply by changing our first "grab everything" sequence to "grab everything except digits":

([^\d]*)

The circumflex or hat character ^ inside brackets means "not the following character." So [^\d] means "anything except a digit." The group ([^\d]*) means "everything up to, but not including, the first digit you find."

Our resulting regular expression comes out even simpler without lookahead:

materials_regex = re.compile("""
                 ([^\d]*)   # Capture group 1, the material
                 (          # Open capture group 2, the size
                 (?:        # Open a non-capturing group
                 \d+x)+     # Strings like 24x
                 \d+        # Final string of digits
                 )          # Close the non-capturing group
                 """, re.VERBOSE)

Conclusion

Regular expressions can solve problems in text processing that have no other feasible solution. But, regular expressions can also be difficult to get right and are computationally expensive. Over time, developers have learned about common tasks in text processing and have added regular expression features to facilitate solutions. If you explore the options provided by your regular expression library, you can often find an elegant solution to your problem. Several books are available to tell you how to think about your text and what regular expression can parse it.