Awk: The Power and Promise of a 40-Year-Old Language

Languages don't enjoy long lives. Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk. (The C language is still widely used but in significantly updated versions.) Bucking that trend, the 1977 Unix utility Awk can boast of a loyal band of users and seems poised to continue far into the future. In this article, I’ll explain what makes Awk special and keeps it relevant.

A Descriptive Language

Awk runs on inputs and a script. The inputs can be files, but the command is often used as part of a pipeline, taking input from the previous command's output:

ls | awk '/SAMPLES_[1-9][0-9]/ { ++counter }'

The long quoted text in the above command is the script, which can be included on the command line or read from files. Each script comprises a set of conditions and actions. The condition is often a regular expression enclosed by slashes. The action appears as one or more statements between braces. If the condition matches a part of the input, the action is executed. Here is my trivial, one-line script:

/SAMPLES_[1-9][0-9]/ { ++counter }

The script searches for strings like SAMPLES_19 or SAMPLES_20 and increments a counter each time a string is found. Of course, a real script would use the counter in further calculations.

This is basically how Awk operates: evaluate a condition, then take action when it matches. The script runs in what David Kerns, in an email exchange with me, called an implied loop. In his review of this article, Arnold Robbins, maintainer of the GNU version of Awk (Gawk), calls the programs data-driven.

I see Awk as more of a declarative language than a procedural one. You describe what you want to happen and the conditions under which it happens, instead of specifying a series of sequential statements. Awk certainly executes statements in sequence and offers control flow statements (if, while), so it can serve quite well as a procedural language. Nelson H. F. Beebe, in his review of this article, mentioned writing a program with 23,981 lines of actions in just 12 patterns.

But overall, sequences of statements execute within a framework of declaring the conditions under which these things should happen. The concept of a declarative language has been around almost since the beginning of high-level programming languages and can be found in the popular notion of a promise, invented by Mark Burgess.

Awk documentation usually calls the condition a "pattern" because regular expressions are so often used as conditions. Janis Papanagnou, in his review of this article, explained that he has recommended the word "condition" instead. I realized that this word choice matches my own view of Awk at a high level as a descriptive language. Aleksey Cheusov, in email, said that Awk programs can be viewed as finite state machines, which declare how to move from one state to another.

Neil Ormos, in an email exchange with me, offered an interesting perspective on when to use Awk:

I'd put Awk in a special category of general-purpose programming languages that are especially well adapted for: (1) personal computing; and (2) programmer-time-efficient prototype development, where the prototype artifact can evolve advantageously into a production-worthy tool with a little incremental effort.

Awk also maintains a delicate balance between being a line-oriented utility like grep and a full programming language. Normally, Awk just applies your script to each line of input, like grep, acting on what matches your condition.

Furthermore, Awk is focused on lines divided into fields that are separated by white space or by any character or regular expression you choose. All behavior is subject to customizations—as Ed Morton suggested in his review, we should speak more generally of "records" instead of "lines"—but traditionally Awk is used on files where each line consists of a regular set of fields. It has proven very useful for parsing log files, for instance.

In 1988, Kernighan put a set of bug fixes and major new features into a version released under the name Nawk (although he wanted it to replace the original Awk), and the standard version has not changed much since then.

It's Not Just About the Language

Languages are part of a larger environment that often plays more of a role in the choice of language than its actual features. For instance, many people use Python because so many important libraries have been written for that language. Other people use a language for legacy reasons: they have an existing application to maintain or work in an organization that has historically depended on a language.

Many of the people who responded to my outreach for this article focused their appreciation of Awk on factors other than language features. Besides being deeply embedded in many Unix scripts, Awk's presence is guaranteed on every Unix-style system, including GNU/Linux, BSD, and macOS. The utility's suitability for widespread use is bolstered by its ability to accomplish complex tasks without requiring the installation of outside libraries or packages. The language's behavior is also guaranteed in a POSIX standard, which turns out to be surprisingly important to a lot of users. However, many variants have added non-standard features. Gawk and mawk are in common use.

Among people who use Awk on large projects, it's a critical part of their toolkit because it's fast. Michael May and Glaudiston Gomes da Silva told me that they had ported some Java data processing programs to Awk with more than ten-fold reductions in CPU and RAM consumption. One researcher clocked Awk on 25 TB of data with impressive results. Another advised Awk’s use for some tasks, along with other classic Unix tools, instead of Hadoop. And one of the most active sites in data science, Analytics Vidhya, published an article praising Awk.

Cheusov, in correspondence with me, provided more evidence of Awk's speed:

When I worked in computational linguistics, we often parsed gigabytes of text. Programs written in GNU Awk and mawk were much faster than equivalent programs written in Ruby, Python and Perl. Because AWK is so simple, its interpreter can be optimized much more easily than for much more complex languages.

Awk is fast because it has stayed simple and avoided features that are considered necessities in other languages. It concentrates on what it can do well. Several correspondents told me that they appreciated being able to do what they wanted without downloading large modules as they would do for other languages.

Computer science professor Tim Menzies, in his article "Why Gawk?", cited the simplicity and regularity of Awk syntax, which allows it to be learned quickly and to ward off overly complex code. Other correspondents also cited the GNU Awk debugger as a boon for Awk development.

Last but not least, we shouldn't ignore the importance of good documentation. Awk documentation is easy to find on the web. The manual for Gawk, written by the software's maintainer, Arnold D. Robbins, is particularly helpful. For example, the Gawk manual carefully distinguishes Gawk extensions from standard features, so that you can avoid the extensions if you want to conform to the standard. I have noticed that GNU tools in general have good manuals, perhaps because Richard M. Stallman and his collaborators have always assigned a high value to documentation.

Expansion Without Bloat

The classic Awk, as created by Alfred Aho, Peter J. Weinberger, and Brian Kernighan (who drew on their initials to create the name of the utility), was informal. It didn't make users declare variables but simply assumed the variables' values to be zero or null the first time they were used. Data types were implied. This kind of casual scripting was common in the 1970s, and anything more formal would have undermined the tool's appeal.

Every language evolves, usually by incorporating popular features from other languages. The trick is to avoid throwing in features of little value that degrade the language by making it hard to use, slow to compile or run, etc. In this regard, Awk has done well. It has resisted modernization in the form of data declarations and objects. Because Awk is very different from general-purpose languages, it doesn't have space for callbacks, polymorphism, and other fads that have become central to application design in many languages. But some variants of Awk added functionality of real value while maintaining Awk's sleek performance and small footprint. Gawk, like many GNU utilities, has upgraded aggressively.

Many dedicated Awk users don't strive for large programs or make use of extended features. Some love Awk for one-liners like the one I showed earlier. Like most Unix and GNU/Linux users, these casual adherents of Awk prefer bigger languages such as Perl (yes, still!) and Python for large tasks. Others, however, write large Awk programs with the help of its newer additions.

Here are some of the features postdating the original 1977 release that users tell me are most useful. I focus on the features that allow Awk programs to grow large and allow programmers to reuse and share code.

Two features were added fairly early to standard Awk: multi-dimensional arrays and user-defined functions. Recent computing algorithms, especially in data science, depend heavily on matrices and higher-dimensional arrays called tensors. So the addition of multi-dimensional arrays to Awk prepared it for modern data processing. User-defined functions provided Awk with a whole new level of reusability. You can call complex code from different statements, and share your functions with colleagues.

The other features promoting reuse and large programs are extensions in Gawk:

Namespaces—Once Awk offered user-defined functions, this Gawk extension allowed even more sharing and growth. As in C++ or Java, namespaces in Awk prevent clashes between function names or other symbols defined in different functions.

BEGINFILE and ENDFILE—Awk provides BEGIN and END actions to let you do initial processing (before all files are read) and terminal processing (after all files are read).  Gawk extends this with BEGINFILE and ENDFILE, which let specify actions to take before reading or after processing each file in a set of multiple files.

Two-way pipelines—These streamline the operation of coprocesses, which allow you to delegate operations to a separate program and get results back. This form of multiprocessing has been around in other languages for quite a while, most notably in Go. The original form of Awk allowed coprocesses, but only through the cumbersome use of temporary files.

Network programming—This capability takes multiprocessing past the local system, using classic internet sockets to communicate with programs on remote hosts. The remote programs could be coded in any language, not just Awk.

Arbitrary-precision arithmetic—Like multi-dimensional arrays, this feature appeals to scientists who need to go beyond the limitations of conventional integers and floating-point numbers, constrained by microprocessor design.

Plugins/extensions—These allow intrepid programmers to extend Gawk without messing around in the core code.

Recent Examples of Awk in Action

An article in LWN.net discusses the continued appeal of Awk along with some recent large projects that use it. Other projects mentioned by people I corresponded with include:

  • Validation of a sports schedule, for example, ensuring that a team doesn't have two games at the same time, that a coach isn't coaching two teams at the same time, a team isn't playing at a time when they aren't available, etc. This program checks about 25 different constraints per team, on average.
  • Converting SQL data across web sites from one schema to another by way of exported/imported CSVs. 
  • A literate programming tool. The concept of "literate programming" was invented by Donald Knuth, in some ways the grandfather of modern programming. Hints of the idea appear in modern commenting systems such as Javadoc. 
  • An IRC client and bot. 
  • Extracting bibliographies from technical journal articles. 
  • runawk, a wrapper for Awk. 
  • Components of the pkgsrc framework for building packages on Unix-like systems, including pkg_summary-utils

In conclusion, Awk can do much more than the simple line-by-line text processing that is usually considered its forte. The discussions and examples in this article show that the language still has a place in the 21st century.
 

FOSSlife Newsetter

Comments

Anonymous (not verified), Tue, Sep 07, 2021 - 16:12
Gawk is my favorite programming language. From one-liners to multi-thousand line programs, I find using it rules.
Hint: Always use --lint.
Anonymous (not verified), Fri, Sep 10, 2021 - 13:08
I enjoyed the reference to Tim Menzies, he was my CS professor (AI/Data mining) back in college 10+ years ago, and I don't think any professor had quite the influence on me as he did, he was all about the awk/clisp