Great Finds: How to Operate on Multiple, Diverse Files at Once

With disk space nowadays reaching into multiple terabytes, even on a humble laptop, operating systems offer sophisticated tools to search for files. Many of these tools present simple graphical interfaces. But for great flexibility and power, it serves you well to turn to a classic Unix tool, the find command. In this article, we'll show some powerful things you can do with this tool. To take us on this tour, we'll answer the following questions, each in a single command:

  • I have some photos with a .JPEG extension and others with a .JPG extension. How do I find them all with a single command?
  • I suspect that somebody maliciously changed a file somewhere. How do I find every file that changed during the past 6 hours?
  • A user left the company. How can I archive the files she created for the marketing group, while deleting all her other files?

As a shell command, find can be run from a script. So you can figure out a solution to a set of problems and record it for automatic use later.

Although this article is aimed at people with some experience, I hope that people who are totally unfamiliar with find can learn how to use it through careful examination of my examples. We'll head out to the open seas soon—but first I'll offer some tips to effective use of find. If you've used find a fair amount, feel free to skip to the "The dilemma of multiple names" section below.

Two other opening notes:

Syntax—The find command is complicated and potentially confusing for two reasons. First, it combines operations in layers, but has to fit the standard, flat Unix command syntax: command -option -option -option ... Second, because the command runs in the shell, you have to remember to escape (with backslashes, quotation marks, or apostrophes) a lot of special characters.

Variants—GNU/Linux has the GNU version of find. Like most GNU commands, this find is very powerful, with numerous options not found elsewhere. Mac computers offer a BSD version of find by default. This article focuses on options that work on all versions of find, noting a few differences as it goes along.

OK, on to our first find command.

Prints, Pipes, and Performance

As I promised, we'll start small. Let's say you just want to print a list of all the files and directories under a directory named app, which stores the source code, binaries, and documentation for a little application you created.

Here at the start, we encounter a unique aspect of find syntax: you have to specify a directory before any options. The find command operates on this directory, ignoring everything outside it (unless a file is a symbolic link to something outside the directory—an issue we won't deal with in this article). Thus, you can get your printout as follows:

$ find app -print

Many versions of find, including the GNU version provided with Linux, print names by default if you don't ask the command to do something else. So you can achieve the same effect as the previous command with a simple:

$ find app

But not all versions of find execute -print by default, so the option is worth including.

The order of the printout is arbitrary, neither alphabetical nor chronological. I have found this to be true on every system I've used, and I think the reason can be found in the Linux man page for the readdir library call:

The order in which filenames are read by successive calls to readdir() depends on the filesystem implementation; it is unlikely that the names will be sorted in any fashion.

But here's one bit of regularity: the printout starts at the top directory and descends. Thus, each directory is shown before its contents. If you want to reverse this, showing the contents before the directory containing them, use the -depth option:

$ find app -depth -print

You can include multiple directories before the options. Thus, the following command prints the contents of three different directories:

$ find app ../project/tmp /home/mendoso/bin -print

The find command, like nearly all Unix utilities, works nicely with pipes. If you just want to know how many files and directories are in the app directory, pipe the output to wc -l:

$ find app -print | wc -l

Or just the files, not the directories:

$ find app -type f -print | wc -l

Or just the directories, not the files:

$ find app -type d -print | wc -l

The order of options matters! find interprets and executes each option from right to left. If you put -print first, the -type d argument has no effect on it:

$ find app -print -type d | wc -l

We've seen that find has two main types of options. The -print option carries out an action, whereas the -type option restricts the output. The power of find lies in all the ways you can restrict output, so that you can run the command on hundreds of thousands of files and get just the ones you want.

We've used a pipe on the output of find, but often you want to go deeper: you want to operate on each file. I'll soon show how you can do this through the -exec option. But we'll use a pipe first. Suppose you want to find the name "Luisa Mendoso" in all the files under the app directory. You can pipe the output of find to a command specially designed to work with it, xargs:

$ find app -type f -print | xargs grep "Luisa Mendoso"
app/doc/server/README.txt:Maintained by "Luisa Mendoso,"

This xargs command grabs a bunch of output from find and passes it to the command:

grep "Luisa Mendoso"

For efficiency, xargs doesn't work on one file at a time. It runs grep on multiple files at once, ensuring that it doesn't exceed any limit imposed by the shell on the size of a command or the number of files submitted to the command.

But you can do the same thing within find, without a pipe. The syntax is subtle and easy to get wrong, though:

$ find app -type f -exec grep "Luisa Mendoso" {} \;

We see our familiar grep command. But what's the junk after it? The answer brings together several things I laid out earlier in the article.

First, find combines operations in layers. If it accepted a syntax like YAML or JSON, it might be easier to use! Given the limitations of the shell command line, you have to provide an indication that the -exec option is finished, and that another option may follow. That's the purpose of the semicolon, which the shell recognizes as the termination of a command. If you forget it, you'll see the error message:

find: -exec: no terminating ";" or "+"

But here's an additional twist: precisely because the shell recognizes the semicolon as a terminator, you have to escape it with a backslash. You want the semicolon to terminate the grep command, not the find command.

But, there’s no need to grumble. We’ve gotten this far in the article and won’t be daunted by  a punctuation mark or two.

The curly braces {} represent a line of output from find. So if the find command prints the filename app/doc/server/README.txt, grep runs as if you specified:

grep "Luisa Mendoso" app/doc/server/README.txt

Let's run our command now and see the output:

$ find app -type f -exec grep "Luisa Mendoso" {} \;
Maintained by "Luisa Mendoso,"

Oops! Because the grep command this time runs on each file individually, it doesn't print the filename, which you presumably would like to know.

There are various ways to get the filename. Most simply, the GNU and BSD versions of grep provide an -H option for that purpose:

$ find app -type f -exec grep -H "Luisa Mendoso" {} \; 
app/doc/server/README.txt:Maintained by "Luisa Mendoso,"

Here's another solution that shows off a neat aspect of find:

$ find app -type f -exec grep "Luisa Mendoso" {} \; -print
Maintained by "Luisa Mendoso,"

We've added -print back. When the grep command is successful, the -exec option notes that it's successful; this is called "returning true." So the find command goes on to execute the -print option. If grep doesn't find what it's searching for, the -exec option returns false. In that case, find skips the options that follow.

Even though I meant this as a section on find basics, we've already gotten into some hairy areas. But there's much, much more...

The Dilemma of Multiple Names

The first problem I stated at the beginning of the article was:

  • I have some pictures with a .JPEG extension and others with a .JPG extension. How do I find them all in a single command?

First, we'll see how to restrict output by filename:

$ find pics -name "*.jpg" -print

We're using file globbing here. The asterisk * means "anything matches." We have to escape the asterisk so the shell doesn't interpret it, allowing find to interpret it instead. Quotation marks are popular for escaping filenames as we do here.

Both GNU and BSD versions of find provides a case-insensitive version of -name, called -iname, so we can try that in case a filename is in uppercase:

$ find pics -iname "*.jpg" -print

And of course, we can run another command to look for JPEG files:

$ find pics -iname "*.jpeg" -print

But we want to find all JPG and JPEG files using a single command. Maybe that doesn't seem important if all you're doing is printing names. But we'll run some complex commands later, using -exec, that make it worthwhile to bundle all our searches into a single find command.

To specify two unrelated options in the same find command, separate them with the -o option, which stands for "or."

 We're creating with our most complex command yet, so I'll build it step by step. First, our options for asking for files with .jpg and .jpeg extension:

-iname "*.jpg"
-iname "*.jpeg"

Link them with the -o option:

-iname "*.jpg" -o -iname "*.jpeg"

But the -o option brings up the need for more grouping. Remember that I said that the find command has many layers? We’ll unpeel a layer right now. Because -o means "or", only one option on each side executes for each filename found. Try the following:

$ find pics -iname "*.jpg" -o -iname "*.jpeg" -print

Here’s what happened in that command:

  1. The search for “*.jpg" succeeded in finding pics/paris/renoir_museum.jpg, so the find returned true. The -o option prevented the rest of the command from executing. The same happened for pics/paris/IMG_073975.JPG.
  2. The name pics/boston/athenaeum.jpeg didn’t match "*.jpg", which therefore returned false. The -o option therefore went ahead and executed the next search, which returned true, The same happened for pics/boston/state_house.JPEG

We want something different. We want the -print option if either a .JPG or a .JPEG is found, So we’ll group them together, which requires parentheses. Because parentheses are shell metacharacters, you must escape them so that the shell, like a discreet doorman, tips its cap and lets them through to the find command:

\(  \)

So our resulting command is:

$ find pics \( -iname "*.jpg" -o -iname "*.jpeg" \) -print

The two -iname options are nested within the parentheses, so everything they find is passed on to be printed.

In general, remember: the -o option stops the command after an operation that succeeds. So if you execute:

$ find pics -type f -print -o -exec rm {} \;

The -print option always succeeds, so the -exec option never executes.

I'll end this section by suggesting another way to specify complex filenames. The GNU version of find offers a set of options that let you specify filenames as regular expressions. You have to be careful here, because the most common metacharacters in regular expressions (* and ?) look just like the metacharacters in file globbing, which we have been using. But they behave differently.

The basic regular expression option is -regex, and a case-insensitive option in -iregex. So either of the following commands match .JPG and .JPEG files in GNU find:

$ find pics -iregex '.*\.jpe?g' -print
$ find pics -iregex '\(.*\.jpeg\|.*\.jpg\)' -print

The BSD version of find offers a -regex option that I find too poor in features to be useful.

Good Times and Bad Times

Now we'll look at the highly flexible and useful time options. Our challenge is this:

  • I suspect that somebody maliciously changed a file somewhere. How do I find every file that changed during the past 6 hours?

To find all the files on the system that changed during the past 6 hours, we have to explore Unix timestamps. Normally three timestamps are maintained on every file and directory, and they are simplistically described as follows:

Access time—Updated whenever a file is read—for instance, if you open a PDF with a PDF reader. Some systems turn off checks for access time, because it’s time-consuming for the system to update this timestamp on every read of a file.

Change time—Updated whenever the metadata of a file changes—for instance, if you change the owner or permissions.

Modification time—Updated whenever the content of a file changes--for instance, if you edit and save it.

You can't really understand the differences between these timestamps without running some experiments. They vary not only from one operating system to another but also from one filesystem to another on the same operating system. On GNU/Linux filesystems, they are pretty intuitive. But to get a sense of how they can trip you up, consider some experiments I ran on a  Mac stocked with the Apple File System (APFS). The details aren't important. The point is that you can make serious mistakes by guessing what leads to a timestamp change. Test your filesystems carefully before you design your find command.

  • When I open a document in a PDF reader or the LibreOffice office suite, the access time is immediately modified, as you'd expect. But when I open a file with vi or Emacs, the access time remains unmodified. The Mac doesn't seem to track those programs.
  • If I edit and save a file, whether with LibreOffice, vi, or Emacs, all three timestamps are modified.
  • The mv and cp commands, if they leave the target file in the same directory, change only the change time, not the modification time—even though the contents of the file could be totally different! Interestingly, these commands change all three timestamps on the directory that contains the file (but not on any of its parents).
  • If the mv and cp commands move the file to a different directory, they update both the change time and the modification time.
  • The > redirect operator, which stores output from commands in a file, changes all three timestamps.

Enough to leave you scratching your head. Anyway, I've decided that on Apple, I'm going to check the change time to search for an altered file, because it reflects more changes than the modification time does. On GNU/Linux, it’s probably safe to depend on the modification time.

All the timestamp options take a number with an optional minus or plus sign. Thus, -mmin -180 asks for files that were modified within the past 180 minutes, which means three hours:

$ find / -type f -mmin -180 -print

In contrast, -mmin +180 lists all files whose most recent modification took place more than 3 hours ago. A plain -mmin 180 asks for something modified exactly 179 to 180 minutes ago. If you’re measuring days instead of minutes, use -mtime. The standard, non-GNU version of find doesn’t offer minute granularity (only -mtime), whereas the BSD version has its own syntax for timestamps.

Archiving and Removing a User's Files

We've seen ways to restrict the output of find by filename and file type, but the options go on and on—you can check the documentation to see what might be useful to you. Two options that are particularly valuable for system administration are -user and -group, which choose files owned by a particular user or group, respectively. Another useful option is -perm, which lets you check for executable files, read-only files, etc.

In this section, we'll clean up the files of Luisa Mendoso, who we're told has left the organization. It would be bad hygiene and bad security to leave around files from a deleted account. On the other hand, files that Mendoso created for the marketing group are valuable, so we'll change the owner on them and archive them. The question we’re answering is:

  • A user left the company. How can I archive the files she created for the marketing group, while deleting all her other files?

In detail, our tasks are to:

  • Find all files owned by the mendoso account and determine which are part of the marketing group.
  • Change her files in the marketing group to user ramjeet and move them to the /home/marketing/archive directory.
  • Remove all other files owned by her.

As a simple test, we can find out whether Mendoso left behind any files outside the marketing group:

$ find /data -user mendoso \! -group marketing -print

The directory is a slash to represent the whole filesystem, so this command could take a while. The exclamation point, duly escaped, is a "not" operator for find. In other words, it makes sure that the output lists only files that are not in the marketing group.

But this was just for curiosity's sake. Now we want to get our archiving and cleanup done. The files we want to archive are represented by the following options:

-user mendoso -group marketing -type f

When we find one of these files, we want to change the owner as follows (a privileged command, so we execute it as superuser):

sudo chown ramjeet {} 

and then we want to move it:

mv {} /home/marketing/archive

I have not found a way to combine the chown and mv under the umbrella of a single -exec option, so I will create a separate -exec option for each:

-exec sudo chown ramjeet {} \; -exec mv {} /home/marketing/archive \;

We create another command to remove any remaining files owned by mendoso:

-exec rm -f {} \;

But on second thought, let's know what we're removing before it's removed. We’ll use the -ok option, which executes commands as -exec does, but shows us each filename first and prompts us to ask whether we want to go on and execute the command. So now the option looks like:

-ok rm -f {} \;

A note of warning: if you are writing a potentially destructive command or script, test it thoroughly on some dummy data before running it for real.

Now we can combine the two commands that archive files with the command that removes them. We want the rm command to run only on files that weren't handled by the previous command, so we preceded it with -o, the "or" operator. Furthermore, after -o we have to specify -user mendoso again so we don’t remove other people’s files, and specify -type f so the rm command doesn’t try to delete directories (failing with error messages).

Here's the stunning command we've ultimately come up with:

$ find /data -user mendoso -group marketing -type f -exec sudo chown ramjeet {} \; -exec mv {} /home/marketing/archive \; -o -user mendoso -type f -ok rm -f {} \;

Before removing a file, the -ok option prompts you like this:

"rm -f mendoso/project2021/slides.odp"?

Typing y at the prompt lets the removal proceed; typing n aborts it and makes the find command skip to the next action.

If You Find Your Way This Far

This article has shown, I hope, that find is itself a great find. I use it several times a day to figure out where I've neglectfully left my magnificent creations. The command can be integrated into scripts and perform complicated, important administrative tasks.

But I also marvel at the design of find itself—how Unix programmers, when hemmed in by a limited shell syntax, created a unique command that can do so much. It has rough corners and puzzling inconsistencies, but it rewards study.

Thanks to Bruno Gomes Pessanha—system engineer and co-author of LPI Linux Certification in a Nutshell—for reviewing this article and testing the commands.

FOSSlife Newsetter