Harley Hahn's Guide to
Filters: Comparing and Extracting
In Chapter 16, we discussed — in detail — the concept of a filter: a program that reads and writes textual data, one line at a time, reading from standard input and writing to standard output. At the time, I observed that most filters are designed as tools to do one thing well. As you read the next three chapters — in which we talk about many specific filters — this observation will start to make more and more sense.
In this chapter, we will be discussing filters that are designed to compare files and to extract parts of files. At first, you might think these are dull topics, and I don't blame you. After all, there are enough details in this chapter to choke a good-sized horse. As you read the details, however, and as you start to see the intelligence of the design behind the filters, you will come to appreciate how interesting they actually are. In fact, the filters we will be discussing in this chapter are not only interesting, but they are among the most useful and important programs in the Unix toolbox.
Over the years, Unix programmers have created a variety of tools to help you answer the questions: Do two files contain the exact same data? If not, how does the data differ from one file to the next? Comparing two files is more complicated than you might think, because there are various ways to compare and to display the results.
In the next few sections, we will discuss the most important of these tools. In particular, I'll explain what they do, what types of files they compare, and which of their options are the most useful. Along the way, I'll show you examples and give you important tips. My goal is simple: whenever the need to compare files arises, you should be able to analyze the situation quickly, decide which program and which options to use, and be able to interpret the results.
Figure 17-1 summarizes the most important file comparison programs (the ones we will be covering). For completeness, I have also included related programs for sorting and selecting data from files. We will discuss these programs in Chapter 19.
Figure 17-1: Programs to compare, sort, and select data from files
Related filters: comm, diff, sdiff
You use cmp in only one situation: to see if two files are identical. The syntax is:
cmp file1 file2
where file1 and file2 are the names of files.
The cmp program compares the two files, one byte at a time, to see if they are the same. If the corresponding bytes in both files are exactly the same, the files are identical, in which case cmp does not do anything. (No news is good news.) If the files are not identical, cmp displays a suitable message.
For example, let's say you have two versions of a program: calculate-1.0 and calculate-backup. You want to see if they are exactly the same. Use:
cmp calculate-1.0 calculate-backup
If the files are the same, you will see nothing. If the files don't match, you will see a message similar to the following:
calculate-1.0 calculate-backup differ: byte 31, line 4
As you can see in Figure 17-1, there are several other programs you can use to compare files (comm, diff, diff3 and sdiff). All of these programs work with text files. That is, they expect lines of text: data in which each line contains zero or more regular characters (letters, numbers, punctuation, whitespace) ending with a newline.
Since cmp compares files one byte at a time, it doesn't care what type of data the files contain. Thus, you can use cmp to compare any type of file, text or binary. For instance, the example above compared two binary files that contain executable programs. You could also compare two music files, two pictures, two word processing documents, and so on. (We will discuss text files and binary files in Chapters 19 and 23.)
Related filters: cmp, diff, sdiff
The comm program compares two sorted text files, line by line. You use comm when you have two similar files, and you want to find the differences. The syntax is:
comm [-123] file1 file2
where file1 and file2 are the names of sorted text files.
The nice thing about comm is that it allows you to visualize the differences in the two files. It does so by displaying its output in three columns: the first column contains the lines that are only in the first file; the second column contains the lines that are only in the second file; the third column contains the lines that are in both files. Let me show you an example.
Two close friends, Frick and Frack(*), are wondering how many other friends they have in common. They each make a list of their own friends, type the list into a file, and then use the sort command (Chapter 19) to sort the file. They then use comm to compare the two files.
Frick and Frack were the stage names of two well-known comedy ice skaters, Werner Groebli and Hans Mauch. Groebli and Mauch came to the United States from their native Switzerland in 1937 and, for decades, performed widely as part of the original Ice Follies. Their most famous trick was the "cantilever spread-eagle" (look for photos on the Web).
The sorted list of Frick's friends is in a file called frick:
The sorted list of Frack's friends is in a file called frack:
Frick and Frack compare the two lists by using the command:
comm frick frack
The output is:
Notice the three columns. The first column has only one name (Ben Dover). This shows that there is only one line that is unique to the first file (frick). The second column has two names (Candy Barr, Sue Perficial). This shows that there are two lines that are unique to the second file (frack). The third column has four names (Alison Wonderland, Barbara Seville, Chuck Wagon, Noah Peel). This shows that there are four lines that are in both files. Thus, Frick and Frack conclude they have four friends in common.
In this example, the files are small — between them,Frick and Frack have a total of seven friends — and you are probably wondering why they bother to create two files, sort them, run the comm command, and interpret the output. Wouldn't it be easier for Frick and Frack to ask one another: Do you know Alison? Do you know Barbara? and so on.
The answer is of course it would, but this is a contrived example. What if you were working with sorted files that had hundreds or thousands of lines — customer records or statistical data or a long list of songs for your MP3 player? In such cases, it would be virtually impossible to compare the lists by hand. You must have a program like comm.
Indeed, comm is especially useful when you have two versions of a sorted file that vary slightly — perhaps because of a small mistake — and you want to find that variation. For example, say that you have two very long sorted lists of numbers. Somewhere in the lists, there is a place where the numbers do not agree. You use comm to compare the files and, in the middle of the output, you see:
This shows you the exact place where the two files don't match.
To give you control over the output, comm allows you to suppress the output of the first, second or third columns by using the -1, -2 and -3 options respectively. In our last example, for instance, you could use -3 to suppress the third column, which would eliminate all the unnecessary output:
Now, all you see are the lines that differ, which is all you want to see. Imagine how much time this saves you if the files contained several thousand numbers.
If you want to suppress more than one column, just combine the options. For example, consider the lists of friends of Frick and Frack we discussed above. Let's say Frack wants to display only those people who are his friends and not Frack's friends. All he has to do is suppress the first and third columns:
comm -13 frick frack
The output shows only those lines that are unique to the second file:
The most common reason why comm does not work as expected is that the input files are not sorted.
If you need to compare two files but you don't want to sort the lines — say, because it will mix up the data — you cannot use comm. Instead, you should use diff (discussed later in the chapter). This is the case when you compare different versions of the source code for a program.
Related filters: cmp, comm, sdiff
The comm program will show you, visually, how two text files differ. However, comm has two limitations. First, the input files must be sorted which, in many cases, is not possible. For example, say you have two different versions of a long file, such as a computer program or an essay, and you want to know how they differ. Since the lines of a program or an essay aren't sorted, you can't use comm. Of course, you could sort the files first, but then the output wouldn't make any sense: you might find the differences, but you would lose the context.
Moreover, the output of comm is fine when you are comparing small- or even medium-sized files, but it can be confusing when you are working with large files. Again, it's a matter of context. When you compare large files, it is important that the output show you not only the differences, but their locations, and do so in a way that makes it easy to find the lines that differ.
The diff program is designed to overcome these limitations. Thus, you use diff when you want to (1) compare unsorted files or, (2) compare large files. More generally, diff can be used to find the differences in any type of work in which incremental additions, deletions or changes are made from time to time. For instance, for many years, programmers have used diff (or tools like diff) to track the changes between versions of a program as the program is modified.
Before we start, I must warn you that the output of diff will look a bit cryptic until you get used to it. However, you will get used to it. Regardless, diff is a powerful and useful program, and it is important that you learn how to use it, especially if you are a programmer.
The syntax for diff is as follows:
diff [-bBiqswy] [-c|-Clines|-u|-Ulines] file1 file2
where file1 and file2 are the names of text files, and lines is a number of lines of context to show.
When you compare two files that are identical, diff will display no output (similar to cmp). If the files are not the same, diff will, by default, display a set of instructions that, if followed, would change the first file into the second file. Here is an example.
We have two files. The file old-names contains:
The file new-names contains:
You will notice that the only difference is the spelling of "Paige" in the third line. To compare the files, we use:
diff old-names new-names
The output is:
As I explained, the goal of diff is to display the instructions you would need to follow to change the first file into the second file. The syntax of the instructions is simple but terse, and it can take a bit of practice to understand it. However, I do want you to become familiar with these types of instructions, because they are a standard part of the Unix culture. In fact, you will encounter this type of syntax a variety of situations, not just when using diff.
The output of diff uses three different 1-character instructions: c (change), d (delete), and a (append)(*). In the example above, you see only a single c instruction. This means that, to turn the first file into the second, you only need to make one modification, a simple change.
Why these three instructions? With two files that are reasonably similar to one another, you can always turn one into the other by some combination of change, delete, and append operations. Take a moment to think about this until it makes sense to you.
To the left and right of each c, d or a character, you will see a list of line numbers. There may be a single line number (such as the 3 above), or there may be a sequence of lines (such as 16,18). The numbers to the left refer to lines in the first file; the numbers to the right refer to lines in the second file. In our example, the instruction 3c3 tells us to change line 3 of the first file to line 3 of the second file.
Whenever diff requires a change, it shows you the actual lines from each file. Lines from the first file are marked by a < (less-than) character. Lines from the second file are marked by a > (greater-than) character. For readability, the two sets of lines are separated by a line consisting of several hyphens (---).
Let's consider another example in which old-names contains:
And new-names contains:
In this case, the only difference is that the first file does not contain the name "Will Power". When you use diff with these two files, the output is:
This tells you how to change the first file into the second. All you need to do is append a single line to the first file. Specifically, you would append line 2 of the second file after line 1 of the first file. Notice that diff shows you the actual line that needs to be appended. The > character tells you that this line is from the second file.
Now, consider a third example in which old-names contains:
And new-names contains:
The difference here is that the second file does not contain the name "Mark Mywords". When you use diff, the output is:
In this case, diff is telling you that, to turn the first file into the second, you need to delete line 4 from the first file. Again, the actual line is displayed. The < character tells you the line is from the first file. (Remember, the goal of diff is to tell you how to turn the first file into the second file.)
Note: Within a d command, you can generally ignore the number after the d (in this case, 3). It shows you where diff found a difference in the second file.
To finish this part of the discussion, let me show you how diff works with a more realistic example. Consider the following two files, each of which contains some code from a Perl script. (Don't worry about what the code does; just concentrate on the output of the diff command.) The first file, command-1.01.pl, contains:
# Check for illegal content
The second file, command-1.02.pl, contains:
# If the address contains a URL, abort
The following diff command compares the two files:
diff command-1.01.pl command-1.02.pl
The output is:
There are two ways to interpret this output. Literally, it tells us what instructions to follow to turn the first file into the second:
• Delete lines 1 and 2 from the first file.
A better way to interpret the output is to be able to read it and — in an instant — understand how the two files are different in a way that makes sense to you. This, of course, is why you are learning to use diff in the first place. The key is being able to read c (change), d (delete), and a (append) commands and instantly grasp their significance. As you can imagine, this takes practice. However, in time, you will be able to read and understand such output quickly and easily.
The diff program is complicated: it has a large number of options and a variety of ways in which it can generate output. In this section and the next, I'll discuss the most important options. For the full details, see the man page for your system (man diff).
The first few options tell diff to ignore certain differences when comparing. The -i (case insensitive) option tells diff to ignore any differences between upper- and lowercase letters. For example, when you use -i, diff considers the following three lines to be the same:
This is a BIG test.
The -w and -b options allow you to control how diff works with whitespace (spaces and tabs). These options are handy when you have data that is formatted with spaces or tabs that you want to ignore. The -w (whitespace) option ignores all whitespace. For example, with -w, the following two lines are considered to be the same.
The -b option is similar, but it doesn't ignore all whitespace; it only ignores differences in the amount of whitespace. For example, if you use -b, the above two lines would not be considered the same because the second line has whitespace, but the first does not. However, the following two lines would be the same:
This is because the two lines both have whitespace; they differ only in the amount of whitespace. The distinction between -w and -b is subtle, so if you have a whitespace problem and you are confused, try both options and see which works best with your particular data.
The -B (blank lines) option tells diff to ignore all blank lines. For example, let's say you have two files that contain different versions of an essay you have written. You want to compare them, but one copy is single-spaced, while the other is double-spaced. If you use the -B option, diff will ignore the blank lines and look only at the lines of text.
The rest of the diff options control how diff displays its results. The -q (quiet) option tells diff to leave out the details when two files are not the same. For example, if you compare two files, frick and frack, that are different, and you use -q, all you will see is:
Files frick and frack differ
As such, comparing two files with diff -q is, essentially, the same as using cmp (discussed earlier in the chapter). The biggest difference is that diff only compares text files, while cmp works with any type of file.
As I mentioned earlier, when diff finds that two files are the same, it does not display anything. This is common with many Unix programs: when they have nothing to say, they say nothing. However, there may be times when you want an explicit notice that two files are identical. In such cases, you can use the -s (same) option. For example, if you compare the two files frick and frack and they are the same, you would normally see nothing. If you use -s, however, you will see:
Files frick and frack are identical
As we discussed earlier, when diff compares two files, the default output consists of instructions (c, d, a) along with line numbers. These instructions, if followed, will turn the first file into the second file. This type of output has the advantage of being terse and, once you get used to it, it actually is readable. In my experience, this is all you need most of the time.
The disadvantage of the default format, however, is that by the time you get used to it, you have made irreversible changes in the gray matter of your brain. The biggest problem is that when you read the output, there is very little context. All you see are some line numbers along with the lines to be changed. For this reason, diff has three options (-c, -u and -y) that will produce more readable types of output. In addition, there is another program, sdiff, that will compare two files side-by-side. Here are the details.
Using diff with the -c (context) option will show you the differences between two files in a format that is less terse and more understandable than the default output. Instead of instructions and line numbers, diff will show you the actual lines that differ, as well as two extra lines above and below. Here is an example.
You have two files to compare. The first file, smart-friends contains:
The second file, rich-friends, contains:
First, let's compare the two files in the regular manner:
diff smart-friends rich-friends
The output is terse, but somewhat cryptic:
Now, let's use the -c option:
diff -c smart-friends rich-friends
The output is much longer, but easier to understand:
*** smart-friends 2009-02-14 15:33:50.000000000 -0700
The top two lines give you information about the files. The first file is marked by * (star) characters; the second file is marked by - (hyphen) characters. Following these lines, you see an excerpt from each file, showing exactly what needs to be changed to make the files identical.
Although this format is easier to understand than the default output, it has an obvious disadvantage: because diff displays excerpts from both files, there is duplicate text, making for a lot of output. When you consider that our example compared only two short files with simple differences, you can imagine how long the output would be if you compared two large files with many differences. In such cases, the -c output is much more verbose than the default format.
As a compromise, you can use the -u (unified output) option. This produces output similar to -c without repeating duplicate lines. For example, when you use:
diff -u smart-friends rich-friends
The output is:
--- smart-friends 2009-02-14 15:33:50.000000000 -0700
By default, when you use diff with -c or -u, the output shows two lines of context above and below every difference.
If you want to display a different number of context lines, you can do so by using -C (uppercase "C") instead of -c, and -U (uppercase "U") instead of -u. Use -C or -U followed by the number of extra lines you want, for example:
diff -C5 file1 file2
The final output option generates a side-by-side format, in which each line of the first file is displayed next to the corresponding line in the second file. To use this format, use -y:
diff -y smart-friends rich-friends
The side-by-side output looks like this:
Alba Tross Alba Tross
You can see the advantage of this type of output: it is very easy to see differences. For instance, in our example, it is obvious that three names are common to both files, (Alba, Dee and Pat), one name is only in the second file (Mick), and one name is only in the first file (Phil). The disadvantage, of course, is that, with a long file, you get a lot of output.
If you like this type of output, there is a special-purpose program, sdiff (side-by-side diff), you can use instead of diff -y. For example, the following two commands produce the same output:
diff -y smart-friends rich-friends
When it is necessary to do a side-by-side comparison, many people prefer to use sdiff, because it has a lot of specialized options, which affords a great deal of control. The syntax for sdiff is:
sdiff [-bBilsW] [-w columns] file1 file2
where file1 and file2 are the names of text files, and columns is the width of the columns.
Using sdiff is straightforward. For example, to compare the two files from our example we would use:
sdiff smart-friends rich-friends
The output is:
Alba Tross Alba Tross
As I mentioned, sdiff has a lot of options. We'll take a look at the most important ones, some of which are the same as the diff options. To read about the rest of the options, take a look at the man page on your system (man sdiff).
To start, there are several options that allow you to reduce the amount of unnecessary output. First, the -l (lowercase "L") option displays only the left column wherever the two files have common lines. For example, if you use:
sdiff -l smart-friends rich-friends
The output is:
Alba Tross (
The -s (same) option reduces the output even further: it tells sdiff not to display any lines that are the same in both files. For example:
sdiff -s smart-friends rich-friends
The output is minimal and easy to understand:
> Mick Stup
When you work with files that have short lines (as in our example), you will often find that the default columns used by sdiff are too wide. When this happens, you can use the -w option to change the width of the columns. Just use -w followed by the number of characters you want in each column. For example:
sdiff -w 30 smart-friends rich-friends
Of course, you can combine more than one option. My favorite strategy is to start with the -s and -w 30 options. For example:
sdiff -s -w 30 smart-friends rich-friends
Once I see the output, I adjust the width of the column to suit my data.
Finally, there are four options similar to those used with diff. The -i option ignores differences between upper- and lowercase letters; -W ignores all whitespace; -b ignores differences in the amount of whitespace; and -B ignores blank lines. (Note that sdiff uses -W, while diff uses -w. The difference is for historical reasons and has never been changed.)
Over the years, diff has been a very important tool, used by programmers to keep track of different versions of their programs. For example, let's say you are a programmer and you are working on a C program named Foo. The current version is 2.0; it is stored in the file foo-2.0.c. Right now, you are working on version 2.1, which is stored in foo-2.1.c. Once version 2.1 is finished, you can capture the changes by running the following command:
diff foo-2.0.c foo-2.1.c > foo-diff-2.1
The output file (foo-diff-2.1) now contains a list of instructions that, when followed, will turn foo-2.0.c into foo-2.1.c.
In general, a list of instructions that will change one file into another is called a DIFF. Thus, we can say that foo-diff-2.1 contains the diff that changes foo-2.0.c into foo-2.1.c.
Programmers create diffs for two reasons: to save storage space when backing up their work, and to distribute changes to other people.
Suppose Foo is a very large program. It would be prudent to make a backup of every version, but that would take up a lot of storage space. Instead, you back up a full copy of the base version (foobar-2.0.c). From then on, you only need to back up the diffs, which are small: foo-diff-2.1, foo-diff-2.2, and so on. (When you get to 3.0, you can save another full copy.)
Let's say you are at version 2.7 when a catastrophic event causes you to lose the original files. To restore them from the backup, you start by copying the base file, foo-2.0.c. You then use the first diff to recreate foo-2.1.c, the second diff to recreate foo-2.2-c, and so on, up to foo-2.7.c. In other words, by backing up a base copy and a series of diffs, you can recreate all the different versions of your program. In fact, using this technique, you can back up different versions of anything stored in a text file: a story, an essay, a sales presentation, and so on.
When you use a diff in this way — to recreate one file from another — we say that you APPLY the diff. The program that is used to apply diffs is called patch (the details of which are beyond the scope of this book). In our example, you would recreate the lost files by copying the base version and all the diffs from the backup, and then using patch to apply one diff after another.
The second way in which programmers use diffs is to distribute changes to their programs. For example, let's say a lot of people have the source code of version 2.0 of your Foo program. It took each person a while to download and install the program, but now they have it. What do you do when you are ready to distribute version 2.1?
You could ask everyone to download the entire new program. However, that would take a long time. Instead, you need only distribute the diff, which is small. To change to version 2.1, all your users need to do is use patch to apply the diff. If, for some reason, they have a problem with the new version, they can use patch to un-apply the diff, and go back to version 2.0.
When programmers use a diff in this way, it is often referred to as a PATCH. Thus, in our example, we would say that you distributed a patch for version 2.1, and your users used the patch program to apply the patch.
The advantage of distributing changes in the form of diffs is that it is much, much faster to update software by applying patches than by downloading and installing brand new versions. Indeed, in the early days of the Internet, when downloading was extremely slow, the only practical way to update large programs was by distributing patches that users would then apply on their own.
In the early days of Unix, it was common for programmers to use diff and patch to maintain, back up, and distribute their programs. However, for a long time, there have been much better systems to automate such tasks, and relatively few programmers use diff and patch directly. Instead, they work with a sophisticated VERSION CONTROL SYSTEM, sometimes referred to as SOURCE CODE CONTROL SYSTEM (SCCS) or REVISION CONTROL SYSTEM (RCS). Such systems are commonly used by software developers and engineers to manage the development of large programs, documents, blueprints, and so on. In fact, without modern version control systems, it would be impossible for large teams of people to work together on creative projects.
However, regardless of the degree of sophistication, all version control systems rely on the fundamental concepts of creating, distributing and applying diffs. This is why it is important that you understand the basic ideas.
What's in a Name?
The word diff comes from the diff program, which is used to compare two files. Among Unix people, it is common to use "diff" as both a noun and a verb.
For example, you might hear someone say, "Send me your diffs for the Foo program," meaning, "Send me the files that contain the updates for the Foo program."
You will also hear people use diff as a verb: "If you want to see the changes I made to your news article, just diff the two files."
Related filters: colrm, join paste
The cut program is a filter that extracts specified columns of data and throws away everything else. (This is the opposite of colrm, which deletes specified columns of data, and saves everything else.)
The cut program has a great deal of flexibility. You can extract either specific columns of each line or delimited portions of each line (called fields). If you are a database expert, you can consider cut as implementing the projection of a relation. (If you are not a database expert, don't worry; your life is still complete.)
In this section, I will concentrate on how to use cut to extract columns of data. In the next section, we'll talk about how to extract fields of data.
The syntax of cut (when you are extracting columns) is:
cut -c list [file...]
where list is a list of columns to extract, and file is the name of an input file.
You use the list to tell cut which columns of data you want to extract. Specify one or more column numbers, separated by commas. Do not put any spaces within the list. For example, to extract column 10 only, use 10. To extract columns 1, 8 and 10, use 1,8,10.
You can also specify a range of column numbers by joining the beginning and end of the range with a hyphen. For example, to extract columns 10 through 15, use 10-15. To extract columns 1, 8, and 10 through 15, use 1,8,10-15.
Here is an example of how to use cut. Say that you have a file named info that contains information about a group of people. Each line contains data pertaining to one person. In particular, columns 14-30 contain a name and columns 42-49 contain a phone number. Here is the sample data:
012-34-5678 Ambercrombie, Al 01/01/72 555-1111
To display the names only, use:
cut -c 14-30 info
You will see:
To display the names and phone numbers, use:
cut -c 14-30,42-49 info
You will see:
Ambercrombie, Al 555-1111
If you like, you can leave out the space after -c. In fact, most people do just that. Thus, the following command is equivalent to the last one:
cut -c14-30,42-49 info
Returning to our example, you can save the information by redirecting standard output to a file, for example:
cut -c 14-30,42-49 info > phonelist
The cut program is handy to use in a pipeline. Here is an example. You share a multiuser Linux computer, and you want to make a list of the userids that are currently logged into the system. Since some userids may be logged in more than once, you want to show how many times each userid is logged in.
Start with who (Chapter 8). This command will generate a report with one line for each userid that is logged in. Here is a typical sample:
harley console Jul 8 10:30
As you can see, the userid is displayed in columns 1 through 8. Thus, we can extract the userids by using:
who | cut -c 1-8
The output is:
Now, let's do more. Let's sort the list of userids using sort, and count the number of duplications using uniq -c. (Both sort and uniq are explained in Chapter 19.) Putting the whole thing together, we have:
who | cut -c 1-8 | sort | uniq -c
(Notice that there is no problem using options within a pipeline.) The output is:
As an interesting variation of this pipeline, let us ask the question: How can we display the names of all userids who are logged in exactly twice? The solution is to search the output of uniq for all the lines that contain "2"(*). You can do so using grep (Chapter 19):
who | cut -c 1-8 | sort | uniq -c | grep "2"
The output is:
Strictly speaking, this grep command will find any lines that contain the character "2". For example, if someone is logged in 12 times or 20 times that will be found as well. A better solution, which uses the techniques that we will discuss in Chapter 20, is to use the command grep "\<2\>". This will find only those lines that contain "2" all by itself.
To rearrange the columns of a table, use cut followed by paste.
In the last section, I showed you how to use the cut program to extract specified columns of data. However, cut has another use: it can extract fields of data. In order to understand how this works, we need to discuss a few basic ideas.
Consider two different files. The first file contains the following lines:
Ambercrombie Al 123
The second file contains:
In both files, each line contains a last name, a first name, and an identification number, in fact, the same information. However, there is a big difference between the files.
The first file is easy for a person to read, because the information is lined up nicely in columns. The second file is more suitable for a program to read, because of the : (colon) characters that separate the three parts of each line. Using the terminology we discussed in Chapter 12, we can call the first file human-readable and the second file machine-readable.
You will often encounter machine-readable files, similar to our second example, when you work with data that is designed to be processed by a program. With such data, each line is referred to as a RECORD; the separate parts of each line are called FIELDS; and the characters that act as separators are DELIMITERS. In our example, there are 4 records, each of which has 3 fields (last name, first name, identification number). Within each record, the delimiters are colons.
Of course, delimiters aren't always colons. In principle, any character that does not appear in the actual data can be used as a delimiter. The most common delimiters are commas, spaces, tabs and whitespace (that is, a combination of tabs and spaces).
Commas, in fact, are used so frequently as delimiters that there is a special name to describe data which is delimited by commas. Such data is said to be stored in CSV ("comma-separated value") format(*).
Until the last few years, CSV format was the most popular storage format for data that might need to be exchanged between programs, particularly with spreadsheet programs such as Microsoft Excel. Today, XML (Extensible Markup Language) is more widely used, because it works with many types of data. CSV format, although it is easy to understand, is much more limited as it can only be used with plain text.
For reference (in case you ever need it), here is my version of a comprehensive, technical definition of CSV format:
"CSV format is used to store textual data organized into records, each of which ends with a newline character (or return-newline with Windows). Within each record, fields are delimited by commas. Any whitespace (spaces or tabs) before or after fields is ignored. A field may be enclosed by double quotes, which are ignored. A field must be enclosed in double quotes if it contains commas, double quotes or newlines, or if it starts or ends with spaces or tabs. Within a field, a double quote character is represented by two double quotes in a row."
Perhaps the most interesting example of a machine-readable file that uses delimiters is the Unix PASSWORD FILE (/etc/passwd), which we discussed in Chapter 11. The password file contains one line for each userid on the system. Within each line, the various fields are separated by : characters. If a field is empty, you will see two : characters in a row.
To take a look at the password file on your system, use one of the following commands(*):
In old versions of Unix, passwords (encrypted, of course) were kept in the password file, hence the name. With modern Unix, the actual passwords are not kept in this file. As we discussed in Chapter 11, for security reasons the encrypted passwords are stored in a different file (/etc/shadow) called the shadow file.
Now that we have laid the groundwork, let me show you how to use the cut program to extract fields from the lines of a file. The syntax is:
cut -c list [file...]
where list is a list of fields to extract, delimiter is the delimiter used to separate fields, and file is the name of an input file.
The list of fields uses the same format as when you use the -c option. You specify one or more numbers, separated by commas. Do not put any spaces within the list. For example, to extract field 10 only, use 10. To extract fields 1, 8 and 10, use 1,8,10.
You can also specify a range of fields joining the beginning and end of the range with a hyphen. For example, to extract fields 10 through 15, use 10-15. To extract fields 1, 8, and 10 through 15, use 1,8,10-15.
Here is an example. Within the password file (/etc/passwd), the first field in each line is the userid. Suppose you want to see a list of all the userids registered with the system. Remembering that this file uses a : for a delimiter, all you need to do is extract the first field from each line in the password file. The command is:
cut -f 1 -d ':' /etc/passwd
If you want to sort the list, just pipe the output to sort (Chapter 19):
cut -f 1 -d ':' /etc/passwd | sort
The following example extracts fields number 1, 3, 4 and 5 from the same file:
cut -f 1,3-5 -d ':' /etc/passwd | sort
You will notice that I have quoted the delimiter (the :). This is a good habit in order to make sure that the delimiter is not interpreted incorrectly when the shell parses the command. In this case, it would have been okay to leave out the quotes, but if your delimiter is a space, tab or metacharacter, you must quote it.
What happens if cut encounters a line that does not contain any delimiters? By default, such lines are simply passed through and will be written to standard output. If you want to throw away such lines, you can use the -s (suppress) option.
One last point. As with the -c option we discussed in the last section, you can leave out the space after -f and -d. Thus, the following commands are equivalent to our last two examples:
cut -f1 -d':' /etc/passwd | sort
Most experienced Unix people leave out the spaces.
When you want to extract fields from a file that has both delimiters and fixed width columns, you can use either cut -d or cut -c. In such cases, you will find that working with delimiters (-d) is a better choice as it is less prone to error.
Related filters: colrm, cut, join
The paste program combines columns of data. This program has a great deal of flexibility. You can combine several files, each of which has a single column of data, into one large table. You can also combine consecutive lines of data to build multiple columns. In this section, I will concentrate on the most useful feature of paste: combining separate files. If you want more details on what paste can do for you, check the man page (man paste).
The syntax of the paste program is:
paste [-d char...] [file...]
where char is a character to be used as a separator, and file is the name of an input file.
You use paste to combine columns of data into one large table. If you want, you can save the table in a file by redirecting standard output. Here is an example. You have four files named idnumber, name, birthday and phone. The contents of the files are as follows.
The file idnumber:
The file name:
The file birthday:
And finally, the file phone:
You want to build one large file named info that combines all this data into a single table. Within the table, the data from each file should be put into its own column. The command to use is:
paste idnumber name birthday phone > info
The contents of info are:
012-34-5678 Ambercromby, Al 01/01/85 555-1111
You will notice that the output is spaced a bit oddly. That is because, by default, paste puts a tab character between each column entry, and Unix assumes that tabs are set every 8 positions, starting with position 1. In other words, Unix assumes that tabs are set at positions 1, 9, 17, 25 and so on. (We will discuss the details of how Unix uses tabs in Chapter 18.)
To tell paste to use a different (non-tab) character between columns, use the -d (delimiter) option followed by an alternative character in single quotes. For example, to create the same table with a space between columns, use:
paste -d ' ' idnumber name birthday phone
Now your output looks like this:
012-34-5678 Ambercromby, Al 01/01/72 555-1111
If you specify more than one delimiter, paste will use each one in turn, repeating if necessary. For example, the following command specifies two different delimiters, a | (vertical bar) and a % (percent sign).
paste -d '|%' idnumber name birthday phone
The output is:
Think of paste as being similar to cat. The difference is that paste combines data horizontally, while cat combines data vertically.
Using cut and paste in sequence, you can change the order of columns in a table. For example, say that you have a file named pizza containing information about four different pizzas you are going to make for a party:
mushrooms regular sausage
You want to change the order of the first and second columns. First, save each column to a separate file:
cut -c 1-9 pizza > vegetables
Now combine the three columns into a single table, specifying the order that you want:
paste -d ' ' crust vegetables meat > pizza
Since this is a short file, you can display it using cat (see the discussion in Chapter 16).
The data now looks like this:
regular mushrooms sausage
(Of course, this is a small, contrived example, but think how important this technique would be if you had to interchange columns in a file with hundreds or thousands of lines.)
Once you have made the changes you want, there are two things left to do. First, use the rm program (Chapter 25) to delete the three temporary files:
rm crust vegetables meat
Second, see if you can think of someone to invite to the party who is willing to eat a liver and tomato pizza.
Review Question #1:
There are ten important Unix programs to compare, sort and select data from files. What are the ten programs?
Why are there so many programs?
Review Question #2:
By default, the comm program compares two files and generates three columns of output. Explain the purpose of each column. How do you suppress a specific column?
Review Question #3:
Both the diff and sdiff programs compare files. When do you use diff and when do you use sdiff?
Review Question #4:
You are given a large text file. Which program would you use to select: a) duplicate lines, b) unique lines, c) lines containing a specific pattern, d) lines beginning with a specific pattern, e) columns of data?
Applying Your Knowledge #1:
You are interested in comparing favorite foods with your two friends Claude and Eustace. Create three files named: me, claude, eustace. Each of the files contains fives lines in sorted order, each of which has the name of a food. Using only two Unix commands, display a list of those foods that appear in all three lists. Hint: You may need to create a temporary file.
Applying Your Knowledge #2:
The comm program is used to compare sorted files; the diff program compares unsorted files.
Give three examples of types of data which you would compare using comm.
Give three examples of types of data where you would use diff.
Are there any instances where either program would work?
Applying Your Knowledge #3:
Each line of the Unix password file (/etc/passwd) contains information about a userid. Within the line, the various fields of data are delimited by : (colon) characters. One of the fields contains the name of the shell for that userid. Use the following command to display and study the format of the password file on your system:
(From within less, you can press <Space> to display the page down and q to quit.)
What command would you use to read the password file and display a list of the various shells used on your system?
How would you sort the output to make it more readable?
How would you eliminate duplications?
Applying Your Knowledge #4:
CSV format (comma-separated value format) describes a file containing machine-readable data in which fields are separated by commas. You have five files — data1, data2, data3, data4 and data5 — each of which contains a single column of data. What command would you use to put the five columns together into one CSV-formatted file named csvdata?
What happens if one of the files has fewer lines in it than the other files?
For Further Thought #1:
The purpose of the diff program is to highlight differences by displaying terse instructions for turning the first file into the second file. Create two files named a and b. Compare the output of the following commands:
What patterns do you notice?
For Further Thought #2:
Why is the output of the diff command so compact?
Should it be easier to understand?
For Further Thought #3:
You are given a text file with 10,000 lines. The file contains two columns of data, and you must change the order of the columns. You can do so quickly and accurately by using the cut and paste commands. Suppose you did not have these commands. Using any other tools at your disposal, how might you accomplish the job?
Consider using a text editor, word processor, a spreadsheet program, writing a program of your own, and so on. Is there anything you can think of that is easier than the Unix cut and paste programs? Why is this?
What qualities do the programs in this chapter have that make them so useful?
© All contents Copyright 2023, Harley Hahn