Donation?

Harley Hahn
Home Page

Send a Message
to Harley


A Personal Note
from Harley Hahn

Unix Book
Home Page

List of Chapters

Table of Contents

List of Figures

Chapters...
   1   2   3
   4   5   6
   7   8   9
  10  11  12
  13  14  15
  16  17  18
  19  20  21
  22  23  24
  25  26

Glossary

Appendixes...
  A  B  C
  D  E  F
  G  H

Command
Summary...

• Alphabetical
• By category

Unix-Linux
Timeline

Internet
Resources

Errors and
Corrections

Endorsements


INSTRUCTOR
AND STUDENT
MATERIAL...

Home Page
& Overview

Exercises
& Answers

The Unix Model
Curriculum &
Course Outlines

PowerPoint Files
for Teachers

Chapter 19...

Filters: Selecting, Sorting,
Combining, and Changing

In this chapter, we conclude our discussion of filters by talking about the most interesting and powerful filters in the Unix toolbox: the programs that select data, sort data, combine data, and change data. These programs are incredibly useful, so it behooves us to take the time to discuss them at length.

As you know, powerful programs can take a long time to learn, and that is certainly the case with the filters we will be discussing in this chapter. In fact, these programs are so powerful, you will probably never master all the nuances.

That's okay. I'll make sure you understand the basics, and I'll show you a great many examples. Over time, as your skills and your needs develop, you can check the online manual for more advanced details, and you can use the Web and Usenet to look for help from other people. Most important: whenever you get a chance to talk to a Unix geek in person, get him or her to show you their favorite tricks using the filters in this chapter. That is the very best way to learn Unix.

This is the last of four chapters devoted to filters (Chapters 16-19). In Chapter 20, we will discuss regular expressions, which are used to specify patterns. Regular expressions increase the power of filters significantly and in Chapter 20, you will find many examples that pertain to the filters in this chapter, particularly grep, perhaps the most important filter of them all.

Jump to top of page

Selecting Lines That Contain
a Specific Pattern: grep

Related filters: look, strings

The grep program reads from standard input or from one or more files, and extracts all the lines that contain a specified pattern, writing the lines to standard output. For example, you might use grep to search 10 long files for all the lines that contain the word Harley. Or, you might use the sort program (discussed later in the chapter) to sort a large amount of data, and then pipe that data to grep to extract all the lines that contain the characters "note:".

Aside from searching for specific strings of characters, you can use grep with what we call "regular expressions" to search for patterns. When you do so, grep becomes a very powerful tool. In fact, regular expressions are so important, we will discuss them separately in Chapter 20, where you will see a lot of examples using grep. (In fact, as you will see at the end of this section, the re in the name grep stands for "regular expression".)

The syntax for grep is:

grep [-cilLnrsvwx] pattern [file...]

where pattern is the pattern to search for, and file is the name of an input file.

Let's start with a simple example of how you might use grep. In Chapter 11, I explained that most Unix systems keep the basic information about each userid in a file named /etc/passwd. Each userid has one line of information in this file. You can display the information about your userid by using grep to search the file for that pattern. For example, to display information about userid harley, use the command:

grep harley /etc/passwd

If grep does not find any lines that match the specified pattern, there will be no output or warning message. Like most Unix commands, grep is terse. When there is nothing to say, grep says nothing.(*)

* Footnote

Wouldn't it be nice if everyone you knew had the same philosophy?

When you specify a pattern that contains punctuation or special characters, you should quote them so the shell will interpret the command properly. (See Chapter 13 for a discussion of quoting.) For example, to search a file named info for all the lines that contain a colon followed by a space, use the command:

grep ': ' info

As useful as grep is for searching individual files, where it really comes into its own is in a pipeline. This is because grep can quickly reduce a large amount of raw data into a small amount of useful information. This is very important capability that makes grep one of the most important programs in the Unix toolbox. Ask any experienced Unix person, and you will find that he or she would not want to live without grep. It will take time for you to appreciate the power of this wonderful program, but we can start with a few simple examples.

When you share a multiuser system with other people, you can use the w program (Chapter 8) to display information about all the users and what they are doing. Here is some sample output:

8:44pm up 9 days, 7:02, 5 users, load: 0.11, 0.02, 0.00
User     tty      login@  idle   JCPU   PCPU  what
tammy    ttyp0   Wed10am 4days  42:41  37:56  -bash
harley   ttyp1    5:47pm        15:11         w
linda    ttyp3    5:41pm    10   2:16     13  -tcsh
casey    ttyp4    4:45pm         1:40   0:36  vi dogstuff
weedly   ttyp5    9:22am  1:40     20      1  gcc catprog.c

Say that you want to display all the users who logged in during the afternoon or evening. You can search for lines of output that contain the pattern "pm". Use the pipeline:

w -h | grep pm

(Notice that I used w with the -h option. This suppresses the header, that is, the first two lines.) Using the above data, the output of the previous command would be:

harley   ttyp1    5:47pm        15:11         w
linda    ttyp3    5:41pm    10   2:16     13  -tcsh
casey    ttyp4    4:45pm         1:40   0:36  vi dogstuff

Suppose we want to display only the userids of the people who logged in during the afternoon and evening. All we have to do is pipe the output of grep to cut (Chapter 17) and extract the first 8 columns of data:

w -h | grep pm | cut -c1-8

The output is:

harley
linda
casey

What about sorting the output? Just pipe it to sort (discussed later in the chapter):

w -h | grep pm | cut -c1-8 | sort

The output is:

casey
harley
linda

What's in a Name?

grep


In the early 1970s, the text editor that was used with the earliest versions of Unix was called ed. Within ed, there was a command that would search a file for all the lines that contained a specified pattern, and then print those lines on the terminal. (In those days, Unix users used terminals that printed output on paper.)

This command was named g, for global, because it was able to search an entire file. When you used g to print all the lines that contained a pattern, the syntax was:

g/re/p

where g stands for "global"; re is a regular expression that describes the pattern you want to search for; and p stands for "print".

It is from this serendipitous abbreviation that the name grep was taken. In other words, grep stands for:

Global: Indicating that grep searches through all of the input data.

Regular Expression: Showing that grep can search for any pattern that can be expressed as a regular expression (discussed in Chapter 20).

Print: Once grep finds what you want, it prints (displays) it for you. As we discussed in Chapter 7, for historical reasons, we often use "print" to mean "display".

Among Unix people, it is common to use "grep" as a verb, in both a technical and non-technical sense. Thus, you might hear someone say, "I lost your email address, so I had to grep all my files to find your phone number." Or, "I grepped my living room several times, and I still can't find the book you lent me."

Jump to top of page

The Most Important grep Options

The grep program has many options of which I will discuss the most important. To start, the - c (count) option displays the number of lines that have been extracted, rather than the lines themselves. Here is an example.

As we will discuss in Chapter 23, the Unix file system uses directories, which are similar to (but not the same as) the folders used with Windows and the Macintosh. A directory can contain both ordinary files and other directories, called subdirectories. For example, a directory might contain 20 files and 3 subdirectories.

As you will see in Chapter 24, you use the ls command to display the names of the files and subdirectories contained in a particular directory. For example, the following command displays the contents of the directory named /etc. (The name /etc will make sense once you read Chapter 23.)

ls /etc

If you run this command on your system, you will see that /etc contains a lot of entries. To see which entries are subdirectories, use the -F option:

ls -F /etc

When you use this option, ls appends a / (slash) character to the end of all subdirectory names. For example, let's say that, within the output, you see:

motd
rc.d/

This means that motd is an ordinary file, and rc.d is a subdirectory.

Suppose you want to count the number of subdirectories in the /etc directory. All you have to do is pipe the output of ls -F(*) to grep -c, and count the slashes:

ls -F /etc | grep -c "/"

* Footnote

By default, ls lists multiple names on each line, to make the output more compact. However, when you pipe the output to another program, ls displays each name on a separate line. If you want to simulate this, use the -1 (the number 1) option, for example:

ls -1 /etc

On my system, the output is:

92

By the way, if you want to count the total entries in a directory, just pipe the output of ls to wc -l (Chapter 18), for example:

ls /etc | wc -l

On my system, there are 242 entries in the /etc directory.

The next option, -i, tells grep to ignore the difference between lower- and uppercase letters when making a comparison. For example, let's say a file named food-costs contains the following five lines:

pizza $25.00
tuna $3.50
Pizza $23.50
PIZZA $21.00
vegetables $18.30

The following command finds all the lines that contain "pizza". Notice that, according to the syntax for grep, the pattern comes before the file name:

grep pizza food-costs

The output consists of a single line:

pizza $25.00

To ignore differences in case, use -i:

grep -i pizza food-costs

This time, the output contains three lines:

pizza $25.00
Pizza $23.50
PIZZA $21.00

— hint —

The -i (ignore) options tell grep to ignore differences between upper- and lower case. Later in the chapter, we will discuss two other programs, look and sort, that have a similar option. However, with these two programs, you use -f (fold) instead of -i. Don't be confused.

(The word "fold" is a technical term indicating that upper- and lowercase letters should be treated the same. We'll talk about it later.)

Moving on, there will be times when you will want to know the location of the selected lines within the data stream. To do so, you use the -n option. This tells grep to write a relative line number in front of each line of output. (Your data does not have to contain the numbers; grep will count the lines for you as it processes the input.) As an example, consider the following command which uses both the -i and -n options with the file food-costs listed above:

grep -in pizza food-costs

The output is:

1:pizza $25.00
3:Pizza $23.50
4:PIZZA $21.00

The -n option is useful when you need to pin down the exact location of certain lines within a large file. For example, let's say you want to modify all the lines that contain a specific pattern. Once you use grep -n to find the locations of those lines, you can use a text editor to jump directly to where you want to make the changes.

The next option, -l (list filenames), is useful when you want to search more than one file for a particular pattern. When you use -l, grep does not display the lines that contain the pattern. Instead, grep writes the names of files in which such lines were found.

For example, say you have three files, names, oldnames and newnames. The file names happens to contain "harley"; the file newnames contains "Harley"; and the file oldnames contains neither. To see which files contain the pattern "Harley", you would use:

grep -l Harley names oldnames newnames

The output is:

newnames

Now add in the -i option to ignore differences in case:

grep -il Harley names oldnames newnames

The output is now:

names
newnames

The -L (uppercase "L") option does the opposite of -l. It shows you the files that do not contain a match. In our example, to list the files that do not contain the pattern "Harley", you would use:

grep -L Harley names oldnames newnames

The output is:

names
oldnames

The next option, -w, specifies that you want to search only for complete words. For example, say you have a file named memo that contains the following lines:

We must, of course, make sure that all the
data is now correct before we publish it.
I thought you would know this.

You want to display all the lines that contain the word "now". If you enter:

grep now memo

you will see:

data is now correct before we publish it.
I thought you would know this.

This is because grep selected both "now" and "know". However, if you enter:

grep -w now memo

You will see only the output you want:

data is now correct before we publish it.

The -v (reverse) option selects all the lines that do not contain the specified pattern. This is an especially useful option that you will find yourself using a great deal. As an example, let's say you are a student and you have a file named homework to keep track of your assignments. This file contains one line for each assignment. Once you have finished an assignment, you mark it "DONE". For example:

Math: problems 12-10 to 12-33, due Monday
Basket Weaving: make a 6-inch basket, DONE
Psychology: essay on Animal Existentialism, due end of term
Surfing: catch at least 10 waves, DONE

To list all the assignments that are not yet finished, enter:

grep -v DONE homework

The output is:

Math: problems 12-10 to 12-33, due Monday
Psychology: essay on Animal Existentialism, due end of term

If you want to see the number of assignments that are not finished, combine -c with -v:

grep -cv DONE homework

In this case, the output is:

2

On occasion, you may want to find the lines in which the search pattern consists of the entire line. To do so, use the -x option. For example, say the file names contains the lines:

Harley
Harley Hahn
My friend is Harley.
My other friend is Linda.
Harley

If you want to find all the lines that contain "Harley", use:

grep Harley names

If you want to find only those lines in which "Harley" is the entire line, use the ‑x option:

grep -x Harley names

In this case, grep will select only the first and last lines.

To search an entire directory tree (see Chapter 23), use the -r (recursive) option. For example, let's say you want to search for the word "initialize" within all the files in the directory named admin, including all subdirectories, all files in those subdirectories, and so on. You would use:

grep -r initialize admin

When you use -r on large directory trees, you will often see error messages telling you that grep cannot read certain files, either because the files don't exist or because you don't have permission to read them. (We will discuss file permissions in Chapter 25.) Typically, you will see one of the following two messages:

No such file or directory
Permission denied

If you don't want to see such messages, use the - s (suppress) option. For example, say you are logged in as superuser, and you want to search all the files on the system for the words "shutdown now".

As we will discuss in Chapter 23, the designation / refers to the root (main) directory of the entire file system. Thus, if we start from the / directory and use the -r (recursive) option, grep will search the entire file system. The command is:

grep -rs / 'shutdown now'

Notice I quoted the search pattern because it contains a space. (Quoting is explained in Chapter 13.)

Jump to top of page

Variations of grep: fgrep egrep

In the olden days (the 1970s and 1980s), it was common for people to use two other versions of grep: fgrep and egrep.

The fgrep program is a fast version of grep that searches only for "fixed-character" strings. (Hence the name fgrep.) This means that fgrep does not allow the use of regular expressions for matching patterns. When computers were slow and memory was limited, fgrep was more efficient than grep as long as you didn't need regular expressions. Today, computers are fast and have lots of memory, so there is no need to use fgrep. I mention it only for historical reasons.

The egrep program is an extended version of grep. (Hence the name egrep.) The original grep allowed only "basic regular expressions". The egrep program, which came later, supported the more powerful "extended regular expressions". We'll discuss the differences in Chapter 20. For now, all you need to know is that extended regular expressions are better, and you should always use them when you have a choice.

Modern Unix systems allow you to use extended regular expressions by using either egrep or grep -E. However, most experienced Unix users would rather type grep. The solution is to create an alias (see Chapter 13) to change grep to either egrep or grep -E. With the Bourne shell family, you would use one of the following commands:

alias grep='egrep'
alias grep='grep -E'

With the C-Shell family, you would use one of these commands:

alias grep 'egrep'
alias grep 'grep -E'

Once you define such an alias, you can type grep and get the full functionality of extended regular expressions. To make such a change permanent, all you need to do is put the appropriate alias command into your environment file (see Chapter 14). Indeed, this is such a useful alias, that I suggest you to take a moment right now and add it to your environment file. In fact, when you get to Chapter 20, I will assume you are using extended regular expressions.

Note: If you use Solaris (from Sun Microsystems), the version of egrep you want is in a special directory named /usr/xpg4/bin/(*), which means you must use different aliases. The examples below are only for Solaris. The first one is for the Bourne Shell family; the second is for the C-Shell family:

alias grep='/usr/xpg4/bin/egrep'
alias grep '/usr/xpg4/bin/egrep'

* Footnote

The name xpg4 stands for "X/Open Portability Guide, Issue 4", an old (1992) standard for how Unix systems should behave. The programs in this directory have been modified to behave in accordance with the XPG4 standard.

Jump to top of page

Selecting Lines Beginning With
a Specific Pattern: look

Related filters: grep

The look program searches data that is in alphabetical order and finds all the lines that begin with a specified pattern.

There are two ways to use look. You can use sorted data from one or more files, or you can have look search a dictionary file (explained in the next section).

When you use look to search one or more files, the syntax is:

look [-df] pattern [file...]

where pattern is the pattern to search for, and file is the name of a file.

Here is an example. You are a student at a school where, every term, all the students evaluate their professors. This term, you are in charge of the project. You have a large file called evaluations, that contains a summary of the evaluations for over a hundred professors. The data is in alphabetical order. Each line of the file contains a ranking (A, B, C, D or F), followed by two spaces, followed by the name of a professor. For example:

A  William Wisenheimer
C  Peter Pedant
F  Norman Knowitall

Your job is to create five lists to post on a Web site. The lists should contain the names of the professors who received an A rating, a B rating, and so on. Since the data is in alphabetical order, you can create the first list (the A professors) by using look to select all the lines of the file that begin with A:

look A evaluations

Although this command will do the job, we can improve upon it. As I mentioned, each line in the data file begins with a single-letter ranking, followed by two spaces. Once you have the names you want, you can use colrm (Chapter 16) to remove the first three characters of each line. The following examples do just that for each of the rankings: they select the appropriate lines from the data file, use colrm to remove the first three characters from each line, and then redirect the output to a file:

look A evaluations | colrm 1 3 > a-professors
look B evaluations | colrm 1 3 > b-professors
look C evaluations | colrm 1 3 > c-professors
look D evaluations | colrm 1 3 > d-professors
look F evaluations | colrm 1 3 > f-professors

Unlike the other programs covered in this chapter, look cannot read from the standard input: it must take its input from one or more files. This means that, strictly speaking, look is not a filter.

The reason for this restriction is that, with standard input, a program can read only one line at a time. However, look uses a search method called a "binary search" that requires access to all the data at once. For this reason, you cannot use look within a pipeline, although you can use it at the beginning of a pipeline.

When you have multiple steps, the best strategy is to prepare your data, save it in a file, and then use look to search the file. For example, let's say the four files frosh, soph, junior and senior contain the raw, unsorted evaluation data as described above. Before you can use look to search the data, you must combine and sort the contents of the four files and save the output in a new file, for example:

sort frosh soph junior senior > evaluations
look A evaluations

We will discuss the sort program later in the chapter. At that time, you will learn about two particular options that are relevant to look. The -d (dictionary) option tells sort to consider only letters and numbers. You use -d when you want look to ignore punctuation and other special characters. The -f (fold) option tells sort to ignore differences between upper- and lowercase letters. For example, when you use - f, "Harley" and "harley" are considered the same.

If you use either of these sort options to prepare data, you must use the same options with look, so look will know what type of data to expect. For example:

sort -df frosh soph junior senior > evaluations
look -df A evaluations

Jump to top of page

When Do You Use look and
When Do You Use grep?

Both look and grep select lines from text files based on a specified pattern. For this reason, it makes sense to ask, when do you use look and when do you use grep?

Similar questions arise in many situations, because Unix often offers more than one way to solve a problem. For this reason, it is important to be able to analyze your options wisely, so as to pick the best tool for the job at hand. As an example, let us compare look and grep.

The look program is limited in three important ways. First, it requires sorted input; second, it can read only from a file, not from standard input; third, it can only search for patterns at the beginning of a line. However, within the scope of these limitations, look has two advantages: is simple to use and it is very fast.

The grep program is a lot more flexible: it does not require sorted input; it can read either from a file or from standard input (which means you can use it in the middle of a pipeline); and it can search for a pattern anywhere, not just at the beginning of a line.

Moreover, grep allows "regular expressions", which enable you to specify generalized patterns, not just simple characters. For example, you can search for "the letters har, followed by one or more characters, followed by the letters ley, followed by zero or more numbers". (Regular expressions are very powerful, and we will talk about them in detail in Chapter 20.)

By using regular expressions, it is possible to make grep do anything look can do. However, grep will be slower, and the syntax is more awkward.

So here is my advice: Whenever you need to select lines from a file, ask yourself if look can do the job. If so, use it, because look is fast and simple. If look can't do the job, (which will be most of the time), use grep. As a general rule, you should always use the simplest possible solution to solve a problem.

But what about speed? I mentioned that look is faster than grep. How important is that? In the early days of Unix, speed was an important consideration, as Unix systems were shared with other users and computers were relatively slow. When you selected lines of text from a very large file, you could actually notice the difference between look and grep.

Today, however, virtually all Unix systems run on computers which, for practical purposes, are blindingly fast. Thus, the speed at which Unix executes a single command — at least for the commands in this chapter — is irrelevant. For example, any example in this chapter will run so quickly as to seem instantaneous. More specifically, if you compare a look command to the equivalent grep command, there is no way you are going to notice the difference in speed.

So my advice is to choose your tools based on simplicity and ease of use, not on tiny differences in speed or efficiency. This is especially important when you are writing programs, including shell scripts. If a program or script is too slow, it is usually possible to find one or two bottlenecks and speed them up. However, if a program is unnecessarily complex or difficult to use, it will, in the long run, waste a lot of your time, which is far more valuable than computer time.

— hint —

Whenever you have a choice of tools, use the simplest one that will do the job.

Jump to top of page

Finding All the Words That Begin
With a Specific Pattern: look

I mentioned earlier that you can use look to search a dictionary file. You do so when you want to find all the words that begin with a specific pattern, for example, all the words that begin with the letters "simult". When you use look in this way, the syntax is simple:

look pattern

where pattern is the pattern to search for.

The "dictionary file" is not an actual dictionary. It is a long, comprehensive list of words that has existed since the early versions of Unix. (Of course, the list has been updated over the years.) The words in the dictionary file are in alphabetical order, one word per line, which makes it easy to search the file using look.

The dictionary file was originally created to use with a program named spell, which provided a crude way to spellcheck documents. The job of spell was to display a list of all the words within a document that were not in the dictionary file. In the olden days, spell could save you a lot of time by finding possible spelling mistakes.

Today, there are much better spellcheck tools and spell is rarely used: indeed, you won't even find it on most Linux or Unix systems. Instead, people use either the spellcheck feature within their word processor or, with text files, an interactive program called aspell, which is one of the GNU utilities. If you want to try aspell, use:

aspell -c file

where file is the name of a file containing plain text. The -c option indicates that you want to check the spelling of the words in the file.

Although spell is not used anymore, the dictionary file still exists, and you can use it in a variety of ways. In particular, you can use the look program to find all the words that begin with a specific pattern. This comes in handy when you are having trouble spelling a word. For example, say that you want to type the word "simultaneous", but you are not sure how to spell it. Enter:

look simult

You will see a list similar to the following:

simultaneity
simultaneous
simultaneously
simultaneousness
simulty

It is now a simple task to pick out the correct word and — if you wish — to copy and paste it from one window to another. (See Chapter 6 for instructions on how to copy and paste.)

We'll talk about the dictionary file again in Chapter 20, at which time I'll show you where to find the actual file on your system, and how to use it help solve word puzzles.

(By the way, a "simulty" is a private grudge or quarrel.)

— hint —

When you are working with the vi text editor (see Chapter 22), you can display a list of words by using :r! to issue a quick look command. For example:

:r !look simult

This command inserts all the words that begin with "simult" into your editing buffer. You can now choose the word you want and delete all the others.

Jump to top of page

Sorting Data: sort

Related filters: tsort, uniq

The sort program can perform two related tasks: sorting data, and checking to see if data is already sorted. We'll start with the basics. The syntax for sorting data is:

sort [-dfnru] [-o outfile] [infile...]

where outfile is the name of a file to hold the output, and infile is the name of a file that contains input.

The sort program has a great deal of flexibility. You can compare either entire lines or selected portions of each line (fields). The simplest way to use sort is to sort a single file, compare entire lines, and display the results on your screen. As an example, let's say you have a file called names that contains the following four lines:

Barbara
Al
Dave
Charles

To sort this data and display the results, enter:

sort names

You will see:

Al
Barbara
Charles
Dave

To save the sorted data to a file named masterfile, you can redirect the standard output:

sort names > masterfile

This last example saves the sorted data in a new file. However, there will be many times when you want to save the data in the same file. That is, you will want to replace a file with the same data in sorted order. Unfortunately, you cannot use a command that redirects the output to the input file:

sort names > names

You will recall that, in Chapter 15, I explained that when you redirect the standard output, the shell sets up the output file before running the command. In this case, since names is the output file, the shell will empty it before running the sort command. Thus, by the time sort is ready to read its input, names will be empty. Thus, the result of entering this command would be to silently wipe out the contents of your input file (*).

* Footnote

Unless you have set the noclobber shell variable. See Chapter 15.

For this reason, sort provides a special option to allow you to save your output to any file you want. Use -o (output) followed by the name of your output file. If the output file is the same as one of your input files, sort will make sure to protect your data. Thus, to sort a file and save the output in the same file, use a command like the following:

sort -o names names

In this case, the original data in names will be preserved until the sort is complete. The output will then be written to the file.

To sort data from more than one file, just specify more than one input file name. For example, to sort the combined contents of the files oldnames, names and extranames, and save the output in the file masterfile, use:

sort oldnames names extranames > masterfile

To sort these same files while saving the output in names (one of the input files), use:

sort -o names oldnames names extranames

The sort program is often used as part of a pipeline to process data that has been produced by another program. The following example combines two files, extracts only those lines that contain the characters "Harley", sorts those lines, and then sends the output to less to be displayed:

cat newnames oldnames | grep Harley | sort | less

By default, sort looks at the entire line when it sorts data. However, if you want, you can tell sort to examine only one or more fields, that is, parts of each line. (We discussed the concept of fields in Chapter 17, when we talked about the cut program.) The options that allow you to use fields with sort afford a great deal of control. However, they are very complex, and I won't go into the details here. If you ever find yourself needing to sort with fields, you will find the details in the Info file (info sort). If your system doesn't have Info files (see Chapter 9), the details will be in the man page instead (man sort).

Jump to top of page

Controlling the Order in Which
Data Is Sorted: sort -dfn

There are a number of options you can use to control how the sort program works:

The -d (dictionary) looks only at letters, numerals and whitespace (spaces and tabs). Use this option when your data contains characters that will get in the way of the sorting process, for example, as punctuation.

The -f (fold) option treats lowercase letters as if they were uppercase. Use this option when you want to ignore the distinctions between upper- and lowercase letters. For example, when you use -f, the words harley and Harley are considered to be the same as HARLEY. (The term "fold" is explained below.)

The -n (numeric) option recognizes numbers at the beginning of a line or a field and sorts them numerically. Such numbers may include leading spaces, negative signs and decimal points. Use this option to tell sort that you are using numeric data. For example, let's say you want to sort:

11
2
1
20
10

If you use sort with no options, the output is:

1
10
11
2
20

If you use sort -n, you get:

1
2
10
11
20

The -r (reverse) option sorts the data in reverse order. For example, if you sort the data in the last example using sort -nr, the output is:

20
11
10
2
1

In my experience, you will find yourself using the -r option a lot more than you might think. This is because it is useful to be able to list information in reverse alphabetical order or reverse numeric order.

The final option, -u (unique), tells sort to check for identical lines and suppress all but one. For example, let's say you use sort -u to sort the following data:

Barbara
Al
Barbara
Barbara
Dave

The output is:

Al
Barbara
Dave

— hint —

As an alternative to sort -u, you can use uniq (discussed later in the chapter). The uniq program is simpler but, unlike sort, it does not let you work with specific fields should that be necessary.

What's in a Name?

Fold


There are a variety of Unix programs that have an option to ignore the differences between upper- and lowercase letters. Sometimes, the option is called -i, for "ignore", which only makes sense. Much of the time, however, the option is -f, which stands for FOLD: a technical term indicating that lowercase letters are to be treated as if they were uppercase, or vice versa, without changing the original data. (The use of the term "fold" in this way has nothing to do with the fold program, so don't be confused.)

The term "fold" is most often used as an adjective: "To make sort case insensitive, use the fold option." At times, however, you will see "fold" used as a verb: "When you use the -f option, sort folds lowercase letters into uppercase."

Here is something interesting: the original version of the Unix sort program folded uppercase letters into lowercase. That is, when you used -f, sort treated all letters as if they were lowercase. Modern versions of sort fold lowercase into uppercase. That is, they treat all letters as if they were uppercase. Is the difference significant? The answer is, sometimes, as you will see when we discuss collating sequences.

By the way, no one knows the origin of the term "fold", so feel free to make up your own metaphor.

Jump to top of page

Checking If Data Is Sorted: sort -c

As I mentioned earlier, sort can perform two related tasks: sorting data, and checking to see if data is already sorted. In this section, we'll talk about checking data. When you sort in this way, the syntax is:

sort -c[u] [file]

where file is the name of a file.

The -c (check) option tells sort that you don't want to sort the data, you only want to know if it is already sorted. For example, to see if the data within the file names is sorted, you would use:

sort -c names

If the data is sorted, sort will display nothing. (No news is good news.) If the data is not sorted, sort will display a message, for example:

sort: names:5: disorder: Polly Ester

In this case, the message means that the data in names is not sorted (that is, there is "disorder"), starting with line 5, which contains the data Polly Ester.

You can use sort -c within a pipeline to check data that has been written to standard output by another program. For example, let's say you have a program named poetry-generator that generates a large amount of output. The output is supposed to be sorted, but you suspect there may be a problem, so you check it with sort -c:

poetry-generator | sort -c

If you combine -c with the -u (unique) option, sort will check your data in two ways at the same time. While it is looking for unsorted data, it will also look for consecutive lines that are the same. You use -cu when you want to ensure (1) your data is sorted, and (2) all the lines are unique. For example, the file friends contains the following data:

Al Packa
Max Out
Patty Cake
Patty Cake
Shirley U. Jest

You enter:

sort -cu friends

Although the data is sorted, sort detects a duplicate line:

sort: friends:4: disorder: Patty Cake

Jump to top of page

The ASCII Code;
Collating Sequences

Suppose you use the sort program to sort the following data. What will the output be?

zzz
ZZZ
bbb
BBB
aaa
AAA

On some systems, you will get:

AAA
BBB
ZZZ
aaa
bbb
zzz

On other systems, you will get:

AAA
aaa
BBB
bbb
ZZZ
zzz

How can this be? In the early days of Unix, there was just one way of organizing characters. Today, this is not the case, and results you see when you run sort depend on how characters are organized on your particular system. Here is the story.

Before the 1990s, the character encoding used by Unix (and most computer systems) was the ASCII CODE, often referred to as ASCII. The name stands for "American Standard Code for Information Interchange".

The ASCII code was created in 1967. It specifies a 7- bit pattern for every character, 128 in all. These bit patterns range from 0000000 (0 in decimal) to 1111111 (127 in decimal). For this reason, the 128 ASCII characters are numbered from 0 to 127.

The 128 characters that comprise the ASCII code consist of 33 "control characters" and 95 "printable characters". The control characters were discussed in Chapter 7. The printable characters, shown below, are the 52 letters of the alphabet (26 uppercase, 26 lowercase), 10 numbers, 32 punctuation symbols, and the space character (listed first below):

 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~

The order of the printable characters is the order in which I have listed them. They range from character #32 (space) to character #126 (tilde). (Remember, numbering starts at 0, so the space is actually the 33rd character.) For reference, Appendix D contains a table of the entire ASCII code. You may want to take a moment and look at it now.

For practical purposes, it is convenient to consider the tab to be a printable character even though, strictly speaking, it is actually a control character. The tab is character #9, which places it before the other printable characters. Thus, I offer the following definition: the 96 PRINTABLE CHARACTERS are the tab, space, punctuation symbols, numbers, and letters.

As a convenience, most Unix systems have a reference page showing the ASCII code to allow you to look at it quickly whenever you want. Unfortunately, the ASCII reference page is not standardized, so the way in which you display it depends on which system you are using. See Figure 19-1 for the details.

Figure 19-1: Displaying the ASCII code

You will find a summary of the ASCII code in Appendix D of this book. For online reference, most Unix systems have a handy page containing the entire ASCII code. Traditionally, this page was stored in a file named ascii in the directory /usr/pub/. In recent years, the Unix file system has been reorganized on some systems, and the ASCII reference file has been moved to /usr/share/misc. On other systems, the file has been converted to a page within the online manual. Thus, the way in which you display the ASCII reference page depends on the system you are using.

Type of Unix Command to Display ASCII Code Page
Linuxman ascii
FreeBSDless /usr/share/misc/ascii
Solarisless /usr/pub/ascii or man ascii

With respect to a character coding scheme, the order in which the characters are organized is called the COLLATING SEQUENCE. The collating sequence is used whenever you need to put characters in order; for example, when you use the sort program, or when you use a range within a regular expression (discussed in Chapter 20).

With the ASCII code, the collating sequence is simply the order in which the characters appear in the code. This is summarized in Figure 19-2. For a more detailed reference, see Appendix D.

Figure 19-2: The order of characters in the ASCII code

The ASCII code defines the 128 basic characters used by Unix systems. Within the ASCII code, the characters are numbered 0 through 127. The table in this figure summarizes the order of the characters, which is important when you use a program like sort. For example, when you sort text, a space comes before "%" (percent), which comes before the number "3", which comes before the letter "A", and so on.

For a more detailed reference, see Appendix D.

Numbers Characters
0-31control characters (including tab)
32space character
33-47symbols: ! " # $ % & ' ( ) * + , - . /
48-57numbers: 0 1 2 3 4 5 6 7 8 9
58-64more symbols: : ; < = > ? @
65-90uppercase letters: A B C ... Z
91-96more symbols: [ \ ] ^ _ `
97-122lowercase letters: a b c ... z
123-126more symbols: { | } ~
127null control character (del)

It is important to be familiar with the ASCII code collating sequence, as it is used by default on many Unix systems and programming languages. Although you don't have to memorize the entire ASCII code, you do need to memorize three basic principles:

• Spaces come before numbers.
• Numbers come before uppercase letters.
• Uppercase letters come before lowercase letters.

Here is an example. Assume that your system uses the ASCII collating sequence. You use the sort program to sort the following data (in which the third line starts with a space):

hello
Hello
 hello
1hello
:hello

The output is:

 hello
1hello
:hello
Hello
hello

— hint —

When it comes to the order of characters in the ASCII code, all you need to memorize is: Space, Numbers, Uppercase letters, and Lowercase letters, in that order.

Just remember "SNUL".(*)

* Footnote

If you have trouble remembering the acronym SNUL, let me show you a memory trick used by many smart people. All you need to do is relate the item you want to remember to your everyday life.

For example, let's say you are a mathematician specializing in difference calculus, and you happen to be working with fourth order difference equations satisfied by those Laguerre-Hahn polynomials that are orthogonal on special non-uniform lattices. To remember SNUL, you would just think of "special non- uniform lattices".

See how easy it is to be smart?

Jump to top of page

Locales and Collating Sequences

In the early days of Unix, everyone used the ASCII code and that was that. However, ASCII is based on English and, as the use of Unix, Linux and the Internet spread throughout the world, it became necessary to devise a system that would work with a large number of languages and a variety of cultural conventions.

In the 1990s, a new system was developed, based on the idea of a "locale", part of the POSIX 1003.2 standard. (POSIX is discussed Chapters 11 and 16.) A LOCALE is a technical specification describing the language and conventions that should be used when communicating with a user from a particular culture. The intention is that a user can choose whichever locale he wants, and the programs he runs will communicate with him accordingly. For example, if a user chooses the American English locale, his programs should display messages in English, write dates in the format "month-day-year", use "$" as a currency symbol, and so on.

Within Unix, your locale is defined by a set of environment variables that identify your language, your date format, your time format, your currency symbol, and other cultural conventions. Whenever a program needs to know your preferences, all it has to do is look at the appropriate environment variables. In particular, there is a environment variable named LC_COLLATE that specifies which collating sequence you want to use. (The variables all have default values, which you can change if you want.)

To display the current value of all the locale variables on your system — including LC_COLLATE — you use the locale command:

locale

If you are wondering which locales are supported on your system, you can display them all by using the -a (all) option:

locale -a

In the United States, Unix systems default to one of two locales. The two locales are basically the same, but have different collating sequences, which means that when you run a program such as sort, your results can vary depending on which locale is being used.

Since many people are unaware of locales, even experienced programmers can be perplexed when they change from one Unix system to another and, all of a sudden, programs like sort do not behave "properly". For this reason, I am going to take a moment to discuss the two American locales and explain what you need to know about them. If you live outside the U.S., the ideas will still apply, but the details will be different.

The first American locale is based on the ASCII code. This locale has two names. It is known as either the C locale (named after the C programming language) or the POSIX locale: you can use whichever name you want. The second American locale is based on American English, and is named en_US, although you will see variations of this name.

The C locale was designed for compatibility, in order to preserve the conventions used by old-time programs (and old-time programmers). The en_US locale was designed to fit into a modern international framework in which American English is only one of many different languages.

As I mentioned, both these locales are the same except for the collating sequence. The C locale uses the ASCII collating sequence in which uppercase letters come before lowercase letters: ABC... XYZabc...z. This pattern is called the C COLLATING SEQUENCE, because it is used by the C programming language.

The en_US locale uses a different collating sequence in which the lowercase letters and uppercase letters are grouped in pairs: aAbBcCdD... zZ. This pattern is more natural, as it organizes words and characters in the same order as you would find in a dictionary. For this reason, this pattern is called the DICTIONARY COLLATING SEQUENCE.

Until the end of the 1990s, all Unix systems used the C collating sequence, based on the ASCII code, and this is still the case with the systems that use the C/POSIX locale. Today, however, some Unix systems, including a few Linux distributions, are designed to have a more international flavor. As such, they use the en_US locale and the dictionary collating sequence.

Can you see a possible source of confusion? Whenever you run a program that depends on the order of upper- and lowercase letters, the output is affected by your collating sequence. Thus, you can get different results depending on which locale your system uses by default. This may happen, for example, when you use the sort program, or when you use certain types of regular expressions called "character classes" (see Chapter 20).

For reference, Figure 19-3 shows the two collating sequences. Notice that there are significant differences, not only in the order of the letters, but in the order of the punctuation symbols.

Figure 19-3: Collating sequences for the C and en_US locales

In the United States, Unix and Linux systems use one of two locales: C/POSIX based on the ASCII code, or en_US based on American English. The following two charts show the collating sequences for each of these locales. In the C collating sequence (used with the C locale), the numbers, lowercase letters, and uppercase letters are separated by symbols. In the dictionary collating sequence (used with the en_US locale), all the symbols come at the beginning, followed by the numbers and letters.

Note: In both charts, I have used a dot () to indicate the space character.

C Locale: C Collating Sequence
space character
symbols! " # $ % & ' ( ) * + , - . /
numbers0 1 2 3 4 5 6 7 8 9
more symbols: ; < = > ? @
uppercase lettersA B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
more symbols[ \ ] ^ _ `
lowercase lettersa b c d e f g h i j k l m
n o p q r s t u v w x y z
more symbols{ | } ~
en_US Locale: Dictionary Collating Sequence
symbols` ^ ~ < = > |
space character
more symbols_ - , ; : ! ? / . ' " ( )
[ ] { } @ $ * \ & # % +
numbers0 1 2 3 4 5 6 7 8 9
lettersa A b B c C d D e E f F g G
h H i I j J k K l L m M n N
o O p P q Q r R s S t T u U
v V w W x X y Y z Z

As an example of how your choice of locale can make a difference, consider what happens when you sort the following data (in which the third line starts with a space):

hello
Hello
 hello
1hello
:hello

With the C locale (C collating sequence), the output is:

 hello
1hello
:hello
Hello
hello

With the en_US locale (dictionary collating sequence), the output is:

 hello
:hello
1hello
hello
Hello

So which locale should you use? In my experience, if you use the en_US locale, you will eventually encounter unexpected problems that will be difficult to track down. For example, as we will discuss in Chapter 25, you use the rm (remove) program to delete files. Let's say you want to delete all your files whose names begin with an uppercase letter. The traditional Unix command to use is:

rm [A-Z]*

This will work fine if you are using the C locale. However, if you are using the en_US locale, you will end up deleting all the files whose names begin with any upper- or lowercase letter, except the letter a. (Don't worry about the details; they will be explained in Chapter 20.)(*)

* Footnote

There are lots of situations in which the C locale works better than the en_US locale. Here is another one: You are writing a C or C++ program. In your directory, you have files containing code with names that have all lowercase letters, such as program1.c, program2.cpp, data.h, and so on. You also have extra files with names that begin with an uppercase letter, such as Makefile, RCS, README. When you list the contents of the directory using the ls program (Chapter 24), all the "uppercase" files will be listed first, separating the extra files from the code files.

My advice is to set your default to be the C locale, because it uses the traditional ASCII collating sequence. In the long run, this will create less problems than using the en_US locale and the dictionary collating sequence. In fact, as you read this book, I assume that you are using the C locale.

So how do you specify your locale? The first step is to determine which collating sequence is the default on your system. If the C locale is already the default on your system, fine. If not, you need to change it.

One way to determine your default collating sequence is to enter the locale command and check the value of the LC_COLLATE environment variable. Is it C or POSIX? Or is it some variation of en_US?

Another way to determine your default collating sequence is to perform the following short test. Create a small file named data using the command:

cat > data

Type the following three lines and then press ^D to end the command:

AAA
[]
aaa

Now sort the contents of the file:

sort data

If you are using the C/POSIX locale, the output will be sorted using the C (ASCII) collating sequence:

AAA
[]
aaa

If you are using the en_US locale, the output will be sorted using the dictionary collating sequence:

[]
aaa
AAA

Before you continue, take a moment to look at the collating sequences in Figure 19-3 and make sure these examples make sense to you.

If your Unix system uses the C or POSIX locale by default, you don't need to do anything. (However, please read through the rest of this section, as one day, you will encounter this problem on another system.)

If your system uses the en_US locale, you need to change the LC_COLLATE environment variable to either C or POSIX. Either of the following commands will do the job with the Bourne Shell family:

export LC_COLLATE=C
export LC_COLLATE=POSIX

With the C-Shell family, you would use:

setenv LC_COLLATE C
setenv LC_COLLATE POSIX

To make the change permanent, all you need to do is put one of these commands into your login file. (Environment variables are discussed in Chapter 12; the login file is discussed in Chapter 14.) For the rest of this book, I will assume that you are, indeed, using the C collating sequence so, if you are not, put the appropriate command in your login file right now.

— hint —

From time to time, you may want to run a single program with a collating sequence that is different from the default. To do so, you can use a subshell to change the value of LC_COLLATE temporarily while you run the program. (We discuss subshells in Chapter 15.)

For example, let's say you are using the C locale, and you want to run the sort program using the en_US (dictionary) collating sequence. You can use:

(export LC_COLLATE=en_US; sort data)

When you run the program in this way, the change you make to LC_COLLATE is temporary, because it exists only within the subshell.

Jump to top of page

Finding Duplicate Lines: uniq

Related filters: sort

Unix has a number of specialized filters designed to work with sorted data. The most useful such filter is uniq, which examines data line by line, looking for consecutive, duplicate lines.

The uniq program can perform four different tasks:

• Eliminate duplicate lines
• Select duplicate lines
• Select unique lines
• Count the number of duplicate lines

The syntax is:

uniq [-cdu] [infile [outfile]]

where infile is the name of an input file, and outfile is the name of an output file.

Let's start with a simple example. The file data contains the following lines:

Al
Al
Barbara
Barbara
Charles

(Remember, because input for uniq must be sorted, duplicate lines will be consecutive.) You want a list of all the lines in the file with no duplications. The command to use is:

uniq data

The output is straightforward:

Al
Barbara
Charles

If you want to save the output to another file, say, processed-data, you can specify its name as part of the command:

uniq data processed-data

To see only the duplicate lines, use the -d option:

uniq -d data

Using our sample file, the output is:

Al
Barbara

To see only the unique (non-duplicate) lines, use - u:

uniq -u data

In our sample, there is only one such line:

Charles

Question: What do you think happens if you use both -d and -u at the same time? (Try it and see.)

To count how many times each line appears, use the - c option:

uniq -c data

With our sample, the output is:

2 Al
2 Barbara
1 Charles

So far, our example has been simple. The real power of uniq comes when you use it within a pipeline. For example, it is common to combine and sort several files, and then pipe the output to uniq, as in the following two examples:

sort file1 file2 file3 | uniq
cat file1 file2 file3 | sort | uniq

— hint —

If you are using uniq without options, you have an alternative. You can use sort -u instead. For example, the following three commands all have the same effect:

sort -u file1 file2 file3
sort file1 file2 file3 | uniq
cat file1 file2 file3 | sort | uniq

(See the discussion on sort -u earlier in the chapter.)

Here is a real-life example to show you how powerful such constructions can be.

Ashley is a student at a large Southern California school. During the upcoming winter break, her cousin Jessica will be coming to visit from the East Coast, where she goes to a small, progressive liberal arts school. Jessica wants to meet guys, but she is very picky: she only likes very smart guys who are athletic.

It happens that Ashley is on her sorority's Ethics Committee, which gives her access to the student/academic database (don't ask). Using her special status, Ashley logs into the system and creates two files. The first file, math237, contains the names of all the male students taking Math 237 (Advanced Calculus). The second file, pe35, contains the names of all the male students taking Physical Education 35 (Surfing Appreciation).

Ashley's idea is to make a list of possible dates for Jessica by finding all the guys who are taking both courses. Because the files are too large to compare by hand, Ashley (who is both beautiful and smart) uses Unix. Specifically, she uses the uniq program with the -d option, saving the output to a file named possible-guys:

sort math237 pe35 | uniq -d > possible-guys

Ashley then emails the list to Jessica, who is able to check out the guys on Instagram before her trip.

— hint —

You must always make sure that input to uniq is sorted. If not, uniq will not be able to detect the duplications. The results will not be what you expect, but there will be no error message to warn you that something has gone wrong.(*)

* Footnote

This is one time where — even for Ashley and Jessica — a "sorted" affair is considered to be a good thing.

Jump to top of page

Merging Sorted Data From Two Files: join

Related filters: colrm, cut, paste

Of all the specialized Unix filters designed to work with sorted data, the most interesting is join, which combines two sorted files based on the values of a particular field. The syntax is:

join [-i] [-a1|-v1] [-a2|-v2] [-1 field1] [-2 field2] file1 file2

where field1 field2 are numbers referring to specific fields; and file1 and file2 are the names of files containing sorted data.

Before we get to the details, I'd like to show you an example. Let's say you have two sorted files containing information about various people, each of whom has a unique identification number. Within the first file, called names, each line contains an ID number followed by a first name and last name:

111 Hugh Mungus
222 Stew Pendous
333 Mick Stup
444 Melon Collie

In the second file, phone, each line contains an ID number followed by a phone number:

111 101-555-1111
222 202-555-2222
333 303-555-3333
444 404-555-4444

The join program allows you to the combine the two files, based on their common values, in this case, the ID number:

join names phone

The output is:

111 Hugh Mungus 101-555-1111
222 Stew Pendous 202-555-2222
333 Mick Stup 303-555-3333
444 Melon Collie 404-555-4444

When join reads its input, it ignores leading whitespace, that is, spaces or tabs at the beginning of a line. For example, the following two lines are considered the same:

111 Hugh Mungus 101-555-1111
    111 Hugh Mungus 101-555-1111

Before we discuss the details of the join program, I'd like to take a moment to go over some terminology. In Chapter 17, we talked about the idea of a fields and delimiters. When you have a file in which every line contains a data record, each separate item within the line is called a field. In our example, each line in the file names contains three fields: an ID number, a first name, and a last name. The file phone contains two fields: an ID number and a phone number.

Within each line, the characters that separate fields are called delimiters. In our example, the delimiters are spaces, although you will often see tabs and commas used in this way. By default, join assumes that each pair of fields is separated by whitespace, that is, by one or more spaces or tabs.

When we combine two sets of data based on matching fields, it is called a JOIN. (The name comes from database theory.) The specific field used for the match is called the JOIN FIELD. By default, join assumes that the join field is the first field of each file but, as you will see in a moment, this can be changed.

To create a join, the program looks for pairs of lines, one from each file, that contain the same value in their join field. For each pair, join generates an output line consisting of three parts: the common join field value, the rest of line from the first file, and the rest of the line from the second file.

As an example, consider the first line in each of the two files above. The join field has the value 111. Thus, the first line of output consists of 111, a space, Hugh, a space, Mungus, a space, and 101-555-1111. (By default, join uses a single space to separate fields in the output.)

In the example above, every line in the first file matches a line in the second file. However, this might not always be the case. For example, consider the following files. You are making a list of your friends' birthdays and their favorite gifts. The first file, birthdays, contains two fields: first name and birthday:

Al       May-10-1987
Barbara  Feb-2-1992
Dave     Apr-8-1990
Frances  Oct-15-1991
George   Jan-17-1992

The second file, gifts, also contains two fields: first name and favorite gift:

Al        money
Barbara   chocolate
Charles   music
Dave      books
Edward    camera

In this case, you have birthday information and gift information for Al, Barbara and Dave. However, you do not have gift information for Frances and George, and you do not have birthday information for Edward. Consider what happens when you use join:

join birthdays gifts

Because only three lines have matching join fields (the lines for Al, Barbara and Dave), there are only three lines of output:

Al May-10-1987 money
Barbara Feb-2-1992 chocolate
Dave Apr-8-1990 books

However, suppose you want to see all the people with birthday information, even if they do not have gift information. You can use the -a (all) option, followed by a 1:

join -a1 birthdays gifts

This tells join to output all the all the names in file #1, even if there is no gift information:

Al May-10-1987 money
Barbara Feb-2-1992 chocolate
Dave Apr-8-1990 books
Frances Oct-15-1991
George Jan-17-1992

Similarly, if you want to see all the people with gift information (from file #2), even if they do not have birthday information, you can use -a2:

join -a2 birthdays gifts

The output is:

Al May-10-1987 money
Barbara Feb-2-1992 chocolate
Charles music
Dave Apr-8-1990 books
Edward camera

To list all the names from both files, use both options:

join -a1 -a2 birthdays gifts

The output is:

Al May-10-1987 money
Barbara Feb-2-1992 chocolate
Charles music
Dave Apr-8-1990 books
Edward camera
Frances Oct-15-1991
George Jan-17-1992

When you use join in the regular way (without the -a option) as we did in our first example, the result is called an INNER JOIN. (The term comes from database theory.) With an inner join, the output comes only from lines where the join field matched.

When you use either -a1 or -a2, the output includes lines in which the join field did not match. We call this an OUTER JOIN.

I won't go into the details because a discussion of database theory, however interesting, is beyond the scope of this book. All I want you to remember is that, if you work with what are called "relational databases", the distinction between inner and outer joins is important.

To continue, if you want to see only those lines that don't match, you can use the -v1 or - v2 (reverse) options. When you use -v1, join outputs only those lines from file #1 that don't match, leaving out all the matches. For example:

join -v1 birthdays gifts

The output is:

Frances Oct-15-1991
George Jan-17-1992

When you use -v2, you get only those lines from file #2 that don't match:

join -v2 birthdays gifts

The output is:

Charles music
Edward camera

Of course, you can use both options to get all the lines from both files that don't match:

join -v1 -v2 birthdays gifts

The output is now:

Charles music
Edward camera
Frances Oct-15-1991
George Jan-17-1992

Because join depends on its data being sorted, there are several options to help you control the sorting requirements. First, you can use the -i (ignore) option to tell join to ignore any differences between upper and lower case. For example, when you use this option, CHARLES is treated the same as Charles.

— hint —

You will often use sort to prepare data for join. Remember: With sort, you ignore differences in upper and lower case by using the - f (fold) option. With join, you use the -i (ignore) option. (See the discussion on "fold" earlier in the chapter.)

I mentioned earlier that join assumes the join field is the first field of each file. You can specify that you want to use different join fields by using the -1 and -2 options.

To change the join field for file #1, use -1 followed by the number of the field you want to use. For example, the following command joins two files, data and statistics using the 3rd field of file #1 and (by default) the 1st field of file #2:

join -1 3 data statistics

To change the join field for file #2, use the -2 option. For example, the following command joins the same two files using 3rd field of file #1 and the 4th field of file #2

join -1 3 -2 4 data statistics

To conclude our discussion, I would like to remind you that, because join works with sorted data, the results you get may depend on your locale and your collating sequences; that is, on the value of the LC_COLLATE environment variable. (See the discussion about locales earlier in the chapter.)

— hint —

The most common mistake in using join is forgetting to sort the two input files. If one or both of the files are not sorted properly with respect to the join fields, you will see either no output or partial output, and there will be no error message to warn you that something has gone wrong.

Jump to top of page

Creating a Total Ordering
From Partial Orderings: tsort

Related filters: sort

Consider the following problem. You are planning your evening activities, and you have a number of constraints:

• You must clean the dishes before you can watch TV.
• You must eat before you clean the dishes
• You must shop before you can cook dinner
• You must shop before you can put away the food
• You must put away the food before you can cook dinner
• You must cook dinner before you can eat it
• You must put away the food before you can watch TV

As you can see, this is a bit confusing. What you need is a master list that specifies when each activity should be done, such that all of the constraints are satisfied.

In mathematical terms, each of these constraints is called a PARTIAL ORDERING, because they specify the order of some (both not all) of the activities. In our example, each of the partial orderings specifies the order of two activities. Should you be able to construct a master list, it would be a TOTAL ORDERING, because it would specify the order of all of the activities.

The job of the tsort program is to analyze a set of partial orderings, each of which represents a single constraint, and calculate a total ordering that satisfies all the constraints. The syntax is simple:

tsort [file]

where file is the name of a file.

Each line of input must consist of a pair of character strings separated by whitespace (spaces or tabs), such that each pair represents a partial ordering. For example, let's say that the file activities contains the following data:

clean-dishes watch-TV
eat clean-dishes
shop cook
shop put-away-food
put-away-food cook
cook eat
put-away-food watch-TV

Notice that each line in the file consists of two characters strings separated by whitespace (in this case, a single space). Each line represents a partial ordering that matches one of the constraints listed above. For example, the first line says that you must clean the dishes before you can watch TV; the second line says you must eat before you can clean the dishes; and so on.

The tsort program will turn the set of partial orderings into a single total ordering. Use the command:

tsort activities

The output is:

shop
put-away-food
cook
eat
clean-dishes
watch-TV

Thus, the solution to the problem is:

• Shop
• Put away your food
• Cook dinner
• Eat dinner
• Clean the dishes
• Watch TV

In general, any set of partial orderings can be combined into a total ordering, as long as there are no loops. For example, consider the following partial orderings:

study watch-TV
watch-TV study

There can be no total ordering, because you can't study before you watch TV, if you insist on watching TV before you study (although many people try). If you were to send this data to tsort, it would display an error message telling you the input contains a loop.

What's in a Name?

tsort


Mathematically, it is possible to represent a set of partial orderings using what is called a "directed graph". If there are no loops, it is called a "directed acyclic graph" or DAG. For example, a tree (see Chapter 9) is a DAG.

Once you have a DAG, you can create a total ordering out of the partial orderings by sorting the elements of the graph based on their relative positions, rather than their values. In fact, this is how tsort does its job (although we don't need to worry about the details).

In mathematics, we use the word "topological" to describe properties that depend on relative positions. Thus, tsort stands for "topological sort".

Jump to top of page

Searching for Character Strings
in Binary Files: strings

Related filters: grep

To use the strings program, you need to understand the difference between text files and binary files. Consider the following three definitions:

1. There are 96 printable characters: tab, space, punctuation symbols, numbers, and letters. Any sequence of printable characters is called a CHARACTER STRING or, more informally, a STRING. For example, "Harley" is a string of length 6. (We discussed printable characters earlier in the chapter.)

2. A file that contains only printable characters (with a newline character at the end of each line) is called a TEXT FILE or an ASCII FILE. For the most part, Unix filters are designed to work with text files. Indeed, within this chapter, all the sample files are text files.

3. A BINARY FILE is any non-empty file that is not a text file, that is, any file that contains at least some non-textual data. Some common examples of binary files are executable programs, object files, images, sound files, video files, word processing documents, spreadsheets and databases.

If you are a programmer, you will work with executable programs and object files ("pieces" of programs), all of which are binary files. If you could look inside an executable program or an object file, most of what you would see would be encoded machine instructions, which look like meaningless gibberish. However, most programs do contain some recognizable character strings such as error messages, help information, and so on.

The strings program was created as a tool for programmers to display character strings that are embedded within executable programs and object files. For example, there used to be a custom that programmers would insert a character string into every program showing the version of that program. This allowed anyone to use strings to extract the version of a program from the program itself.

Today, programmers and users have better ways to keep track of such information(*) and the strings program is not used much. Still, you can use it, just for fun, to look "inside" any type of binary file. Although there is rarely a practical reason for doing so, it is cool to check out binary files for hidden messages. The syntax is:

strings [-length] [file...]

where length is the minimum length character string to display, and file is the name of a file, most often a pathname.

* Footnote

As we discussed in Chapter 10, most of the GNU utilities (used with Linux and FreeBSD) support the --version option to display version information.

As an example, let's say you want to look inside the sort program. To start, you use the whereis program to find the pathname — that is, the exact location — of the file that contains the program. (We'll discuss pathnames and whereis in Chapter 24, so don't worry about the details for now.) The command to use is:

whereis sort

Typical output would be:

sort: /usr/bin/sort /usr/share/man/man1/sort.1.gz

The output shows us the exact locations of the program and its man page. We are only interested in the program, so to use strings to look inside the sort program, we use:

strings /usr/bin/sort

Such commands usually generate a lot of output. There are, however, three things you can do to make the output more manageable.

First, by default, strings will only extract character strings that are at least 4 characters long. The idea is to eliminate short, meaningless sequences of characters. Even so, you are likely to see a great many spurious character strings. However, you can eliminate a lot of them by specifying a longer minimum length. To do so, you use an option consisting of hyphen (-) followed by a number. For example, to specify that you only want to see strings that are at least 7 characters long (a good number), you would use:

strings -7 /usr/bin/sort

Next, you can sort the output and remove duplicate lines. To do so, just pipe the output to sort -iu (discussed earlier in the chapter):

strings -7 /usr/bin/sort | sort -iu

Finally, if there is so much output that it scrolls off your screen before you can read it, you can use less (Chapter 21) to display the output one screenful at a time:

strings -7 /usr/bin/sort | sort -iu | less

If the idea of looking inside programs for hidden messages appeals to you, here is an easy way to use strings to explore a variety of programs. The most important Unix utilities are stored in the two directories /bin and /usr/bin. (We will discuss this in Chapter 23.) Let's say you want to look inside some of the programs in these directories. To start, enter either of the following two cd (change directory) commands. This will change your "working directory" to whichever directory you choose:

cd /bin
cd /usr/bin

Now use the ls (list) program to display a list of all the files in that directory:

ls

All of these files are programs, and you can use strings to look at any of them. Moreover, because the files are in your working directory, you don't have to specify the entire pathname. In this case, the file name by itself is enough. For example, if your working directory is /bin, where the date program resides, you can look inside the date program by using the command:

strings -7 date | sort -iu | less

In this way, you can look for hidden character strings inside the most important Unix utilities. Once you are finished experimenting, enter the command:

cd

This will change your working directory back to your home directory (explained in Chapter 23).

Jump to top of page

Translating Characters: tr

Related filters: sed

The tr (translate) program can perform three different operations on characters. First, it can change characters to other characters. For example, you might change lowercase characters to uppercase characters, or tabs to spaces. Or, you might change every instance of the number "0" to the letter "X". When you do this, we say that you TRANSLATE the characters.

Second, you can specify that if a translated character occurs more than once in a row, it should be replaced by only a single character. For example, you might replace one of more numbers in a row by the letter "X". Or, you might replace multiple spaces by a single space. When you make such a change, we say that you SQUEEZE the characters.

Finally, tr can delete specified characters. For example, you might delete all the tabs in a file. Or, you might delete all the characters that are not letters or numbers.

In the next few sections, we will examine each of these operations in turn. Before we start, however, let's take a look at the syntax:

tr [-cds] [set1 [set2]]

where set1 and set2 are sets of characters(*).

* Footnote

If you are using Solaris, you should use the Berkeley Unix version of tr. Such programs are stored in the directory /usr/ucb, so all you have to do is make sure this directory is at the beginning of your search path. (The name ucb stands for University of California, Berkeley.)

We discuss Berkeley Unix in Chapter 2, and the search path in Chapter 13.

Notice that the syntax does not let you specify a file name, either for input or output. This is because tr is a pure filter that reads only from standard input and writes only to standard output. If you want to read from a file, you must redirect standard input; if you want to write to a file (to save the output), you must redirect standard output. This will make sense when you see the examples. (Redirection is explained in Chapter 15.)

The basic operation performed by the tr program is translation. You specify two sets of characters. As tr reads the data, it looks for characters in the first set. Whenever tr finds such characters, it replaces them with corresponding characters from the second set. For example, say you have a file named old. You want to change all the "a" characters to "A". The command to do so is:

tr a A < old

To save the output, just redirect it to a file, for example:

tr a A < old > new

By defining longer sets of characters, you can replace more than one character at the same time. The following command looks for and makes three different replacements: "a" is replaced by "A"; "b" is replaced by "B"; and "c" is replaced by "C".

tr abc ABC < old > new

If the second set of characters is shorter than the first, the last character in the second set is duplicated. For example, the following two commands are equivalent:

tr abcde Ax < old > new
tr abcde Axxxx < old > new

They both replace "a" with "A", and the other four characters ("b", "c", "d", "e") with "x".

When you specify characters that have special meaning to the shell, you must quote them (see Chapter 13) to tell the shell to treat the characters literally. You can use either single or double quotes, although, in most cases, single quotes work best. However, if you are quoting only a single character, it is easier to use a backslash (again, see Chapter 13).

As a general rule, it is a good idea to quote all characters that are not numbers or letters. For example, let's say you want to change all the colons, semicolons and question marks to periods. You would use:

tr ':;?' \. < old > new

Much of the power of tr comes from its ability to work with ranges of characters. Consider, for example, the following command which changes all uppercase letters to lowercase:

tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz <old >new

The correspondence between upper- and lowercase letters is clear. However, it's a bother to have to type the entire alphabet twice. Instead, you can use hyphen (-) to define a range of characters, according to the following syntax:

start-end

where start is the first character in the range, and end is the last character in the range.

For example, the previous example can be rewritten as follows:

tr A-Z a-z < old > new

A range can be any set of characters you want, as long as they form a consecutive sequence within the collating sequence you are using. (Collating sequences are discussed earlier in the chapter.) For example, the following command implements a secret code you might use to encode numeric data. The digits 0 through 9 are replaced by the first nine letters of the alphabet, A through I, respectively. For example, 375 is replaced by CGE.

tr 0-9 A-I < old > new

As a convenience, there are several abbreviations you can use instead of ranges. These abbreviations are called "predefined character classes", and we will discuss them in detail in Chapter 20 when we talk about regular expressions. For now, all you need to know is that you can use [:lower:] instead of a-z; [:upper:] instead of A-Z; and [:digit:] instead of 0-9. For example, the following two commands are equivalent:

tr A-Z a-z < old > new
tr [:upper:] [:lower:] < old > new

As are these two commands:

tr 0-9 A-I < old > new
tr [:digit:] A-I < old > new

(Note that the square brackets and colons are part of the name.)

For practical purposes, these three predefined character classes are the ones you are most likely to use with the tr program. However, there are more predefined character classes available if you need them. You will find the full list in Figure 20-3 in Chapter 20.

— hint —

Compared to other filters, tr is unusual in that it does not allow you to specify the names of an input file or output file directly. To read from a file, you must redirect standard input; to write to a file, you must redirect standard output. For this reason, the most common mistake beginners make with tr is to forget the redirection. For example, the following commands will not work:

tr abc ABC old
tr abc ABC old new

Linux will display a vague message telling you that there is an "extra operand". Other types of Unix will display messages that are even less helpful. For this reason, you may, one day, find yourself spending a lot of time trying to figure out why your tr commands don't work.

The solution is to never forget: When you use tr with files, you always need redirection:

tr abc ABC < old
tr abc ABC < old > new

Jump to top of page

Translating Unprintable Characters

So far, all our examples have been straightforward. Still, they were a bit contrived. After all, how many times in your life will you need to change colons, semicolons and question marks to periods? Or change the letters "abc" to "ABC? Or use a secret code that changes numbers to letters? Traditionally, the tr program has been used for more esoteric translations, often involving non-printable characters. Here is a typical example to give you the idea.

In Chapter 7, I explained that, within a text file, Unix marks the end of each line by a newline (^J) character(*1) and Windows uses a return followed by a newline (^M^J). Old versions of the Macintosh operating system, up to OS 9, used a return (^M) character(*2).

*1 Footnote

As we discussed in Chapter 7, when Unix people write the names of control keys, they often use ^ as an abbreviation for "Ctrl". Thus, ^J refers to <Ctrl-J>.

*2 Footnote

For many years, the Macintosh operating system (Mac OS) used ^M to mark the end of a line of text. As I mentioned, this was the case up to OS 9. In 2001, OS 9 was replaced by OS X, which is based on Unix. Like other Unix-based systems, OS X uses ^J to mark the end of a line of text.

Suppose you have a text file that came from an old Macintosh. Within the file, the end of each line is marked by a return. Before you can use the file with Unix, you need to change all the returns to newlines. This is easy with tr. However, in order to do so, you need a way to represent both the newline and return characters. You have two choices.

First, you can use special codes that are recognized by the tr program: \r for a return, and \n for newline. Alternatively, you can use a \ (backslash) character followed by the 3-digit octal value for the character. In this case, \015 for return, and \012 for newline. For reference, Figure 19-4 shows the special codes and octal values you are most likely to use with tr.

The octal values are simply the base 8(*) number of the character within the ASCII code. For reference, you will find the octal values for the entire ASCII code in Appendix D.

* Footnote

In general, we count in base 10 (decimal), using the 10 digits 0 through 9. When we use computers, however, there are three other bases that are important:

• Base 2 (binary): uses 2 digits, 0-1
• Base 8 (octal): uses 8 digits, 0-7
• Base 16 (hexadecimal): uses 16 digits, 0-9 A-F

We will talk about these number systems in Chapter 21.

Figure 19-4: Codes used by the tr program to represent control characters

The tr program is used to translate (change) specific characters into other characters. To specify non-printable characters, you can either use a special code, or a backslash (\) followed by the 3-digit octal (base 8) value of the character. (You will find the octal values for all the characters in the ASCII code in Appendix D.)

This table shows the special codes and octal values for the four most commonly used control characters. There are others, but you are unlikely to need them with tr.

Note: Since the backslash is used as an escape character (see Chapter 13), if you want to specify an actual backslash, you must use two in a row.

Code Ctrl Key Octal Code Name
\b^H\010backspace
\t^I\011tab
\n^J\012newline/linefeed
\r^M\015return
\\backslash

Let us consider then, how to use tr to change returns to newlines. Let's say we have a text file named macfile in which each line ends with a return. We want to change all the returns to newlines and save the output in a file named unixfile. Either of the following commands will do the job:

tr '\r' '\n' < macfile > unixfile
tr '\015' '\012' < macfile > unixfile

As you can see, using these codes is simple once you understand how they work. For example, the following two commands change all the tabs in the file olddata to spaces, saving the output in newdata(*):

tr '\t' ' ' < olddata > newdata
tr '\011' ' ' < olddata > newdata

* Footnote

When you change tabs to spaces, or spaces to tabs, it is often better to use expand and unexpand (Chapter 18). These two programs were designed specifically to make such changes and, hence, offer more flexibility.

Jump to top of page

Translating Characters: Advanced Topics

So far, we have discussed how to use tr for straightforward substitutions, where one character is replaced by another character. We will now turn our attention to more advanced topics. Before we do, here is a reminder of syntax we will be using:

tr [-cds] [set1 [set2]]

where set1 and set2 are sets of characters.

The -s option tells tr that multiple consecutive characters from the first set should be replaced by a single character. As I mentioned earlier, when we do this, we say that we squeeze the characters. Here is an example.

The following two commands replace any digit (0-9) with the uppercase letter "X". The input is read from a file named olddata, and the output is written to a file named newdata:

tr [:digit:] X < olddata > newdata
tr 0-9 X < olddata > newdata

Now these commands replace each occurrence of a digit with an "X". For example, the 6-digit number 120357 would be changed to XXXXXX. Let's say, however, you want to change all multi-digit numbers, no matter long they are, into a single "X". You would use the -s option:

tr -s [:digit:] X < olddata > newdata
tr -s 0-9 X < olddata > newdata

This tells tr to squeeze all multi-digit numbers into a single character. For example, the number 120357 is now changed to X.

Here is a useful example in which we squeeze multiple characters, without actually changing the character. You want to replace consecutive spaces with a single space. The solution is to replace a space with a space, while squeezing out the extras:

tr -s  ' '  ' '  < olddata > newdata

The next option, -d, deletes the characters you specify. As such, when you use -d, you define only one set of characters. For example, to delete all the left and right parentheses, use:

tr -d '()' < olddata > newdata

To delete all numbers, use either of the commands:

tr -d [:digit:] < olddata > newdata
tr -d 0-9 < olddata > newdata

The final option, -c, is the most complex and the most powerful. This option tells tr to match all the characters that are not in the first set(*). For example, the following command replaces all characters except a space or a newline with an "X":

tr -c ' \n' X < olddata > newdata

* Footnote

The name -c stands for "complement", a mathematical term. In set theory, the complement of a set refers to all the elements that are not part of the set. For example, with respect to the integers, the complement of the set of all even numbers is the set of all odd numbers. With respect to all the uppercase letters, the complement of the set {ABCDWXYZ} is the set {EFGHIJKLMNOPQRSTUV}.

The effect of this command is to preserve the "image" of the text, without the meaning. For instance, let's say the file olddata contains:

Do you really think you were designed to spend most of
your waking hours working in an office and going to
meetings? — Harley Hahn

The previous command will generate:

XX XXX XXXXXX XXXXX XXX XXXX XXXXXXXX XX XXXXX XXXX XX
XXXX XXXXXX XXXXX XXXXXXX XX XX XXXXXX XXX XXXXX XX
XXXXXXXXX X XXXXXX XXXX

To finish the discussion of tr, here is an interesting example in which we combine the -c (complement) and -s (squeeze) options to count unique words. Let's say you have written two history essays, stored in text files named greek and roman. You want to count the unique words found in both files. The strategy is as follows:

• Use cat to combine the files
• Use tr to place each word on a separate line
• Use sort to sort the lines and eliminate duplications
• Use wc to count the remaining lines

To place each word on a separate line (step 2), all we need to do is use tr to replace every character that is not part of a word with a newline. For example, let's say we have the words:

As you can see

This would change to:

As
you
can
see

To keep things simple, we will say words are constructed from a set of 53 different characters: 26 uppercase letters, 26 lowercase letters, and the apostrophe (that is, the single quote). The following three commands — choose the one you like — will do the job:

tr -cs [:alpha:]\' "\n"
tr -cs [:upper:][:lower:]\' "\n"
tr -cs A-Za-z\' "\n"

The -c option changes the characters that are not in the first set; and the -s option squeezes out repeated characters. The net effect is to replace all characters that are not a letter or an apostrophe with a newline.

Once you have isolated the words, one per line, it is a simple matter to sort them. Just use the sort program with -u (unique) to eliminate duplicate lines, and -f (fold) to ignore differences between upper and lower case. You can then use wc -l to count the number of lines. Here, then, is the complete pipeline:

cat greek roman | tr -cs [:alpha:]\' "\n" | sort -fu | wc -l

More generally:

cat file1... | tr -cs [:alpha:]\' "\n" | sort -fu | wc -l

In this way, a single Unix pipeline can count how many unique words are contained in a set of input files. If you want to save the list of words, all you need to do is redirect the output of the sort program:

cat file1... | tr -cs [:alpha:]\' "\n" | sort -fu > file

Jump to top of page

Non-interactive Text Editing: sed

A text editor is a program that enables you to perform operations on lines of text. Typically, you can insert, delete, make changes, search, and so on. The two most important Unix text editors are vi (which we will discuss in Chapter 22), and Emacs. There are also several other, less important, but simpler text editors which we discussed in Chapter 14: kedit, gedit, Pico and Nano.

The characteristic all these programs have in common is that they are interactive. That is, you work with them by opening a file and then entering commands, one after another until you are done. In this section, I am going to introduce you to a text editor, called sed, which is non-interactive.

With a non-interactive text editor, you compose your commands ahead of time. You then send the commands to the program, which carries them out automatically. Using a non-interactive text editor allows you to automate a large variety of tasks which, otherwise, you would have to carry out by hand.

You can use sed in two ways. First, you can have sed read its input from a file. This allows you to make changes in an existing file automatically. For example, you might want to read a file and change all the occurrences of "harley" to "Harley".

Second, you can use sed as a filter in a pipeline. This allows you to edit the output of a program. It also allows you to pipe the output of sed to yet another program for further processing.

Before we get started, here is a bit of terminology. When you think of data being sent from one program to another in a pipeline, it conjures up the metaphor of water running along a path. For this reason, when data flows from one program to another, we call it a STREAM. More specifically, when data is read by a program, we call the data an INPUT STREAM. When data is written by a program, we call the data an OUTPUT STREAM.

Of all the filters we have discussed, sed is, by far, the most powerful. This is because sed is more than a single-purpose program. It is actually an interpreter for a portable, shell-independent language designed to perform text transformations on a stream of data. Hence the name: sed is an abbreviation of "stream editor".

A full discussion of everything that sed can do is beyond the scope of this book. However, the most useful operation you can perform with sed is to make simple substitutions, so that is what I will teach you. Still, I am leaving out a lot, so when you get a spare moment, look on the Web for a sed tutorial to learn more. If you need a reference, check the man page on your system (man sed).

The syntax to use sed in this way is:

sed [-i] command | -e command... [file...]

where command is a sed command, and file is the name of an input file.

To show you what it looks like to use sed, here is a typical example in which we change every occurrence of "harley" to "Harley". The input comes from a text file named names; the output is written to a file named newnames:

sed 's/harley/Harley/g' names > newnames

I'll explain the details of the actual command in a moment. First, though, we need to talk about input and output files.

The sed program reads one line at a time from the data stream, processing all the data from beginning to end, according to a 3-step procedure,

  1. Read a line from the input stream.
  2. Execute the specified commands, making changes as necessary to the line.
  3. Write the line to the output stream.

By default, sed writes its output to standard output, which means sed does not change the input file. In some cases this is fine, because you don't want to change the original file; you want to redirect standard output to another file. You can see this in the example above. The input comes from names, and the output goes to newnames. The file names is left untouched.

Most of the time, however, you do want to change the original file. To do so, you must use the - i (in-place) option. This causes sed to save its output to a temporary file. Once all the data is processed successfully, sed copies the temporary file to the original file. The net effect is to change the original file, but only if sed finishes without an error. Here is a typical sed command using ‑i:

sed -i 's/harley/Harley/g' names

In this case, sed modifies the input file names by changing all occurrence of "harley" to "Harley".(*)

* Footnote

The -i option is available only with the GNU version of sed. If your system does not use the GNU utilities — for example, if you use Solaris — you cannot use -i. Instead, to use sed to change a file, you must save the output to a temporary file. You then use the cp (copy) program to copy the temporary file to the original file, and the rm (remove) program to delete the temporary file. For example:

sed 's/harley/Harley/g' names > temp
cp temp names
rm temp

In other words, you must do by hand what the -i option does for you automatically.

When you use sed -i, you must be careful. The changes you make to the input file are permanent, and there is no "undo" command.

— hint —

Before you use sed to change a file, it is a good idea to preview the changes by running the program without the -i option. For example:

sed 's/xx/XXX/g' file | less

This allows you to look at the output, and see if it is what you expected. If so, you can rerun the command with -i to make the changes(*):

sed -i 's/xx/XXX/g' file

* Footnote

There is a general Unix principle that says, before you make important permanent changes, preview them if possible.

We used a similar strategy in Chapter 13 with the history list and with aliases. Both times, we discussed how to avoid deleting the wrong files accidentally by previewing the results before performing the actual deletion.

This principle is so important, I want you to remember it forever or until you die (whichever comes first).

Jump to top of page

Using sed for Substitutions

Related filters: tr

The power of sed comes from the operations you can have it perform. The most important operation is substitution, for which you use the s command. The syntax is:

[/address|pattern/]s/search/replacement/[g]

where address is the address of one or more lines within the input stream; pattern is a character string; search is a regular expression; and replacement is the replacement text.

In its simplest form, you use the substitute command by specifying a search string and a replacement string. For example:

s/harley/Harley/

This command tells sed to search each line of the input stream for the character string "harley". If the string is found, change it to "Harley". By default, sed changes only the first occurrence of the search string on each line. For example, let's say the following line is part of the input stream:

I like harley.  harley is smart.  harley is great.

The above command will change this line to:

I like Harley.  harley is smart.  harley is great.

If you want to change all occurrences of the search string, type the suffix g (for global) at the end of the command:

s/harley/Harley/g

In our example, adding the g causes the original line to be changed to:

I like Harley.  Harley is smart.  Harley is great.

In my experience, when you use sed to make a substitution, you usually want to use g to change all the occurrences of the search string, not just the first one in each line. This is why I have included the g suffix in all our examples.

So far, we have searched only for simple character strings. However, you can make your search a lot more powerful by using what is called a "regular expression" (often abbreviated as "regex"). Using a regular expression allows you to specify a pattern, which gives you more flexibility. However, regexes can be complex, and it will take you a while to learn how to use them well.

I won't go into the details of using regular expressions now. In fact, they are so complicated — and so powerful — that I have devoted an entire chapter to them, Chapter 20. Once you have read that chapter, I want you to come back to this section and spend some time experimenting with regular expressions and sed. (Be sure to use the handy reference tables in Figures 20-1, 20-2 and 20-3.)

For now, I'll show you just two examples that use regular expressions with sed. To start, let's say you have a file named calendar that contains information about your plans for the next several months. You want to change all occurrences of the string "mon" or "Mon" to the word "Monday. Here is a command that makes the change by using a regular expression:

sed -i 's/[mM]on/Monday/g' calendar

To understand this command, all you need to know is that, within a regular expression, the notation [...] matches any single element within the brackets; in this case, either an "m" or an "M". Thus, the search string is either "mon" or "Mon".

This second example is a bit trickier. Earlier in the chapter, when we discussed the tr program, we talked about how Unix, Windows, and the Macintosh all use different characters to mark the end of a line of text. Unix uses a newline (^J); Windows uses a return followed by a newline (^M^J); and the Macintosh uses a return (^M). (These characters are discussed in detail in Chapter 7.)

During the discussion, I showed you how to convert a text file in Macintosh format to Unix format. You do so by using tr to change all the returns to newlines:

tr '\r' '\n' < macfile > unixfile

But what do you do if you have a text file in Windows format and you want to use the file with Unix? In other words, how do you change the "return newline" at the end of each line of text to a simple newline? You can't use tr, because you need to change two characters (^M^J)into one (^J); tr can only change one character into another character.

We can, however, use sed, because sed can change anything into anything. To create the command, we use the fact that the return character (^M) will be at the end of the line, just before the newline (^J). All we need to do is find and delete the ^M.

Here are two commands that will do the conversion. The first command reads its input from a file named winfile, and writes the output to a file named unixfile. The second command uses -i to change the original file itself:

sed 's/.$//' winfile > unixfile
sed -i 's/.$//' winfile

So how does this work? Within a regular expression, a . (dot) character matches any single character; the $ (dollar sign) character matches the end of a line. Thus, the search string .$ refers to the character just before the newline.

Look carefully at the replacement string. Notice that it is empty. This tells sed to change the search string to nothing. That is, we are telling sed to delete the search string. This has the effect of removing the spurious return character from each line in the file.

If you have never used regular expressions before, I don't expect you to feel completely comfortable with the last several commands. However, I promise you, by the time you finish Chapter 20, these examples, and others like them, will be very easy to understand.

— hint —

To delete a character string with sed, you search for the string and replace it with nothing.

This is an important technique to remember, as you can use it with any program that allows search and replace operations. (In fact, you will often use this technique within a text editor.)

Jump to top of page

Telling sed to Operate Only on Specific Lines

By default, sed performs its operations on every line in the data stream. To change this, you can preface your command with an "address". This tells sed to operate only on the lines with that address. An address has the following syntax:

number[,number] | /regex/

where number is a line number, and regex is a regular expression.

In its simplest form, an address is a single line number. For example, the following command changes only the 5th line of the data steam:

sed '5s/harley/Harley/g' names

To specify a range of lines, separate the two line numbers with a comma. For example, the following command changes lines 5 through 10:

sed '5,10s/harley/Harley/g' names

As a convenience, you can designate the last line of the data stream by the $ (dollar sign) character. For example, to change only the last line of the data stream, you would use:

sed '$s/harley/Harley/g' names

To change lines 5 through the last line, you would use:

sed '5,$s/harley/Harley/g' names

As an alternative to specifying line numbers, you can use a regular expression or a character string(*) enclosed in / (slash) characters. This tells sed to process only those lines that contain the specified pattern. For example, to make a change to only those lines that contain the string "OK", you would use:

sed '/OK/s/harley/Harley/g' names

* Footnote

As we will discuss in Chapter 20, character strings are considered to be regular expressions.

Here is a more complex example. The following command changes only those lines that contain 2 digits in a row:

sed '/[0-9][0-9]/s/harley/Harley/g' names

(The notation [0-9] refers to a single digit from 0 to 9. See Chapter 20 for the details.)

Jump to top of page

Using Very Long sed Commands

As I mentioned earlier, sed is actually an interpreter for a text-manipulation programming language. As such, you can write programs — consisting of as many sed commands as you want — which you can store in files and run whenever you want.

To do so, you identify the program file by using the -f command. For example, to run the sed program stored in a file named instructions, using data from a file named input, you would use:

sed -f instructions input

The use of sed to write programs, alas, is beyond the scope of this book. In this chapter, we are concentrating on how to use sed as a filter. Nevertheless, there will be times when you will want sed to perform several operations; in effect, to run a tiny program. When this need arises, you can specify as many sed commands as you want, as long as you precede each one by the -e (editing command) option. Here is an example.

You have a file named calendar in which you keep your schedule. Within the file, you have various abbreviations you would like to expand. In particular, you want to change "mon" to "Monday". The command to use is:

sed -i 's/mon/Monday/g' calendar

However, you also want to change "tue" to "Tuesday". This requires two separate sed commands, both of which must be preceded by the -e option:

sed -i -e 's/mon/Monday/g' -e 's/tue/Tuesday/g' calendar

By now, you can see the pattern. You are going to need seven separate sed commands, one for each day of the week. This, however, will require a very long command line.

As we discussed in Chapter 13, the best way to enter a very long command is to break it onto multiple lines. All you have to do is type a \ (backslash) before you press the <Return> key. The backslash quotes the newline, which allows you to break the command onto more than one line.

As an example, here is a long sed command that changes the abbreviations for all seven days of the week. Notice that all the lines, except the last one, are continued. What you see here is, in reality, one very long command line:

sed -i \
-e 's/mon/Monday/g' \
-e 's/tue/Tuesday/g' \
-e 's/wed/Wednesday/g' \
-e 's/thu/Thursday/g' \
-e 's/fri/Friday/g' \
-e 's/sat/Saturday/g' \
-e 's/sun/Sunday/g' \
calendar

— hint —

When you type \<Return> to continue a line, most shells displays a special prompt, called the SECONDARY PROMPT, to indicate that a command is being continued.

Within the Bourne Shell family (Bash, Korn Shell), the default secondary prompt is a > (greater-than) character. You can change the secondary prompt by modifying the PS2 shell variable (although most people don't).

Within the C-Shell family, only the Tcsh has a secondary prompt. By default, it is a ? (question mark). You can change the secondary prompt by modifying the prompt2 shell variable.

(The commands to modify shell variables are explained in Chapter 12. Putting such commands in one of your initialization files is discussed in Chapter 14.)

Jump to top of page



Exercises

Review Question #1:

Of all the filters, grep is the most important. What does grep do?

Why is it especially useful in a pipeline?

Explain the meaning of the following options: -c, -i, -l, -L, -n, -r, -s, -v, - w and -x.

Review Question #2:

What two tasks can the sort program perform?

Explain the meaning of the following options: -d, -f, -n, -o, -r and -u.

Why is the -o option necessary?

Answer

The sort program can (1) sort data, (2)

Review Question #3:

What is a collating sequence? What is a locale? What is the connection between the two?

Review Question #4:

What four tasks can the uniq program perform?

Review Question #5:

What three tasks can the tr program perform?

When using tr, what special codes do you use to represent: backspace, tab, newline/linefeed, return and backslash.

Applying Your Knowledge #1:

As we will discuss in Chapter 23, the /etc directory is used to hold configuration files (explained in Chapter 6). Create a command that looks through all the files in the /etc directory, searching for lines that contain the word "root". The output should be displayed one screenful at a time. Hint: To specify the file names, use the pattern /etc/*.

Searching through the files in the /etc directory will generate a few spurious error messages. Create a second version of the command that suppresses all such messages.

Applying Your Knowledge #2:

Someone bets you that, without using a dictionary, you can't find more than 5 English words that begin with the letters "book". You are, however, allowed a single Unix command. What command should you use?

Applying Your Knowledge #3:

You are running an online dating service for your school. You have three files containing user registrations: reg1, reg2 and reg3. Within these files, each line contains information about a single person (no pun intended).

Create a pipeline that processes all three files, selecting only those lines that contain the word "female" or "male" (your choice). After eliminating all duplications, the results should be saved in a file named prospects.

Once this is done, create a second pipeline that displays a list of all the people (male or female) who have registered more than once. Hint: Look for duplicate lines within the files.

Applying Your Knowledge #4:

You have a text file named data. Create a pipeline that displays all instances of double words, for example, "hello hello". (Assume that a "word" consists of consecutive upper- or lowercase letters.)

Hint: First create a list of all the words, one per line. Then pipe the output to a program that searches for consecutive identical lines.

For Further Thought #1:

In an earlier question, I observed that grep is the most important filter, and I asked you to explain why it is especially useful in a pipeline. Considering your answer to that question, what is it about the nature of human beings that makes grep seem so powerful and useful?

For Further Thought #2:

Originally, Unix was based on American English and American data processing standards (such as the ASCII code). With the development of internationalization tools and standards (such as locales), Unix can now be used by people from a variety of cultures. Such users are able to interact with Unix in their own languages using their own data processing conventions.

What are some of the tradeoffs in expanding Unix in this way? List three advantages and three disadvantages.

For Further Thought #3:

In this chapter, we talked about the tr and sed programs in detail. As you can see, both of these programs can be very useful. However, they are complex tools that require a lot of time and effort to master.

For some people, this is not a problem. For many other people, however, taking the time to learn how to use a complex tool well is an uncomfortable experience. Why do you think this is so?

Should all tools be designed to be easy to learn?

For Further Thought #4:

Comment on the following statement: There is no program in the entire Unix toolbox that can't be mastered in less time than it takes to learn how to play the piano well.

Jump to top of page