Harley Hahn's Guide to Unix and Linux

The grep program reads from standard input or from one or more files, and extracts all the lines that contain a specified pattern, writing the lines to standard output.

Within a pipeline, grep is useful because it can quickly reduce a large amount of raw data into a small amount of useful information.

The options are:


-c		(count) display the number of lines that have been extracted
-i		(ignore case) ignore the difference between lower- and uppercase letters
-l		(list filenames) instead of displaying the lines of text that contain the pattern,
		display the names of the files in which such names were found
-L		(opposite of -l) list names of files that do not contain the pattern
-n		(number) write a relative line number in front of each line of output
-r		(recursive) search all subdirectories
-s		(suppress) don't show file permission errors
-v		(reverse) select all lines that do not contain the pattern
-w		(word) search only for complete words
-x		(entire line) only select lines in which the pattern is the entire line

Review Question #2:

What two tasks can the sort program perform?

Explain the meaning of the following options: -d, -f, -n, -o, -r and -u.

Why is the -o option necessary?

Answer

The sort program can (1) sort data, (2) check to see if data is already sorted.


-d		(dictionary) looks only at letters, numerals and whitespace; ignore
		punctuation
-f		(fold) ignore the difference between lower- and uppercase letters
-n		(numeric) recognize numbers at the beginning of a line or field and
		sort them numerically
-o		(output) write output to specified file
-r		(reverse) sorts data in reverse order
-u		(unique) check for identical lines and suppress all but one

The -o option is necessary to protect data when the output file is the same as one of the input files. Without -o, you might want to redirect standard output to the file. However, this would wipe out the contents of the file before the command even started. So, without -o, you would have to use a temporary file to hold the output. You would then have to move the temp file to the original file by hand.

Review Question #3:

What is a collating sequence? What is a locale? What is the connection between the two?

Answer

A collating sequence describes the order in which characters are placed when sorted. In the U.S., the two principal collating sequences are:

• C collating sequence, based on the ASCII code, used by the C (POSIX) locale; named after the C programming language. Within the C collating sequence, uppercase letters come before lowercase letters (ABC... XYZabc...xyz).

• Dictionary collating sequence, used by the en_US locale, in which uppercase letters and lowercase letters are grouped in pairs (AaBbCcDd... Zz).

In modern versions of Unix or Linux, the collating sequence depends on the choice of locale.

A locale is a technical specification describing the language and conventions that should be used when communicating with a user from a particular culture. The intention is that a user can choose whichever locale he wants, and the programs he runs will communicate with him accordingly. For users of American English, the default locale is either the C (POSIX) locale based on the ASCII code; or the en_US locale, part of a newer international system.

The locale specifies a default collating sequence, although that can be changed by the user.

Review Question #4:

What four tasks can the uniq program perform?

Answer

The uniq program can:

• Eliminate duplicate lines
• Select duplicate lines
• Select unique lines
• Count the number of duplicate lines

Review Question #5:

What three tasks can the tr program perform?

When using tr, what special codes do you use to represent: backspace, tab, newline/linefeed, return and backslash.

Answer

The tr program can:

• Change characters to other characters
• Replace identical consecutive characters by a single character
• Delete specified characters

When using tr, you can use the following special codes:


backspace		\b
tab		\t
newline/linefeed		\n
return		\r
backslash		\\

Applying Your Knowledge #1:

As we will discuss in Chapter 23, the /etc directory is used to hold configuration files (explained in Chapter 6). Create a command that looks through all the files in the /etc directory, searching for lines that contain the word "root". The output should be displayed one screenful at a time. Hint: To specify the file names, use the pattern /etc/*.

Searching through the files in the /etc directory will generate a few spurious error messages. Create a second version of the command that suppresses all such messages.

Answer

grep -w root /etc/* | less

To suppress error messages, use the -s option:

grep -ws root /etc/* | less

Applying Your Knowledge #2:

Someone bets you that, without using a dictionary, you can't find more than 5 English words that begin with the letters "book". You are, however, allowed a single Unix command. What command should you use?

Answer

look book

Applying Your Knowledge #3:

You are running an online dating service for your school. You have three files containing user registrations: reg1, reg2 and reg3. Within these files, each line contains information about a single person (no pun intended).

Create a pipeline that processes all three files, selecting only those lines that contain the word "female" or "male" (your choice). After eliminating all duplications, the results should be saved in a file named prospects.

Once this is done, create a second pipeline that displays a list of all the people (male or female) who have registered more than once. Hint: Look for duplicate lines within the files.

Answer

You can use either of the following pipelines:

The second grep command will show you the name of the file in which each line was found. The first one won't. To display a list of duplicate registrations:

sort reg1 reg2 reg3 | uniq -d

Applying Your Knowledge #4:

You have a text file named data. Create a pipeline that displays all instances of double words, for example, "hello hello". (Assume that a "word" consists of consecutive upper- or lowercase letters.)

Hint: First create a list of all the words, one per line. Then pipe the output to a program that searches for consecutive identical lines.

Answer

tr -cs '[:alpha:]' '[\n*]' < data | uniq -d

For Further Thought #1:

In an earlier question, I observed that grep is the most important filter, and I asked you to explain why it is especially useful in a pipeline. Considering your answer to that question, what is it about the nature of human beings that makes grep seem so powerful and useful?

Answer

By our nature, human beings are wired to recognize patterns. To do so, we must, at every moment, extract meaning from huge amounts of information. In several important ways, grep does just that with textual data. It this way, grep is able to help us find conceptual pattrens by quickly examining large amounts of data. (It is no coincidence that grep was invented by a human being.)

For Further Thought #2:

Originally, Unix was based on American English and American data processing standards (such as the ASCII code). With the development of internationalization tools and standards (such as locales), Unix can now be used by people from a variety of cultures. Such users are able to interact with Unix in their own languages using their own data processing conventions.

What are some of the tradeoffs in expanding Unix in this way? List three advantages and three disadvantages.

Answer

Advantages:

• Allows many people to use Unix or Linux using the language and conventions with which they are already comfortable.

• Opens the world of Unix and Linux to hundreds of millions more people. This will have a large, positive effect on the world. For example, many more volunteers will become involved with open source development.

• Helps the economies of developing countries by making sophisticated computer tools available at low cost.

Disadvantages:

• Creates an enormous burden with respect to creating and maintaining different versions of software. Even with the use of locales, internationalized software is larger, more complex and requires a lot more testing than software written only for English.

• It reduces the motivation for non-English speaking users to learn English which, after all, is the world's primary business and computing language.

• The system of locales creates unnecessary complications for English-speaking users, for example, the collating sequence problems we discussed in the chapter.

For Further Thought #3:

In this chapter, we talked about the tr and sed programs in detail. As you can see, both of these programs can be very useful. However, they are complex tools that require a lot of time and effort to master.

For some people, this is not a problem. For many other people, however, taking the time to learn how to use a complex tool well is an uncomfortable experience. Why do you think this is so?

Should all tools be designed to be easy to learn?

Answer

Many people have trouble learning how to use complex tools. There are a variety of reasons. The most important are:

• Not smart enough.
• Don't want to put in the necessary effort.
• Don't have enough time.

However, this does not mean that all tools should be designed to be easy to learn. The tools that should be easy to learn are those designed for people who are less intelligent, lazy, or short of time. There is also a place for easy-to-learn tools for smart, motivated users, if the tools are used only occasionally to carry out important tasks.

As a general rules, all tools should be fast and easy to use once you have mastered them. This is a rule that is often forgotten by companies that design software for the mass market.

For Further Thought #4:

Comment on the following statement: There is no program in the entire Unix toolbox that can't be mastered in less time than it takes to learn how to play the piano well.

Answer

To learn how to play the piano well, it generally takes about and hour of practice daily for at least several years. Since no single Unix tool requires this much work, the statement is literally true.

The spirit of the statement is that, although some people (including students) might complain that it takes a fair bit of effort to master the basic Unix tools, such complaints should be viewed with perspective. Learning to use Unix well takes less effort, over the long run, than is required to master other, less technical skills.

You might suggest to your students that the effort you are asking them to put forth is less than the effort they spent learning how to drive.

Return to Previous Page

Jump to top of page

Jump to Exercises & Answers for Chapter 20
Regular Expressions

Exercises: Introduction | Chapter list

Instructor/Student home page