Donation?

Harley Hahn
Home Page

Send a Message
to Harley


A Personal Note
from Harley Hahn

Unix Book
Home Page

SEARCH

List of Chapters

Table of Contents

List of Figures

Chapters...
   1   2   3
   4   5   6
   7   8   9
  10  11  12
  13  14  15
  16  17  18
  19  20  21
  22  23  24
  25  26

Glossary

Appendixes...
  A  B  C
  D  E  F
  G  H

Command
Summary...

• Alphabetical
• By category

Unix-Linux
Timeline

Internet
Resources

Errors and
Corrections

Endorsements


INSTRUCTOR
AND STUDENT
MATERIAL...

Home Page
& Overview

Exercises
& Answers

The Unix Model
Curriculum &
Course Outlines

PowerPoint Files
for Teachers

Exercises and Answers for Chapter 17...

Filters: Comparing and Extracting

Review Question #1:

There are ten important Unix programs to compare, sort and select data from files. What are the ten programs?

Why are there so many programs?

Answer

The 10 important Unix programs used to compare, sort and select data from files are:

cmp
comm
diff
sdiff
cut
paste
sort
uniq
grep
look

For reference, you will find a summary of these 10 programs — all of which are filters — in the table below.

This table will show you information about each of these program: it's purpose, the type of files it operates on, the number of files it operates on, and the chapter in which you will find the description of the program and how to use it.


There are so many programs, because each one meets a specific need with respect to a specific type of file. This will begin to make more and more sense to you as you learn about each of the various programs. In time, you will appreciate the importance of having the right tool at the right time.

Imagine that, instead of having so many separate tools, there were one (or perhaps two or three) complex programs that, in principle, could do all the work. True, you would only have to learn how to use one program. However, it wouldn't be long before you came to conclusion that programs that try to do "everything" take a long time to learn and are far too complicated to use quickly and efficiently on a daily basis.

Indeed, in such a situation, you would find yourself with a strong motivation to create many separate tools, each of which was relatively simple to use and each of which did only one thing well.

Isn't it nice that so many things about Unix are designed so well?

Filter Purpose Chapter Type of files Number of files
cmpCompare two files17binary or textTwo
commCompare two sorted files, show differences17text: sortedTwo
diffCompare two files, show differences17textTwo
sdiffCompare two files, show differences17textTwo
cutExtract specified columns/fields of data17textOne or more
pasteCombine columns of data17textOne or more
sortSort data19textOne or more
uniqSelect duplicate/unique lines19text: sortedOne
grepSelect lines containing a specified pattern19textOne or more
lookSelect lines beginning with a specified pattern19text: sortedOne

Review Question #2:

By default, the comm program compares two files and generates three columns of output. Explain the purpose of each column. How do you suppress a specific column?

Answer

The first column contains the lines that are only in the first file; the second column contains the lines that are only in the second file; the third column contains the lines that are in both files.

To suppress the output of the first, second or third columns, use the options -1, -2 and -3 respectively. To suppress more than one column, combine the options.

The following example compares two data files and suppress the first and third columns. That is, to display all the lines that are only in the second file:

comm -13 file1 file2

Review Question #3:

Both the diff and sdiff programs compare files. When do you use diff and when do you use sdiff?

Answer

Using sdiff is equivalent to using diff with the -y option: it generates output using a side-by-side comparison.

Use sdiff when you want easy-to-understand output. Use diff when you want more compact output.

Review Question #4:

You are given a large text file. Which program would you use to select: a) duplicate lines, b) unique lines, c) lines containing a specific pattern, d) lines beginning with a specific pattern, e) columns of data?

Answer

uniq duplicate lines
uniq unique lines
grep lines containing a specific pattern
look lines beginning with a specific pattern
cut columns of data

Applying Your Knowledge #1:

You are interested in comparing favorite foods with your two friends Claude and Eustace. Create three files named: me, claude, eustace. Each of the files contains fives lines in sorted order, each of which has the name of a food. Using only two Unix commands, display a list of those foods that appear in all three lists. Hint: You may need to create a temporary file.

Answer

Remember to sort the three files. Then use:

comm -12 me claude > temp
comm -12 temp eustace
rm temp

Applying Your Knowledge #2:

The comm program is used to compare sorted files; the diff program compares unsorted files.

Give three examples of types of data which you would compare using comm.

Give three examples of types of data where you would use diff.

Are there any instances where either program would work?

Answer

Use comm to compare small files with sorted data, such as:

• Two lists of names
• Two almost identical lists of numbers
• Two lists of part numbers

Use diff to compare unsorted data, or large files with sorted data, such as:

• Two versions of a program
• Two versions of an essay
• Two long sorted lists of names

Either command will work with short files containing sorted data. However, the programs do generate different output. The output from comm can be confusing with large files. In such cases, diff is better, even if the data is sorted.

Applying Your Knowledge #3:

Each line of the Unix password file (/etc/passwd) contains information about a userid. Within the line, the various fields of data are delimited by : (colon) characters. One of the fields contains the name of the shell for that userid. Use the following command to display and study the format of the password file on your system:

less /etc/password

(From within less, you can press <Space> to display the page down and q to quit.)

What command would you use to read the password file and display a list of the various shells used on your system?

How would you sort the output to make it more readable?

How would you eliminate duplications?

Answer

Here is a typical line from a password file:

harley:x:500:500:Harley Hahn:/home/harley:/bin/bash

As you can see, there are 7 fields delimited by : (colon) characters. The name of the shell program is in the 7th field. To extract this field from all the lines in the password file, use:

cut -f 7 -d ':' /etc/passwd

This command lists the various shells being used on your system. To sort the output, use:

cut -f 7 -d ':' /etc/passwd | sort

To eliminate duplications, use either of the following:

cut -f 7 -d ':' /etc/passwd | sort | uniq
cut -f 7 -d ':' /etc/passwd | sort -u

Applying Your Knowledge #4:

CSV format (comma-separated value format) describes a file containing machine-readable data in which fields are separated by commas. You have five files — data1, data2, data3, data4 and data5 — each of which contains a single column of data. What command would you use to put the five columns together into one CSV-formatted file named csvdata?

What happens if one of the files has fewer lines in it than the other files?

Answer

paste -d',' data1 data2 data3 data4 data5 > csvdata

If one of the files has fewer lines in it than the other files, paste will use all the data is can find. When it runs out, it will insert an empty field. Thus, when a column has less lines than another column, there will be two commas in a row in the final output wherever missing data was detected.

For Further Thought #1:

The purpose of the diff program is to highlight differences by displaying terse instructions for turning the first file into the second file. Create two files named a and b. Compare the output of the following commands:

diff a b diff b a

What patterns do you notice?

Answer

The first command generates the instructions to turn the file a into the file b. The second command generates the instructions to turn the file b into the file a.

Thus, the two sets output are inverses of one another.

For Further Thought #2:

Why is the output of the diff command so compact?

Should it be easier to understand?

Answer

The output of diff is designed to generate as little output as possible. This is suitable for users who are adept at reading this type of output and for programs that interpret this type of output.

There is no need to make the output of diff easier to understand for two reasons. First, once you get used to it, it is not hard to understand. Second, if you really need less compact more verbose output, you can use the sdiff program.

For Further Thought #3:

You are given a text file with 10,000 lines. The file contains two columns of data, and you must change the order of the columns. You can do so quickly and accurately by using the cut and paste commands. Suppose you did not have these commands. Using any other tools at your disposal, how might you accomplish the job?

Consider using a text editor, word processor, a spreadsheet program, writing a program of your own, and so on. Is there anything you can think of that is easier than the Unix cut and paste programs? Why is this?

What qualities do the programs in this chapter have that make them so useful?

Answer

With a spreadsheet program, you would have to select the data from one column and cut it (^X) to the clipboard (an awkward process when you have 10,000 lines). You would then create a new column next to the second column and paste the data (^V) into the new column. Finally, you would delete the original column.

This type of operation is difficult to do with a word processor, unless the data set up as a table. If it is set up as a table (unlikely with 10,000 lines), you could use a similar procedure as with the spreadsheet, but it would be a lot more problematic.

With a text editor also, this is a time-consuming procedure. However, it is possible, as long as the text editor can operate in "column mode". Even then, it will be very awkward. If the text editor does not have a column mode, forget it.

If you are an experienced programmer, writing such a program (say a Python, Javascript, PHP, Bash, or Perl script) isn't difficult, but it would be a lot more trouble than using the Unix commands. When it comes to such tasks, there is nothing better than cut and paste.

The programs in this chapter are so useful because they designed well, require little effort on the part of the user, and work with large amounts of data as easily as with small amounts of data.

Jump to top of page