Harley Hahn's Guide to
Exercises and Answers for Chapter 17...
Filters: Comparing and Extracting
Review Question #1:
There are ten important Unix programs to compare, sort and select data from files. What are the ten programs?
Why are there so many programs?
The 10 important Unix programs used to compare, sort and select data from files are:
For reference, you will find a summary of these 10 programs — all of which are filters — in the table below.
This table will show you information about each of these program: it's purpose, the type of files it operates on, the number of files it operates on, and the chapter in which you will find the description of the program and how to use it.
There are so many programs, because each one meets a specific need with respect to a specific type of file. This will begin to make more and more sense to you as you learn about each of the various programs. In time, you will appreciate the importance of having the right tool at the right time.
Imagine that, instead of having so many separate tools, there were one (or perhaps two or three) complex programs that, in principle, could do all the work. True, you would only have to learn how to use one program. However, it wouldn't be long before you came to conclusion that programs that try to do "everything" take a long time to learn and are far too complicated to use quickly and efficiently on a daily basis.
Indeed, in such a situation, you would find yourself with a strong motivation to create many separate tools, each of which was relatively simple to use and each of which did only one thing well.
Isn't it nice that so many things about Unix are designed so well?
Review Question #2:
By default, the comm program compares two files and generates three columns of output. Explain the purpose of each column. How do you suppress a specific column?
The first column contains the lines that are only in the first file; the second column contains the lines that are only in the second file; the third column contains the lines that are in both files.
To suppress the output of the first, second or third columns, use the options -1, -2 and -3 respectively. To suppress more than one column, combine the options.
The following example compares two data files and suppress the first and third columns. That is, to display all the lines that are only in the second file:
comm -13 file1 file2
Review Question #3:
Both the diff and sdiff programs compare files. When do you use diff and when do you use sdiff?
Using sdiff is equivalent to using diff with the -y option: it generates output using a side-by-side comparison.
Use sdiff when you want easy-to-understand output. Use diff when you want more compact output.
Review Question #4:
You are given a large text file. Which program would you use to select: a) duplicate lines, b) unique lines, c) lines containing a specific pattern, d) lines beginning with a specific pattern, e) columns of data?
Applying Your Knowledge #1:
You are interested in comparing favorite foods with your two friends Claude and Eustace. Create three files named: me, claude, eustace. Each of the files contains fives lines in sorted order, each of which has the name of a food. Using only two Unix commands, display a list of those foods that appear in all three lists. Hint: You may need to create a temporary file.
Remember to sort the three files. Then use:
comm -12 me claude > temp
Applying Your Knowledge #2:
The comm program is used to compare sorted files; the diff program compares unsorted files.
Give three examples of types of data which you would compare using comm.
Give three examples of types of data where you would use diff.
Are there any instances where either program would work?
Use comm to compare small files with sorted data, such as:
• Two lists of names
Use diff to compare unsorted data, or large files with sorted data, such as:
• Two versions of a program
Either command will work with short files containing sorted data. However, the programs do generate different output. The output from comm can be confusing with large files. In such cases, diff is better, even if the data is sorted.
Applying Your Knowledge #3:
Each line of the Unix password file (/etc/passwd) contains information about a userid. Within the line, the various fields of data are delimited by : (colon) characters. One of the fields contains the name of the shell for that userid. Use the following command to display and study the format of the password file on your system:
(From within less, you can press <Space> to display the page down and q to quit.)
What command would you use to read the password file and display a list of the various shells used on your system?
How would you sort the output to make it more readable?
How would you eliminate duplications?
Here is a typical line from a password file:
As you can see, there are 7 fields delimited by : (colon) characters. The name of the shell program is in the 7th field. To extract this field from all the lines in the password file, use:
cut -f 7 -d ':' /etc/passwd
This command lists the various shells being used on your system. To sort the output, use:
cut -f 7 -d ':' /etc/passwd | sort
To eliminate duplications, use either of the following:
cut -f 7 -d ':' /etc/passwd | sort | uniq
Applying Your Knowledge #4:
CSV format (comma-separated value format) describes a file containing machine-readable data in which fields are separated by commas. You have five files — data1, data2, data3, data4 and data5 — each of which contains a single column of data. What command would you use to put the five columns together into one CSV-formatted file named csvdata?
What happens if one of the files has fewer lines in it than the other files?
paste -d',' data1 data2 data3 data4 data5 > csvdata
If one of the files has fewer lines in it than the other files, paste will use all the data is can find. When it runs out, it will insert an empty field. Thus, when a column has less lines than another column, there will be two commas in a row in the final output wherever missing data was detected.
For Further Thought #1:
The purpose of the diff program is to highlight differences by displaying terse instructions for turning the first file into the second file. Create two files named a and b. Compare the output of the following commands:
What patterns do you notice?
The first command generates the instructions to turn the file a into the file b. The second command generates the instructions to turn the file b into the file a.
Thus, the two sets output are inverses of one another.
For Further Thought #2:
Why is the output of the diff command so compact?
Should it be easier to understand?
The output of diff is designed to generate as little output as possible. This is suitable for users who are adept at reading this type of output and for programs that interpret this type of output.
There is no need to make the output of diff easier to understand for two reasons. First, once you get used to it, it is not hard to understand. Second, if you really need less compact more verbose output, you can use the sdiff program.
For Further Thought #3:
You are given a text file with 10,000 lines. The file contains two columns of data, and you must change the order of the columns. You can do so quickly and accurately by using the cut and paste commands. Suppose you did not have these commands. Using any other tools at your disposal, how might you accomplish the job?
Consider using a text editor, word processor, a spreadsheet program, writing a program of your own, and so on. Is there anything you can think of that is easier than the Unix cut and paste programs? Why is this?
What qualities do the programs in this chapter have that make them so useful?
With a spreadsheet program, you would have to select the data from one column and cut it (^X) to the clipboard (an awkward process when you have 10,000 lines). You would then create a new column next to the second column and paste the data (^V) into the new column. Finally, you would delete the original column.
This type of operation is difficult to do with a word processor, unless the data set up as a table. If it is set up as a table (unlikely with 10,000 lines), you could use a similar procedure as with the spreadsheet, but it would be a lot more problematic.
With a text editor also, this is a time-consuming procedure. However, it is possible, as long as the text editor can operate in "column mode". Even then, it will be very awkward. If the text editor does not have a column mode, forget it.
The programs in this chapter are so useful because they designed well, require little effort on the part of the user, and work with large amounts of data as easily as with small amounts of data.
© All contents Copyright 2024, Harley Hahn