Donation?

Harley Hahn
Home Page

Send a Message
to Harley


A Personal Note
from Harley Hahn

Unix Book
Home Page

List of Chapters

Table of Contents

List of Figures

Chapters...
   1   2   3
   4   5   6
   7   8   9
  10  11  12
  13  14  15
  16  17  18
  19  20  21
  22  23  24
  25  26

Glossary

Appendixes...
  A  B  C
  D  E  F
  G  H

Command
Summary...

• Alphabetical
• By category

Unix-Linux
Timeline

Internet
Resources

Errors and
Corrections

Endorsements


INSTRUCTOR
AND STUDENT
MATERIAL...

Home Page
& Overview

Exercises
& Answers

The Unix Model
Curriculum &
Course Outlines

PowerPoint Files
for Teachers

Chapter 16...

Filters: Introduction and Basic Operations

In Chapter 15, we discussed how the Unix philosophy led to the development of many programs, each of which was a tool designed to do one thing well. We also talked about how to redirect input and output, and how to create pipelines in which data is passed from one program to the next.

In the next four chapters (16, 17, 18 and 19), we will continue the discussion by taking a look at a number of very useful Unix programs called "filters". (I'll give you the exact definition soon.) Using these programs with the techniques we discussed in Chapter 15, you will be able to build flexible, customized solutions to solve a wide variety of problems.

We'll start our discussion by talking about some general topics to help you understand the importance of filters and how they are used. We will then move on to discuss the most important Unix filters. Although some of the filters are related, they are independent tools that do not have to be learned in a particular order. If you want to learn about one specific filter, you can jump right to that section. However, if you have the time, I'd prefer that you read all four chapters in order, from beginning to end, as I will be developing various important ideas along the way. If you want to see a list of the filters before we start, take a look at Figure 16-1, which you will find later in this chapter.

In Chapter 20, we will discuss a very important facility called regular expressions, which are used to specify patterns. Regular expressions can increase the power of filters significantly, so you should consider the next four chapters and Chapter 20 as being complementary.

Jump to top of page

Variations of Commands and Options

The purpose of Chapters 16, 17, 18 and 19 is to discuss the basic Unix filters. All of these programs are available with most versions of Unix and Linux. If one of the programs is not available on your system, it may be because the program is not installed by default and you need to install a particular package. For example, with some Linux distributions, you won't be able to use the strings program (Chapter 19), unless you have installed the binutils (binary file utilities) package.

As you know, the details of a particular program can vary from one system to another. In this chapter, I will describe the GNU version of each command. Since Linux and FreeBSD use the GNU utilities (see Chapter 2), if you are a Linux or FreeBSD user, what you read in these four chapters should work unchanged on your system. If you use another type of Unix, there may be differences, but they will be small.

For example, later in this chapter, I will discuss three options you can use with the cat command. If you use Linux or FreeBSD, these options will work exactly as I show you. If you use Solaris, one of the options (-s) has a different meaning.

As we discuss each filter, I will introduce you to the most important options for that program. You should understand that most programs will have other options we will not discuss. In fact, almost all the GNU utilities have a lot of options, many more than you will normally need.

As you read this chapter, please remember that, whenever you want to learn about a program, you can read the definitive documentation for your system by using the man command to access the online manual and, if you are using the GNU utilities, the info command to access the Info system. (This is all explained in Chapter 9.) In particular, you can use the man command to display a list of all the options available with a specific command. For example, to learn about the cat command (discussed in this chapter), you can use:

man cat
info cat

Before we start, let me mention two more points that apply to the GNU utilities (used with Linux and FreeBSD). First, as we discussed in Chapter 10, most of the GNU utilities have two types of options. There are short options, consisting of a dash (hyphen) followed by a single character, and long options consisting of two dashes followed by a word. In Chapter 10, we called these the "dash" and "dash-dash" options, because that is how most people talk about them.

As a general rule, the most important options are the short ones. In most cases, the long options are either synonyms for shorter options or more esoteric options you probably won't need. For this reason, in this book, I generally only talk about the short options.

However, there is one long option you should remember. With the GNU utilities, most commands recognize an option named --help. You can use this option to display the syntax for almost any command, including a summary of the command's options. For example, to display the syntax and the options for the cat command, you can use:

cat --help

Jump to top of page

Filters

In Chapter 15, you saw how a series of programs can be used in sequence to create a pipeline, almost like an assembly line. Consider, for example, the following command, in which data passes through four programs in sequence: cat, grep, sort and less.

cat new old extra | grep Harley | sort | less

In this pipeline, we combine three files named new, old and extra (using cat), extract all the lines that contain Harley (using grep), and then sort these results (using sort). We then use less to display the final output one screenful at a time.

Don't worry about the details of how to use cat, grep and sort. We'll talk about all that later. For now, all I want is for you to appreciate how useful a program can be if it is designed so that it can be used within a pipeline.

We call such programs filters. For example, cat, grep and sort are all filters. Such programs read data, perform some operation on the data, and then write the results. More precisely, a FILTER is any program that reads and writes textual data, one line at a time, reading from standard input and writing to standard output. As a general rule, most filters are designed as tools, to do one thing well.

Interestingly enough, the first and last programs in a pipeline do not have to be filters. In our example, for instance, we use less to display the output of sort. We will discuss less in detail in Chapter 21. For now, I'll tell you that when less displays output, it allows you to look at all the data one screenful at a time, scroll backwards and forwards, search for a specific pattern, and so on. Clearly, less does not write to standard output one line at a time, which means it is not a filter.

Similarly, the first command in this particular pipeline, cat (which combines files), does not read from the standard input. Although cat can be used as a filter, in this situation it reads its input from files, not from standard input. Thus, in our example, cat is not a filter.

— hint —

When you find yourself creating a specific pipeline to use over and over, you can define it permanently by creating an alias in your environment file. (See Chapter 14.) This will allow you to use the pipeline whenever you want, without having to type it every time.

Jump to top of page

Should You Create Your Own Filters?

If you are a programmer, it is not hard to make your own filters. All you have to do is write a program or shell script that reads and writes textual data, one line at a time, using standard I/O. Any program that does this is a filter and, hence, can be used in a pipeline.

Before you run off to design your own programs, however, let me remind you that every Unix and Linux system comes with hundreds of programs, many of which are filters. Indeed, over the last thirty-five years, some of the smartest programmers in history have been creating and perfecting Unix filters.

This means that, if you think of a great idea for a new filter, chances are someone else had the same idea a long time ago and the filter already exists. In fact, most of the tools we will discuss in this chapter are over thirty years old! Thus, when you have a problem, it behooves you to find out what is already available, before you take the time to write your own program. That is why I spent so much time teaching you about the online manual and the Info system (Chapter 9), as well as the man, whatis, apropos and info commands (also Chapter 9).

The art of using Unix well does not necessarily lie in being able to write programs to create new tools, although that certainly is handy. For most people, using Unix well means being able to solve problems by putting together tools that already exist.

Jump to top of page

The Problem Solving Process

If you watch an experienced Unix person use filters to build a pipeline, the technique looks mysterious. Out of nowhere, it seems, he or she will know exactly which filters to use, and exactly how to combine them in just the right way. Eventually, you will be able to do the same. All it takes is knowledge and practice. At first, however, it helps to break down the process into a series of steps, so let's take a moment to do just that.

Your goal is to figure out how to solve the problem at hand by combining a number of filters into a single pipeline. If necessary, you can use more than one command line, or even a simple shell script containing a list of commands. However, the smartest Unix people solve most of their problems with a single command line, so start with that as your goal.

Right now, there isn't a lot you can do until you actually learn how to use some of the Unix filters. So as you read this section, think of it as general-advice-that-will-eventually-make-sense-to-you. What I am about to explain is a roadmap that shows you where you will be going. Concentrate on the general ideas and, later, when you get stuck, you can unstick yourself by coming back and reading this section again.

So: you have a problem you want to solve using filters and a pipeline. How do you go about doing so? Here are the steps.

1. Frame the problem.

Start by thinking. Find a quiet place, close your eyes, and think about how you can break your problem into parts, each of which can be carried out by a separate program. At this point, you don't need to know which tools you will be using to perform the various tasks. All you need to do is think.

When experienced Unix people think about a problem, they turn it over in their mind, looking at it from different points of view, until they find something that looks like it might work. Then they look for the tools to do the job. Then they experiment, to see how it works. If you watch them work, you will notice they never get frustrated. (Take a moment to think about that.)

2. Choose your tools.

There are hundreds of Unix programs — many of which are filters — and, to use Unix well, you need to know which programs are the best for whatever problem you happen to encounter. This, of course, sounds impossible. How can you memorize the function of hundreds of programs, let alone the details?

Actually, you will find that most Unix problems can be solved by selecting filters from a relatively small toolbox of around thirty programs. Over the years, these are the programs that have proven to be the most versatile and the most useful, and these are the programs we will be discussing in the next four chapters. For reference, Figure 16-1 (later in this chapter) contains a list of these important filters.

3. Talk to other people.

Once you have thought about how to frame your problem and you have an idea what tools you might use, look for people you can ask for suggestions. It's true that you should read the manual before you ask for help (see Chapter 9) but, traditionally, Unix has always been taught by word of mouth. To mature as a Unix person, you must see how other, more experienced people solve problems.

4. Select options.

Once you have studied your problem and chosen your tools, you should take a few moments to look at the documentation in the online manual (Chapter 9). Do this for each program you are thinking of using. Your goal is to check out the options, looking for the ones that are relevant to what you are trying to do.

It will always be the case that you can safely ignore most options. However, it is always a good idea to at least skim the description of the options in case there is one you need for this particular problem. Otherwise, you run the risk of doing a lot of extra work because you didn't know that a particular option was available. The smartest, most knowledgeable people I know check the online manual several times a day.

— hint —

When it comes to solving problems using redirection, filters and pipelines, the three most important skills are thinking, RTFMing(*), and asking other people for their opinions.

* Footnote

RTFM is explained in the glossary and in Chapter 9.

Jump to top of page

The Simplest Possible Filter: cat

A filter reads from standard input one line at a time, does something, and then writes the results to standard output one line at a time. What would be the simplest possible filter? The one that does nothing at all.

The name of this filter is cat (you will see why in a moment), and all it does is copy data from standard input to standard output, without doing anything special or changing the data in any way.

Here is a simple example you can perform for yourself. Enter the command:

cat

The cat program will start and wait for data from standard input. Since, by default, standard input is the keyboard, cat is waiting for you to type something.

Type whatever you want. At the end of each line, press the <Return> key. Each time you press <Return>, the line you have just typed is sent to cat, which will copy it to the standard output, by default, your screen. The result is that each line you type is displayed twice, once when you type it, and once by cat. For example:

this is line 1
this is line 1
this is line 2
this is line 2

When you are finished, press ^D (<Ctrl-D>), the eof key. This tells Unix that there is no more input (see Chapter 7). The cat command will end, and you will be returned to a shell prompt.

By now, you are probably asking, what use is a filter that does nothing? Actually, there are several uses and they are all important.

Since cat doesn't do anything, there is no point using it within a pipeline. (Take a moment to think about this until it makes sense.) However, cat can be handy all by itself when you combine it with I/O redirection. It can also be useful at the beginning of a pipeline when you need to combine more than one file. Here are some examples.

The first use of cat is to combine it with redirection to create a small file quickly. Consider the following command:

cat > data

The standard input (by default) is the keyboard, but the standard output has been redirected to a file named data. Thus, every line that you type is copied directly to this file as soon as you press the <Return> key. You can type as many lines as you want and, when you are finished, you can tell cat there is no more data by pressing ^D.

If the file data does not already exist, Unix will create it for you. If the file does exist, its contents will be replaced. In this way, you can use cat to create a new file, or replace the contents of an existing file. Experienced users do this a lot, when all they want to do is create or replace a small file.

The reason I say a "small file" is that, the moment you press <Return>, the line you just typed is copied to standard output. If you want to change the line, you have to stop cat, restart it, and type everything all over again.

I find that using cat in this way is a great way to create or replace a small file quickly, say 4-5 lines at most. Using cat is faster (and more fun) than starting a text editor, such as vi or Emacs, typing the text, and then stopping the text editor. Of course, it is easy to make mistakes so, if I want to type more than 5 lines, I'll use an editor, which lets me make changes as I type.

The second use for cat is to append a small number of lines to an existing file. To do this, you use >> to redirect the standard output, for example:

cat >> data

Now, whatever you type is appended to the file data. (As I explained in Chapter 15, when you redirect output with >>, the shell appends the output.)

The third use for cat is to display a short file. Simply redirect the standard input to the file you want to display. For example:

cat < data

In this case, the input comes from the file data, and the output goes to the screen (by default). In other words, you have just displayed the file data. Of course, if the file is longer than the size of your screen, some of the lines will scroll off the screen before you can read it. In such cases, you would not use cat to display the file; you would use less (Chapter 21) to display the file one screenful at a time.

The fourth use for cat is to display the last part of any file. Let's say you use the previous command to display the file named data. If data is a short file that will fit on the screen, all is well. However, if data is longer than the size of the monitor, all but the last part will scroll off the screen. Usually, this will happen fast and all you will be left with is the last part of the file — as much as will fit on your screen — which is exactly what you want.

If you'd like to try this for yourself, you can use cat to display one of the configuration files I discussed in Chapter 6. For example, try this command:

cat < /etc/profile

Notice how, when you display a long file, cat leaves you looking at the last part of it.

As a convenience, if you leave out the < character, cat will read directly from the file. (We'll talk about this in the next section.) So, if you want to experiment with the configuration files from Chapter 6, the commands to use are:

cat /boot/grub/menu.lst
cat /etc/hosts
cat /etc/inittab
cat /etc/passwd
cat /etc/profile
cat /etc/samba/smb.conf

(Notes: 1. If you are not using Linux, your system may not have all of these files. 2. On most systems, you will need to be superuser to display the menu.lst file.)

As you experiment with these commands, you will see that, if the file is short, cat will show it all to you quickly. If the file is long, most of the file will scroll by so quickly you won't be able to read it. What you will see, as we discussed, is the last part of the file (as much as will fit on your screen).

Later in the chapter, we will meet another program, called tail, which can be used to display the end of a file quickly. Most of the time, tail works better than cat. However, in some cases, cat is actually a better choice. This is because tail displays the number of lines you specify, with 10 being the default; cat shows you as many lines as will fit on your screen. To see what I mean, try using tail on some of the files listed above. For example:

tail /etc/profile

Moving on, the fifth use of cat is to make a copy of a file by redirecting both the standard input and output. For example, to copy the file data to another file named newdata, enter:

cat < data > newdata

Of course, Unix has a better command to use for copying files. It's called cp, and we'll talk about it in Chapter 25. However, it is interesting to know that cat can do the job if it needs to.

There is even more that cat can do, which we'll get to in the next section. Before we do, let's take a moment to reflect on something truly remarkable. We started with the simplest possible filter, cat, a filter that — by definition — does nothing. However, by using cat with I/O redirection, we were able to make it sit up and perform a variety of tricks. (This is all summarized in Figure 16-2, later in the chapter.)

Putting cat through its paces in this way provides us with a good example of the elegance of Unix. What seems like a simple concept — that data should flow from standard input to standard output — turns out to bear fruit in so many unexpected ways. Look how much we can do with a filter that does nothing. Imagine what is yet to come!

— hint —

Part of the charm of Unix is, all of a sudden, having a great insight and saying to yourself, "So that's why they did it that way."

Jump to top of page

Increasing the Power of Filters

By making one significant change to a filter, it is possible to increase its usefulness enormously. The enhancement is to be able to specify the names of one or more input files.

As you know, the strict definition of a filter requires it to read its data from the standard input. If you want to read data from a file, you must redirect the standard input from that file, for example:

cat < data

However, what if we also had the option of reading from a file whose name we could specify as an argument? For example:

cat data

This is indeed the case with cat, and the last two commands are equivalent. Thus, to display a short file quickly, all you need to do is type cat followed by the name of a file, such as in the last example. Experienced Unix users often use cat in this way, as a quick way to display a short file. (For longer files, you would use less; see Chapter 21.)

At first, such a small change — leaving out the < character — seems insignificant, but this is not the case. It is true that we have made the command line simpler, which is convenient, but there is a price to pay. The cat program itself must now be more complex. It not only must be able to read from the standard input, it must also be able to read from any file. Moreover, by extending the power of cat, we have lost some of the beauty and simplicity of a pure filter.

Nevertheless, many filters are extended in just this way, not because it makes it easy to read from one file, but because it makes it possible to read from multiple files. For example, here is an abbreviated version of the syntax for the cat command:

cat [file...]

where file is the name of a file from which cat will read its input.

Notice the three dots after the file argument. This means that you can specify more than one file name. (See Chapter 10 for an explanation of command syntax.)

Thus — and here is the important point — in extending the power of cat to read from any file, we have also allowed it to read from more than one file. When we specify more than one input file, cat will read all the data from each of them in turn. As it reads, it will write each line of text, in the order it was encountered, to the standard output. This means we can use cat to combine the contents of as many files as we want.

This is a very important concept, so take a moment to consider the following examples carefully:

cat name address phone
cat name address phone > info
cat name address phone | sort

The first example combines the contents of multiple files (name, address and phone) and displays it on your screen; the second example combines the same files and writes the data to another file (info); the third example pipes the data to a program (sort) for further processing.

As I mentioned, many other filters, not just cat, can also read input from multiple files. Technically, this is not necessary. If we want to operate on data from more than one file, we can collect the data with cat and then pipe it to whichever filter we want. For example, let's say we wanted to combine the data from three files and then sort it. There is no need for sort to be able to read from multiple files. All we need to do is combine the data using cat, and then pipe it to sort:

cat name address phone | sort

This is appealing in one sense. By extending cat to read from files as well as standard input, we have lost some of the elegance of the overall design. However, by using cat to feed other filters, we can at least retain the purity of the other filters.

However, as in many aspects of life, utility has won out over beauty and purity. It is just too much trouble to combine files with cat every time we want to send such data to a filter. Thus, most filters allow us to specify multiple file names as arguments.

For example, the following three commands all sort the data from more than one file. The first command displays the output on your screen; the second command saves the output to a file; the third command pipes the output to another program for further processing. (Don't worry about the details for now.)

sort name address phone
sort name address phone > info
sort name address phone | grep Harley

At this point, I'd like you to consider the following philosophical question. By definition, a filter must read its data from standard input. Does this mean that a program that can read its data from a file is not really a filter?

There are two possible answers, both of which are acceptable. First, we can decide that when a program like cat or sort reads from standard input, it is acting like a filter, but when it reads from a file, it is not acting as a filter. This approach maintains the purity of the system. However, it also means that a great many programs may or may not be filters, depending on how they are used.

Alternatively, we can broaden the definition of a filter to allow it to read from either standard input or from a file. This definition is practical, but it sacrifices some of the beauty of the original design.

Jump to top of page

A List of the Most Useful Filters

At this point, we have discussed the basic ideas related to filters. In the rest of the chapter — and in Chapters 17, 18 and 19 — I will discuss a variety of different filters, one after another. As a preview, Figure 16-1 shows a list of what I consider to be the most useful Unix filters.

Regardless of which type of Unix or Linux you are using, you will find that most of what you read in these four chapters will work on your system. This is because the basic details of the filters we will be covering are the same from one system to another. Indeed, most of these filters have worked the same way for over thirty years!

Before we continue, though, let me remind you that, at any time, you can check the definitive reference for how a program works on your system by using the man command to display the man page for that program. If you are using the GNU utilities — which is the case with Linux and FreeBSD — (see Chapter 2) you can use the info command to access the Info system. For example:

man cat
info cat

With most of the GNU utilities, you can display the syntax of a command and a summary of its options by using the --help option. For example:

cat --help

For a discussion of how to use man and info, see Chapter 9. For a discussion of syntax, see Chapter 10.

01">

Figure 16-1: The Most Useful Unix Filters

This table shows the most important Unix filters, most of which are over thirty years old. You can solve many different types of problems using the filters from this list. Most often, you will need only a single filter; you will rarely need more than four.

awk and perl are complex programming languages you can use to write programs to act as filters within a pipeline. For more information, start with the online manual (man awk, man perl), and then look on the Web, where you will find a great deal of information.

Filter Chapter See also Purpose
awkperlProgramming language: manipulate text
cat16rev, split, tacCombine files; copy standard input to standard output
colrm16cut, join, pasteDelete specified columns of data
comm17cmp, diff, sdiffCompare two sorted files, show differences
cmp17comm, diff, sdiffCompare two files
cut17colrm, join, pasteExtract specified columns/fields of data
diff17cmp, comm, sdiffCompare two files, show differences
expand18unexpandChange tabs to spaces
fold18fmt, prFormat long lines into shorter lines
fmt18fold, prFormat paragraphs to make them look nice
grep19look, stringsSelect lines containing a specified pattern
head16tailSelect lines from beginning of data
join19colrm, cut, pasteCombine columns of data, based on common fields
look19grepSelect lines that begin with a specified pattern
nl18wcCreate line numbers
paste17colrm, cut, joinCombine columns of data
perlawkProg. language: manipulate text, files, processes
pr18fold, fmtFormat text into pages or columns
rev16cat, tacReverse order of characters in each line of data
sdiff17cmp, comm, diffCompare two files, show differences
sed19trNon-interactive text editing
sort19tsort, uniqSort data; check if data is sorted
split16catSplit a large file into smaller files
strings19grepSearch for character strings in binary files
tac16cat, revCombine files while reversing order of lines of text
tail16headSelect lines from end of data
tr19sedChange or delete selected characters
tsort19sortCreate a total ordering from partial orderings
unexpand18expandChange spaces to tabs
uniq19sortSelect duplicate/unique lines
wc18nlCount lines, words and characters
Filter Chapter See also Purpose

Jump to top of page

Combining Files: cat

Related filters: rev, split, tac

The cat program copies data, unchanged, to the standard output. The data can come from the standard input or from one or more files. The syntax is:

cat [-bns] [file...]

where file is the name of a file.

We have already covered several ways in which you can use the cat program. However, by far, the most important use for cat is to combine multiple files. Here are some typical examples that combine three files. Of course, you can use as many files as you want.

cat name address phone
cat name address phone > info
cat name address phone | sort

These patterns are worth memorizing, as you will use them a lot. For reference, they are summarized in Figure 16-2.

In the first example, cat reads and combines the contents of three files (in this case, name, address and phone), and displays the output on your screen. Normally, you would only use such a command if the files were so short that the combined output would not scroll off the screen. More likely, you would pipe the output to less (Chapter 21) to display the output one screenful at a time. For example:

cat name address phone | less

The second example combines the same three files, but redirects standard output to another file (in this case, info). If the file does not exist, the shell will create it. If the file does exist, it will be replaced, which means that the data originally in the file will be lost forever. (See Chapter 15 for a discussion of redirection and file replacement.)

The third example combines the same files and pipes the output to another program for further processing, in this case, sort (Chapter 19).

When you use cat to combine, there is a common mistake you must be sure to avoid: do not redirect output to one of the input files. For example, say you want to append the contents of address and phone to the file name. You might think, all you have to do is combine all three files and save the result in name:

cat name address phone > name

This will not work, because of the way the shell handles redirection. Before a program can redirect standard output to a file, the shell must make sure that the file exists and is empty. In this case, if name does not exist, the shell will create it. If name does exist, the shell will empty it. In our example, by the time cat is ready to read from name, the file is already empty.

When you enter a command like the one above, you will see a message similar to:

cat: name: input file is output file

It looks like a warning message but, actually, it is already too late. Even pressing ^C (to abort the command) won't do any good. By the time you see this message, the contents of name have been deleted.

The safe way to append the contents of address and phone to the file name is to use:

cat address phone >> name

Notice we do not use our output file as an input file. Rather, we append the contents of all the other files to the output file.

To conclude our discussion of cat, here are the most useful options:

• The -n (number) option will place a line number in front of each line.

• The -b (blank) option is used with -n and tells cat not to number blank lines.

• The -s (squeeze) option replaces more than one consecutive blank line with a single blank line.

What's in a Name?

cat


The main use of the cat program is to combine the contents of multiple files into a single output stream. For this reason, it would be natural to assume that cat stands for "concatenate". Actually, this is not the case.

The name cat comes from the archaic word "catenate", which means "to join in a chain". As all classically educated Unix users know, catena is the Latin word for chain.

Figure 16-2: The Many Uses of the cat Program

The cat program is the simplest possible filter. It reads from standard input and writes to standard output without modifying the data. In spite of its simplicity, cat can perform a surprising number of tasks, which are summarized in the table. The power of such a simple filter comes from the richness of the Unix I/O redirection and pipeline capabilities.

Most of the time, cat is used to combine files, either to be displayed (by piping the output to less), to be saved in another file (by redirecting standard output to that file), or to be piped to another program for further processing. See text for details.

Syntax Purpose
cat > fileRead from keyboard, create new file or replace existing file
cat >> fileRead from keyboard, append to existing file
cat < fileDisplay an existing file
cat fileDisplay an existing file
cat < file1 > file2Copy a file
cat file1 file2 file3 | lessCombine multiple files, display one screenful at a time
cat file1 file2 file3 > file4Combine multiple files, save output in a different file
cat file1 file2 file3 | programCombine multiple files, pipe output to another program

Jump to top of page

Splitting Files: split

Related filters: cat

We have just discussed how to use cat to combine two or more files into one large file. What if you want to do the opposite: split a large file into smaller files? To do so, you use the split program. The syntax is:

split [-d] [-a num] [-l lines] [file [prefix]]

where num is the number of characters or digits to use as a suffix when creating file names; lines is the maximum number of lines for each new file; file is the name of an input file; and prefix is a name to use when creating file names.

The split program was developed in the early 1970s, when large text files could create problems. In those days disk storage was limited and processors were slow.(*) Today, large hard disks are ubiquitous, and computers are extremely fast. As a result, we rarely have problems storing and manipulating large text files. Still, there will be times when you will want to break a large file into pieces and, when you do, split can save you a lot of time. For example, you may want to email a very large file to someone whose email account has a limit on the size of received messages.

* Footnote

In 1976, when I was a first-year graduate student working with Unix, I wrote a C program to mathematically manipulate the data in a file that, by today's standards, was relatively small. At the time, however, the file was considered large, and the computer did not have nearly enough memory to hold all the data. As a result, my program had to be very complex, as it had to be able to process data in small pieces, which were swapped in and out of memory as necessary.

By default, split creates files that are 1,000 lines long. For example, say that you have a file named data with 57,984 lines, which you want to break into smaller files. You would use the command:

split data

This creates 58 new files: 57 files containing 1,000 lines each, and 1 last file containing the remaining 984 lines.

If you want to change the maximum size of the files, use the -l (lines) option. For example, to split the file data (with 57,984 lines) into files containing 5,000 lines, you would use:

split -l 5000 data

This command creates 12 new files: 11 files containing 5,000 lines (55,000 lines in all), and 1 last file containing 2,984 lines (the remainder).

— hint —

When you use options that require large numbers, you do not type a comma (or, in Europe, a period) to break the number into groups of three. For example, you would use:

split -l 5000 data

You would not use:

split -l 5,000 data

If you do use a comma, it will cause a syntax mistake and you will get an error message such as:

split: 5,000: invalid number of lines

By now you are probably wondering, what are the names of all these new files? If split is going to create files automatically, it should use names that make sense. However, it must also be careful not to replace any of your existing files accidentally.

By default, split uses names that start with the letter x, followed by a 2-character suffix. The suffixes are aa, ab, ac, ad, and so on. For instance, in the last example, where split created 12 new files, the names (by default) would be:

xaa  xab  xac  xad  xae  xaf
xag  xah  xai  xaj  xak  xal

If split requires more than 26 files, the names after xaz are xba, xbb, xbc, and so on. Since there are 26 letters in the alphabet, this allows for up to 676 (26x26) new file names, xaa through xzz.

If you don't like these names, there are two different ways to change them. First, if you use the -d (digits) option, split will use 2-digit numbers starting with 00 at the end of the file name, rather than a 2-letter suffix. For example, consider the following command, which uses the same file data (containing 57,984 lines) we used above:

split -d -l 5000 data

The 12 new files are named:

x00  x01  x02  x03  x04  x05 x06  x07  x08  x09  x10  x1l

If you don't want your file names to start with x, you can specify your own name to be used as a prefix, for example:

split -d -l 5000 data harley

The new files are named:

harley00  harley01  harley02  harley03
harley04  harley05  harley06  harley07
harley08  harley09  harley10  harley11

When you use split with the -d option, you can create up to 100 files (10x10), using the suffixes 00 to 99. Without -d, you can create up to 676 files (26x26), using the suffixes aa to zz. If you need more files, you can use the -a option followed by the number of digits or characters you want in the suffix. For example:

split -d -a 3 data

The new file names will use 3-digit suffixes:

x000  x001  x002  x003...

Similarly, you can use -a without the -d option:

split -a 3 data

In this case, the new file names use 3-letter suffixes:

xaaa  xaab  xaac  xaad...

In this way, you can use split to break up very large input files without running out of file names.

By default, split creates 1,000-line files. However, as I mentioned, you can create any size files you want, even small ones. Here is a typical, everyday example I am sure you can relate to.

You are working for a powerful U.S. senator who is running for President of the United States. It is two weeks before the election and the campaign is suffering. The senator is desperate and, because you know Unix, you are promoted to be the new Chief of Staff.

Your first day on the job, you are given a very large text file, named supporters, in which each line contains the name and phone number of a potential voter. Your job is to organize volunteers around the country to call all the people on the list and urge them to vote for your candidate. You decide there is only enough time for each volunteer to call 40 people. You log into your Unix system and enter the command:

split -d -l 40 supporters voter

You now have a set of files, named voter00, voter01, voter02, and so on. Each of these files (except, possibly, the last one) contains exactly 40 names. You order your staff to email each volunteer one file, along with instructions to call all 40 names on his or her list.

Because of your hard work and superior Unix skills, your candidate is elected. Within a week, you are appointed to a position of great power and influence.

Jump to top of page

Combining Files While Reversing Lines: tac

Related filters: cat, rev

As we have discussed, cat is the most basic of all the filters, as well as one of the most useful filters. The tac program is similar to cat with one major difference: tac reverses the order of the lines of text before writing them to standard output. (The name tac is cat spelled backwards.)

The syntax for tac is:

tac [file...]

Like cat, tac reads from standard input and writes to standard output, and tac can combine input files. For instance, you have a file named log. You want to reverse the order of all the lines in log and write the results to a new file named reverse-log. The command to use is:

tac log > reverse-log

For example, let's say log contains:

Oct 01: event 1 took place
Oct 02: event 2 took place
Oct 03: event 3 took place
Oct 04: event 4 took place

After running the tac command above, reverse-log would contain:

Oct 04: event 4 took place
Oct 03: event 3 took place
Oct 02: event 2 took place
Oct 01: event 1 took place

At this point, you might be wondering if tac is nothing more than a curiosity. Perhaps it was written simply because someone thought the name was cute (being cat backwards). However, would you ever actually need this program?

The answer is you don't need tac all that often. However, when you do need it, it is invaluable. For example, say you have a program that writes notes to a log file (a common occurrence). The oldest notes will be at the beginning of the file; the newest notes will be at the end of the file. The file, which is named log, is now 5,000 lines long, and you want to display the notes from newest to oldest.

Without tac, there is no simple way to display the lines of a long file in reverse order. With tac, it's easy. Just use tac to reverse the lines and pipe the output to less (Chapter 21):

tac log | less

If you need to combine files, you can do that as well, for example:

tac log1 log2 log3 | less

This command reverses the lines in three files, combines them, and then pipes the output to less.

Jump to top of page

Reversing the Order of Characters: rev

Related filters: cat, tac

The tac program reverses the lines within a file, but what if you want to reverse the characters within each line? In such cases, you use rev. The syntax is rev:

rev [file...]

where file is the name of a file.

Here is an example. You have a file named data that contains:

12345
abcde
AxAxA

You enter:

rev data

The output is:

54321
edcba
AxAxA

Suppose you want to reverse the order of the characters in each line and reverse the lines in the file. Just pipe the output of rev to tac, for example:

rev data | tac

The output is:

AxAxA
edcba
54321

What do you think would happen if you used tac first?

tac data | rev

To complete this section, let's consider one more example. You have a file named pattern that contains the following 4 lines:

   X
  XX
 XXX
XXXX

Consider the output from the following four commands (the $ is the shell prompt):

$cat pattern
   X
  XX
 XXX
XXXX

$tac pattern
XXXX
 XXX
  XX
   X

$rev pattern
X   
XX  
XXX 
XXXX

$rev pattern | tac
XXXX
XXX 
XX  
X   

Does it all make sense to you?

Jump to top of page

Selecting Lines From the Beginning
or End of Data: head, tail

When you have more data than you can understand easily, there are two programs that allow you to select part of the data quickly: head command selects lines from the beginning of the data; tail command selects lines from the end of the data.

Most of the time, you will use head and tail to display the beginning or end of a file. For this reason, I will defer the principal discussion of these commands until Chapter 21, where we will talk about the file display commands. In this section, I will show you how to use head and tail as filters within a pipeline.

When you use head and tail as filters, the syntax is simple:

head [-n lines]
tail [-n lines]

where lines is the number of lines you want to select. (In Chapter 21, we will use a more complex syntax.)

By default, both head and tail select 10 lines of data. For example, let's say you have a program called calculate that generates many lines of data. To display the first 10 lines, you would use:

calculate | head

To display the last 10 lines, you would use:

calculate | tail

If you want to select a different number of lines, use a hyphen (-) followed by that number. For example, to select 15 lines, you would use:

calculate | head -n 15
calculate | tail -n 15

You will often use head and tail at the end of a complex pipeline to select part of the data generated by the previous commands. For example, you have four files: data1, data2, data3 and data4. You want to combine the contents of the files, sort everything, and then display the first and last 20 lines.

To combine the files, you use the cat program (discussed earlier in the chapter). To perform the sort, you use the sort program (Chapter 19):

cat data1 data2 data3 data4 | sort | head -n 20
cat data1 data2 data3 data4 | sort | tail -n 20

Sometimes you will want to send the output of head or tail to another filter. For example, the following pipeline selects 300 lines from the beginning of the sort output, which we then send to less (Chapter 21) to be displayed one screenful at a time:

cat data1 data2 data3 data4 | sort | head -n 300 | less

Similarly, it is often handy to save the output of head or tail to a file. The following example selects the last 10 lines of output and saves it to a file named most-recent:

cat data1 data2 data3 data4 | sort | tail > most-recent

— hint —

Originally, head and tail did not require you to use the -n option; you could simply type a hyphen followed by a number. For example, the following commands all display 15 lines of output:

calculate | head -n 15
calculate | tail -n 15
calculate | head -15
calculate | tail -15

Officially, modern versions of head and tail are supposed to require the -n option, which is why I have included it. However, most versions of Unix and Linux will accept both types of syntax so — as long as your mother isn't watching — you can usually leave out the -n.

Jump to top of page

Deleting Columns of Data: colrm

Related filters: cut, paste

The colrm ("column remove") program reads from the standard input, deletes specified columns of data, and then writes the remaining data to the standard output. The syntax is:

colrm [startcol [endcol]]

where startcol and endcol specify the starting and ending range of the columns to be removed. Numbering starts with column 1.

Here is an example: You are a tenured professor at a college in California, and you need a list of grades for all the students in your PE 201 class ("Intermediate Surfing"). This list should not show the students' names.

You have a master data file, named students, which contains one line of information about each student. Each line has a student number, a name, the final exam grade, and the course grade:

012-34-5678  Ambercrombie, Al  95%  A
123-45-6789  Barton, Barbara   65%  C
234-56-7890  Canby, Charles    77%  B
345-67-8901  Danfield, Deann   82%  B

To construct the list of grades, you need to remove the names, which are in columns 14 through 30, inclusive. Use the command:

colrm 14 30 < students

The output is:

012-34-5678 95% A
123-45-6789 65% C
234-56-7890 77% B
345-67-8901 82% B

As a quick review of piping and redirection, let me show you two more examples. First, if the list happens to be very long, you can pipe it to less, to display the data one screenful at a time :

colrm 14 30 < students | less

Second, if you want to save the output, you can redirect it to a file:

colrm 14 30 < students > grades

If you specify only a starting column, colrm will remove all the columns from that point to the end of the line. For example:

colrm 14 < students

displays:

012-34-5678
123-45-6789
234-56-7890
345-67-8901

If you specify neither a starting nor ending column, colrm will delete nothing.

Jump to top of page



Exercises

Review Question #1:

What is a filter? Why are filters so important?

Review Question #2:

You need to solve a difficult problem using filters and a pipeline. What four steps should you follow?

What are the three most important skills you need?

Review Question #3:

Why is cat the simplest possible filter?

In spite of its simplicity, cat can be used for a variety of purposes. Name four.

Review Question #4:

What is the difference between tac and rev?

Applying Your Knowledge #1:

A scientist ran an experiment that generated data that accumulated in a sequence of files: data1, data2, data3, data4 and data5. He wants to know how many lines of data he has altogether.

The command wc -l reads from standard input and counts the number of lines. How would you use this command to count the total number of lines in the five files?

Applying Your Knowledge #2:

You have a text file named important. What commands would you use to display the contents of this file in the following four different ways?

(a) As is
(b) Reverse the order of the lines
(c) Reverse the order of the characters within each line
(d) Reverse both the lines and characters

For (b), (c), and (d), which command performs the opposite transformation? How would you test this?

Applying Your Knowledge #3:

In Chapter 6, we discussed the Linux program dmesg, which displays the messages generated when the system was booted. Typically, there are a great many such messages. What command would you use to display the last 25 boot messages?

For Further Thought #1:

Figure 16-1 lists the most important Unix filters. Not counting awk and perl, which are programming languages, there are 19 different filters in the list. For most problems, you will need only a single filter; you will rarely need more than four. Why do you think this is the case?

Can you think of any tools that, in your opinion, are missing from the list?

For Further Thought #2:

The split program was developed in the early 1970s, when large text files could create problems, because disk space was relatively slow storage and very expensive. Today, disks are fast and cheap. Do we still need a program like split? Why?

Jump to top of page