Go to the first, previous, next, last section, table of contents.
section User-Defined Functions, describes how to write your own @command{awk} functions. Writing functions is important, because it allows you to encapsulate algorithms and program tasks in a single place. It simplifies programming, making program development more manageable, and making programs more readable.
One valuable way to learn a new programming language is to read programs in that language. To that end, this major node and @ref{Sample Programs, ,Practical @command{awk} Programs}, provide a good-sized body of code for you to read, and hopefully, to learn from.
This major node presents a library of useful @command{awk} functions. Many of the sample programs presented later in this Info file use these functions. The functions are presented here in a progression from simple to complex.
section Extracting Programs from Texinfo Source Files, presents a program that you can use to extract the source code for these example library functions and programs from the Texinfo source for this Info file. (This has already been done as part of the @command{gawk} distribution.)
If you have written one or more useful, general purpose @command{awk} functions and would like to contribute them to the author's collection of @command{awk} programs, see section How to Contribute, for more information.
The programs in this major node and in @ref{Sample Programs, ,Practical @command{awk} Programs}, freely use features that are @command{gawk}-specific. It is straightforward to rewrite these programs for different implementations of @command{awk}.
Diagnostic error messages are sent to `/dev/stderr'. Use `| "cat 1>&2"' instead of `> "/dev/stderr"', if your system does not have a `/dev/stderr' or if you cannot use @command{gawk}.
A number of programs use nextfile
(@pxref{Nextfile Statement, ,Using @command{gawk}'s nextfile
Statement})
to skip any remaining input in the input file.
section Implementing nextfile
as a Function,
shows you how to write a function that does the same thing.
Finally, some of the programs choose to ignore upper- and lowercase
distinctions in their input. They do so by assigning one to IGNORECASE
.
You can achieve almost the same effect(50) by adding the following rule to the
beginning of the program:
# ignore case { $0 = tolower($0) }
Also, verify that all regexp and string constants used in comparisons only use lowercase letters.
Due to the way the @command{awk} language evolved, variables are either
global (usable by the entire program) or local (usable just by
a specific function). There is no intermediate state analogous to
static
variables in C.
Library functions often need to have global variables that they can use to
preserve state information between calls to the function--for example,
getopt
's variable _opti
(see section Processing Command-Line Options).
Such variables are called private, since the only functions that need to
use them are the ones in the library.
When writing a library function, you should try to choose names for your private variables that will not conflict with any variables used by either another library function or a user's main program. For example, a name like `i' or `j' is not a good choice, because user programs often use variable names like these for their own purposes.
The example programs shown in this major node all start the names of their private variables with an underscore (`_'). Users generally don't use leading underscores in their variable names, so this convention immediately decreases the chances that the variable name will be accidentally shared with the user's program.
In addition, several of the library functions use a prefix that helps
indicate what function or set of functions use the variables--for example,
_pw_byname
in the user database routines
(see section Reading the User Database).
This convention is recommended, since it even further decreases the
chance of inadvertent conflict among variable names. Note that this
convention is used equally well for variable names and for private
function names as well.(51) programming style has evolved, and to
provide some basis for this discussion.}
As a final note on variable naming, if a function makes global variables
available for use by a main program, it is a good convention to start that
variable's name with a capital letter--for
example, getopt
's Opterr
and Optind
variables
(see section Processing Command-Line Options).
The leading capital letter indicates that it is global, while the fact that
the variable name is not all capital letters indicates that the variable is
not one of @command{awk}'s built-in variables, such as FS
.
It is also important that all variables in library functions that do not need to save state are, in fact, declared local.(52)'s @option{--dump-variables} command-line option is useful for verifying this.} If this is not done, the variable could accidentally be used in the user's program, leading to bugs that are very difficult to track down:
function lib_func(x, y, l1, l2) { ... use variable some_var # some_var should be local ... # but is not by oversight }
A different convention, common in the Tcl community, is to use a single
associative array to hold the values needed by the library function(s), or
"package." This significantly decreases the number of actual global names
in use. For example, the functions described in
section Reading the User Database,
might have used array elements PW_data["inited"]
, PW_data["total"]
,
PW_data["count"]
, and PW_data["awklib"]
, instead of
_pw_inited
, _pw_awklib
, _pw_total
,
and _pw_count
.
The conventions presented in this minor node are exactly that: conventions. You are not required to write your programs this way--we merely recommend that you do so.
This minor node presents a number of functions that are of general programming use.
nextfile
as a Function
The nextfile
statement presented in
@ref{Nextfile Statement, ,Using @command{gawk}'s nextfile
Statement},
is a @command{gawk}-specific extension--it is not available in most other
implementations of @command{awk}. This minor node shows two versions of a
nextfile
function that you can use to simulate @command{gawk}'s
nextfile
statement if you cannot use @command{gawk}.
A first attempt at writing a nextfile
function is as follows:
# nextfile -- skip remaining records in current file # this should be read in before the "main" awk program function nextfile() { _abandon_ = FILENAME; next } _abandon_ == FILENAME { next }
Because it supplies a rule that must be executed first, this file should
be included before the main program. This rule compares the current
data file's name (which is always in the FILENAME
variable) to
a private variable named _abandon_
. If the file name matches,
then the action part of the rule executes a next
statement to
go on to the next record. (The use of `_' in the variable name is
a convention. It is discussed more fully in
section Naming Library Function Global Variables.)
The use of the next
statement effectively creates a loop that reads
all the records from the current data file.
The end of the file is eventually reached and
a new data file is opened, changing the value of FILENAME
.
Once this happens, the comparison of _abandon_
to FILENAME
fails and execution continues with the first rule of the "real" program.
The nextfile
function itself simply sets the value of _abandon_
and then executes a next
statement to start the
loop.
This initial version has a subtle problem.
If the same data file is listed twice on the commandline,
one right after the other
or even with just a variable assignment between them,
this code skips right through the file, a second time, even though
it should stop when it gets to the end of the first occurrence.
A second version of nextfile
that remedies this problem
is shown here:
# nextfile -- skip remaining records in current file # correctly handle successive occurrences of the same file # this should be read in before the "main" awk program function nextfile() { _abandon_ = FILENAME; next } _abandon_ == FILENAME { if (FNR == 1) _abandon_ = "" else next }
The nextfile
function has not changed. It makes _abandon_
equal to the current file name and then executes a next
statement.
The next
statement reads the next record and increments FNR
so that FNR
is guaranteed to have a value of at least two.
However, if nextfile
is called for the last record in the file,
then @command{awk} closes the current data file and moves on to the next
one. Upon doing so, FILENAME
is set to the name of the new file
and FNR
is reset to one. If this next file is the same as
the previous one, _abandon_
is still equal to FILENAME
.
However, FNR
is equal to one, telling us that this is a new
occurrence of the file and not the one we were reading when the
nextfile
function was executed. In that case, _abandon_
is reset to the empty string, so that further executions of this rule
fail (until the next time that nextfile
is called).
If FNR
is not one, then we are still in the original data file
and the program executes a next
statement to skip through it.
An important question to ask at this point is: given that the
functionality of nextfile
can be provided with a library file,
why is it built into @command{gawk}? Adding
features for little reason leads to larger, slower programs that are
harder to maintain.
The answer is that building nextfile
into @command{gawk} provides
significant gains in efficiency. If the nextfile
function is executed
at the beginning of a large data file, @command{awk} still has to scan the entire
file, splitting it up into records,
just to skip over it. The built-in
nextfile
can simply close the file immediately and proceed to the
next one, which saves a lot of time. This is particularly important in
@command{awk}, because @command{awk} programs are generally I/O-bound (i.e.,
they spend most of their time doing input and output, instead of performing
computations).
When writing large programs, it is often useful to know
that a condition or set of conditions is true. Before proceeding with a
particular computation, you make a statement about what you believe to be
the case. Such a statement is known as an
assertion. The C language provides an <assert.h>
header file
and corresponding assert
macro that the programmer can use to make
assertions. If an assertion fails, the assert
macro arranges to
print a diagnostic message describing the condition that should have
been true but was not, and then it kills the program. In C, using
assert
looks this:
#include <assert.h> int myfunc(int a, double b) { assert(a <= 5 && b >= 17.1); ... }
If the assertion fails, the program prints a message similar to this:
prog.c:5: assertion failed: a <= 5 && b >= 17.1
The C language makes it possible to turn the condition into a string for use
in printing the diagnostic message. This is not possible in @command{awk}, so
this assert
function also requires a string version of the condition
that is being tested.
Following is the function:
# assert -- assert that a condition is true. Otherwise exit. function assert(condition, string) { if (! condition) { printf("%s:%d: assertion failed: %s\n", FILENAME, FNR, string) > "/dev/stderr" _assert_exit = 1 exit 1 } } END { if (_assert_exit) exit 1 }
The assert
function tests the condition
parameter. If it
is false, it prints a message to standard error, using the string
parameter to describe the failed condition. It then sets the variable
_assert_exit
to one and executes the exit
statement.
The exit
statement jumps to the END
rule. If the END
rules finds _assert_exit
to be true, it then exits immediately.
The purpose of the test in the END
rule is to
keep any other END
rules from running. When an assertion fails, the
program should exit immediately.
If no assertions fail, then _assert_exit
is still
false when the END
rule is run normally, and the rest of the
program's END
rules execute.
For all of this to work correctly, `assert.awk' must be the
first source file read by @command{awk}.
The function can be used in a program in the following way:
function myfunc(a, b) { assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1") ... }
If the assertion fails, you see a message similar to the following:
mydata:1357: assertion failed: a <= 5 && b >= 17.1
There is a small problem with this version of assert
.
An END
rule is automatically added
to the program calling assert
. Normally, if a program consists
of just a BEGIN
rule, the input files and/or standard input are
not read. However, now that the program has an END
rule, @command{awk}
attempts to read the input data files or standard input
(see section Startup and Cleanup Actions),
most likely causing the program to hang as it waits for input.
There is a simple workaround to this:
make sure the BEGIN
rule always ends
with an exit
statement.
The way printf
and sprintf
(see section Using printf
Statements for Fancier Printing)
perform rounding often depends upon the system's C sprintf
subroutine. On many machines, sprintf
rounding is "unbiased,"
which means it doesn't always round a trailing `.5' up, contrary
to naive expectations. In unbiased rounding, `.5' rounds to even,
rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means
that if you are using a format that does rounding (e.g., "%.0f"
),
you should check what your system does. The following function does
traditional rounding; it might be useful if your awk's printf
does unbiased rounding:
# round -- do normal rounding function round(x, ival, aval, fraction) { ival = int(x) # integer part, int() truncates # see if fractional part if (ival == x) # no fraction return x if (x < 0) { aval = -x # absolute value ival = int(aval) fraction = aval - ival if (fraction >= .5) return int(x) - 1 # -2.5 --> -3 else return int(x) # -2.3 --> -2 } else { fraction = x - ival if (fraction >= .5) return ival + 1 else return ival } } # test harness { print $0, round($0) }
The Cliff random number generator(53) is a very simple random number generator that "passes the noise sphere test for randomness by showing no structure." It is easily programmed, in less than 10 lines of @command{awk} code:
# cliff_rand.awk -- generate Cliff random numbers BEGIN { _cliff_seed = 0.1 } function cliff_rand() { _cliff_seed = (100 * log(_cliff_seed)) % 1 if (_cliff_seed < 0) _cliff_seed = - _cliff_seed return _cliff_seed }
This algorithm requires an initial "seed" of 0.1. Each new value
uses the current seed as input for the calculation.
If the built-in rand
function
(see section Numeric Functions)
isn't random enough, you might try using this function instead.
One commercial implementation of @command{awk} supplies a built-in function,
ord
, which takes a character and returns the numeric value for that
character in the machine's character set. If the string passed to
ord
has more than one character, only the first one is used.
The inverse of this function is chr
(from the function of the same
name in Pascal), which takes a number and returns the corresponding character.
Both functions are written very nicely in @command{awk}; there is no real
reason to build them into the @command{awk} interpreter:
# ord.awk -- do ord and chr # Global identifiers: # _ord_: numerical values indexed by characters # _ord_init: function to initialize _ord_ BEGIN { _ord_init() } function _ord_init( low, high, i, t) { low = sprintf("%c", 7) # BEL is ascii 7 if (low == "\a") { # regular ascii low = 0 high = 127 } else if (sprintf("%c", 128 + 7) == "\a") { # ascii, mark parity low = 128 high = 255 } else { # ebcdic(!) low = 0 high = 255 } for (i = low; i <= high; i++) { t = sprintf("%c", i) _ord_[t] = i } }
Some explanation of the numbers used by chr
is worthwhile.
The most prominent character set in use today is ASCII. Although an
eight-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
defines characters that use the values from 0 to 127.(54)
In the now distant past,
at least one minicomputer manufacturer
used ASCII, but with mark parity, meaning that the leftmost bit in the byte
is always 1. This means that on those systems, characters
have numeric values from 128 to 255.
Finally, large mainframe systems use the EBCDIC character set, which
uses all 256 values.
While there are other character sets in use on some older systems,
they are not really worth worrying about:
function ord(str, c) { # only first character is of interest c = substr(str, 1, 1) return _ord_[c] } function chr(c) { # force c to be numeric by adding 0 return sprintf("%c", c + 0) } #### test code #### # BEGIN \ # { # for (;;) { # printf("enter a character: ") # if (getline var <= 0) # break # printf("ord(%s) = %d\n", var, ord(var)) # } # }
An obvious improvement to these functions is to move the code for the
_ord_init
function into the body of the BEGIN
rule. It was
written this way initially for ease of development.
There is a "test program" in a BEGIN
rule, to test the
function. It is commented out for production use.
When doing string processing, it is often useful to be able to join
all the strings in an array into one long string. The following function,
join
, accomplishes this task. It is used later in several of
the application programs
(@pxref{Sample Programs, ,Practical @command{awk} Programs}).
Good function design is important; this function needs to be general but it
should also have a reasonable default behavior. It is called with an array
as well as the beginning and ending indices of the elements in the array to be
merged. This assumes that the array indices are numeric--a reasonable
assumption since the array was likely created with split
(see section String Manipulation Functions):
# join.awk -- join an array into a string function join(array, start, end, sep, result, i) { if (sep == "") sep = " " else if (sep == SUBSEP) # magic value sep = "" result = array[start] for (i = start + 1; i <= end; i++) result = result sep array[i] return result }
An optional additional argument is the separator to use when joining the
strings back together. If the caller supplies a non-empty value,
join
uses it; if it is not supplied, it has a null
value. In this case, join
uses a single blank as a default
separator for the strings. If the value is equal to SUBSEP
,
then join
joins the strings with no separator between them.
SUBSEP
serves as a "magic" value to indicate that there should
be no separation between the component strings.(55) had an assignment operator for concatenation.
The lack of an explicit operator for concatenation makes string operations
more difficult than they really need to be.}
The systime
and strftime
functions described in
@ref{Time Functions, ,Using @command{gawk}'s Timestamp Functions},
provide the minimum functionality necessary for dealing with the time of day
in human readable form. While strftime
is extensive, the control
formats are not necessarily easy to remember or intuitively obvious when
reading a program.
The following function, gettimeofday
, populates a user-supplied array
with preformatted time information. It returns a string with the current
time formatted in the same way as the @command{date} utility:
# gettimeofday.awk -- get the time of day in a usable format # Returns a string in the format of output of date(1) # Populates the array argument time with individual values: # time["second"] -- seconds (0 - 59) # time["minute"] -- minutes (0 - 59) # time["hour"] -- hours (0 - 23) # time["althour"] -- hours (0 - 12) # time["monthday"] -- day of month (1 - 31) # time["month"] -- month of year (1 - 12) # time["monthname"] -- name of the month # time["shortmonth"] -- short name of the month # time["year"] -- year modulo 100 (0 - 99) # time["fullyear"] -- full year # time["weekday"] -- day of week (Sunday = 0) # time["altweekday"] -- day of week (Monday = 0) # time["dayname"] -- name of weekday # time["shortdayname"] -- short name of weekday # time["yearday"] -- day of year (0 - 365) # time["timezone"] -- abbreviation of timezone name # time["ampm"] -- AM or PM designation # time["weeknum"] -- week number, Sunday first day # time["altweeknum"] -- week number, Monday first day function gettimeofday(time, ret, now, i) { # get time once, avoids unnecessary system calls now = systime() # return date(1)-style output ret = strftime("%a %b %d %H:%M:%S %Z %Y", now) # clear out target array delete time # fill in values, force numeric values to be # numeric by adding 0 time["second"] = strftime("%S", now) + 0 time["minute"] = strftime("%M", now) + 0 time["hour"] = strftime("%H", now) + 0 time["althour"] = strftime("%I", now) + 0 time["monthday"] = strftime("%d", now) + 0 time["month"] = strftime("%m", now) + 0 time["monthname"] = strftime("%B", now) time["shortmonth"] = strftime("%b", now) time["year"] = strftime("%y", now) + 0 time["fullyear"] = strftime("%Y", now) + 0 time["weekday"] = strftime("%w", now) + 0 time["altweekday"] = strftime("%u", now) + 0 time["dayname"] = strftime("%A", now) time["shortdayname"] = strftime("%a", now) time["yearday"] = strftime("%j", now) + 0 time["timezone"] = strftime("%Z", now) time["ampm"] = strftime("%p", now) time["weeknum"] = strftime("%U", now) + 0 time["altweeknum"] = strftime("%W", now) + 0 return ret }
The string indices are easier to use and read than the various formats
required by strftime
. The alarm
program presented in
section An Alarm Clock Program,
uses this function.
A more general design for the gettimeofday
function would have
allowed the user to supply an optional timestamp value to use instead
of the current time.
This minor node presents functions that are useful for managing command-line datafiles.
The BEGIN
and END
rules are each executed exactly once, at
the beginning and end of your @command{awk} program, respectively
(see section The BEGIN
and END
Special Patterns).
We (the @command{gawk} authors) once had a user who mistakenly thought that the
BEGIN
rule is executed at the beginning of each data file and the
END
rule is executed at the end of each data file. When informed
that this was not the case, the user requested that we add new special
patterns to @command{gawk}, named BEGIN_FILE
and END_FILE
, that
would have the desired behavior. He even supplied us the code to do so.
Adding these special patterns to @command{gawk} wasn't necessary;
the job can be done cleanly in @command{awk} itself, as illustrated
by the following library program.
It arranges to call two user-supplied functions, beginfile
and
endfile
, at the beginning and end of each data file.
Besides solving the problem in only nine(!) lines of code, it does so
portably; this works with any implementation of @command{awk}:
# transfile.awk # # Give the user a hook for filename transitions # # The user must supply functions beginfile() and endfile() # that each take the name of the file being started or # finished, respectively. FILENAME != _oldfilename \ { if (_oldfilename != "") endfile(_oldfilename) _oldfilename = FILENAME beginfile(FILENAME) } END { endfile(FILENAME) }
This file must be loaded before the user's "main" program, so that the rule it supplies is executed first.
This rule relies on @command{awk}'s FILENAME
variable that
automatically changes for each new data file. The current file name is
saved in a private variable, _oldfilename
. If FILENAME
does
not equal _oldfilename
, then a new data file is being processed and
it is necessary to call endfile
for the old file. Because
endfile
should only be called if a file has been processed, the
program first checks to make sure that _oldfilename
is not the null
string. The program then assigns the current file name to
_oldfilename
and calls beginfile
for the file.
Because, like all @command{awk} variables, _oldfilename
is
initialized to the null string, this rule executes correctly even for the
first data file.
The program also supplies an END
rule to do the final processing for
the last file. Because this END
rule comes before any END
rules
supplied in the "main" program, endfile
is called first. Once
again the value of multiple BEGIN
and END
rules should be clear.
This version has same problem as the first version of nextfile
(see section Implementing nextfile
as a Function).
If the same data file occurs twice in a row on the command line, then
endfile
and beginfile
are not executed at the end of the
first pass and at the beginning of the second pass.
The following version solves the problem:
# ftrans.awk -- handle data file transitions # # user supplies beginfile() and endfile() functions FNR == 1 { if (_filename_ != "") endfile(_filename_) _filename_ = FILENAME beginfile(FILENAME) } END { endfile(_filename_) }
section Counting Things, shows how this library function can be used and how it simplifies writing the main program.
Another request for a new built-in function was for a rewind
function that would make it possible to reread the current file.
The requesting user didn't want to have to use getline
(see section Explicit Input with getline
)
inside a loop.
However, as long as you are not in the END
rule, it is
quite easy to arrange to immediately close the current input file
and then start over with it from the top.
For lack of a better name, we'll call it rewind
:
# rewind.awk -- rewind the current file and start over function rewind( i) { # shift remaining arguments up for (i = ARGC; i > ARGIND; i--) ARGV[i] = ARGV[i-1] # make sure gawk knows to keep going ARGC++ # make current file next to get done ARGV[ARGIND+1] = FILENAME # do it nextfile }
This code relies on the ARGIND
variable
(see section Built-in Variables That Convey Information),
which is specific to @command{gawk}.
If you are not using
@command{gawk}, you can use ideas presented in
the previous minor node
@ifnottex
section Noting Data File Boundaries,
to either update ARGIND
on your own
or modify this code as appropriate.
The rewind
function also relies on the nextfile
keyword
(@pxref{Nextfile Statement, ,Using @command{gawk}'s nextfile
Statement}).
See section Implementing nextfile
as a Function,
for a function version of nextfile
.
Normally, if you give @command{awk} a data file that isn't readable, it stops with a fatal error. There are times when you might want to just ignore such files and keep going. You can do this by prepending the following program to your @command{awk} program:
# readable.awk -- library file to skip over unreadable files BEGIN { for (i = 1; i < ARGC; i++) { if (ARGV[i] ~ /^[A-Za-z_][A-Za-z0-9_]*=.*/ \ || ARGV[i] == "-") continue # assignment or standard input else if ((getline junk < ARGV[i]) < 0) # unreadable delete ARGV[i] else close(ARGV[i]) } }
In @command{gawk}, the getline
won't be fatal (unless
@option{--posix} is in force).
Removing the element from ARGV
with delete
skips the file (since it's no longer in the list).
Occasionally, you might not want @command{awk} to process command-line variable assignments (see section Assigning Variables on the Command Line). In particular, if you have file names that contain an `=' character, @command{awk} treats the file name as an assignment, and does not process it.
Some users have suggested an additional command-line option for @command{gawk} to disable command-line assignments. However, some simple programming with a library file does the trick:
# noassign.awk -- library file to avoid the need for a # special option that disables command-line assignments function disable_assigns(argc, argv, i) { for (i = 1; i < argc; i++) if (argv[i] ~ /^[A-Za-z_][A-Za-z_0-9]*=.*/) argv[i] = ("./" argv[i]) } BEGIN { if (No_command_assign) disable_assigns(ARGC, ARGV) }
You then run your program this way:
awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
The function works by looping through the arguments. It prepends `./' to any argument that matches the form of a variable assignment, turning that argument into a file name.
The use of No_command_assign
allows you to disable command-line
assignments at invocation time, by giving the variable a true value.
When not set, it is initially zero (i.e., false), so the command-line arguments
are left alone.
Most utilities on POSIX compatible systems take options, or "switches," on the command line that can be used to change the way a program behaves. @command{awk} is an example of such a program (see section Command-Line Options). Often, options take arguments; i.e., data that the program needs to correctly obey the command-line option. For example, @command{awk}'s @option{-F} option requires a string to use as the field separator. The first occurrence on the command line of either @option{--} or a string that does not begin with `-' ends the options.
Modern Unix systems provide a C function named getopt
for processing
command-line arguments. The programmer provides a string describing the
one-letter options. If an option requires an argument, it is followed in the
string with a colon. getopt
is also passed the
count and values of the command-line arguments and is called in a loop.
getopt
processes the command-line arguments for option letters.
Each time around the loop, it returns a single character representing the
next option letter that it finds, or `?' if it finds an invalid option.
When it returns -1, there are no options left on the command line.
When using getopt
, options that do not take arguments can be
grouped together. Furthermore, options that take arguments require that the
argument is present. The argument can immediately follow the option letter
or it can be a separate command-line argument.
Given a hypothetical program that takes three command-line options, @option{-a}, @option{-b}, and @option{-c}, where @option{-b} requires an argument, all of the following are valid ways of invoking the program:
prog -a -b foo -c data1 data2 data3 prog -ac -bfoo -- data1 data2 data3 prog -acbfoo data1 data2 data3
Notice that when the argument is grouped with its option, the rest of the argument is considered to be the option's argument. In this example, @option{-acbfoo} indicates that all of the @option{-a}, @option{-b}, and @option{-c} options were supplied, and that `foo' is the argument to the @option{-b} option.
getopt
provides four external variables that the programmer can use:
optind
argv
) where the first
non-option command-line argument can be found.
optarg
opterr
getopt
prints an error message when it finds an invalid
option. Setting opterr
to zero disables this feature. (An
application might want to print its own error message.)
optopt
The following C fragment shows how getopt
might process command-line
arguments for @command{awk}:
int main(int argc, char *argv[]) { ... /* print our own message */ opterr = 0; while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) { switch (c) { case 'f': /* file */ ... break; case 'F': /* field separator */ ... break; case 'v': /* variable assignment */ ... break; case 'W': /* extension */ ... break; case '?': default: usage(); break; } } ... }
As a side point, @command{gawk} actually uses the GNU getopt_long
function to process both normal and GNU-style long options
(see section Command-Line Options).
The abstraction provided by getopt
is very useful and is quite
handy in @command{awk} programs as well. Following is an @command{awk}
version of getopt
. This function highlights one of the
greatest weaknesses in @command{awk}, which is that it is very poor at
manipulating single characters. Repeated calls to substr
are
necessary for accessing individual characters
(see section String Manipulation Functions).(56) acquired the ability to
split strings into single characters using ""
as the separator.
We have left it alone, since using substr
is more portable.}
The discussion that follows walks through the code a bit at a time:
# getopt.awk -- do C library getopt(3) function in awk # External variables: # Optind -- index in ARGV of first non-option argument # Optarg -- string value of argument to current option # Opterr -- if nonzero, print our own diagnostic # Optopt -- current option letter # Returns: # -1 at end of options # ? for unrecognized option # <c> a character representing the current option # Private Data: # _opti -- index in multi-flag option, e.g., -abc
The function starts out with a list of the global variables it uses, what the return values are, what they mean, and any global variables that are "private" to this library function. Such documentation is essential for any program, and particularly for library functions.
The getopt
function first checks that it was indeed called with a string of options
(the options
parameter). If options
has a zero length,
getopt
immediately returns -1:
function getopt(argc, argv, options, thisopt, i) { if (length(options) == 0) # no options given return -1 if (argv[Optind] == "--") { # all done Optind++ _opti = 0 return -1 } else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) { _opti = 0 return -1 }
The next thing to check for is the end of the options. A @option{--}
ends the command-line options, as does any command-line argument that
does not begin with a `-'. Optind
is used to step through
the array of command-line arguments; it retains its value across calls
to getopt
, because it is a global variable.
The regular expression that is used, /^-[^: \t\n\f\r\v\b]/
, is
perhaps a bit of overkill; it checks for a `-' followed by anything
that is not whitespace and not a colon.
If the current command-line argument does not match this pattern,
it is not an option, and it ends option processing.
if (_opti == 0) _opti = 2 thisopt = substr(argv[Optind], _opti, 1) Optopt = thisopt i = index(options, thisopt) if (i == 0) { if (Opterr) printf("%c -- invalid option\n", thisopt) > "/dev/stderr" if (_opti >= length(argv[Optind])) { Optind++ _opti = 0 } else _opti++ return "?" }
The _opti
variable tracks the position in the current command-line
argument (argv[Optind]
). If multiple options are
grouped together with one `-' (e.g., @option{-abx}), it is necessary
to return them to the user one at a time.
If _opti
is equal to zero, it is set to two, which is the index in
the string of the next character to look at (we skip the `-', which
is at position one). The variable thisopt
holds the character,
obtained with substr
. It is saved in Optopt
for the main
program to use.
If thisopt
is not in the options
string, then it is an
invalid option. If Opterr
is nonzero, getopt
prints an error
message on the standard error that is similar to the message from the C
version of getopt
.
Because the option is invalid, it is necessary to skip it and move on to the
next option character. If _opti
is greater than or equal to the
length of the current command-line argument, it is necessary to move on
to the next argument, so Optind
is incremented and _opti
is reset
to zero. Otherwise, Optind
is left alone and _opti
is merely
incremented.
In any case, because the option is invalid, getopt
returns `?'.
The main program can examine Optopt
if it needs to know what the
invalid option letter actually is. Continuing on:
if (substr(options, i + 1, 1) == ":") { # get option argument if (length(substr(argv[Optind], _opti + 1)) > 0) Optarg = substr(argv[Optind], _opti + 1) else Optarg = argv[++Optind] _opti = 0 } else Optarg = ""
If the option requires an argument, the option letter is followed by a colon
in the options
string. If there are remaining characters in the
current command-line argument (argv[Optind]
), then the rest of that
string is assigned to Optarg
. Otherwise, the next command-line
argument is used (`-xFOO' vs. `-x FOO'). In either case,
_opti
is reset to zero, because there are no more characters left to
examine in the current command-line argument. Continuing:
if (_opti == 0 || _opti >= length(argv[Optind])) { Optind++ _opti = 0 } else _opti++ return thisopt }
Finally, if _opti
is either zero or greater than the length of the
current command-line argument, it means this element in argv
is
through being processed, so Optind
is incremented to point to the
next element in argv
. If neither condition is true, then only
_opti
is incremented, so that the next option letter can be processed
on the next call to getopt
.
The BEGIN
rule initializes both Opterr
and Optind
to one.
Opterr
is set to one, since the default behavior is for getopt
to print a diagnostic message upon seeing an invalid option. Optind
is set to one, since there's no reason to look at the program name, which is
in ARGV[0]
:
BEGIN { Opterr = 1 # default is to diagnose Optind = 1 # skip ARGV[0] # test program if (_getopt_test) { while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) printf("c = <%c>, optarg = <%s>\n", _go_c, Optarg) printf("non-option arguments:\n") for (; Optind < ARGC; Optind++) printf("\tARGV[%d] = <%s>\n", Optind, ARGV[Optind]) } }
The rest of the BEGIN
rule is a simple test program. Here is the
result of two sample runs of the test program:
$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x -| c = <a>, optarg = <> -| c = <c>, optarg = <> -| c = <b>, optarg = <ARG> -| non-option arguments: -| ARGV[3] = <bax> -| ARGV[4] = <-x> $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc -| c = <a>, optarg = <> error--> x -- invalid option -| c = <?>, optarg = <> -| non-option arguments: -| ARGV[4] = <xyz> -| ARGV[5] = <abc>
In both runs,
the first @option{--} terminates the arguments to @command{awk}, so that it does
not try to interpret the @option{-a}, etc., as its own options.
Several of the sample programs presented in
@ref{Sample Programs, ,Practical @command{awk} Programs},
use getopt
to process their arguments.
The PROCINFO
array
(see section Built-in Variables)
provides access to the current user's real and effective user and group id
numbers, and if available, the user's supplementary group set.
However, because these are numbers, they do not provide very useful
information to the average user. There needs to be some way to find the
user information associated with the user and group numbers. This
minor node presents a suite of functions for retrieving information from the
user database. See section Reading the Group Database,
for a similar suite that retrieves information from the group database.
The POSIX standard does not define the file where user information is
kept. Instead, it provides the <pwd.h>
header file
and several C language subroutines for obtaining user information.
The primary function is getpwent
, for "get password entry."
The "password" comes from the original user database file,
`/etc/passwd', which stores user information, along with the
encrypted passwords (hence the name).
While an @command{awk} program could simply read `/etc/passwd'
directly, this file may not contain complete information about the
system's set of users.(57) To be sure you are able to
produce a readable and complete version of the user database, it is necessary
to write a small C program that calls getpwent
. getpwent
is defined as returning a pointer to a struct passwd
. Each time it
is called, it returns the next entry in the database. When there are
no more entries, it returns NULL
, the null pointer. When this
happens, the C program should call endpwent
to close the database.
Following is @command{pwcat}, a C program that "cats" the password database.
/* * pwcat.c * * Generate a printable version of the password database */ #include <stdio.h> #include <pwd.h> int main(argc, argv) int argc; char **argv; { struct passwd *p; while ((p = getpwent()) != NULL) printf("%s:%s:%d:%d:%s:%s:%s\n", p->pw_name, p->pw_passwd, p->pw_uid, p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); endpwent(); exit(0); }
If you don't understand C, don't worry about it. The output from @command{pwcat} is the user database, in the traditional `/etc/passwd' format of colon-separated fields. The fields are:
The user's login name.
|
The user's encrypted password. This may not be available on some systems.
|
The user's numeric user-id number.
|
The user's numeric group-id number.
|
The user's full name, and perhaps other information associated with the
user.
|
The user's login (or "home") directory (familiar to shell programmers as
$HOME ).
|
The program that is run when the user logs in. This is usually a shell, such as @command{bash}. |
$ pwcat -| root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh -| nobody:*:65534:65534::/: -| daemon:*:1:1::/: -| sys:*:2:2::/:/bin/csh -| bin:*:3:3::/bin: -| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh -| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh -| andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh ...With that introduction, following is a group of functions for getting user information. There are several functions here, corresponding to the C functions of the same names:
# passwd.awk -- access password file information BEGIN { # tailor this to suit your system _pw_awklib = "/usr/local/libexec/awk/" } function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw) { if (_pw_inited) return oldfs = FS oldrs = RS olddol0 = $0 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") FS = ":" RS = "\n" pwcat = _pw_awklib "pwcat" while ((pwcat | getline) > 0) { _pw_byname[$1] = $0 _pw_byuid[$3] = $0 _pw_bycount[++_pw_total] = $0 } close(pwcat) _pw_count = 0 _pw_inited = 1 FS = oldfs if (using_fw) FIELDWIDTHS = FIELDWIDTHS RS = oldrs $0 = olddol0 }The
BEGIN
rule sets a private variable to the directory where
@command{pwcat} is stored. Because it is used to help out an @command{awk} library
routine, we have chosen to put it in `/usr/local/libexec/awk';
however, you might want it to be in a different directory on your system.
The function _pw_init
keeps three copies of the user information
in three associative arrays. The arrays are indexed by username
(_pw_byname
), by user-id number (_pw_byuid
), and by order of
occurrence (_pw_bycount
).
The variable _pw_inited
is used for efficiency; _pw_init
needs only to be called once.
Because this function uses getline
to read information from
@command{pwcat}, it first saves the values of FS
, RS
, and $0
.
It notes in the variable using_fw
whether field splitting
with FIELDWIDTHS
is in effect or not.
Doing so is necessary, since these functions could be called
from anywhere within a user's program, and the user may have his
or her
own way of splitting records and fields.
The using_fw
variable checks PROCINFO["FS"]
, which
is "FIELDWIDTHS"
if field splitting is being done with
FIELDWIDTHS
. This makes it possible to restore the correct
field-splitting mechanism later. The test can only be true for
@command{gawk}. It is false if using FS
or on some other
@command{awk} implementation.
The main part of the function uses a loop to read database lines, split
the line into fields, and then store the line into each array as necessary.
When the loop is done, _pw_init
cleans up by closing the pipeline,
setting _pw_inited
to one, and restoring FS
(and FIELDWIDTHS
if necessary), RS
, and $0
.
The use of _pw_count
is explained shortly.
The getpwnam
function takes a username as a string argument. If that
user is in the database, it returns the appropriate line. Otherwise it
returns the null string:
function getpwnam(name) { _pw_init() if (name in _pw_byname) return _pw_byname[name] return "" }Similarly, the
getpwuid
function takes a user-id number argument. If that
user number is in the database, it returns the appropriate line. Otherwise it
returns the null string:
function getpwuid(uid) { _pw_init() if (uid in _pw_byuid) return _pw_byuid[uid] return "" }The
getpwent
function simply steps through the database, one entry at
a time. It uses _pw_count
to track its current position in the
_pw_bycount
array:
function getpwent() { _pw_init() if (_pw_count < _pw_total) return _pw_bycount[++_pw_count] return "" }The
endpwent
function resets _pw_count
to zero, so that
subsequent calls to getpwent
start over again:
function endpwent() { _pw_count = 0 }A conscious design decision in this suite is that each subroutine calls
_pw_init
to initialize the database arrays. The overhead of running
a separate process to generate the user database, and the I/O to scan it,
are only incurred if the user's main program actually calls one of these
functions. If this library file is loaded along with a user's program, but
none of the routines are ever called, then there is no extra runtime overhead.
(The alternative is move the body of _pw_init
into a
BEGIN
rule, which always runs @command{pwcat}. This simplifies the
code but runs an extra process that may never be needed.)
In turn, calling _pw_init
is not too expensive, because the
_pw_inited
variable keeps the program from reading the data more than
once. If you are worried about squeezing every last cycle out of your
@command{awk} program, the check of _pw_inited
could be moved out of
_pw_init
and duplicated in all the other functions. In practice,
this is not necessary, since most @command{awk} programs are I/O-bound, and it
clutters up the code.
The @command{id} program in section Printing out User Information,
uses these functions.
Much of the discussion presented in
section Reading the User Database,
applies to the group database as well. Although there has traditionally
been a well-known file (`/etc/group') in a well-known format, the POSIX
standard only provides a set of C library routines
(<grp.h>
and getgrent
)
for accessing the information.
Even though this file may exist, it likely does not have
complete information. Therefore, as with the user database, it is necessary
to have a small C program that generates the group database as its output.
@command{grcat}, a C program that "cats" the group database, is as follows:
/* * grcat.c * * Generate a printable version of the group database */ #include <stdio.h> #include <grp.h> int main(argc, argv) int argc; char **argv; { struct group *g; int i; while ((g = getgrent()) != NULL) { printf("%s:%s:%d:", g->gr_name, g->gr_passwd, g->gr_gid); for (i = 0; g->gr_mem[i] != NULL; i++) { printf("%s", g->gr_mem[i]); if (g->gr_mem[i+1] != NULL) putchar(','); } putchar('\n'); } endgrent(); exit(0); }
Each line in the group database represents one group. The fields are separated with colons and represent the following information:
The group's name.
|
The group's encrypted password. In practice, this field is never used;
it is usually empty or set to `*'.
|
The group's numeric group-id number; this number should be unique within the file.
|
A comma-separated list of usernames. These users are members of the group.
Modern Unix systems allow users to be members of several groups
simultaneously. If your system does, then there are elements
"group1" through "groupN" in PROCINFO
for those group-id numbers.
(Note that PROCINFO is a @command{gawk} extension;
see section Built-in Variables.)
|
$ grcat -| wheel:*:0:arnold -| nogroup:*:65534: -| daemon:*:1: -| kmem:*:2: -| staff:*:10:arnold,miriam,andy -| other:*:20: ...Here are the functions for obtaining information from the group database. There are several, modeled after the C library functions of the same names:
# group.awk -- functions for dealing with the group file BEGIN \ { # Change to suit your system _gr_awklib = "/usr/local/libexec/awk/" } function _gr_init( oldfs, oldrs, olddol0, grcat, using_fw, n, a, i) { if (_gr_inited) return oldfs = FS oldrs = RS olddol0 = $0 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") FS = ":" RS = "\n" grcat = _gr_awklib "grcat" while ((grcat | getline) > 0) { if ($1 in _gr_byname) _gr_byname[$1] = _gr_byname[$1] "," $4 else _gr_byname[$1] = $0 if ($3 in _gr_bygid) _gr_bygid[$3] = _gr_bygid[$3] "," $4 else _gr_bygid[$3] = $0 n = split($4, a, "[ \t]*,[ \t]*") for (i = 1; i <= n; i++) if (a[i] in _gr_groupsbyuser) _gr_groupsbyuser[a[i]] = \ _gr_groupsbyuser[a[i]] " " $1 else _gr_groupsbyuser[a[i]] = $1 _gr_bycount[++_gr_count] = $0 } close(grcat) _gr_count = 0 _gr_inited++ FS = oldfs if (using_fw) FIELDWIDTHS = FIELDWIDTHS RS = oldrs $0 = olddol0 }The
BEGIN
rule sets a private variable to the directory where
@command{grcat} is stored. Because it is used to help out an @command{awk} library
routine, we have chosen to put it in `/usr/local/libexec/awk'. You might
want it to be in a different directory on your system.
These routines follow the same general outline as the user database routines
(see section Reading the User Database).
The _gr_inited
variable is used to
ensure that the database is scanned no more than once.
The _gr_init
function first saves FS
, FIELDWIDTHS
, RS
, and
$0
, and then sets FS
and RS
to the correct values for
scanning the group information.
The group information is stored is several associative arrays.
The arrays are indexed by group name (_gr_byname
), by group-id number
(_gr_bygid
), and by position in the database (_gr_bycount
).
There is an additional array indexed by username (_gr_groupsbyuser
),
which is a space-separated list of groups that each user belongs to.
Unlike the user database, it is possible to have multiple records in the
database for the same group. This is common when a group has a large number
of members. A pair of such entries might look like the following:
tvpeople:*:101:johnny,jay,arsenio tvpeople:*:101:david,conan,tom,joanFor this reason,
_gr_init
looks to see if a group name or
group-id number is already seen. If it is, then the usernames are
simply concatenated onto the previous list of users. (There is actually a
subtle problem with the code just presented. Suppose that
the first time there were no names. This code adds the names with
a leading comma. It also doesn't check that there is a $4
.)
Finally, _gr_init
closes the pipeline to @command{grcat}, restores
FS
(and FIELDWIDTHS
if necessary), RS
, and $0
,
initializes _gr_count
to zero
(it is used later), and makes _gr_inited
nonzero.
The getgrnam
function takes a group name as its argument, and if that
group exists, it is returned. Otherwise, getgrnam
returns the null
string:
function getgrnam(group) { _gr_init() if (group in _gr_byname) return _gr_byname[group] return "" }The
getgrgid
function is similar, it takes a numeric group-id and
looks up the information associated with that group-id:
function getgrgid(gid) { _gr_init() if (gid in _gr_bygid) return _gr_bygid[gid] return "" }The
getgruser
function does not have a C counterpart. It takes a
username and returns the list of groups that have the user as a member:
function getgruser(user) { _gr_init() if (user in _gr_groupsbyuser) return _gr_groupsbyuser[user] return "" }The
getgrent
function steps through the database one entry at a time.
It uses _gr_count
to track its position in the list:
function getgrent() { _gr_init() if (++_gr_count in _gr_bycount) return _gr_bycount[_gr_count] return "" }The
endgrent
function resets _gr_count
to zero so that getgrent
can
start over again:
function endgrent() { _gr_count = 0 }As with the user database routines, each function calls
_gr_init
to
initialize the arrays. Doing so only incurs the extra overhead of running
@command{grcat} if these functions are used (as opposed to moving the body of
_gr_init
into a BEGIN
rule).
Most of the work is in scanning the database and building the various
associative arrays. The functions that the user calls are themselves very
simple, relying on @command{awk}'s associative arrays to do work.
The @command{id} program in section Printing out User Information,
uses these functions.
Go to the first, previous, next, last section, table of contents.