Go to the first, previous, next, last section, table of contents.
This appendix contains information mainly of interest to implementors and maintainers of @command{gawk}. Everything in it applies specifically to @command{gawk} and not to other implementations.
@xref{POSIX/GNU, ,Extensions in @command{gawk} Not in POSIX @command{awk}}, for a summary of the GNU extensions to the @command{awk} language and program. All of these features can be turned off by invoking @command{gawk} with the @option{--traditional} option or with the @option{--posix} option.
If @command{gawk} is compiled for debugging with `-DDEBUG', then there is one more option available on the command line:
-W parsedebug
--parsedebug
This option is intended only for serious @command{gawk} developers and not for the casual user. It probably has not even been compiled into your version of @command{gawk}, since it slows down execution.
If you find that you want to enhance @command{gawk} in a significant fashion, you are perfectly free to do so. That is the point of having free software; the source code is available and you are free to change it as you want (see section GNU General Public License).
This minor node discusses the ways you might want to change @command{gawk} as well as any considerations you should bear in mind.
You are free to add any new features you like to @command{gawk}. However, if you want your changes to be incorporated into the @command{gawk} distribution, there are several steps that you need to take in order to make it possible for me to include your changes:
int
, on the
line above the line with the name and arguments of the function.
if
, while
, for
, do
, switch
,
and return
).
for
loop initialization and increment parts, and in macro bodies.
NULL
and '\0'
in the conditions of
if
, while
, and for
statements, as well as in the case
s
of switch
statements, instead of just the
plain pointer or character value.
TRUE
, FALSE
and NULL
symbolic constants
and the character constant '\0'
where appropriate, instead of 1
and 0
.
ISALPHA
, ISDIGIT
, etc. macros, instead of the
traditional lowercase versions; these macros are better behaved for
non-ASCII character sets.
alloca
function for allocating memory off the stack.
Its use causes more portability trouble than is worth the minor benefit of not having
to free the storage. Instead, use malloc
and free
.
patch
).
If I have to apply the changes manually, using a text editor, I may
not do so, particularly if there are lots of changes.
Although this sounds like a lot of work, please remember that while you may write the new code, I have to maintain it and support it. If it isn't possible for me to do that with a minimum of extra work, then I probably will not.
If you want to port @command{gawk} to a new operating system, there are several steps to follow:
Following these steps makes it much easier to integrate your changes into @command{gawk} and have them co-exist happily with other operating systems' code that is already there.
In the code that you supply and maintain, feel free to use a coding style and brace layout that suits your taste.
Danger Will Robinson! Danger!!
Warning! Warning!
The Robot
Beginning with @command{gawk} 3.1, it is possible to add new built-in
functions to @command{gawk} using dynamically loaded libraries. This
facility is available on systems (such as GNU/Linux) that support
the dlopen
and dlsym
functions.
This minor node describes how to write and use dynamically
loaded extentions for @command{gawk}.
Experience with programming in
C or C++ is necessary when reading this minor node.
Caution: The facilities described in this minor node are very much subject to change in the next @command{gawk} release. Be aware that you may have to re-do everything, perhaps from scratch, upon the next release.
The truth is that @command{gawk} was not designed for simple extensibility. The facilities for adding functions using shared libraries work, but are something of a "bag on the side." Thus, this tour is brief and simplistic; would-be @command{gawk} hackers are encouraged to spend some time reading the source code before trying to write extensions based on the material presented here. Of particular note are the files `awk.h', `builtin.c', and `eval.c'. Reading `awk.y' in order to see how the parse tree is built would also be of use.
With the disclaimers out of the way, the following types, structure members, functions, and macros are declared in `awk.h' and are of use when writing extensions. The next minor node shows how they are used:
AWKNUM
AWKNUM
is the internal type of @command{awk}
floating-point numbers. Typically, it is a C double
.
NODE
NODE
.
These contain both strings and numbers, as well as variables and arrays.
AWKNUM force_number(NODE *n)
void force_string(NODE *n)
NODE
's string value is current.
It may end up calling an internal @command{gawk} function.
It also guarantees that the string is zero-terminated.
n->param_cnt
n->stptr
n->stlen
NODE
's string value, respectively.
The string is not guaranteed to be zero-terminated.
If you need to pass the string value to a C library function, save
the value in n->stptr[n->stlen]
, assign '\0'
to it,
call the routine, and then restore the value.
n->type
NODE
. This is a C enum
. Values should
be either Node_var
or Node_var_array
for function
parameters.
n->vname
void assoc_clear(NODE *n)
n
.
Make sure that `n->type == Node_var_array' first.
NODE **assoc_lookup(NODE *symbol, NODE *subs, int reference)
symbol
is the array, subs
is the subscript.
This is usually a value created with tmp_string
(see below).
reference
should be TRUE
if it is an error to use the
value before it is created. Typically, FALSE
is the
correct value to use from extension functions.
NODE *make_string(char *s, size_t len)
NODE
that
can be stored appropriately. This is permanent storage; understanding
of @command{gawk} memory management is helpful.
NODE *make_number(AWKNUM val)
AWKNUM
and turn it into a pointer to a NODE
that
can be stored appropriately. This is permanent storage; understanding
of @command{gawk} memory management is helpful.
NODE *tmp_string(char *s, size_t len);
NODE
that
can be stored appropriately. This is temporary storage; understanding
of @command{gawk} memory management is helpful.
NODE *tmp_number(AWKNUM val)
AWKNUM
and turn it into a pointer to a NODE
that
can be stored appropriately. This is temporary storage;
understanding of @command{gawk} memory management is helpful.
NODE *dupnode(NODE *n)
NODE
;
understanding of @command{gawk} memory management is helpful.
void free_temp(NODE *n)
NODE
allocated with tmp_string
or tmp_number
.
Understanding of @command{gawk} memory management is helpful.
void make_builtin(char *name, NODE *(*func)(NODE *), int count)
func
as new built-in
function name
. name
is a regular C string. count
is the maximum number of arguments that the function takes.
The function should be written in the following manner:
/* do_xxx -- do xxx function for gawk */ NODE * do_xxx(NODE *tree) { ... }
NODE *get_argument(NODE *tree, int i)
i
'th argument from the function call.
The first argument is argument zero.
void set_value(NODE *tree)
void update_ERRNO(void)
ERRNO
variable, based on the current
value of the C errno
variable.
It is provided as a convenience.
An argument that is supposed to be an array needs to be handled with some extra code, in case the array being passed in is actually from a function parameter. The following "boiler plate" code shows how to do this:
NODE *the_arg; the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */ /* if a parameter, get it off the stack */ if (the_arg->type == Node_param_list) the_arg = stack_ptr[the_arg->param_cnt]; /* parameter referenced an array, get it */ if (the_arg->type == Node_array_ref) the_arg = the_arg->orig_array; /* check type */ if (the_arg->type != Node_var && the_arg->type != Node_var_array) fatal("newfunc: third argument is not an array"); /* force it to be an array, if necessary, clear it */ the_arg->type = Node_var_array; assoc_clear(the_arg);
Again, you should spend time studying the @command{gawk} internals; don't just blindly copy this code.
Two useful functions that are not in @command{awk} are chdir
(so that an @command{awk} program can change its directory) and
stat
(so that an @command{awk} program can gather information about
a file).
This minor node implements these functions for @command{gawk} in an
external extension library.
chdir
and stat
This minor node shows how to use the new functions at the @command{awk}
level once they've been integrated into the running @command{gawk}
interpreter.
Using chdir
is very straightforward. It takes one argument,
the new directory to change to:
... newdir = "/home/arnold/funstuff" ret = chdir(newdir) if (ret < 0) { printf("could not change to %s: %s\n", newdir, ERRNO) > "/dev/stderr" exit 1 } ...
The return value is negative if the chdir
failed,
and ERRNO
(see section Built-in Variables)
is set to a string indicating the error.
Using stat
is a bit more complicated.
The C stat
function fills in a structure that has a fair
amount of information.
The right way to model this in @command{awk} is to fill in an associative
array with the appropriate information:
file = "/home/arnold/.profile" fdata[1] = "x" # force `fdata' to be an array ret = stat(file, fdata) if (ret < 0) { printf("could not stat %s: %s\n", file, ERRNO) > "/dev/stderr" exit 1 } printf("size of %s is %d bytes\n", file, fdata["size"])
The stat
function always clears the data array, even if
the stat
fails. It fills in the following elements:
"name"
stat
'ed.
"dev"
"ino"
"mode"
"nlink"
"uid"
"gid"
"size"
"blocks"
"atime"
"mtime"
"ctime"
strftime
(see section Built-in Functions).
"pmode"
"drwxr-xr-x"
.
"type"
"blockdev"
"chardev"
"directory"
"fifo"
"file"
"socket"
AF_UNIX
("Unix domain") socket in the
filesystem.
"symlink"
Several additional elements may be present depending upon the operating
system and the type of the file. You can test for them in your @command{awk}
program by using the in
operator
(see section Referring to an Array Element):
"blksize"
stat
structure.
"linkval"
"rdev"
"major"
"minor"
chdir
and stat
Here is the C code for these extensions. They were written for GNU/Linux. The code needs some more work for complete portability to other POSIX-compliant systems:(65) distribution.}
#include "awk.h" #include <sys/sysmacros.h> /* do_chdir -- provide dynamically loaded chdir() builtin for gawk */ static NODE * do_chdir(tree) NODE *tree; { NODE *newdir; int ret = -1; newdir = get_argument(tree, 0);
The file includes the "awk.h"
header file for definitions
for the @command{gawk} internals. It includes <sys/sysmacros.h>
for access to the major
and minor
macros.
By convention, for an @command{awk} function foo
, the function that
implements it is called `do_foo'. The function should take
a `NODE *' argument, usually called tree
, that
represents the argument list to the function. The newdir
variable represents the new directory to change to, retrieved
with get_argument
. Note that the first argument is
numbered zero.
This code actually accomplishes the chdir
. It first forces
the argument to be a string and passes the string value to the
chdir
system call. If the chdir
fails, ERRNO
is updated.
The result of force_string
has to be freed with free_temp
:
if (newdir != NULL) { (void) force_string(newdir); ret = chdir(newdir->stptr); if (ret < 0) update_ERRNO(); free_temp(newdir); }
Finally, the function returns the return value to the @command{awk} level,
using set_value
. Then it must return a value from the call to
the new built-in (this value ignored by the interpreter):
/* Set the return value */ set_value(tmp_number((AWKNUM) ret)); /* Just to make the interpreter happy */ return tmp_number((AWKNUM) 0); }
The stat
built-in is more involved. First comes a function
that turns a numeric mode into a printable representation
(e.g., 644 becomes `-rw-r--r--'). This is omitted here for brevity:
/* format_mode -- turn a stat mode field into something readable */ static char * format_mode(fmode) unsigned long fmode; { ... }
Next comes the actual do_stat
function itself. First come the
variable declarations and argument checking:
/* do_stat -- provide a stat() function for gawk */ static NODE * do_stat(tree) NODE *tree; { NODE *file, *array; struct stat sbuf; int ret; char *msg; NODE **aptr; char *pmode; /* printable mode */ char *type = "unknown"; /* check arg count */ if (tree->param_cnt != 2) fatal( "stat: called with %d arguments, should be 2", tree->param_cnt);
Then comes the actual work. First, we get the arguments.
Then, we always clear the array. To get the file information,
we use lstat
, in case the file is a symbolic link.
If there's an error, we set ERRNO
and return:
/* * directory is first arg, * array to hold results is second */ file = get_argument(tree, 0); array = get_argument(tree, 1); /* empty out the array */ assoc_clear(array); /* lstat the file, if error, set ERRNO and return */ (void) force_string(file); ret = lstat(file->stptr, & sbuf); if (ret < 0) { update_ERRNO(); set_value(tmp_number((AWKNUM) ret)); free_temp(file); return tmp_number((AWKNUM) 0); }
Now comes the tedious part: filling in the array. Only a few of the calls are shown here, since they all follow the same pattern:
/* fill in the array */ aptr = assoc_lookup(array, tmp_string("name", 4), FALSE); *aptr = dupnode(file); aptr = assoc_lookup(array, tmp_string("mode", 4), FALSE); *aptr = make_number((AWKNUM) sbuf.st_mode); aptr = assoc_lookup(array, tmp_string("pmode", 5), FALSE); pmode = format_mode(sbuf.st_mode); *aptr = make_string(pmode, strlen(pmode));
When done, we free the temporary value containing the file name, set the return value, and return:
free_temp(file); /* Set the return value */ set_value(tmp_number((AWKNUM) ret)); /* Just to make the interpreter happy */ return tmp_number((AWKNUM) 0); }
Finally, it's necessary to provide the "glue" that loads the
new function(s) into @command{gawk}. By convention, each library has
a routine named dlload
that does the job:
/* dlload -- load new builtins in this library */ NODE * dlload(tree, dl) NODE *tree; void *dl; { make_builtin("chdir", do_chdir, 1); make_builtin("stat", do_stat, 2); return tmp_number((AWKNUM) 0); }
And that's it! As an exercise, consider adding functions to
implement system calls such as chown
, chmod
, and umask
.
Now that the code is written, it must be possible to add it at runtime to the running @command{gawk} interpreter. First, the code must be compiled. Assuming that the functions are in a file named `filefuncs.c', and idir is the location of the @command{gawk} include files, the following steps create a GNU/Linux shared library:
$ gcc -shared -DHAVE_CONFIG_H -c -O -g -Iidir filefuncs.c $ ld -o filefuncs.so -shared filefuncs.o
Once the library exists, it is loaded by calling the extension
built-in function.
This function takes two arguments: the name of the
library to load and the name of a function to call when the library
is first loaded. This function adds the new functions to @command{gawk}.
It returns the value returned by the initialization function
within the shared library:
# file testff.awk BEGIN { extension("./filefuncs.so", "dlload") chdir(".") # no-op data[1] = 1 # force `data' to be an array print "Info for testff.awk" ret = stat("testff.awk", data) print "ret =", ret for (i in data) printf "data[\"%s\"] = %s\n", i, data[i] print "testff.awk modified:", strftime("%m %d %y %H:%M:%S", data["mtime"]) }
Here are the results of running the program:
$ gawk -f testff.awk -| Info for testff.awk -| ret = 0 -| data["blksize"] = 4096 -| data["mtime"] = 932361936 -| data["mode"] = 33188 -| data["type"] = file -| data["dev"] = 2065 -| data["gid"] = 10 -| data["ino"] = 878597 -| data["ctime"] = 971431797 -| data["blocks"] = 2 -| data["nlink"] = 1 -| data["name"] = testff.awk -| data["atime"] = 971608519 -| data["pmode"] = -rw-r--r-- -| data["size"] = 607 -| data["uid"] = 2076 -| testff.awk modified: 07 19 99 08:25:36
AWK is a language similar to PERL, only considerably more elegant.
Arnold RobbinsHey!
Larry Wall
This minor node briefly lists extensions and possible improvements that indicate the directions we are currently considering for @command{gawk}. The file `FUTURES' in the @command{gawk} distribution lists these extensions as well.
Following is a list of probable future changes visible at the @command{awk} language level:
RECLEN
variable for fixed length records
FIELDWIDTHS
, this would speed up the processing of
fixed-length records.
PROCINFO["RS"]
would be "RS"
or "RECLEN"
,
depending upon which kind of record processing is in effect.
printf
specifiers
printf
format specifiers. These should be evaluated for possible inclusion
in @command{gawk}.
lint
warnings
Following is a list of probable improvements that will make @command{gawk}'s source code easier to work with:
Following is a list of probable improvements that will make @command{gawk} perform better:
dfa
dfa
pattern matcher from GNU @command{grep} has some
problems. Either a new version or a fixed one will deal with some
important regexp matching issues.
Finally, the programs in the test suite could use documenting in this Info file.
@xref{Additions, ,Making Additions to @command{gawk}}, if you are interested in tackling any of these projects.
Go to the first, previous, next, last section, table of contents.