------------------------------- Page    i -------------------------------

            Awk -- A Pattern Scanning and Processing Language




                                                 Alfred V. Aho

                                                 Brian W. Kernighan

                                                 Peter J. Weinberger



                                                 Edited for UTS

------------------------------- Page   ii -------------------------------

                            TABLE OF CONTENTS


1.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .   1

1.1      Usage  . . . . . . . . . . . . . . . . . . . . . . . . . . .   1
1.2      Program Structure  . . . . . . . . . . . . . . . . . . . . .   2
1.3      Records, Fields and Pre-defined Variables  . . . . . . . . .   2

2.    Patterns  . . . . . . . . . . . . . . . . . . . . . . . . . . .   3

2.1      Regular Expressions  . . . . . . . . . . . . . . . . . . . .   3
2.2      Relational Expressions . . . . . . . . . . . . . . . . . . .   4
2.3      Combinations of Patterns . . . . . . . . . . . . . . . . . .   4
2.4      Pattern Ranges . . . . . . . . . . . . . . . . . . . . . . .   5
2.5      BEGIN and END  . . . . . . . . . . . . . . . . . . . . . . .   5

3.    Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . .   6

3.1      Printing . . . . . . . . . . . . . . . . . . . . . . . . . .   6
3.2      Variables, Expressions, and Assignments  . . . . . . . . . .   7
3.3      Field Variables  . . . . . . . . . . . . . . . . . . . . . .   8
3.4      String Concatenation . . . . . . . . . . . . . . . . . . . .   9
3.5      Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
3.6      Built-in Functions . . . . . . . . . . . . . . . . . . . . .   9
3.7      Flow-of-Control Statements . . . . . . . . . . . . . . . . .  11

References  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  12

Appendix A.    Awk Grammar  . . . . . . . . . . . . . . . . . . . . .  13


                                                            Last Page  17

-------------------------------- Page  1 --------------------------------

1.    INTRODUCTION

Awk is a programming  language designed to  make many common  information
retrieval and text manipulation tasks easy to state and perform.

The basic operation of  awk is to  scan a  set of input  lines in  order,
searching for lines which match  any of a set of patterns which the  user
has specified.  For each pattern, an action can be specified; this action
will be performed on each line that matches the pattern.

Awk patterns  may  include  arbitrary  boolean  combinations  of  regular
expressions and  of  relational operators  on strings,  numbers,  fields,
variables, and array  elements.  Actions  may include  the same  pattern-
matching constructions as in patterns,  as well as arithmetic and  string
expressions and assignments, if-else, while and for statements, and  mul-
tiple output streams.  For example, the awk program

          { print $3, $2 }

prints the third and second columns of a  table in that order.  The  pro-
gram

          $2 ~ /A|B|C/

prints all input lines with an A, B, or C in the second field.  The  pro-
gram

          $1 != prev  { print; prev = $1 }

prints all lines in which the first field is different from the  previous
first field.


1.1      USAGE

The command

          awk  [-Fc]  program  [files]

executes the  awk commands  in the  string program  on the  set of  named
files, or on the standard input if there are no  files or a '-' is given.
The program can also be placed in a file  pfile and executed by the  com-
mand

          awk  [-Fc]  -f pfile  [files]

The -F option is  used to specify  the character c  that separates  input
fields (see 1.3).  The default character is a blank and may be changed in
program as well as on the command line.

-------------------------------- Page  2 --------------------------------

1.2      PROGRAM STRUCTURE

An awk program is a sequence of statements of the form:

          pattern  { action }
          pattern  { action }
          ...

Each line of input is matched against each of the patterns in turn.   For
each pattern that matches, the  associated action is executed.  When  all
the patterns have been tested, the next line is fetched and the  matching
starts over.

Either the pattern or the action may be left out, but not both.  If there
is no action  for a pattern, the  matching line is  simply copied to  the
output.  (Thus  a line  which  matches several  patterns can  be  printed
several times.)  If there is no pattern for an action, then the action is
performed for  every input  line.  A  line which  matches no  pattern  is
ignored.

Since patterns and actions are both optional, actions must be enclosed in
braces to distinguish them from patterns.

Comments may be placed in awk programs.  They begin with the character  #
and end with the end of the line, as in

          { print x, y }     # this is a comment



1.3      RECORDS, FIELDS AND PRE-DEFINED VARIABLES

Awk input is  divided into  "records" terminated by  a record  separator.
The default record separator  is a newline,  so by default awk  processes
its input a line at a time.  The number  of the current record is  avail-
able in a variable named NR.

Each input record is considered to  be divided into "fields." Fields  are
normally separated  by white space  -- blanks  or tabs --  but the  input
field separator may be changed, as described below.  Fields are  referred
to as "$1", "$2" and so forth, where $1 is the first field, and $0 is the
whole input record itself.   Fields may  be assigned to.   The number  of
fields in the current record is available in a variable named NF.

The variables FS and RS refer to  the input field and record  separators;
they may be changed  at any time  to any single character.  The  optional
command-line argument -Fc may also be used to set FS to the character c.

If the record separator  is empty, an  empty input line  is taken as  the
record separator,  and blanks,  tabs and  newlines are  treated as  field

-------------------------------- Page  3 --------------------------------

separators.

The variables OFS and ORS may be used to change the current output  field
separator and output  record separator.  The  output record separator  is
appended to the output of the print statement.

The variable FILENAME contains the name of the current input file.




2.    PATTERNS

A pattern  in front  of an  action  acts as  a selector  that  determines
whether the action is  to be executed.   A variety of expressions may  be
used as  patterns:  regular  expressions, arithmetic  relational  expres-
sions, string-valued expressions, and  arbitrary boolean combinations  of
these.


2.1      REGULAR EXPRESSIONS

The simplest  regular  expression  is  a  literal  string  of  characters
enclosed in slashes, like

          /smith/

This is actually a complete awk program which will print all lines  which
contain any occurrence of the  name "smith".  If a line contains  "smith"
as part of a larger word, it will also be printed, as in "blacksmithing."

Awk regular expressions include the regular expression forms found in the
UTS programs ed,  ned and grep[1]  (without back-referencing).  In  addi-
tion, awk allows parentheses for grouping, | for alternatives, + for "one
or more", and ? for  "zero or one", all as in lex[2].  Character  classes
may be abbreviated:  [a-zA-Z0-9] is  the set of  all letters and  digits.
As an example, the awk program

          /[Aa]ho|[Ww]einberger|[Kk]ernighan/

will print all lines which contain  any of the names "Aho,"  "Weinberger"
or "Kernighan," whether capitalized or not.

Regular expressions (with the extensions  listed above) must be  enclosed
in slashes, just  as in ed,  ned and sed.   Within a regular  expression,
blanks and  the regular  expression metacharacters  are significant.   To
turn off the magic meaning  of one of the regular expression  characters,
precede it with a backslash.  An example is the pattern

-------------------------------- Page  4 --------------------------------

          /\/.*\//

which matches any string of characters enclosed in slashes.

One can also specify that any field or variable matches a regular expres-
sion (or does not match it) with the operators ~ and !~.  The program

          $1 ~ /[jJ]ohn/

prints all lines where the first  field matches "john" or "John."  Notice
that this will also match "Johnson", "Johnny-on-the-spot", and so on.  To
restrict it to exactly "[jJ]ohn", use

          $1 ~ /^[jJ]ohn$/

The caret ^ refers to the beginning of a line or field; the dollar sign $
refers to the end.


2.2      RELATIONAL EXPRESSIONS

An awk pattern can be a  relational expression involving the usual  rela-
tional operators <, <=, ==, !=, >= and >.  An example is

          $2 > $1 + 100

which selects lines where the second field  is at least 100 greater  than
the first field.  Similarly,

          NF % 2 == 0

prints lines with an even number of fields.

In relational tests, if neither  operand is numeric, a string  comparison
is made; otherwise it is numeric.  Thus,

          $1 >= "s"

selects lines that begin  with an s, t,  u, etc.  In  the absence of  any
other information, fields are treated as strings, so the program

          $1 > $2

will perform a string comparison.


2.3      COMBINATIONS OF PATTERNS

A pattern can be any boolean combination of patterns, using the operators
|| (or), && (and), and ! (not).  For example,

-------------------------------- Page  5 --------------------------------

          $1 >= "s" && $1 < "t" && $1 != "smith"

selects lines where the first field begins with "s", but is not  "smith".
&& and || guarantee  that their operands  will be evaluated from left  to
right; evaluation stops as soon as the truth or falsehood is determined.


2.4      PATTERN RANGES

The "pattern" that selects  an action  may also consist  of two  patterns
separated by a comma, as in

          pat1, pat2  { ... }

In this case, the action is performed for each line between an occurrence
of pat1 and the next occurrence of pat2 (inclusive).  For example,

          /start/, /stop/

prints all lines between start and stop, while

          NR == 100, NR == 200 { ... }

does the action for lines 100 through 200 of the input.


2.5      BEGIN AND END

The special pattern BEGIN matches the beginning of the input, before  the
first record  is read.  The  pattern END  matches the end  of the  input,
after the last record has been processed.   BEGIN and END thus provide  a
way to gain control before  and after processing, for initialization  and
wrap-up.

As an example, the field separator can be set to a colon by

          BEGIN { FS = ":" }
          ... rest of program ...

Or the input lines may be counted by

          END  { print NR }

If BEGIN is present, it must be the first  pattern; END must be the  last
if used.

-------------------------------- Page  6 --------------------------------

3.    ACTIONS

An awk action is a sequence of action statements terminated by semicolons
or newlines.  Newlines may be escaped by preceding them with  a backslash
so that an action  statement can be  continued on the  next line.   These
action statements can be used  to do a variety of bookkeeping and  string
manipulating tasks.


3.1      PRINTING

The simplest action is to print some or all  of a record; this is  accom-
plished by the awk command print.  The awk program

          { print }

prints each record, thus copying  the input to  the output intact.   More
useful is to print a field or fields from each record.  For instance,

          { print $2, $1 }

prints the first two fields in reverse order.  Items separated by a comma
in the  print statement  will be  separated by the  current output  field
separator when  output.   Items not  separated  by commas  will  be  con-
catenated, so

          { print $1 $2 }

runs the first and second fields together.

The predefined variables NF and NR can be used; for example

          { print NR, NF, $0 }

prints each  record preceded  by  the record  number and  the  number  of
fields.

Output may be diverted to multiple files; the program

          { print $1 >"foo1"; print $2 >"foo2" }

writes the first field,  $1, on the  file foo1, and  the second field  on
file foo2.  The >> notation can also be used:

          { print $1 >>"foo" }

appends the output to the file foo.  (In each case, the output files  are
created if necessary.)   The file name can  be a variable  or a field  as
well as a constant; for example,

-------------------------------- Page  7 --------------------------------

          { print $1 >$2 }

uses the contents of field 2 as a file name.

Naturally, there is a limit on the  number of output files; currently  it
is 10.

Similarly, output can be piped into another process; for instance,

          { print | "mail zzz" }

mails the output to zzz.

Awk also provides the printf statement for output formatting:

          printf format expr, expr, ...

formats the expressions  in the  list according to  the specification  in
format and prints them.  For example,

          { printf "%8.2f  %10d\n", $1, $2 }

prints $1 as a floating point number 8 digits wide, with two digits after
the decimal point, two blanks  and $2 as a 10-digit decimal number,  fol-
lowed by a newline.  No output separators are produced automatically; you
must add them  yourself, as in  this example.  The  version of printf  is
identical to that used with C[3].


3.2      VARIABLES, EXPRESSIONS, AND ASSIGNMENTS

Variables in awk are not declared.  They take on numeric (floating point)
or string values according to context.  For example, in

          { x = 1 }

x is clearly a number, while in

          { x = "smith" }

it is clearly a string.  Strings are converted to numbers and vice  versa
whenever context demands it.  For instance,

          { x = "3" + "4" }

assigns 7 to  x.  Strings  which cannot be  interpreted as  numbers in  a
numerical context  will generally  have  numeric value  zero, but  it  is
unwise to count on this behavior.

-------------------------------- Page  8 --------------------------------

By default, variables other than pre-defined variables (see 1.3) are ini-
tialized to the null string,  which has numerical value zero; this  elim-
inates the need for most  BEGIN sections.  For example,  the sums of  the
first two fields can be computed by

                { s1 += $1; s2 += $2 }
          END   { print s1, s2 }


Arithmetic is done internally in  floating point.  The arithmetic  opera-
tors are +, -,  *, / and  % (mod).  The  C increment ++ and decrement  --
operators are also available, and so are the assignment operators +=, -=,
*=, /= and %=.  These operators may all be used in expressions.


3.3      FIELD VARIABLES

Fields in awk have essentially all of the properties of variables.   They
may be used  in arithmetic or  string operations, and  may be assigned  a
value.  Thus one can replace the first field with a sequence number  like
this:

          { $1 = NR; print }

or accumulate two fields into a third, like this:

          { $1 = $2 + $3; print $0 }

or assign a string to a field:

          { if ($3 > 1000)
                $3 = "too big"
            print
          }

which replaces the third field by "too big" when  it is, and in any  case
prints the record.

Field references may be numerical expressions, as in

          { print $i, $(i+1), $(i+n) }

Whether a field is deemed numeric or string depends on context; in  ambi-
guous cases like

          { if ($1 == $2) ... }

fields are treated as strings.

-------------------------------- Page  9 --------------------------------

3.4      STRING CONCATENATION

Strings may be concatenated.  For example

          length($1 $2 $3)

returns the length of the  first three fields (see 3.6).   Or in a  print
statement,

          { print $1 " is " $2 }

prints the two fields separated by " is ".  Variables and numeric expres-
sions may also appear in concatenations.


3.5      ARRAYS

Array elements are not declared; they spring into existence by being men-
tioned.  Subscripts may  have any  non-null value, including  non-numeric
strings.  As an example of  a conventional numeric subscript, the  state-
ment

          x[NR] = $0

assigns the current input record to the NRth element of the array x.   In
fact, it is possible  in principle (though  perhaps slow) to process  the
entire input in a random order with the awk program

                { x[NR] = $0 }
          END     { ... program ... }

The first action merely records each input line in the array x.

Array elements may  be named  by non-numeric  values, which  gives awk  a
capability rather like the associative memory of Snobol tables.   Suppose
the input contains fields with values like apple, orange, etc.  Then  the
program

          /apple/    { x["apple"]++ }
          /orange/   { x["orange"]++ }
          END        { print x["apple"], x["orange"] }

increments counts for the named  array elements, and  prints them at  the
end of the input.


3.6      BUILT-IN FUNCTIONS

Awk provides a "length"  function to compute  the length of  a string  of
characters.  This program prints each record, preceded by its length:

-------------------------------- Page 10 --------------------------------

          {print length, $0}

length by itself is a  "pseudo-variable" which yields  the length of  the
current record; length(argument) is a function which yields the length of
its argument, as in the equivalent

          { print length($0), $0 }

The argument may be any expression.

Awk also provides the  arithmetic functions sqrt,  for square root;  log,
for base e logarithm; and exp and int for exponential and integer part of
their respective arguments.

The name  of  one  of  these  built-in  functions,  without  argument  or
parentheses, stands for the  value of the  function on the whole  record.
The program

          { print $0, sqrt }

prints each record and  its square root  (or 0 if  there are  non-numeric
characters in the record or more than one field).

The function substr(s, m, n) produces the  substring of s that begins  at
position m (origin 1) and is at most n characters long.  If n is omitted,
the substring goes to the end of s.

The function  index(s1, s2) returns  the  position where  the  string  s2
occurs in s1, or zero if it does not.

The function sprintf(f, e1, e2, ...)  produces the value  of the  expres-
sions e1, e2, etc., in the printf format specified by f.  Thus, for exam-
ple,

          { x = sprintf("%8.2f %10ld", $1, $2) }

sets x to the string produced by formatting the values of $1 and $2.

Each input line is split into  fields automatically as necessary.  It  is
also possible to split any variable or string into fields:

          { n = split(s, array, sep) }

splits the the string s into array[1], ..., array[n].  The number of ele-
ments found is returned.  If the sep argument is provided,  it is used as
the field separator; otherwise FS is used as the separator.

-------------------------------- Page 11 --------------------------------

3.7      FLOW-OF-CONTROL STATEMENTS

Awk provides the basic flow-of-control statements if-else, while and for,
and statement grouping with braces, as in C.  We showed  the if statement
in section 3.3 without describing it.

          if ( condition ) {
                statements
          } else {
                statements
          }

The condition in parentheses is evaluated;  if it is true, the  statement
following the if is done.  The else part is optional.

The while statement is exactly like that of C.  For example, to print all
input fields one per line,

          i = 1
          while (i <= NF) {
                print $i
                ++i
          }


The for statement is also exactly that of C:

          for (i = 1; i <= NF; i++)
                print $i

does the same job as the while statement above.

There is  an alternate  form of  the for  statement which  is suited  for
accessing the elements of an associative array:

          for (i in array)
                statement

does statement with i set in turn to each element of array.  The elements
are accessed in  an apparently random  order.  Chaos will  ensue if i  is
altered, or if any new elements are accessed during the loop.

The expression in the condition part of an  if, while or for can  include
relational operators like <, <=,  >, >=, == ("is equal to") and !=  ("not
equal to"); regular expression matches with the match operators ~ and !~;
the logical operators ||, && and !; and of course  parentheses for group-
ing.

The break statement causes an immediate  exit from an enclosing while  or
for; the continue statement causes the next iteration to begin.

-------------------------------- Page 12 --------------------------------

The statement next causes awk to skip immediately to the next record  and
begin scanning the patterns from the top.  The statement exit  causes the
program to behave as if the end of the input had occurred.




REFERENCES

 [1]  UTS Programmer's Manual, Volume 1.

 [2]  Lex -- A Lexical Analyzer Generator.

 [3]  B. W.  Kernighan and  D. M.  Ritchie, The  C Programming  Language,
      Prentice-Hall, Englewood Cliffs, New Jersey (1978).

-------------------------------- Page 13 --------------------------------

APPENDIX A.    AWK GRAMMAR

The following grammar for awk  is patterned after the one  in The C  Pro-
gramming Language.  Words and characters in bold are literal and the sym-
bols number, character, string, variable and null have the meanings  that
one would intuitively expect.

program:
        begin pat_act_stmts end

begin:
        BEGIN { stmt_list }
        begin newline
        null

end:
        END { stmt_list }
        end newline
        null

pat_act_stmts:
        pat_act_stmts pat_act_stmt terminator
        pat_act_stmts pat_act_stmt
        null

pat_act_stmt:
        pattern
        pattern { stmt_list }
        pattern , pattern
        pattern , pattern { stmt_list }
        { stmt_list }

pattern:
        compound_pattern
        regular_expr
        relational_expr
        lexical_expr

compound_pattern:
        pattern || pattern
        pattern && pattern
        ! pattern
        ( compound_pattern )

regular_expr:
        / r /

-------------------------------- Page 14 --------------------------------

r:
        character
        .
        ]
        ^
        $
        r | r
        r r
        r *
        r +
        r ?
        ( r )

relational_expr:
        expr relational_op expr
        ( relational_expr )

relational_op:
        !=  <  <=  == >=  >  >>

lexical_expr:
        expr matching_op regular_expr
        ( lexical_expr )

matching_op:
        ~  !~

expr:
        term
        expr term
        variable_expr assignment_op expr

assignment_op:
        +=  -=  *=  /=  %=  =

variable_expr:
        number
        string
        variable
        variable [ expr ]
        field

field:
        $term

-------------------------------- Page 15 --------------------------------

term:
        variable_expr
        function
        function ( )
        function ( expr )
        sprintf print_list
        substr ( expr , expr , expr )
        substr ( expr , expr )
        split ( expr , variable , expr )
        split ( expr , variable )
        index ( expr , expr )
        ( expr )
        term + term
        term - term
        term * term
        term / term
        term % term
        - term
        + term
        ++ variable_expr
        -- variable_expr
        variable_expr ++
        variable_expr --

function:
        length
        log
        int
        exp
        sqrt

stmt_list:
        stmt_list stmt
        null

stmt:
        simple_stmt terminator
        if_stmt stmt
        if_stmt stmt else_stmt stmt
        while_stmt stmt
        for_stmt
        next terminator
        exit terminator
        break terminator
        continue terminator
        { stmt_list }

terminator:
        newline
        ;

-------------------------------- Page 16 --------------------------------

simple_stmt:
        print print_list redirection expr
        print print_list
        printf print_list redirection expr
        printf print_list
        expr
        null

redirection:
        relational_op
        |

print_list:
        expr
        print_expr_list
        null

print_expr_list:
        expr , expr
        print_expr_list , expr
        ( print_expr_list )

if_stmt:
        if ( conditional ) opt_newline

else_stmt:
        else opt_newline

while_stmt:
        while ( conditional ) opt_newline

for_stmt:
        for ( simple_stmt ; conditional ; simple_stmt )
                opt_newline stmt
        for ( simple_stmt ; ; simple_stmt )
                opt_newline stmt
        for ( variable in variable ) opt_newline stmt

opt_newline:
        newline
        null

conditional:
        expr
        relational_expr
        lexical_expr
        compound_conditional

-------------------------------- Page 17 --------------------------------

compound_conditional:
        conditional || conditional
        conditional && conditional
        ! conditional
        ( compound_conditional )
