------------------------------- Page    i -------------------------------

                              Writing Tools

                     The Style and Diction Programs




                                                 L. L. Cherry

                                                 W. Vesterman



                                                 Edited for UTS

------------------------------- Page   ii -------------------------------

                            TABLE OF CONTENTS


Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1

1.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .   1

2.    Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   2

2.1      What is a Sentence?  . . . . . . . . . . . . . . . . . . . .   3
2.2      The Style Output Report  . . . . . . . . . . . . . . . . . .   4

2.2.1       Readability Measures  . . . . . . . . . . . . . . . . . .   4
2.2.2       Sentence Lengths and Types  . . . . . . . . . . . . . . .   5
2.2.3       Sentence Beginnings . . . . . . . . . . . . . . . . . . .   7
2.2.4       Word Usage  . . . . . . . . . . . . . . . . . . . . . . .   7
2.2.5       Verb Usage  . . . . . . . . . . . . . . . . . . . . . . .   9

2.3      Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . .  10
2.4      Early Use  . . . . . . . . . . . . . . . . . . . . . . . . .  10

3.    Diction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13

3.1      The Diction Data Base  . . . . . . . . . . . . . . . . . . .  14
3.2      Early Use  . . . . . . . . . . . . . . . . . . . . . . . . .  15

4.    Suggest . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15

4.1      The Suggest Data Base  . . . . . . . . . . . . . . . . . . .  15

5.    Deroff  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15

6.    Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .  16

References  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17


                                                            Last Page  17




                            TABLE OF EXAMPLES


Example 1.    Sample Style Output . . . . . . . . . . . . . . . . . .   3

-------------------------------- Page  1 --------------------------------

                             TABLE OF TABLES


Table 1.    Sentence Identification on 20 Technical Documents . . . .  10

Table 2.    Text Statistics on 20 Technical Documents . . . . . . . .  11

Table 3.    Text Statistics on Single Authors . . . . . . . . . . . .  12

-------------------------------- Page  2 --------------------------------

ABSTRACT

Text processing systems are now in heavy use in many companies to  format
documents.  With many documents stored on line, it has become possible to
use computers to study writing itself and to help writers produce  better
written and more readable prose.   The system of programs described  here
is an initial step  toward such help.   It includes programs  and a  data
base designed to produce a  stylistic profile of writing at the word  and
sentence levels.   The system  measures  readability, sentence  and  word
length, sentence types, sentence beginnings, word usage, and verb  usage.
It also locates common examples of  wordy phrasing and bad diction.   The
system is  useful for  evaluating a documents  style, locating  sentences
that may be difficult  to read  or excessively wordy,  and determining  a
particular writer's style over several documents.




1.    INTRODUCTION

Computers have become important in the document preparation process, with
programs to check for  spelling errors and  to format documents.  As  the
amount of text stored on line increases, it becomes feasible and  attrac-
tive to study writing style and to attempt to help  the writer in produc-
ing readable documents.  The system of writing tools described here is  a
first step toward  such help.   The system includes  programs and a  data
base to analyze writing style at the word and sentence level.  We use the
term 'style' in this paper to describe the results of a writer's particu-
lar choices among  individual words  and sentence  forms.  Although  many
judgements of style  are subjective, particularly  those of word  choice,
there are some objective measures that experts agree lead to good  style.
Three programs  have been  written  to measure  some of  the  objectively
definable characteristics of writing style and to identify some  commonly
misused or unnecessary phrases.  Although a document that conforms to the
stylistic rules is not guaranteed to  be coherent and readable, one  that
violates all the rules is likely to be difficult or tedious to read.  The
program style calculates  readability, variability  of sentence  lengths,
types and beginnings, word  usage, and verb  usage.  It assumes that  the
sentences are well-formed, i.e. that  each sentence has  a verb and  that
the subject and  verb agree  in number.  The  diction program  identifies
phrases that are either  bad usage or  unnecessarily wordy.  The  suggest
program acts as a thesaurus for the phrases found by diction.

-------------------------------- Page  3 --------------------------------

2.    STYLE

The program style reads a  document and prints  a summary of  readability
indices; sentence lengths, types and beginnings; and word and verb usage.
It can locate all sentences in a document longer than a given length,  of
readability index higher than a given number, those containing a  passive
verb, those  containing a  nominalization,  or those  beginning  with  an
expletive.  It can also print the part of speech of  each word in a docu-
ment.  See the entry for style in [2] for  details on how to produce  the
different reports.

Style is based on the system called  parts, described in [3], that  finds
English word classes or parts of speech.  Parts is a set of programs that
uses a small dictionary (about 350  words) and suffix rules to  partially
assign word classes to English text.  It then uses experimentally derived
rules of word order to assign word classes to all words in the text  with
an accuracy of about 95%.  Because parts uses only a small dictionary and
general rules,  it works  on  text about  any subject,  from  physics  to
psychology.  Measures of style have  been built into the output phase  of
the programs  that  make up  parts.   Some of  the  measures  are  simple
counters of the word classes  found by parts; many are more  complicated.
For example, the verb count  is the total number  of verb phrases.   This
includes phrases like:

     has been going
     was only going
     to go

each of which counts as  one verb.  Example 1 shows  the output of  style
run on this document.  As the example shows, style output  is in six sec-
tions.  After a brief discussion of sentences, we will describe the parts

-------------------------------- Page  4 --------------------------------

in order.

 ________________________________________________________________________
| readability:    mean grade level: 11.9                                |
|       Kincaid 10.8,   ARI 11.4,   Coleman-Liau 12.4,   Flesch 13.0    |
| sentence lengths:    257 sentences, 4601 words, 23339 letters         |
|       mean words/sentence: 17.9,     mean letters/word:    5.1        |
|       64 short sentences, <13 words, 24.9%, shortest has 2 words      |
|       23  long sentences, >28 words,  8.9%, longest has 42 words      |
|       1 questions, 3 imperative sentences                             |
|       61.4% non-functional words, mean length:  6.4 letters           |
| sentence types:    out of 257 total sentences                         |
|        129 simple   50.2%,      88 complex    34.2%,                  |
|         22 compound  8.6%,      18 cmpd-cmplx  7.0%                   |
| sentence beginnings:    out of 257 total sentences                    |
|          21 prep  8.2%,     6 verb  2.3%,     0 conjunct  0.0%,       |
|           3 expl  1.2%,    15 adv   5.8%,    27 sub-conj 10.5%,       |
|         185 subj 72.0%:                                               |
|       52 nouns, 21 pronouns, 1 possessives, 29 adjectives, 82 articles|
| word usage:    out of 4601 total words                                |
|        510 prep 11.1%,  176 conj  3.8%,  162 adv  3.5%,  80 nom  1.7%,|
|       1394 noun 30.3%,  144 pron  3.1%,  855 adj 18.6%                |
| verb usage:    out of 496 total verbs                                 |
|       179 to be 36.1%,   64 auxiliary 12.9%,   93 infinitives 18.8%   |
|_______82_passive_verbs_are_20.3%_of_all_noninfinitive_verbs___________|

                    Example 1.    Sample Style Output



2.1      WHAT IS A SENTENCE?

Readers of documents  have little  trouble deciding  where the  sentences
end.  People don't even have to stop and think about  uses of the charac-
ter '.'  in phrases  like '1.25',  'A. J.  Jones', 'Ph.D.',  'i. e.',  or
'etc.'.  When a computer reads a document, finding the ends  of sentences
is not as easy.  First we must throw away the printer's marks and format-
ting commands that litter the text in computer form.  Then  style defines
a sentence as a string of words ending in '.', '?', '!', or '\.'.  A sen-
tence ending with '?' is  considered to be a question; a sentence  ending
with '!'  or '\.'  is considered  to be  imperative.  (Note:   '\.'  will
appear as a normal period  when formatted, but is recognized by style  as
marking an  imperative sentence.)   Style properly  handles numbers  with
embedded decimal points and commas,  strings of letters and numbers  with
embedded decimal points  used for  computer file names,  and many  common
abbreviations.  Numbers that end sentences cause a sentence break if  the
next word begins with a capital  letter.  Initials only cause a  sentence
break if the next word begins with a capital and is found in the diction-
ary of function words used  by parts.  So  the string 'J. D. Jones'  does

-------------------------------- Page  5 --------------------------------

most sentences  are broken  at the  proper place,  although  occasionally
either two sentences are called  one or a fragment is called a  sentence.
(See {#5} for more details.)


2.2      THE STYLE OUTPUT REPORT


2.2.1       READABILITY MEASURES

The first section of style  output consists of four readability  indices.
As Klare points  out in  [10], readability indices  estimate the  reading
skills needed by the  reader to understand  a document.  The  readability
indices reported  by style  are based  on measures of  sentence and  word
lengths.  Although the indices  may not measure  whether the document  is
coherent and well organized, experience has shown that high indices  seem
to suggest  stylistic difficulty.   Documents  with short  sentences  and
short words have low scores; those with long sentences and  many polysyl-
labic words have high scores.  The four formulae reported are the Kincaid
Formula [9], the Automated Readability Index [12], the Coleman-Liau  For-
mula [6], and a normalized version of the Flesch Reading Ease Score  [8].
The formulae differ because  they were experimentally derived using  dif-
ferent texts and subject groups.   We will discuss  each of the  formulae
briefly; for a more detailed discussion the reader should see [10].

The Kincaid Formula, given by:

     Reading_Grade = 11.8*syl_per_word + 0.39*words_per_sent - 15.59

was based on Navy training manuals that ranged in difficulty from 5.5  to
16.3 in reading grade level.  The score reported by this formula tends to
be in the  midrange of the  four scores.   Because it is  based on  adult
training manuals rather than school  book text, this formula is  probably
the best one to apply to technical documents.

The Automated Readability Index (ARI), based on text from grades 0 to  7,
was derived to be easy to automate.  The formula is:

     Reading_Grade = 4.71*let_per_word + 0.5*words_per_sent - 21.43

ARI tends to produce scores that are higher than Kincaid and Coleman-Liau
but are usually slightly lower than Flesch.

The Coleman-Liau Formula, based on text ranging in difficulty from .4  to
16.3, is:

     Reading_Grade = 5.89*let_per_word - 0.3*sent_per_100_words - 15.8

Of the four formulae this one usually gives the lowest grade when applied

-------------------------------- Page  6 --------------------------------

to technical documents.

The last formula, the Flesch Reading Ease Score, is based on grade school
text covering grades 3 to 12.  The formula, given by:

     Reading_Grade = 206.835 - 84.6*syl_per_word - 1.015*words_per_sent

is usually reported in the range 0 (difficult) to 100 (easy).  The  score
reported by  style is  scaled  to be  comparable to  the other  formulas,
except that the maximum grade  level reported is set  to 17.  The  Flesch
score is usually the highest of the four scores on technical documents.

Coke [4] found that the  Kincaid formula is  probably the best  predictor
for technical documents;  both ARI  and Flesch tend  to overestimate  the
difficulty; Coleman-Liau tends to underestimate.  On text between  grades
7 and 9 the four  formulas tend to be about  the same.  On easy text  the
Coleman-Liau formula is probably preferred  since it is reasonably  accu-
rate at the lower grades and it is safer to present text that is a little
too easy than a little too hard.

If a document has particularly difficult technical content, especially if
it includes much mathematics, it  is probably best to make the text  easy
to read, i.e. a lower readability  index by shortening the sentences  and
words.  This will allow the  reader to concentrate on the technical  con-
tent and not the  long sentences.   The user should  remember that  these
indices are estimators;  they should  not be taken  as absolute  numbers.
Style called with '-r n' will print all sentences with an Automated  Rea-
dability Index equal to or greater than n.


2.2.2       SENTENCE LENGTHS AND TYPES

The next two sections  of style  output deals with  sentence lengths  and
types.  Almost all books on writing style or effective writing  emphasize
the importance of variety in sentence length and structure for good writ-
ing.  Ewing's  first rule  in discussing  style in the  book Writing  for
Results [7] is:

     Vary the sentence structure and length of your sentences.

Leggett, Mead and Charvat break this  rule into 3 in Prentice-Hall  Hand-
book for Writers [11] as follows:

     34a.  Avoid the overuse of short simple sentences.
     34b.  Avoid the overuse of long compound sentences.
     34c.  Use various sentence structures to avoid monotony
           and increase effectiveness.

Although experts agree that  these rules are  important, not all  writers

-------------------------------- Page  7 --------------------------------

follow them.  Sample technical documents  have been found with almost  no
sentence length or type variability.   One document had  90% of its  sen-
tences about the same length  as the mean; another was found to have  80%
simple sentences.

The output sections labeled 'sentence lengths' and 'sentence types'  give
both length and structure measures.  Style reports on the number and mean
length of  both sentences  and words,  and the  number of  questions  and
imperative sentences.  The measures  of nonfunction words are an  attempt
to look at the content  words in the  document.  In English,  nonfunction
words are nouns,  adjectives, adverbs,  and nonauxiliary verbs;  function
words are  prepositions,  conjunctions, articles,  and  auxiliary  verbs.
Since most function words are short, they tend to lower  the average word
length.  The average length of  non-function words may  be a more  useful
measure for comparing  word choice  of different writers  than the  total
average word length.  The percentages of short and long sentences measure
sentence length variability.  Short sentences are those at least 5  words
less than the average; long  sentences are those at  least 10 words  more
than the average.  Last in the sentence information section is the length
and location of the longest and  shortest sentences.  If the flag  '-s n'
is used, style will print all sentences longer than n.

Because of the difficulties in dealing with  the many uses of commas  and
conjunctions in  English, sentence  type definitions  vary slightly  from
those of standard textbooks,  but still measure  the same  constructional
activity.

 1.  A simple sentence has one verb and no dependent clause.

 2.  A complex  sentence has  one independent  clause and  one  dependent
     clause, each with one verb.  Complex sentences are found by  identi-
     fying sentences that contain either  a subordinate conjunction or  a
     clause beginning  with words  like 'that' or  'who'.  The  preceding
     sentence has such a clause.

 3.  A compound sentence has more than one verb and no dependent  clause.
     Sentences joined by ';' are also counted as compound.

 4.  A compound-complex sentence has either several dependent clauses  or
     one dependent clause and a compound verb in either the  dependent or
     independent clause.

Even using these broader definitions,  simple sentences dominate many  of
the technical documents that  have been tested,  but Example 1 shows  the
variety in both sentence types and sentence lengths.

-------------------------------- Page  8 --------------------------------

2.2.3       SENTENCE BEGINNINGS

Another accepted principle of  style is variety  in sentence  beginnings.
Because style determines the type of sentence beginning by looking at the
part of speech of the first word  in the sentence, the sentences  counted
under the heading 'subject opener' may not all really begin with the sub-
ject.  However, a large  percentage of sentences  in this category  still
shows lack  of variety  in  sentence beginnings.   Other sentence  opener
measures help the user  determine if there  are transitions between  sen-
tences and where the  subordination occurs.  Adverbs and conjunctions  at
the beginning of sentences  are mechanisms for  transitions between  sen-
tences.  A pronoun at the beginning shows a link to  something previously
mentioned and suggests connectivity.

The location of subordination can  be determined by comparing the  number
of sentences that begin with a subordinator with the number  of sentences
with complex clauses.  If few  sentences start with subordinate  conjunc-
tions then the  subordination is embedded  or at the  end of the  complex
sentences.  For variety the writer  may want to transform some  sentences
to have leading subordination.

The last category of  beginnings, expletives, is  commonly overworked  in
technical writing.  Expletives  are the words  'it' and 'there',  usually
with the verb 'to  be', in clauses  where the subject  follows the  verb.
For example:

     It is often necessary to continue a line on the next line.
     There were too many users on the system.

This phrasing tends to emphasize  the object rather  than the subject  of
the sentence.  The flag '-e' will cause style to print all sentences that
begin with an expletive.


2.2.4       WORD USAGE

The word usage measures are an  attempt to identify some other  construc-
tional features  of writing  style.   There are  many different  ways  in
English to say the same thing.  The constructions differ from one another
in the words  used.  The  following sentences all  convey about the  same
meaning but differ in word usage:

     My program is used to perform all communication between the systems.
     My program performs all communication between the systems.
     My program is used to communicate between the systems.
     My program communicates between the systems.
     All communication between the systems is performed by my program.

The distribution  of the  parts of  speech and  verb constructions  helps

-------------------------------- Page  9 --------------------------------

identify overuse of particular constructions.  Although the measures used
by style are crude, they do point out problem areas.  For each  category,
style reports a percentage  and a raw  count.  In addition to looking  at
the percentage, the user may find it useful to compare the raw count with
the number of sentences.  If, for example, the number of infinitives (see
{#2.2.5}) is almost equal to  the number of sentences,  then many of  the
sentences in the document  are arranged like  the first and third in  the
previous example.  The user may want to transform some of these sentences
into another form.  Some of  the implications of the word usage  measures
are discussed below.

Pronouns
     add cohesiveness and connectivity to  a document by providing  back-
     reference.  They are often a shorthand notation for something previ-
     ously mentioned, and therefore connect  the sentence containing  the
     pronoun with the word to  which the pronoun refers.  Although  there
     are other mechanisms for such  connections, documents with few  pro-
     nouns (1%) tend to be wordy and to have little connectivity.

Adverbs
     can provide  transition between  sentences and  order in  space  and
     time.  In these functions,  adverbs, like pronouns, provide  connec-
     tivity and cohesiveness.

Conjunctions
     provide parallelism in a  document by connecting  two or more  equal
     units.  These  units may  be whole sentences,  verb phrases,  nouns,
     adjectives, or prepositional  phrases.  The  compound and  compound-
     complex sentences reported under  sentence type are parallel  struc-
     tures.  Other uses of  parallel structures are  shown by the  degree
     that the number  of conjunctions reported  under word usage  exceeds
     the compound sentence measures.

Nouns and adjectives.
     Having the number of adjectives nearly the same as (or greater than)
     the number  of nouns  may suggest  the overuse  of modifiers.   Some
     technical writers qualify every  noun with one  or more  adjectives.
     Qualifiers in phrases like 'simple linear single-link network model'
     often lend more obscurity than precision to a text.

Nominalizations
     are verbs that are changed  to nouns by adding  one of the  suffixes
     'ment', 'ance',  'ence', or 'ion'.   Examples are  'accomplishment',
     'admittance',  'adherence',  and  'abbreviation'.   When  a   writer
     transforms a nominalized sentence into one that is not  nominalized,
     the effectiveness of the sentence is increased in several ways.  The
     noun becomes an  active verb and  frequently one complicated  clause
     becomes two shorter clauses.  For example:

-------------------------------- Page 10 --------------------------------

          Their inclusion of this provision is admission of
          the importance of the system.

          When they included this provision, they admitted
          the importance of the system.

     Coleman found that the transformed  sentences were easier to  under-
     stand, even  when the  transformation produced  sentences that  were
     slightly longer, provided the transformation  broke one clause  into
     two.  Writers who find their document contains many  nominalizations
     (5%) may  want to  transform some  of the  sentences to  use  active
     verbs.  The flag '-n' causes style to print all sentences containing
     nominalizations.


2.2.5       VERB USAGE

Verbs are measured in several ways to try to determine what types of verb
constructions are most frequent in the document.  Technical writing tends
to contain many passive verb constructions and other uses of the verb 'to
be'.  The category of  verbs labeled 'to  be' includes both passives  and
sentences of the form:

     subject to-be predicate

In counting verbs,  whole verb  phrases are  counted as  one verb.   Verb
phrases  containing  auxiliary  verbs  are  counted  in  the  'auxiliary'
category.  The verb phrases  counted here  are those whose  tense is  not
simple present or simple past.  It might eventually be useful  to do more
detailed measures of verb tense or mood.  There is a category for 'infin-
itives' also.  The  percentages reported for  these three categories  are
based on the total number  of verb phrases  found.  These categories  are
not mutually exclusive; they cannot be added, since, for example,  'to be
going' counts as both 'to be' and 'infinitive'.  Use of these three types
of verb constructions varies significantly among authors.

Style reports passive verbs as a percentage of the noninfinitive verbs in
the document.   Most style  books  warn against  the overuse  of  passive
verbs.  Coleman [5] has shown that sentences with active verbs are easier
to  learn  than  those   with  passive  verbs.   Although  the   inverted
object/subject order of the passive voice seems to emphasize the  object,
Coleman's experiments showed that there is little difference in retention
by word position.  He  also showed that  the direct object  of an  active
verb is retained better than the subject of a passive verb.  These exper-
iments support  the advice  of the  style books  suggesting that  writers
should try to use active  verbs wherever possible.  The flag '-p'  causes
style to print all sentences containing passive verbs.

-------------------------------- Page 11 --------------------------------

2.3      ACCURACY

Sentence Identification
The correctness of the style output on the 20 document sample was checked
in detail.  Style misidentified  129 sentence fragments as sentences  and
incorrectly joined two or more  sentences 75 times  in the 3287  sentence
sample.  The problems were usually because of nonstandard formatting com-
mands, unknown abbreviations,  or lists of  nonsentences.  An  impossibly
long sentence found as  the longest sentence  in the document usually  is
the result of a long list of nonsentences.

Sentence Types
Style correctly identified sentence type on 86.5% of the sentences in the
sample.  The type distribution of  the sentences was 52.5% simple,  29.9%
complex, 8.5% compound and 9.1%  compound-complex.  The program  reported
49.5% simple, 31.9%  complex, 8.0%  compound and 10.4%  compound-complex.
Looking at the errors on the  individual documents, the number of  simple
sentences was underreported  by about  4% and the  complex and  compound-
complex were  overreported by  3% and  2%, respectively.   The  following
table compares the sentence types  reported by the program with the  real
sentence types:


      Table 1.    Sentence Identification on 20 Technical Documents

                                     Reported Sentence Types

                            simple   complex   compound   cmpd-cmplx

      Real     simple        1566      132         49          17
    Sentence   complex         47      892          6          65
      Types    compound        40        6        207          23
               cmpd-cmplx       0       52          5         249


Word Usage
The accuracy of identifying word types  reflects that of parts, which  is
about 95% correct.  The largest source of confusion is between  nouns and
adjectives.  The verb counts were checked on about 20 sentences from each
document and found to be about 98% correct.


2.4      EARLY USE

To get base-line statistics and check style's accuracy, it was run on  20
technical documents.  There were a total of 3287 sentences in the sample.
The shortest document had  67 sentences; the  longest had 339  sentences.
The documents covered a wide range of subject matter, including theoreti-
cal computing, physics, psychology, engineering, and affirmative  action.

-------------------------------- Page 12 --------------------------------

style measures.  As you will note, most  of the measurements have a  wide
range of values across the sample documents.

          Table 2.    Text Statistics on 20 Technical Documents

                   variable        minimum   maximum   mean    std dev
   ___________________________________________________________________
   read-      Kincaid                9.5      16.9     13.3      2.2
   ability    A. R. I.               9.0      17.4     13.3      2.2
              Coleman-Liau          10.0      16.0     12.7      1.8
   ___________Flesch_________________8.9______17.0_____14.4______2.2__
   sentence   mean sent length      15.5      30.3     21.6      4.0
   lengths    mean word length       4.61      5.63     5.08     0.29
              short sentences       23%       46%      33%       5.9
              long sentences         7%       20%      14%       2.9
   ___________mean_nonfunc_length____5.72______7.30_____6.52_____0.45_
   sentence   simple                31%       71%      49%      11.4
   types      complex               19%       50%      33%       8.3
              compound               2%       14%       7%       3.3
   ___________compound-complex_______2%_______19%______10%_______4.8__
   sentence   prepositions           6%       19%      12%       3.4
   begin-     expletives             0%        6%       2%       1.7
   nings      verbs                  0%        4%       1%       1.0
              adverbs                0%       20%       9%       4.6
              conjunctions           0%        4%       0%       1.5
              subord conjunct's      1%       12%       5%       2.7
   ___________subjects______________56%_______85%______70%_______8.0__
   word       prepositions          10.1%     15.0%    12.3%     1.6
   usage      nouns                 23.6%     31.6%    27.8%     1.7
              conjunctions           1.8%      4.8%     3.4%     0.9
              pronouns               1.2%      8.4%     2.5%     1.1
              adverbs                1.2%      5.0%     3.4%     1.0
              adjectives            15.4%     27.1%    21.1%     3.4
   ___________nominalizations________2.0%______5.0%_____3.3%_____0.8__
   verb       to be                 26%       64%      44.7%    10.3
   usage      auxiliary             10%       40%      21.0%     8.7
              infinitive             8%       24%      15.1%     4.8
              passive               12%       50%      29.0%     9.3


As a  comparison, Table  3 gives  the median  results for  two  different
technical authors, a sample  of instructional material,  and a sample  of
the Federalist Papers.   The two  authors show  similar styles,  although
author 2 uses somewhat shorter sentences and longer words than  author 1.
Author 1 uses all types of sentences,  while author 2 prefers simple  and
complex sentences, using few compound or compound-complex sentences.  The
other major difference in the styles of these authors is the location  of
subordination.  Author 1 seems to prefer embedded or trailing  subordina-
tion, while author 2 begins  many sentences with the subordinate  clause.

-------------------------------- Page 13 --------------------------------

written for a technical audience.  The instructional documents, which are
written for craftspeople, vary surprisingly little from the two technical
samples.  The sentences and words are  a little longer, and they  contain
many passive and auxiliary  verbs, few adverbs,  and almost no  pronouns.
The instructional documents contain many  imperative sentences, so  there
are many sentences with  verb beginnings.  The  sample of the  Federalist
Papers contrasts with the other samples in almost every way.

              Table 3.    Text Statistics on Single Authors

                variable        author 1   author 2   instruct   Federal.
_________________________________________________________________________
read-      Kincaid               11.0       10.3       10.8       16.3
ability    A. R. I.              11.0       10.3       11.9       17.8
           Coleman-Liau           9.3       10.1       10.2       12.3
___________Flesch________________10.3_______10.7_______10.1_______15.0___
sentence   mean sent length      22.64      19.61      22.78      31.85
lengths    mean word length       4.47       4.66       4.65       4.95
           short sentences       35%        43%        35%        40%
           long sentences        18%        15%        16%        21%
___________mean_nonfunc_length____5.64_______5.92_______6.04_______6.87__
sentence   simple                36%        43%        40%        31%
types      complex               34%        41%        37%        34%
           compound              13%         7%         4%        10%
___________compound-complex______16%_________8%________14%________25%____
sentence   prepositions          11%        14%         6%         5%
begin-     expletives             3%         3%         0%         3%
nings      verbs                  3%         2%        14%         2%
           adverbs                9%         9%         6%         4%
           conjunctions           1%         0%         0%         3%
           subord conjunct's      8%        14%        11%         3%
___________subjects______________65%________59%________54%________66%____
word       prepositions          10.0%      10.8%      12.3%      15.9%
usage      nouns                 27.7%      26.5%      29.1%      24.9%
           conjunctions           3.2%       2.4%       3.9%       3.4%
           pronouns               5.3%       4.3%       2.1%       6.5%
           adverbs                5.0%       4.6%       3.5%       3.7%
           adjectives            17.0%      19.0%      15.4%      12.4%
___________nominalizations________1.0%_______2.0%_______2.0%_______3.0%__
verb       to be                 42%        43%        45%        37%
usage      auxiliary             17%        19%        32%        32%
           infinitive            17%        15%        12%        21%
           passive               20%        19%        36%        20%

-------------------------------- Page 14 --------------------------------

3.    DICTION

The diction program prints all sentences in a document containing phrases
that are either frequently misused or suggest wordiness.  The program  is
an extension of Aho's fgrep [1] [2] string matching program.  Fgrep takes
as input  a file  of patterns  to be  matched and  a file of  text to  be
searched, and outputs each line that contains any of the patterns with no
suggestion of which pattern was matched.  The following changes have been
made to produce diction:

 1.  The basic unit operated on is the sentence rather than the line.
 2.  Upper-case letters are changed to lower-case.
 3.  Punctuation characters are changed to blanks.
 4.  All matches in the sentence are found and surrounded by '[' and ']'.
 5.  A method for suppressing a string match has been added (see {#3.1}).

A data base of  over 500 phrases  has been compiled  as a default  phrase
file for diction.  Before attempting to locate phrases, the program  maps
upper-case letters to lower-case and substitutes blanks for  punctuation.
Sentence boundaries were deemed less  critical in diction than in  style,
so abbreviations and other uses of the character '.' are not treated spe-
cially.  Diction  brackets all  phrases matched  in a  sentence with  the
characters '[' and ']'.  Although many of the phrases in the default data
base are correct in some contexts, in others they are not or suggest wor-
diness.  Some examples of commonly  found phrases and suggested  alterna-
tives are:


                          Phrase           Alternative

                    a great deal of         much
                    as a consequence of     because
                    at this time            now
                    for the purpose of      for, to
                    in conjunction with     with
                    in many cases           often
                    in the form of          as
                    is used to              OMIT
                    one of the              one, a
                    through the use of      by, with


Some of the entries are short forms of problem phrases.  For example, the
phrase 'the fact'  is found in  each of  the following and  is enough  to
point out the wordiness to the user:

-------------------------------- Page 15 --------------------------------

                  Phrase                           Alternative

   accounted for by the fact that          due to, caused by, because
   an example of this is the fact that     thus
   based on the fact that                  because
   despite the fact that                   although, though
   due to the fact that                    because
   in the light of the fact that           because
   in view of the fact that                since, as, because
   notwithstanding the fact that           although
   on account of the fact that             because
   the fact is                             AVOID


The  user  may   supply  an   additional  phrase  file   with  the   flag
'-f phrase.file'.  Here, the default file will be loaded first,  followed
by the user file.  This mechanism  allows users to suppress phrases  con-
tained in the default  file or to include  their own pet peeves that  are
not in the default  file.  The flag  '-n' will exclude  the default  file
altogether.


3.1      THE DICTION DATA BASE

The default phrase file for  diction is in  '/usr/lib/diction/diction.d'.
When creating a phrase file,  a blank should usually precede each  phrase
to avoid matching substrings in words.  Occasionally a trailing blank  is
also desired.  For example,  to find all  occurrences of the word  'the',
the phrase ' the ' should be used.  The blanks cause only the word  'the'
to be matched and  not the string  'the' in words like 'there',  'other',
and 'lathe'.  One side effect  of surrounding words  with blanks is  that
when two phrases occur without intervening words, only the first  will be
matched.

If necessary, some  matches may  be suppressed.  Any  phrase that  begins
with '~' will not be  matched.  Because the matching algorithm finds  the
longest substring,  the suppression  of a  match allows  phrases in  some
correct contexts not  to be  matched while allowing  the phrase in  other
contexts to be found.  For example,  the word 'which' is usually  correct
when preceded  by  a  preposition  or comma.   The  default  phrase  file
suppresses the match of the  common prepositions or  a double blank  fol-
lowed by 'which' and therefore matches only the suspect uses.  The double
blank accounts for the replaced comma.  Similarly, the entries  ' rather'
and '~ rather than' cause  the word  'rather' to be  matched except  when
followed by the word 'than'.

-------------------------------- Page 16 --------------------------------

3.2      EARLY USE

In the first few weeks that diction was available, about 35,000 sentences
were processed with about 5,000  string matches.  The users seem to  make
the suggested changes about 50-75% of the  time.  Almost 200 of the  over
500 phrases in  the default  file were matched.   Although most of  these
phrases are valid and correct  in some contexts,  the 50-75% change  rate
seems to show that the phrases are used much more often than concise dic-
tion warrants.




4.    SUGGEST

The suggest program  is an  interactive thesaurus for  phrases found  and
bracketed by diction, and  gives suggested substitutions for the  phrases
that will improve the diction of the document.  If invoked with an  argu-
ment, suggest will respond with  suggestions for only that phrase.   With
no arguments, suggest repeatedly  prompts for phrases  and makes  sugges-
tions, until a blank line or end-of-file is found.


4.1      THE SUGGEST DATA BASE

The suggest data base is in '/usr/lib/diction/suggest.d'.  It consists of
lines having  the  questionable  phrase and  the  suggested  alternatives
separated  by  ' -> '.   Literal  phrases  appear  in  lower-case,  while
instructions appear in upper-case, for example:

     in the majority of cases -> usually, generally
     which -> USE that IF THE CLAUSE IS RESTRICTIVE




5.    DEROFF

The formatting commands embedded in  the text increase the difficulty  of
finding sentences.  Not all text in a document is in sentence form; there
are headings, tables, equations  and lists, for  example.  Headings  like
'The Suggest Data Base'  above should be  discarded, not attached to  the
next sentence.  However, since many of the documents may contain  format-
ting commands that change fonts, which usually operate on the most impor-
tant words  in the  document,discarding all  formatting commands  is  not

-------------------------------- Page 17 --------------------------------

correct.  To  allow  programs to  find  sentence boundaries  better,  the
deformatting programming, deroff [2],  has been given  some knowledge  of
the formatting packages used  on the UTS  operating system.  Deroff  will
now do the following:

 1.  Suppress formatting  commands that  are used  for titles,  headings,
     author's names, etc., including their argument lists.

 2.  Suppress displays, tables, footnotes and text that is centered or in
     no-fill mode.

 3.  Suppress lists if the '-l' flag is provided.

 4.  Replace string and number register references with the place  holder
     'x'.

Both style and diction  call deroff before  they look at  the text.   The
user should supply the  '-l' flag (which is  passed on to deroff) if  the
document contains  many lists  of nonsentences  that should  be  skipped.
This is a separate flag because of the variety of  ways the list commands
are used.  Often,  lists are  sentences that  should be  included in  the
analysis.  The user must determine how lists are used in  the document to
be analyzed.




6.    CONCLUSIONS

A system of writing tools that measure some of the objective characteris-
tics of writing style has  been developed.  The tools are general  enough
that they may be applied to documents on any subject with equal accuracy.
Although the measurements are only of the surface structure of  the text,
they do point out problem areas.  In addition to helping writers  produce
better documents, these programs may  be useful for studying the  writing
process and finding other formulae for measuring readability.

-------------------------------- Page 18 --------------------------------

REFERENCES

 [1]  Aho, A. V. and M. J.  Corasick, "Efficient String Matching: an  Aid
      to Bibliographic Search", Communications of the ACM, June 1975, 18,
      333-340.

 [2]  Amdahl Corporation, UTS Programmer's Manual, Volume 1.

 [3]  Cherry, L. L.,  "PARTS -  A System  for Assigning  Word Classes  to
      English Text", Computing Science  Technical Report #81, 1980,  Bell
      Laboratories, Murray Hill, N. J. 07974.

 [4]  Coke, E. U., private communication.

 [5]  Coleman, E.  B., "Learning  of Prose  Written in  Four  Grammatical
      Transformations", Journal of Applied Psychology, 1965, 49, 332-341.

 [6]  Coleman, M.  and  T.  L.  Liau,  "A  Computer  Readability  Formula
      Designed for Machine Scoring", Journal of Applied Psychology, 1975,
      60, 283-284.

 [7]  Ewing D. W.,  Writing for  Results, John  Wiley &  Sons, Inc.,  New
      York, N. Y., 1974.

 [8]  Flesch, R.,  "A  New Readability  Yardstick",  Journal  of  Applied
      Psychology, 1948, 32, 221-233.

 [9]  Kincaid, J. P., R. P.  Fishburne, R. L. Rogers  and B. S.  Chissom,
      "Derivation of  new  readability  formulas  (Automated  Readability
      Index, Fog  count,  and  Flesch  Reading  Ease  Formula)  for  Navy
      enlisted personnel", Navy Training  Command Research Branch  Report
      8-75, February 1975.

[10]  Klare, G. R., "Assessing Readability", Reading Research  Quarterly,
      1974-1975, 10, 62-102.

[11]  Leggett, G., C. D. Mead and W. Charvat, Prentice-Hall Handbook  for
      Writers, Seventh  Edition,  Prentice-Hall Inc.,  Englewood  Cliffs,
      N. J., 1978.

[12]  Smith, E.  A. and  P. Kincaid,  "Derivation and  validation of  the
      automated readability  index  for use  with  technical  materials",
      Human Factors, 1970, 12, 457-464.
