it is well-known that part of speech depends on context. the word "table," for example, can be a verb in some contexts (e.g., "he will table the motion") and a noun in others (e.g., "the table is ready"). a program has been written which tags each word in an input sentence with the most likely part of speech. the program produces the following output for the two "table" sentences just mentioned: (pps = subject pronoun; md = modal; vb = verb (no inflection); at = article; nn = noun; bez = present 3rd sg form of "to be"; jj = adjective; notation is borrowed from [francis and kucera, pp. 6-8]) part of speech tagging is an important practical problem with potential applications in many areas including speech synthesis, speech recognition, spelling correction, proof-reading, query answering, machine translation and searching large text data bases (e.g., patents, newspapers). the author is particularly interested in speech synthesis applications, where it is clear that pronunciation sometimes depends on part of speech. consider the following three examples where pronunciation depends on part of speech. first, there are words like "wind" where the noun has a different vowel than the verb. that is, the noun "wind" has a short vowel as in "the wind is strong," whereas the verb "wind" has a long vowel as in "don't forget to wind your watch." secondly, the pronoun "that" is stressed as in "did you see that?" unlike the complementizer "that," as in "it is a shame that he's leaving." thirdly, note the difference between "oily fluid" and "transmission fluid"; as a general rule, an adjective-noun sequence such as "oily fluid" is typically stressed on the right whereas a noun-noun sequence such as "transmission fluid" is typically stressed on the left. these are but three of the many constructions which would sound more natural if the synthesizer had access to accurate part of speech information. perhaps the most important application of tagging programs is as a tool for future research. a number of large projects such as [cobuild] have recently been collecting large corpora (101000 million words) in order to better describe how language is actually used in practice: "for the first time, a dictionary has been compiled by the thorough examination of representative group of english texts, spoken and written, running to many millions of words. this means that in addition to all the tools of the conventional dictionary makers... the dictionary is based on hard, measureable evidence." [cobuild, p. xv] it is likely that there will be more and more research projects collecting larger and larger corpora. a reliable parts program might greatly enhance the value of these corpora to many of these researchers. the program uses a linear time dynamic programming algorithm to find an assignment of parts of speech to words that optimizes the product of (a) lexical probabilities (probability of observing part of speech i given word j), and (b) contextual probabilities (probability of observing part of speech i given k previous parts of speech). probability estimates were obtained by training on the tagged brown corpus [francis and kucera], a corpus of approximately 1,000,000 words with part of speech tags assigned laboriously by hand over many years. program performance is encouraging (95-99% "correct", depending on the definition of "correct"). a small 400 word sample is presented in the appendix, and is judged to be 99.5% correct. it is surprising that a local "bottom-up" approach can perform so well. most errors are attributable to defects in the lexicon; remarkably few errors are related to the inadequacies of the extremely over-simplified grammar (a trigram model). apparently, "long distance" dependences are not very important, at least most of the time. one might have thought that ngram models weren't adequate for the task since it is wellknown that they are inadequate for determining grammaticality: "we find that no finite-state markov process that produces symbols with transition from state to state can serve as an english grammar. furthermore, the particular subclass of such processes that produce norder statistical approximations to english do not come closer, with increasing n, to matching the output of an english grammar." [chomsky, p. 113] chomslcy's conclusion was based on the observation that constructions such as: have long distance dependencies that span across any fixed length window n. thus, ngram models are clearly inadequate for many natural language applications. however, for the tagging application, the ngram approximation may be acceptable since long distance dependencies do not seem to be very important. statistical ngram models were quite popular in the 1950s, and have been regaining popularity over the past few years. the ibm speech group is perhaps the strongest advocate of ngram methods, especially in other applications such as speech recognition. robert mercer (private communication, 1982) has experimented with the tagging application, using a restricted corpus (laser patents) and small vocabulary (1000 words). another group of researchers working in lancaster around the same time, leech, garside and atwell, also found ngram models highly effective; they report 96.7% success in automatically tagging the lob corpus, using a bigram model modified with heuristics to cope with more important trigrams. the present work developed independently from the lob project. many people who have not worked in computational linguistics have a strong intuition that lexical ambiguity is usually not much of a problem. it is commonly believed that most words have just one part of speech, and that the few exceptions such as "table" are easily disambiguated by context in most cases. in contrast, most experts in computational linguists have found lexical ambiguity to be a major issue; it is said that practically any content word can be used as a noun, verb or adjective,i and that local context is not always adequate to disambiguate. introductory texts are full of ambiguous sentences such as where no amount of syntactic parsing will help. these examples are generally taken to indicate that the parser must allow for multiple possibilities and that grammar formalisms such as lr(k) are inadequate for natural language since these formalisms cannot cope with ambiguity. this argument was behind a large set of objections to marcus' "lr(k)-like" deterministic parser. although it is clear that an expert in computational linguistics can dream up arbitrarily hard sentences, it may be, as marcus suggested, that most texts are not very hard in practice. recall that marcus hypothesized most decisions can be resolved by the parser within a small window (i.e., three buffer cells), and there are only a few problematic cases where the parser becomes confused. he called these confusing cases "garden paths," by analogy with the famous example: • the horse raced past the barn fell. with just a few exceptions such as these "garden paths," marcus assumes, there is almost always a unique "best" interpretation which can be found with very limited resources. the proposed stochastic approach is largely compatible with this; the proposed approach 1. from an information theory point of view, one can quantity ambiguity in bits. in the case of the brown tagged corpus, the lexical entropy, the conditional entropy of the part of speech given the word is about 0.25 bits per part of speech. this is considerably smaller than the contextual entropy, the conditional entropy of the part of speech given the next two parts of speech. this entropy is estimated to be about 2 bits per part of speech. assumes that it is almost always sufficient to assign each word a unique "best" part of speech (and this can be accomplished with a very efficient linear time dynamic programming algorithm). after reading introductory discussions of "flying planes can be dangerous," one might have expected that lexical ambiguity was so pervasive that it would be hopeless to try to assign just one part of speech to each word and in just one linear time pass over the input words.the proposed method omitted only 5 of 243 noun phrase brackets in the appendix. it is well-known that part of speech depends on context. find all assignments of parts of speech to "a" and score. after reading introductory discussions of "flying planes can be dangerous," one might have expected that lexical ambiguity was so pervasive that it would be hopeless to try to assign just one part of speech to each word and in just one linear time pass over the input words. the word "table," for example, can be a verb in some contexts (e.g., "he will table the motion") and a noun in others (e.g., "the table is ready"). this entropy is estimated to be about 2 bits per part of speech. assumes that it is almost always sufficient to assign each word a unique "best" part of speech (and this can be accomplished with a very efficient linear time dynamic programming algorithm). this is considerably smaller than the contextual entropy, the conditional entropy of the part of speech given the next two parts of speech. there is some tendency to underestimate the number of brackets and run two noun phrases together as in [np the time fairchild].