alwaysaditi's picture
End of training
dc78b20 verified
raw
history blame contribute delete
No virus
4.09 kB
parsing sentences using statistical information gathered from a treebank was first examined a decade ago in (chitrad and grishman, 1990) and is by now a fairly well-studied problem ((charniak, 1997), (collins, 1997), (ratnaparkhi, 1997)). but to date, the end product of the parsing process has for the most part been a bracketing with simple constituent labels like np, vp, or sbar. the penn treebank contains a great deal of additional syntactic and semantic information from which to gather statistics; reproducing more of this information automatically is a goal which has so far been mostly ignored. this paper details a process by which some of this information—the function tags— may be recovered automatically. in the penn treebank, there are 20 tags (figure 1) that can be appended to constituent labels in order to indicate additional information about the syntactic or semantic role of the constituent. we have divided them into four categories (given in figure 2) based on those in the bracketing guidelines (bies et al., 1995). a constituent can be tagged with multiple tags, but never with two tags from the same category.1 in actuality, the case where a constituent has tags from all four categories never happens, but constituents with three tags do occur (rarely). at a high level, we can simply say that having the function tag information for a given text is useful just because any further information would help. but specifically, there are distinct advantages for each of the various categories. grammatical tags are useful for any application trying to follow the thread of the text—they find the 'who does what' of each clause, which can be useful to gain information about the situation or to learn more about the behaviour of the words in the sentence. the form/function tags help to find those constituents behaving in ways not conforming to their labelled type, as well as further clarifying the behaviour of adverbial phrases. information retrieval applications specialising in describing events, as with a number of the muc applications, could greatly benefit from some of these in determining the where-when-why of things. noting a topicalised constituent could also prove useful to these applications, and it might also help in discourse analysis, or pronoun resolution. finally, the 'miscellaneous' tags are convenient at various times; particularly the clr 'closely related' tag, which among other things marks phrasal verbs and prepositional ditransitives. to our knowledge, there has been no attempt so far to recover the function tags in parsing treebank text. in fact, we know of only one project that used them at all: (collins, 1997) defines certain constituents as complements based on a combination of label and function tag information. this boolean condition is then used to train an improved parser.this boolean condition is then used to train an improved parser. this work presents a method for assigning function tags to text that has been parsed to the simple label level. • there is no reason to think that this work could not be integrated directly into the parsing process, particularly if one's parser is already geared partially or entirely towards feature-based statistics; the function tag information could prove quite useful within the parse itself, to rank several parses to find the most plausible. parsing sentences using statistical information gathered from a treebank was first examined a decade ago in (chitrad and grishman, 1990) and is by now a fairly well-studied problem ((charniak, 1997), (collins, 1997), (ratnaparkhi, 1997)). but to date, the end product of the parsing process has for the most part been a bracketing with simple constituent labels like np, vp, or sbar. in fact, we know of only one project that used them at all: (collins, 1997) defines certain constituents as complements based on a combination of label and function tag information. there are, it seems, two reasonable baselines for this and future work. we have found it useful to define our statistical model in terms of features.