alwaysaditi's picture
End of training
dc78b20 verified
raw
history blame contribute delete
No virus
3.65 kB
the task of identifying sentence boundaries in text has not received as much attention as it deserves. many freely available natural language processing tools require their input to be divided into sentences, but make no mention of how to accomplish this (e.g. (brill, 1994; collins, 1996)). others perform the division implicitly without discussing performance (e.g. (cutting et al., 1992)). on first glance, it may appear that using a short list, of sentence-final punctuation marks, such as ., ?, and !, is sufficient. however, these punctuation marks are not used exclusively to mark sentence breaks. for example, embedded quotations may contain any of the sentence-ending punctuation marks and . is used as a decimal point, in email addresses, to indicate ellipsis and in abbreviations. both ! and ? are somewhat less ambiguous *the authors would like to acknowledge the support of arpa grant n66001-94-c-6043, aro grant daah0494-g-0426 and nsf grant sbr89-20230. but appear in proper names and may be used multiple times for emphasis to mark a single sentence boundary. lexically-based rules could be written and exception lists used to disambiguate the difficult cases described above. however, the lists will never be exhaustive, and multiple rules may interact badly since punctuation marks exhibit absorption properties. sites which logically should be marked with multiple punctuation marks will often only have one ((nunberg, 1990) as summarized in (white, 1995)). for example, a sentence-ending abbreviation will most likely not be followed by an additional period if the abbreviation already contains one (e.g. note that d.0 is followed by only a single . in the president lives in washington, d.c.). as a result, we believe that manually writing rules is not a good approach. instead, we present a solution based on a maximum entropy model which requires a few hints about what. information to use and a corpus annotated with sentence boundaries. the model trains easily and performs comparably to systems that require vastly more information. training on 39441 sentences takes 18 minutes on a sun ultra sparc and disambiguating the boundaries in a single wall street journal article requires only 1.4 seconds.we would also like to thank the anonymous reviewers for their helpful insights. training on 39441 sentences takes 18 minutes on a sun ultra sparc and disambiguating the boundaries in a single wall street journal article requires only 1.4 seconds. the task of identifying sentence boundaries in text has not received as much attention as it deserves. we would like to thank david palmer for giving us the test data he and marti hearst used for their sentence detection experiments. the model trains easily and performs comparably to systems that require vastly more information. to our knowledge, there have been few papers about identifying sentence boundaries. liberman and church suggest in (liberma.n and church, 1992) that a. system could be quickly built to divide newswire text into sentences with a nearly negligible error rate, but do not actually build such a system. we have described an approach to identifying sentence boundaries which performs comparably to other state-of-the-art systems that require vastly more resources. instead, we present a solution based on a maximum entropy model which requires a few hints about what. information to use and a corpus annotated with sentence boundaries. we present two systems for identifying sentence boundaries. many freely available natural language processing tools require their input to be divided into sentences, but make no mention of how to accomplish this (e.g.