File size: 3,888 Bytes
dc78b20
1
the empiricist revolution in computational linguistics has dramatically shifted the accepted boundary between what kinds of knowledge are best supplied by humans and what kinds are best learned from data, with much of the human supplied knowledge now being in the form of annotations of data. as we look to the future, we expect that relatively unsupervised methods will grow in applicability, reducing the need for expensive human annotation of data. with respect to part-of-speech tagging, we believe that the way forward from the relatively small number of languages for which we can currently identify parts of speech in context with reasonable accuracy will make use of unsupervised methods that require only an untagged corpus and a lexicon of words and their possible parts of speech. we believe this based on the fact that such lexicons exist for many more languages (in the form of conventional dictionaries) than extensive human-tagged training corpora exist for. unsupervised part-of-speech tagging, as defined above, has been attempted using a variety of learning algorithms (brill 1995, church, 1988, cutting et. al. 1992, elworthy, 1994 kupiec 1992, merialdo 1991). while this makes unsupervised part-of-speech tagging a relatively well-studied problem, published results to date have not been comparable with respect to the training and test data used, or the lexicons which have been made available to the learners. in this paper, we provide the first comprehensive comparison of methods for unsupervised part-of speech tagging. in addition, we explore two new ideas for improving tagging accuracy. first, we explore an hmm approach to tagging that uses context on both sides of the word to be tagged, inspired by previous work on building bidirectionality into graphical models (lafferty et. al. 2001, toutanova et. al. 2003). second we describe a method for sequential unsupervised training of tag sequence and lexical probabilities in an hmm, which we observe leads to improved accuracy over simultaneous training with certain types of models. in section 2, we provide a brief description of the methods we evaluate and review published results. section 3 describes the contextualized variation on hmm tagging that we have explored. in section 4 we provide a direct comparison of several unsupervised part-of-speech taggers, which is followed by section 5, in which we present a new method for training with suboptimal lexicons. in section 6, we revisit our new approach to hmm tagging, this time, in the supervised framework.in the future, we will consider making an increase the context-size, which helped toutanova et al (2003). in section 6, we revisit our new approach to hmm tagging, this time, in the supervised framework. the empiricist revolution in computational linguistics has dramatically shifted the accepted boundary between what kinds of knowledge are best supplied by humans and what kinds are best learned from data, with much of the human supplied knowledge now being in the form of annotations of data. this result falls only slightly below the full-blown training intensive dependency-based conditional model. we have presented a comprehensive evaluation of several methods for unsupervised part-of-speech tagging, comparing several variations of hidden markov model taggers and unsupervised transformation-based learning using the same corpus and same lexicons. in section 4 we provide a direct comparison of several unsupervised part-of-speech taggers, which is followed by section 5, in which we present a new method for training with suboptimal lexicons. using a 50% 50% train-test split of the penn treebank to assess hmms, maximum entropy markov models (memms) and conditional random fields (crfs), they found that crfs, which make use of observation features from both the past and future, outperformed hmms which in turn outperformed memms.