End of training

dc78b20 verified 4 months ago

No virus

4.32 kB

the penn chinese treebank (ctb) is an ongoing project, with its objective being to create a segmented chinese corpus annotated with pos tags and syntactic brackets. the first installment of the project (ctb-i) consists of xinhua newswire between the years 1994 and 1998, totaling 100,000 words, fully segmented, pos-tagged and syntactically bracketed and it has been released to the public via the penn linguistic data consortium (ldc). the preliminary results of this phase of the project have been reported in xia et al (2000). currently the second installment of the project, the 400,000-word ctb-ii is being developed and is expected to be completed early in the year 2003. ctb-ii will follow the standards set up in the segmentation (xia 2000b), pos tagging (xia 2000a) and bracketing guidelines (xue and xia 2000) and it will use articles from peoples' daily, hong kong newswire and material translated into chinese from other languages in addition to the xinhua newswire used in ctb-i in an effort to diversify the sources. the availability of ctb-i changed our approach to ctb-ii considerably. due to the existence of ctb-i, we were able to train new automatic chinese language processing (clp) tools, which crucially use annotated corpora as training material. these tools are then used for preprocessing in the development of the ctb-ii. we also developed tools to control the quality of the corpus. in this paper, we will address three issues in the development of the chinese treebank: annotation speed, annotation accuracy and usability of the corpus. specifically, we attempt to answer four questions: (i) how do we speed up the annotation process, (ii) how do we maintain high quality, i.e. annotation accuracy and inter-annotator consistency during the annotation process, and (iii) for what purposes is the corpus applicable, and (iv) what are our future plans? although we will touch upon linguistic problems that are specific to chinese, we believe these issues are general enough for the development of any single language corpus. 1 annotation speed. there are three main factors that affect the annotation speed : annotators? background, guideline design and more importantly, the availability of preprocessing tools. we will discuss how each of these three factors affects annotation speed. 1.1 annotator background. even with the best sets of guidelines, it is important that annotators have received considerable training in linguistics, particularly in syntax. in both the segmentation/pos tagging phase and the syntactic bracketing phase, understanding the structure of the sentences is essential for correct annotation with reasonable speed. for example, the penn chinese treebank (ctb) is an ongoing project, with its objective being to create a segmented chinese corpus annotated with pos tags and syntactic brackets. for example, in both the segmentation/pos tagging phase and the syntactic bracketing phase, understanding the structure of the sentences is essential for correct annotation with reasonable speed. the preliminary results of this phase of the project have been reported in xia et al (2000). even with the best sets of guidelines, it is important that annotators have received considerable training in linguistics, particularly in syntax. the first installment of the project (ctb-i) consists of xinhua newswire between the years 1994 and 1998, totaling 100,000 words, fully segmented, pos-tagged and syntactically bracketed and it has been released to the public via the penn linguistic data consortium (ldc). currently the second installment of the project, the 400,000-word ctb-ii is being developed and is expected to be completed early in the year 2003. 1.1 annotator background. we will discuss how each of these three factors affects annotation speed. the availability of ctb-i changed our approach to ctb-ii considerably. background, guideline design and more importantly, the availability of preprocessing tools. ctb-ii will follow the standards set up in the segmentation (xia 2000b), pos tagging (xia 2000a) and bracketing guidelines (xue and xia 2000) and it will use articles from peoples' daily, hong kong newswire and material translated into chinese from other languages in addition to the xinhua newswire used in ctb-i in an effort to diversify the sources.