alwaysaditi commited on Jun 23, 2024

Commit

dc78b20

verified ·

1 Parent(s): 56ddc78

End of training

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

DATASET_PACSUM/Copy_of_Data_Creation_and_Preprocessing.ipynb +0 -0
DATASET_PACSUM/README.md +109 -0
DATASET_PACSUM/config.json +59 -0
DATASET_PACSUM/dataset/inputs/A00-1031.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-1043.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2004.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2009.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2018.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2019.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2024.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2026.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2030.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2031.txt +1 -0
DATASET_PACSUM/dataset/inputs/A00-2034.txt +1 -0
DATASET_PACSUM/dataset/inputs/A88-1019.txt +1 -0
DATASET_PACSUM/dataset/inputs/A92-1006.txt +1 -0
DATASET_PACSUM/dataset/inputs/A92-1018.txt +1 -0
DATASET_PACSUM/dataset/inputs/A92-1021.txt +1 -0
DATASET_PACSUM/dataset/inputs/A94-1006.txt +1 -0
DATASET_PACSUM/dataset/inputs/A94-1009.txt +1 -0
DATASET_PACSUM/dataset/inputs/A94-1016.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1004.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1011.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1014.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1029.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1030.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1039.txt +1 -0
DATASET_PACSUM/dataset/inputs/A97-1052.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-1007.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-1044.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-1072.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-2136.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-2137.txt +1 -0
DATASET_PACSUM/dataset/inputs/C00-2163.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1011.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1054.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1114.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1144.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1145.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-1150.txt +1 -0
DATASET_PACSUM/dataset/inputs/C02-2025.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1010.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1024.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1041.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1051.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1059.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1072.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1080.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1081.txt +1 -0
DATASET_PACSUM/dataset/inputs/C04-1111.txt +1 -0

DATASET_PACSUM/Copy_of_Data_Creation_and_Preprocessing.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

DATASET_PACSUM/README.md ADDED Viewed

	@@ -0,0 +1,109 @@

+---
+license: apache-2.0
+base_model: allenai/led-base-16384
+tags:
+- generated_from_trainer
+model-index:
+- name: DATASET_PACSUM
+  results: []
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# DATASET_PACSUM
+This model is a fine-tuned version of [allenai/led-base-16384](https://huggingface.co/allenai/led-base-16384) on the None dataset.
+It achieves the following results on the evaluation set:
+- Loss: 2.5461
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 2
+- eval_batch_size: 2
+- seed: 42
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 8
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 5
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step | Validation Loss |
+|:-------------:|:-----:|:----:|:---------------:|
+| 2.8648        | 0.1   | 10   | 2.8816          |
+| 2.9889        | 0.2   | 20   | 2.7866          |
+| 3.0516        | 0.3   | 30   | 2.7394          |
+| 2.6605        | 0.4   | 40   | 2.7132          |
+| 2.8093        | 0.5   | 50   | 2.6759          |
+| 2.9206        | 0.6   | 60   | 2.6607          |
+| 2.8094        | 0.7   | 70   | 2.6576          |
+| 2.5233        | 0.8   | 80   | 2.6327          |
+| 2.6508        | 0.9   | 90   | 2.6117          |
+| 2.8456        | 1.0   | 100  | 2.5861          |
+| 2.4622        | 1.1   | 110  | 2.5942          |
+| 2.2871        | 1.2   | 120  | 2.5751          |
+| 2.4482        | 1.3   | 130  | 2.5776          |
+| 2.4079        | 1.4   | 140  | 2.5777          |
+| 2.2842        | 1.5   | 150  | 2.5621          |
+| 2.6267        | 1.6   | 160  | 2.5463          |
+| 2.3895        | 1.7   | 170  | 2.5503          |
+| 2.2786        | 1.8   | 180  | 2.5470          |
+| 2.3628        | 1.9   | 190  | 2.5420          |
+| 2.2809        | 2.0   | 200  | 2.5367          |
+| 2.2726        | 2.1   | 210  | 2.5405          |
+| 2.1934        | 2.2   | 220  | 2.5676          |
+| 2.2447        | 2.3   | 230  | 2.5399          |
+| 2.4508        | 2.4   | 240  | 2.5435          |
+| 2.2969        | 2.5   | 250  | 2.5490          |
+| 2.4206        | 2.6   | 260  | 2.5317          |
+| 2.0131        | 2.7   | 270  | 2.5378          |
+| 2.0025        | 2.8   | 280  | 2.5492          |
+| 2.2179        | 2.9   | 290  | 2.5280          |
+| 2.2082        | 3.0   | 300  | 2.5190          |
+| 1.9491        | 3.1   | 310  | 2.5608          |
+| 2.291         | 3.2   | 320  | 2.5448          |
+| 2.0431        | 3.3   | 330  | 2.5319          |
+| 2.0671        | 3.4   | 340  | 2.5529          |
+| 2.1939        | 3.5   | 350  | 2.5388          |
+| 2.0606        | 3.6   | 360  | 2.5306          |
+| 2.0088        | 3.7   | 370  | 2.5557          |
+| 2.1919        | 3.8   | 380  | 2.5317          |
+| 2.2516        | 3.9   | 390  | 2.5290          |
+| 1.9401        | 4.0   | 400  | 2.5404          |
+| 2.1101        | 4.1   | 410  | 2.5354          |
+| 1.8906        | 4.2   | 420  | 2.5520          |
+| 1.9808        | 4.3   | 430  | 2.5488          |
+| 1.8195        | 4.4   | 440  | 2.5496          |
+| 1.8512        | 4.5   | 450  | 2.5535          |
+| 2.0464        | 4.6   | 460  | 2.5519          |
+| 2.0176        | 4.7   | 470  | 2.5450          |
+| 2.0686        | 4.8   | 480  | 2.5460          |
+| 2.0267        | 4.9   | 490  | 2.5463          |
+| 1.8617        | 5.0   | 500  | 2.5461          |
+### Framework versions
+- Transformers 4.41.2
+- Pytorch 2.3.0+cu121
+- Datasets 2.20.0
+- Tokenizers 0.19.1

DATASET_PACSUM/config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "_name_or_path": "allenai/led-base-16384",
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "architectures": [
+    "LEDForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "attention_window": [
+    1024,
+    1024,
+    1024,
+    1024,
+    1024,
+    1024
+  ],
+  "bos_token_id": 0,
+  "classif_dropout": 0.0,
+  "classifier_dropout": 0.0,
+  "d_model": 768,
+  "decoder_attention_heads": 12,
+  "decoder_ffn_dim": 3072,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 6,
+  "decoder_start_token_id": 2,
+  "dropout": 0.1,
+  "early_stopping": true,
+  "encoder_attention_heads": 12,
+  "encoder_ffn_dim": 3072,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 6,
+  "eos_token_id": 2,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "length_penalty": 2.0,
+  "max_decoder_position_embeddings": 1024,
+  "max_encoder_position_embeddings": 16384,
+  "max_length": 512,
+  "min_length": 100,
+  "model_type": "led",
+  "no_repeat_ngram_size": 3,
+  "num_beams": 2,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.2",
+  "use_cache": false,
+  "vocab_size": 50265
+}

DATASET_PACSUM/dataset/inputs/A00-1031.txt ADDED Viewed

	@@ -0,0 +1 @@

+ a large number of current language processing systems use a part-of-speech tagger for pre-processing. the tagger assigns a (unique or ambiguous) part-ofspeech tag to each token in the input and passes its output to the next processing level, usually a parser. furthermore, there is a large interest in part-ofspeech tagging for corpus annotation projects, who create valuable linguistic resources by a combination of automatic processing and human correction. for both applications, a tagger with the highest possible accuracy is required. the debate about which paradigm solves the part-of-speech tagging problem best is not finished. recent comparisons of approaches that can be trained on corpora (van halteren et al., 1998; volk and schneider, 1998) have shown that in most cases statistical aproaches (cutting et al., 1992; schmid, 1995; ratnaparkhi, 1996) yield better results than finite-state, rule-based, or memory-based taggers (brill, 1993; daelemans et al., 1996). they are only surpassed by combinations of different systems, forming a "voting tagger". among the statistical approaches, the maximum entropy framework has a very strong position. nevertheless, a recent independent comparison of 7 taggers (zavrel and daelemans, 1999) has shown that another approach even works better: markov models combined with a good smoothing technique and with handling of unknown words. this tagger, tnt, not only yielded the highest accuracy, it also was the fastest both in training and tagging. the tagger comparison was organized as a "blackbox test": set the same task to every tagger and compare the outcomes. this paper describes the models and techniques used by tnt together with the implementation. the reader will be surprised how simple the underlying model is. the result of the tagger comparison seems to support the maxime "the simplest is the best". however, in this paper we clarify a number of details that are omitted in major previous publications concerning tagging with markov models. as two examples, (rabiner, 1989) and (charniak et al., 1993) give good overviews of the techniques and equations used for markov models and part-ofspeech tagging, but they are not very explicit in the details that are needed for their application. we argue that it is not only the choice of the general model that determines the result of the tagger but also the various "small" decisions on alternatives. the aim of this paper is to give a detailed account of the techniques used in tnt. additionally, we present results of the tagger on the negra corpus (brants et al., 1999) and the penn treebank (marcus et al., 1993). the penn treebank results reported here for the markov model approach are at least equivalent to those reported for the maximum entropy approach in (ratnaparkhi, 1996). for a comparison to other taggers, the reader is referred to (zavrel and daelemans, 1999).tnt is freely available to universities and related organizations for research purposes (see http://www.coli.uni-sb.derthorstenant). a large number of current language processing systems use a part-of-speech tagger for pre-processing. for a comparison to other taggers, the reader is referred to (zavrel and daelemans, 1999). we have shown that a tagger based on markov models yields state-of-the-art results, despite contrary claims found in the literature. the penn treebank results reported here for the markov model approach are at least equivalent to those reported for the maximum entropy approach in (ratnaparkhi, 1996). the tagger assigns a (unique or ambiguous) part-ofspeech tag to each token in the input and passes its output to the next processing level, usually a parser. furthermore, there is a large interest in part-ofspeech tagging for corpus annotation projects, who create valuable linguistic resources by a combination of automatic processing and human correction. additionally, we present results of the tagger on the negra corpus (brants et al., 1999) and the penn treebank (marcus et al., 1993). for example, the markov model tagger used in the comparison of (van halteren et al., 1998) yielded worse results than all other taggers. it is a very interesting future research topic to determine the advantages of either of these approaches, to find the reason for their high accuracies, and to find a good combination of both.

DATASET_PACSUM/dataset/inputs/A00-1043.txt ADDED Viewed

	@@ -0,0 +1 @@

+ current automatic summarizers usually rely on sentence extraction to produce summaries. human professionals also often reuse the input documents to generate summaries; however, rather than simply extracting sentences and stringing them together, as most current summarizers do, humans often "edit" the extracted sentences in some way so that the resulting summary is concise and coherent. we analyzed a set of articles and identified six major operations that can be used for editing the extracted sentences, including removing extraneous phrases from an extracted sentence, combining a reduced sentence with other sentences, syntactic transformation, substituting phrases in an extracted sentence with their paraphrases, substituting phrases with more general or specific descriptions, and reordering the extracted sentences (jing and mckeown, 1999; jing and mckeown, 2000). we call the operation of removing extraneous phrases from an extracted sentence sentence reduction. it is one of the most effective operations that can be used to edit the extracted sentences. reduction can remove material at any granularity: a word, a prepositional phrase, a gerund, a to-infinitive or a clause. we use the term "phrase" here to refer to any of the above components that can be removed in reduction. the following example shows an original sentence and its reduced form written by a human professional: original sentence: when it arrives sometime next year in new tv sets, the v-chip will give parents a new and potentially revolutionary device to block out programs they don't want their children to see. reduced sentence by humans: the v-chip will give parents a device to block out programs they don't want their children to see. we implemented an automatic sentence reduction system. input to the reduction system includes extracted sentences, as well as the original document. output of reduction are reduced forms of the extracted sentences, which can either be used to produce summaries directly, or be merged with other sentences. the reduction system uses multiple sources of knowledge to make reduction decisions, including syntactic knowledge, context, and statistics computed from a training corpus. we evaluated the system against the output of human professionals. the program achieved a success rate of 81.3%, meaning that 81.3% of reduction decisions made by the system agreed with those of humans. sentence reduction improves the conciseness of automatically generated summaries, making it concise and on target. it can also improve the coherence of generated summaries, since extraneous phrases that can potentially introduce incoherece are removed. we collected 500 sentences and their corresponding reduced forms written by humans, and found that humans reduced the length of these 500 sentences by 44.2% on average. this indicates that a good sentence reduction system can improve the conciseness of generated summaries significantly. in the next section, we describe the sentence reduction algorithm in details. in section 3, we introduce the evaluation scheme used to access the performance of the system and present evaluation results. in section 4, we discuss other applications of sentence reduction, the interaction between reduction and other modules in a summarization system, and related work on sentence simplication. finally, we the goal of sentence reduction is to "reduce without major loss"; that is, we want to remove as many extraneous phrases as possible from an extracted sentence so that it can be concise, but without detracting from the main idea the sentence conveys. ideally, we want to remove a phrase from an extracted sentence only if it is irrelevant to the main topic. to achieve this, the system relies on multiple sources of knowledge to make reduction decisions. we first introduce the resources in the system and then describe the reduction algorithm. (1) the corpus. one of the key features of the system is that it uses a corpus consisting of original sentences and their corresponding reduced forms written by humans for training and testing purpose. this corpus was created using an automatic program we have developed to automatically analyze human-written abstracts. the program, called the decomposition program, matches phrases in a human-written summary sentence to phrases in the original document (jing and mckeown, 1999). the human-written abstracts were collected from the free daily news service "communicationsrelated headlines", provided by the benton foundation (http://www.benton.org). the articles in the corpus are news reports on telecommunication related issues, but they cover a wide range of topics, such as law, labor, and company mergers. database to date. it provides lexical relations between words, including synonymy, antonymy, meronymy, entailment (e.g., eat —> chew), or causation (e.g., kill --* die). these lexical links are used to identify the focus in the local context. (4) the syntactic parser. we use the english slot grammar(esg) parser developed at ibm (mccord, 1990) to analyze the syntactic structure of an input sentence and produce a sentence parse tree. the esg parser not only annotates the syntactic category of a phrase (e.g., "np" or "vp"), it also annotates the thematic role of a phrase (e.g., "subject" or "object"). there are five steps in the reduction program: step 1: syntactic parsing. we first parse the input sentence using the esg parser and produce the sentence parse tree. the operations in all other steps are performed based on this parse tree. each following step annotates each node in the parse tree with additional information, such as syntactic or context importance, which are used later to determine which phrases (they are represented as subtrees in a parse tree) can be considered extraneous and thus removed. step 2: grammar checking. in this step, we determine which components of a sentence must not be deleted to keep the sentence grammatical. to do this, we traverse the parse tree produced in the first step in top-down order and mark, for each node in the parse tree, which of its children are grammatically obligatory. we use two sources of knowledge for this purpose. one source includes simple, linguistic-based rules that use the thematic role structure produced by the esg parser. for instance, for a sentence, the main verb, the subject, and the object(s) are essential if they exist, but a prepositional phrase is not; for a noun phrase, the head noun is essential, but an adjective modifier of the head noun is not. the other source we rely on is the large-scale lexicon we described earlier. the information in the lexicon is used to mark the obligatory arguments of verb phrases. for example, for the verb "convince", the lexicon has the following entry: this entry indicates that the verb "convince" can be followed by a noun phrase and a prepositional phrase starting with the preposition "of' (e.g., he convinced me of his innocence). it can also be followed by a noun phrase and a to-infinitive phrase (e.g., he convinced me to go to the party). this information prevents the system from deleting the "of" prepositional phrase or the to-infinitive that is part of the verb phrase. at the end of this step, each node in the parse tree — including both leaf nodes and intermediate nodes — is annotated with a value indicating whether it is grammatically obligatory. note that whether a node is obligatory is relative to its parent node only. for example, whether a determiner is obligatory is relative to the noun phrase it is in; whether a prepositional phrase is obligatory is relative to the sentence or the phrase it is in. step 3: context information. in this step, the system decides which components in the sentence are most related to the main topic being discussed. to measure the importance of a phrase in the local context, the system relies on lexical links between words. the hypothesis is that the more connected a word is with other words in the local context, the more likely it is to be the focus of the local context. we link the words in the extracted sentence with words in its local context, if they are repetitions, morphologically related, or linked in wordnet through one of the lexical relations. the system then computes an importance score for each word in the extracted sentence, based on the number of links it has with other words and the types of links. the formula for computing the context importance score for a word w is as follows: here, i represents the different types of lexical relations the system considered, including repetition, inflectional relation, derivational relation, and the lexical relations from wordnet. we assigned a weight to each type of lexical relation, represented by li in the formula. relations such as repetition or inflectional relation are considered more important and are assigned higher weights, while relations such as hypernym are considered less important and assigned lower weights. nu (w) in the formula represents the number of a particular type of lexical links the word w has with words in the local context. after an importance score is computed for each word, each phrase in the 'sentence gets a score by adding up the scores of its children nodes in the parse tree. this score indicates how important the phrase is in the local context. step 4: corpus evidence. the program uses a corpus consisting of sentences reduced by human professionals and their corresponding original sentences to compute how likely humans remove a certain phrase. the system first parsed the sentences in the corpus using esg parser. it then marked which subtrees in these parse trees (i.e., phrases in the sentences) were removed by humans. using this corpus of marked parse trees, we can compute how likely a subtree is removed from its parent node. for example, we can compute the probability that the "when" temporal clause is removed when the main verb is "give", represented as prob("when-clause is removed" i "v=give"), or the probability that the to-infinitive modifier of the head noun "device" is removed, represented as prob("to-infinitive modifier is removed" i"n=device"). these probabilities are computed using bayes's rule. for example, the probability that the "when" temporal clause is removed when the main verb is "give", prob("when-clause is removed" i "v=give"), is computed as the product of prob( "v=give" i "when-clause is removed") (i.e., the probability that the main verb is "give" when the "when" clause is removed) and prob("when-clause is removed") (i.e., the probability that the "when" clause is removed), divided by prob("v=give") (i.e., the probability that the main verb is "give"). besides computing the probability that a phrase is removed, we also compute two other types of probabilities: the probability that a phrase is reduced (i.e., the phrase is not removed as a whole, but some components in the phrase are removed), and the probability that a phrase is unchanged at all (i.e., neither removed nor reduced). these corpus probabilities help us capture human practice. for example, for sentences like "the agency reported that ..." , "the other source says that ..." , "the new study suggests that ..." , the thatclause following the say-verb (i.e., report, say, and suggest) in each sentence is very rarely changed at all by professionals. the system can capture this human practice, since the probability that that-clause of the verb say or report being unchanged at all will be relatively high, which will help the system to avoid removing components in the that-clause. these corpus probabilities are computed beforehand using a training corpus. they are then stored in a table and loaded at running time. step 5: final decision. the final reduction decisions are based on the results from all the earlier steps. to decide which phrases to remove, the system traverses the sentence parse tree, which now have been annotated with different types of information from earlier steps, in the top-down order and decides which subtrees should be removed, reduced or unchanged. a subtree (i.e., a phrase) is removed only if it is not grammatically obligatory, not the focus of the local context (indicated by a low importance score), and has a reasonable probability of being removed by humans. figure 1 shows sample output of the reduction program. the reduced sentences produced by humans are also provided for comparison.the reduced sentences produced by humans are also provided for comparison. current automatic summarizers usually rely on sentence extraction to produce summaries. figure 1 shows sample output of the reduction program. any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the national science foundation. it is one of the most effective operations that can be used to edit the extracted sentences. the final reduction decisions are based on the results from all the earlier steps. we call the operation of removing extraneous phrases from an extracted sentence sentence reduction. reduction can remove material at any granularity: a word, a prepositional phrase, a gerund, a to-infinitive or a clause. we analyzed a set of articles and identified six major operations that can be used for editing the extracted sentences, including removing extraneous phrases from an extracted sentence, combining a reduced sentence with other sentences, syntactic transformation, substituting phrases in an extracted sentence with their paraphrases, substituting phrases with more general or specific descriptions, and reordering the extracted sentences (jing and mckeown, 1999; jing and mckeown, 2000). step 5: final decision. they are then stored in a table and loaded at running time.

DATASET_PACSUM/dataset/inputs/A00-2004.txt ADDED Viewed

	@@ -0,0 +1 @@

+ even moderately long documents typically address several topics or different aspects of the same topic. the aim of linear text segmentation is to discover the topic boundaries. the uses of this procedure include information retrieval (hearst and plaunt, 1993; hearst, 1994; yaari, 1997; reynar, 1999), summarization (reynar, 1998), text understanding, anaphora resolution (kozima, 1993), language modelling (morris and hirst, 1991; beeferman et al., 1997b) and improving document navigation for the visually disabled (choi, 2000). this paper focuses on domain independent methods for segmenting written text. we present a new algorithm that builds on previous work by reynar (reynar, 1998; reynar, 1994). the primary distinction of our method is the use of a ranking scheme and the cosine similarity measure (van rijsbergen, 1979) in formulating the similarity matrix. we propose that the similarity values of short text segments is statistically insignificant. thus, one can only rely on their order, or rank, for clustering.even moderately long documents typically address several topics or different aspects of the same topic. a segmentation algorithm has two key elements, a, clustering strategy and a similarity measure. we would also like to develop a linear time and multi-source version of the algorithm. thus, one can only rely on their order, or rank, for clustering. the significance of our results has been confirmed by both t-test and ks-test. the definition of a topic segment ranges from complete stories (allan et al., 1998) to summaries (ponte and croft, 1997). given the quality of an algorithm is task dependent, the following experiments focus on the relative performance. c99, k98 and r98 are all polynomial time algorithms. existing work falls into one of two categories, lexical cohesion methods and multi-source methods (yaari, 1997). it would be interesting to compare c99 with the multi-source method described in (beeferman et al., 1999) using the tdt corpus. if one disregards segmentation accuracy, h94 has the best algorithmic performance (linear). our evaluation strategy is a variant of that described in (reynar, 1998, 71-73) and the tdt segmentation task (allan et al., 1998). our results show divisive clustering (r98) is more precise than sliding window (h94) and lexical chains (k98) for locating topic boundaries.

DATASET_PACSUM/dataset/inputs/A00-2009.txt ADDED Viewed

	@@ -0,0 +1 @@

+ word sense disambiguation is often cast as a problem in supervised learning, where a disambiguator is induced from a corpus of manually sense—tagged text using methods from statistics or machine learning. these approaches typically represent the context in which each sense—tagged instance of a word occurs with a set of linguistically motivated features. a learning algorithm induces a representative model from these features which is employed as a classifier to perform disambiguation. this paper presents a corpus—based approach that results in high accuracy by combining a number of very simple classifiers into an ensemble that performs disambiguation via a majority vote. this is motivated by the observation that enhancing the feature set or learning algorithm used in a corpus—based approach does not usually improve disambiguation accuracy beyond what can be attained with shallow lexical features and a simple supervised learning algorithm. for example, a naive bayesian classifier (duda and hart, 1973) is based on a blanket assumption about the interactions among features in a sensetagged corpus and does not learn a representative model. despite making such an assumption, this proves to be among the most accurate techniques in comparative studies of corpus—based word sense disambiguation methodologies (e.g., (leacock et al., 1993), (mooney, 1996), (ng and lee, 1996), (pedersen and bruce, 1997)). these studies represent the context in which an ambiguous word occurs with a wide variety of features. however, when the contribution of each type of feature to overall accuracy is analyzed (eg. (ng and lee, 1996)), shallow lexical features such as co—occurrences and collocations prove to be stronger contributors to accuracy than do deeper, linguistically motivated features such as part—of—speech and verb—object relationships. it has also been shown that the combined accuracy of an ensemble of multiple classifiers is often significantly greater than that of any of the individual classifiers that make up the ensemble (e.g., (dietterich, 1997)). in natural language processing, ensemble techniques have been successfully applied to part— of—speech tagging (e.g., (brill and wu, 1998)) and parsing (e.g., (henderson and brill, 1999)). when combined with a history of disambiguation success using shallow lexical features and naive bayesian classifiers, these findings suggest that word sense disambiguation might best be improved by combining the output of a number of such classifiers into an ensemble. this paper begins with an introduction to the naive bayesian classifier. the features used to represent the context in which ambiguous words occur are presented, followed by the method for selecting the classifiers to include in the ensemble. then, the line and interesi data is described. experimental results disambiguating these words with an ensemble of naive bayesian classifiers are shown to rival previously published results. this paper closes with a discussion of the choices made in formulating this methodology and plans for future work.this work extends ideas that began in collaboration with rebecca bruce and janyce wiebe. a preliminary version of this paper appears in (pedersen, 2000). word sense disambiguation is often cast as a problem in supervised learning, where a disambiguator is induced from a corpus of manually senseâ€”tagged text using methods from statistics or machine learning. this paper closes with a discussion of the choices made in formulating this methodology and plans for future work. each of the nine member classifiers votes for the most probable sense given the particular context represented by that classifier; the ensemble disambiguates by assigning the sense that receives a majority of the votes. a naive bayesian classifier assumes that all the feature variables representing a problem are conditionally independent given the value of a classification variable. these approaches typically represent the context in which each senseâ€”tagged instance of a word occurs with a set of linguistically motivated features. this approach was evaluated using the widely studied nouns line and interest, which are disambiguated with accuracy of 88% and 89%, which rivals the best previously published results. experimental results disambiguating these words with an ensemble of naive bayesian classifiers are shown to rival previously published results.

DATASET_PACSUM/dataset/inputs/A00-2018.txt ADDED Viewed

	@@ -0,0 +1 @@

+ we present a new parser for parsing down to penn tree-bank style parse trees [16] that achieves 90.1% average precision/recall for sentences of length < 40, and 89.5% for sentences of length < 100, when trained and tested on the previously established [5,9,10,15,17] "standard" sections of the wall street journal tree-bank. this represents a 13% decrease in error rate over the best single-parser results on this corpus [9]. following [5,10], our parser is based upon a probabilistic generative model. that is, for all sentences s and all parses 7r, the parser assigns a probability p(s , 7r) = p(r), the equality holding when we restrict consideration to 7r whose yield * this research was supported in part by nsf grant lis sbr 9720368. the author would like to thank mark johnson and all the rest of the brown laboratory for linguistic information processing. is s. then for any s the parser returns the parse ir that maximizes this probability. that is, the parser implements the function arg maxrp(7r s) = arg maxirp(7r, s) = arg maxrp(w). what fundamentally distinguishes probabilistic generative parsers is how they compute p(r), and it is to that topic we turn next.it is to this project that our future parsing work will be devoted. what fundamentally distinguishes probabilistic generative parsers is how they compute p(r), and it is to that topic we turn next. we present a new parser for parsing down to penn tree-bank style parse trees [16] that achieves 90.1% average precision/recall for sentences of length < 40, and 89.5% for sentences of length < 100, when trained and tested on the previously established [5,9,10,15,17] "standard" sections of the wall street journal tree-bank. indeed, we initiated this line of work in an attempt to create a parser that would be flexible enough to allow modifications for parsing down to more semantic levels of detail. we have presented a lexicalized markov grammar parsing model that achieves (using the now standard training/testing/development sections of the penn treebank) an average precision/recall of 91.1% on sentences of length < 40 and 89.5% on sentences of length < 100. this corresponds to an error reduction of 13% over the best previously published single parser results on this test set, those of collins [9]. in the previous sections we have concentrated on the relation of the parser to a maximumentropy approach, the aspect of the parser that is most novel.

DATASET_PACSUM/dataset/inputs/A00-2019.txt ADDED Viewed

	@@ -0,0 +1 @@

+ a good indicator of whether a person knows the meaning of a word is the ability to use it appropriately in a sentence (miller and gildea, 1987). much information about usage can be obtained from quite a limited context: choueka and lusignan (1985) found that people can typically recognize the intended sense of a polysemous word by looking at a narrow window of one or two words around it. statistically-based computer programs have been able to do the same with a high level of accuracy (kilgarriff and palmer, 2000). the goal of our work is to automatically identify inappropriate usage of specific vocabulary words in essays by looking at the local contextual cues around a target word. we have developed a statistical system, alek (assessing lexical knowledge), that uses statistical analysis for this purpose. a major objective of this research is to avoid the laborious and costly process of collecting errors (or negative evidence) for each word that we wish to evaluate. instead, we train alek on a general corpus of english and on edited text containing example uses of the target word. the system identifies inappropriate usage based on differences between the word's local context cues in an essay and the models of context it has derived from the corpora of well-formed sentences. a requirement for alek has been that all steps in the process be automated, beyond choosing the words to be tested and assessing the results. once a target word is chosen, preprocessing, building a model of the word's appropriate usage, and identifying usage errors in essays is performed without manual intervention. alek has been developed using the test of english as a foreign language (toefl) administered by the educational testing service. toefl is taken by foreign students who are applying to us undergraduate and graduate-level programs.toefl is taken by foreign students who are applying to us undergraduate and graduate-level programs. a good indicator of whether a person knows the meaning of a word is the ability to use it appropriately in a sentence (miller and gildea, 1987). the unsupervised techniques that we have presented for inferring negative evidence are effective in recognizing grammatical errors in written text. however, its techniques could be incorporated into a grammar checker for native speakers. approaches to detecting errors by non-native writers typically produce grammars that look for specific expected error types (schneider and mccoy, 1998; park, palmer and washburn, 1997). the problem of error detection does not entail finding similarities to appropriate usage, rather it requires identifying one element among the contextual cues that simply does not fit. alek has been developed using the test of english as a foreign language (toefl) administered by the educational testing service. under this approach, essays written by esl students are collected and examined for errors. this system was tested on eight essays, but precision and recall figures are not reported. an incorrect usage can contain two or three salient contextual elements as well as a single anomalous element. comparison of these results to those of other systems is difficult because there is no generally accepted test set or performance baseline.

DATASET_PACSUM/dataset/inputs/A00-2024.txt ADDED Viewed

	@@ -0,0 +1 @@

+ there is a big gap between the summaries produced by current automatic summarizers and the abstracts written by human professionals. certainly one factor contributing to this gap is that automatic systems can not always correctly identify the important topics of an article. another factor, however, which has received little attention, is that automatic summarizers have poor text generation techniques. most automatic summarizers rely on extracting key sentences or paragraphs from an article to produce a summary. since the extracted sentences are disconnected in the original article, when they are strung together, the resulting summary can be inconcise, incoherent, and sometimes even misleading. we present a cut and paste based text summarization technique, aimed at reducing the gap between automatically generated summaries and human-written abstracts. rather than focusing on how to identify key sentences, as do other researchers, we study how to generate the text of a summary once key sentences have been extracted. the main idea of cut and paste summarization is to reuse the text in an article to generate the summary. however, instead of simply extracting sentences as current summarizers do, the cut and paste system will "smooth" the extracted sentences by editing them. such edits mainly involve cutting phrases and pasting them together in novel ways. the key features of this work are:there is a big gap between the summaries produced by current automatic summarizers and the abstracts written by human professionals. the key features of this work are: any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the national science foundation. we thank ibm for licensing us the esg parser and the mitre corporation for licensing us the coreference resolution system. finally, we conclude and discuss future work. we will also extend the system to query-based summarization and investigate whether the system can be modified for multiple document summarization. this paper presents a novel architecture for text summarization using cut and paste techniques observed in human-written abstracts. ing operations. related work is discussed in section 6. we identified six operations that can be used alone or together to transform extracted sentences into sentences in human-written abstracts. (mani et al., 1999) addressed the problem of revising summaries to improve their quality. however, the combination operations and combination rules that we derived from corpus analysis are significantly different from those used in the above system, which mostly came from operations in traditional natural language generation. such edits mainly involve cutting phrases and pasting them together in novel ways.

DATASET_PACSUM/dataset/inputs/A00-2026.txt ADDED Viewed

	@@ -0,0 +1 @@

+ this paper presents three trainable systems for surface natural language generation (nlg). surface nlg, for our purposes, consists of generating a grammatical natural language phrase that expresses the meaning of an input semantic representation. the systems take a "corpus-based" or "machinelearning" approach to surface nlg, and learn to generate phrases from semantic input by statistically analyzing examples of phrases and their corresponding semantic representations. the determination of the content in the semantic representation, or "deep" generation, is not discussed here. instead, the systems assume that the input semantic representation is fixed and only deal with how to express it in natural language. this paper discusses previous approaches to surface nlg, and introduces three trainable systems for surface nlg, called nlg1, nlg2, and nlg3. quantitative evaluation of experiments in the air travel domain will also be discussed.this paper presents three trainable systems for surface natural language generation (nlg). quantitative evaluation of experiments in the air travel domain will also be discussed. this paper presents the first systems (known to the author) that use a statistical learning approach to produce natural language text directly from a semantic representation. we conjecture that nlg2 and nlg3 should work in other domains which have a complexity similar to air travel, as well as available annotated data. the nlg2 and nlg3 systems automatically attempt to generalize from the knowledge inherent in the training corpus of templates, so that they can generate templates for novel attribute sets. in contrast, (langkilde and knight, 1998) uses corpus-derived statistical knowledge to rank plausible hypotheses from a grammarbased surface generation component. templates are the easiest way to implement surface nlg. this limitation can be overcome by using features on values, so that nlg2 and nlg3 might discover â€” to use a hypothetical example â€” that "flights leaving $city-fr" is preferred over "flights from $city-fr" when $city-fr is a particular value, such as "miami". our current approach has the limitation that it ignores the values of attributes, even though they might strongly influence the word order and word choice.

DATASET_PACSUM/dataset/inputs/A00-2030.txt ADDED Viewed

	@@ -0,0 +1 @@

+ since 1995, a few statistical parsing algorithms (magerman, 1995; collins, 1996 and 1997; charniak, 1997; rathnaparki, 1997) demonstrated a breakthrough in parsing accuracy, as measured against the university of pennsylvania treebank as a gold standard. yet, relatively few have embedded one of these algorithms in a task. chiba, (1999) was able to use such a parsing algorithm to reduce perplexity with the long term goal of improved speech recognition. in this paper, we report adapting a lexicalized, probabilistic context-free parser with head rules (lpcfg-hr) to information extraction. the technique was benchmarked in the seventh message understanding conference (muc-7) in 1998. several technical challenges confronted us and were solved: treebank on wall street journal adequately train the algorithm for new york times newswire, which includes dozens of newspapers? manually creating sourcespecific training data for syntax was not required. instead, our parsing algorithm, trained on the upenn treebank, was run on the new york times source to create unsupervised syntactic training which was constrained to be consistent with semantic annotation.this simple semantic annotation was the only source of task knowledge used to configure the model. we have demonstrated, at least for one problem, that a lexicalized, probabilistic context-free parser with head rules (lpcfghr) can be used effectively for information extraction. instead, our parsing algorithm, trained on the upenn treebank, was run on the new york times source to create unsupervised syntactic training which was constrained to be consistent with semantic annotation. while performance did not quite match the best previously reported results for any of these three tasks, we were pleased to observe that the scores were at or near state-of-the-art levels for all cases. since 1995, a few statistical parsing algorithms (magerman, 1995; collins, 1996 and 1997; charniak, 1997; rathnaparki, 1997) demonstrated a breakthrough in parsing accuracy, as measured against the university of pennsylvania treebank as a gold standard. we evaluated the new approach to information extraction on two of the tasks of the seventh message understanding conference (muc-7) and reported in (marsh, 1998). our system for muc-7 consisted of the sentential model described in this paper, coupled with a simple probability model for cross-sentence merging. for the following example, the template relation in figure 2 was to be generated: "donald m. goldstein, a historian at the university of pittsburgh who helped write..."

DATASET_PACSUM/dataset/inputs/A00-2031.txt ADDED Viewed

	@@ -0,0 +1 @@

+ parsing sentences using statistical information gathered from a treebank was first examined a decade ago in (chitrad and grishman, 1990) and is by now a fairly well-studied problem ((charniak, 1997), (collins, 1997), (ratnaparkhi, 1997)). but to date, the end product of the parsing process has for the most part been a bracketing with simple constituent labels like np, vp, or sbar. the penn treebank contains a great deal of additional syntactic and semantic information from which to gather statistics; reproducing more of this information automatically is a goal which has so far been mostly ignored. this paper details a process by which some of this information—the function tags— may be recovered automatically. in the penn treebank, there are 20 tags (figure 1) that can be appended to constituent labels in order to indicate additional information about the syntactic or semantic role of the constituent. we have divided them into four categories (given in figure 2) based on those in the bracketing guidelines (bies et al., 1995). a constituent can be tagged with multiple tags, but never with two tags from the same category.1 in actuality, the case where a constituent has tags from all four categories never happens, but constituents with three tags do occur (rarely). at a high level, we can simply say that having the function tag information for a given text is useful just because any further information would help. but specifically, there are distinct advantages for each of the various categories. grammatical tags are useful for any application trying to follow the thread of the text—they find the 'who does what' of each clause, which can be useful to gain information about the situation or to learn more about the behaviour of the words in the sentence. the form/function tags help to find those constituents behaving in ways not conforming to their labelled type, as well as further clarifying the behaviour of adverbial phrases. information retrieval applications specialising in describing events, as with a number of the muc applications, could greatly benefit from some of these in determining the where-when-why of things. noting a topicalised constituent could also prove useful to these applications, and it might also help in discourse analysis, or pronoun resolution. finally, the 'miscellaneous' tags are convenient at various times; particularly the clr 'closely related' tag, which among other things marks phrasal verbs and prepositional ditransitives. to our knowledge, there has been no attempt so far to recover the function tags in parsing treebank text. in fact, we know of only one project that used them at all: (collins, 1997) defines certain constituents as complements based on a combination of label and function tag information. this boolean condition is then used to train an improved parser.this boolean condition is then used to train an improved parser. this work presents a method for assigning function tags to text that has been parsed to the simple label level. â€¢ there is no reason to think that this work could not be integrated directly into the parsing process, particularly if one's parser is already geared partially or entirely towards feature-based statistics; the function tag information could prove quite useful within the parse itself, to rank several parses to find the most plausible. parsing sentences using statistical information gathered from a treebank was first examined a decade ago in (chitrad and grishman, 1990) and is by now a fairly well-studied problem ((charniak, 1997), (collins, 1997), (ratnaparkhi, 1997)). but to date, the end product of the parsing process has for the most part been a bracketing with simple constituent labels like np, vp, or sbar. in fact, we know of only one project that used them at all: (collins, 1997) defines certain constituents as complements based on a combination of label and function tag information. there are, it seems, two reasonable baselines for this and future work. we have found it useful to define our statistical model in terms of features.

DATASET_PACSUM/dataset/inputs/A00-2034.txt ADDED Viewed

	@@ -0,0 +1 @@

+ diathesis alternations are alternate ways in which the arguments of a verb are expressed syntactically. the syntactic changes are sometimes accompanied by slight changes in the meaning of the verb. an example of the causative alternation is given in (1) below. in this alternation, the object of the transitive variant can also appear as the subject of the intransitive variant. in the conative alternation, the transitive form alternates with a prepositional phrase construction involving either at or on. an example of the conative alternation is given in (2). we refer to alternations where a particular semantic role appears in different grammatical roles in alternate realisations as "role switching alternations" (rsas). it is these alternations that our method applies to. recently, there has been interest in corpus-based methods to identify alternations (mccarthy and korhonen, 1998; lapata, 1999), and associated verb classifications (stevenson and merlo, 1999). these have either relied on a priori knowledge specified for the alternations in advance, or are not suitable for a wide range of alternations. the fully automatic method outlined here is applied to the causative and conative alternations, but is applicable to other rsas.however, a considerably larger corpus would be required to overcome the sparse data problem for other rsa alternations. we have discovered a significant relationship between the similarity of selectional preferences at the target slots, and participation in the causative and conative alternations. diathesis alternations are alternate ways in which the arguments of a verb are expressed syntactically. the fully automatic method outlined here is applied to the causative and conative alternations, but is applicable to other rsas. we propose a method to acquire knowledge of alternation participation directly from corpora, with frequency information available as a by-product. notably, only one negative decision was made because of the disparate frame frequencies, which reduces the cost of combining the argument head data. diathesis alternations have been proposed for a number of nlp tasks. earlier work by resnik (1993) demonstrated a link between selectional preference strength and participation in alternations where the direct object is omitted. the syntactic changes are sometimes accompanied by slight changes in the meaning of the verb. these have either relied on a priori knowledge specified for the alternations in advance, or are not suitable for a wide range of alternations. for the conative, a sample of 16 verbs was used and this time accuracy was only 56%.

DATASET_PACSUM/dataset/inputs/A88-1019.txt ADDED Viewed

	@@ -0,0 +1 @@

+ it is well-known that part of speech depends on context. the word "table," for example, can be a verb in some contexts (e.g., "he will table the motion") and a noun in others (e.g., "the table is ready"). a program has been written which tags each word in an input sentence with the most likely part of speech. the program produces the following output for the two "table" sentences just mentioned: (pps = subject pronoun; md = modal; vb = verb (no inflection); at = article; nn = noun; bez = present 3rd sg form of "to be"; jj = adjective; notation is borrowed from [francis and kucera, pp. 6-8]) part of speech tagging is an important practical problem with potential applications in many areas including speech synthesis, speech recognition, spelling correction, proof-reading, query answering, machine translation and searching large text data bases (e.g., patents, newspapers). the author is particularly interested in speech synthesis applications, where it is clear that pronunciation sometimes depends on part of speech. consider the following three examples where pronunciation depends on part of speech. first, there are words like "wind" where the noun has a different vowel than the verb. that is, the noun "wind" has a short vowel as in "the wind is strong," whereas the verb "wind" has a long vowel as in "don't forget to wind your watch." secondly, the pronoun "that" is stressed as in "did you see that?" unlike the complementizer "that," as in "it is a shame that he's leaving." thirdly, note the difference between "oily fluid" and "transmission fluid"; as a general rule, an adjective-noun sequence such as "oily fluid" is typically stressed on the right whereas a noun-noun sequence such as "transmission fluid" is typically stressed on the left. these are but three of the many constructions which would sound more natural if the synthesizer had access to accurate part of speech information. perhaps the most important application of tagging programs is as a tool for future research. a number of large projects such as [cobuild] have recently been collecting large corpora (101000 million words) in order to better describe how language is actually used in practice: "for the first time, a dictionary has been compiled by the thorough examination of representative group of english texts, spoken and written, running to many millions of words. this means that in addition to all the tools of the conventional dictionary makers... the dictionary is based on hard, measureable evidence." [cobuild, p. xv] it is likely that there will be more and more research projects collecting larger and larger corpora. a reliable parts program might greatly enhance the value of these corpora to many of these researchers. the program uses a linear time dynamic programming algorithm to find an assignment of parts of speech to words that optimizes the product of (a) lexical probabilities (probability of observing part of speech i given word j), and (b) contextual probabilities (probability of observing part of speech i given k previous parts of speech). probability estimates were obtained by training on the tagged brown corpus [francis and kucera], a corpus of approximately 1,000,000 words with part of speech tags assigned laboriously by hand over many years. program performance is encouraging (95-99% "correct", depending on the definition of "correct"). a small 400 word sample is presented in the appendix, and is judged to be 99.5% correct. it is surprising that a local "bottom-up" approach can perform so well. most errors are attributable to defects in the lexicon; remarkably few errors are related to the inadequacies of the extremely over-simplified grammar (a trigram model). apparently, "long distance" dependences are not very important, at least most of the time. one might have thought that ngram models weren't adequate for the task since it is wellknown that they are inadequate for determining grammaticality: "we find that no finite-state markov process that produces symbols with transition from state to state can serve as an english grammar. furthermore, the particular subclass of such processes that produce norder statistical approximations to english do not come closer, with increasing n, to matching the output of an english grammar." [chomsky, p. 113] chomslcy's conclusion was based on the observation that constructions such as: have long distance dependencies that span across any fixed length window n. thus, ngram models are clearly inadequate for many natural language applications. however, for the tagging application, the ngram approximation may be acceptable since long distance dependencies do not seem to be very important. statistical ngram models were quite popular in the 1950s, and have been regaining popularity over the past few years. the ibm speech group is perhaps the strongest advocate of ngram methods, especially in other applications such as speech recognition. robert mercer (private communication, 1982) has experimented with the tagging application, using a restricted corpus (laser patents) and small vocabulary (1000 words). another group of researchers working in lancaster around the same time, leech, garside and atwell, also found ngram models highly effective; they report 96.7% success in automatically tagging the lob corpus, using a bigram model modified with heuristics to cope with more important trigrams. the present work developed independently from the lob project. many people who have not worked in computational linguistics have a strong intuition that lexical ambiguity is usually not much of a problem. it is commonly believed that most words have just one part of speech, and that the few exceptions such as "table" are easily disambiguated by context in most cases. in contrast, most experts in computational linguists have found lexical ambiguity to be a major issue; it is said that practically any content word can be used as a noun, verb or adjective,i and that local context is not always adequate to disambiguate. introductory texts are full of ambiguous sentences such as where no amount of syntactic parsing will help. these examples are generally taken to indicate that the parser must allow for multiple possibilities and that grammar formalisms such as lr(k) are inadequate for natural language since these formalisms cannot cope with ambiguity. this argument was behind a large set of objections to marcus' "lr(k)-like" deterministic parser. although it is clear that an expert in computational linguistics can dream up arbitrarily hard sentences, it may be, as marcus suggested, that most texts are not very hard in practice. recall that marcus hypothesized most decisions can be resolved by the parser within a small window (i.e., three buffer cells), and there are only a few problematic cases where the parser becomes confused. he called these confusing cases "garden paths," by analogy with the famous example: • the horse raced past the barn fell. with just a few exceptions such as these "garden paths," marcus assumes, there is almost always a unique "best" interpretation which can be found with very limited resources. the proposed stochastic approach is largely compatible with this; the proposed approach 1. from an information theory point of view, one can quantity ambiguity in bits. in the case of the brown tagged corpus, the lexical entropy, the conditional entropy of the part of speech given the word is about 0.25 bits per part of speech. this is considerably smaller than the contextual entropy, the conditional entropy of the part of speech given the next two parts of speech. this entropy is estimated to be about 2 bits per part of speech. assumes that it is almost always sufficient to assign each word a unique "best" part of speech (and this can be accomplished with a very efficient linear time dynamic programming algorithm). after reading introductory discussions of "flying planes can be dangerous," one might have expected that lexical ambiguity was so pervasive that it would be hopeless to try to assign just one part of speech to each word and in just one linear time pass over the input words.the proposed method omitted only 5 of 243 noun phrase brackets in the appendix. it is well-known that part of speech depends on context. find all assignments of parts of speech to "a" and score. after reading introductory discussions of "flying planes can be dangerous," one might have expected that lexical ambiguity was so pervasive that it would be hopeless to try to assign just one part of speech to each word and in just one linear time pass over the input words. the word "table," for example, can be a verb in some contexts (e.g., "he will table the motion") and a noun in others (e.g., "the table is ready"). this entropy is estimated to be about 2 bits per part of speech. assumes that it is almost always sufficient to assign each word a unique "best" part of speech (and this can be accomplished with a very efficient linear time dynamic programming algorithm). this is considerably smaller than the contextual entropy, the conditional entropy of the part of speech given the next two parts of speech. there is some tendency to underestimate the number of brackets and run two noun phrases together as in [np the time fairchild].

DATASET_PACSUM/dataset/inputs/A92-1006.txt ADDED Viewed

	@@ -0,0 +1 @@

+ this paper presents the joyce system as an example of a fully-implemented, application-oriented text generation system. joyce covers the whole range of tasks associated with text generation, from content selection to morphological processing. it was developped as part of the interface of the software design environment ulysses. the following design goals were set for it: while we were able to exploit existing research for many of the design issues, it turned out that we needed to develop our own approach to text planning (ra.mbow 1990). this paper will present the system and attempt to show how these design objectives led to particular design decisions. the structure of the paper is as follows. in section 2, we will present the underlying application and give examples of the output of the system. in section 3, we will discuss the overall structure of joyce. we then discuss the three main components in turn: the text planner in section 4, the sentence planner in section 5 and the realizer in section 6. we will discuss the text planner in some detail since it represents a new approach to the problem. section 7 traces the generation of a short text. in section 8, we address the problem of portability, and wind up by discussing some shortcomings of joyce in the conclusion.in section 8, we address the problem of portability, and wind up by discussing some shortcomings of joyce in the conclusion. this paper presents the joyce system as an example of a fully-implemented, application-oriented text generation system. we are aware of several shortcomings of joyce, which we will address in future versions of the system. ple in text planning, it appears to play an important role as a constraint on possible text structures. it passes it through the incrementor to the formater, which downgrades it when a classified corrected reading leaves through p34. ii has met the design objectives of speed and quality, and our experience in porting the text generator to new task: and to new applications indicates that joyce is a flexibl( system that can adapt to a variety of text generatior tasks. initial results, including a prototype, are encouraging. porting is an important way to evaluate complete applied text generation systems, since there is no canonical set of tasks that such a system must be able to perform and on which it can be tested. the analyzer downgrades it to secret. furthermore, it helps determine the use of connectives between rhetorically related clauses. the joyce text generation system was developped part of the software design environment ulysses (korelsky and ulysses staff 1988; rosenthal et al 1988) ulysses includes a graphical environment for the design of secure, distributed software systems.

DATASET_PACSUM/dataset/inputs/A92-1018.txt ADDED Viewed

	@@ -0,0 +1 @@

+ many words are ambiguous in their part of speech. for example, "tag" can be a noun or a verb. however, when a word appears in the context of other words, the ambiguity is often reduced: in "a tag is a part-of-speech label," the word "tag" can only be a noun. a part-of-speech tagger is a system that uses context to assign parts of speech to words. automatic text tagging is an important first step in discovering the linguistic structure of large text corpora. part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text. for a tagger to function as a practical component in a language processing system, we believe that a tagger must be: robust text corpora contain ungrammatical constructions, isolated phrases (such as titles), and nonlinguistic data (such as tables). corpora are also likely to contain words that are unknown to the tagger. it is desirable that a tagger deal gracefully with these situations. efficient if a tagger is to be used to analyze arbitrarily large corpora, it must be efficient—performing in time linear in the number of words tagged. any training required should also be fast, enabling rapid turnaround with new corpora and new text genres. accurate a tagger should attempt to assign the correct part-of-speech tag to every word encountered. tunable a tagger should be able to take advantage of linguistic insights. one should be able to correct systematic errors by supplying appropriate a priori "hints." it should be possible to give different hints for different corpora. reusable the effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal.reusable the effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal. many words are ambiguous in their part of speech. one should be able to correct systematic errors by supplying appropriate a priori "hints." it should be possible to give different hints for different corpora. the algorithm has an accuracy of approximately 80% in assigning grammatical functions. we have used the tagger in a number of applications. by using the fact that words are typically associated with only a few part-ofspeech categories, and carefully ordering the computation, the algorithms have linear complexity (section 3.3). for example, "tag" can be a noun or a verb. several different approaches have been used for building text taggers. probabilities corresponding to category sequences that never occurred in the training data are assigned small, non-zero values, ensuring that the model will accept any sequence of tokens, while still providing the most likely tagging. we describe three applications here: phrase recognition; word sense disambiguation; and grammatical function assignment. if a noun phrase is labeled, it is also annotated as to whether the governing verb is the closest verb group to the right or to the left. taggit disambiguated 77% of the corpus; the rest was done manually over a period of several years.

DATASET_PACSUM/dataset/inputs/A92-1021.txt ADDED Viewed

	@@ -0,0 +1 @@

+ there has been a dramatic increase in the application of probabilistic models to natural language processing over the last few years. the appeal of stochastic techniques over traditional rule-based techniques comes from the ease with which the necessary statistics can be automatically acquired and the fact that very little handcrafted knowledge need be built into the system. in contrast, the rules in rule-based systems are usually difficult to construct and are typically not very robust. one area in which the statistical approach has done particularly well is automatic part of speech tagging, assigning each word in an input sentence its proper part of speech [church 88; cutting et al. 92; derose 88; deroualt and merialdo 86; garside et al. 87; jelinek 85; kupiec 89; meteer et al. 911. stochastic taggers have obtained a high degree of accuracy without performing any syntactic analysis on the input. these stochastic part of speech taggers make use of a markov model which captures lexical and contextual information. the parameters of the model can be estimated from tagged ([church 88; derose 88; deroualt and merialdo 86; garside et al. 87; meteer et al. 91]) or untag,ged ([cutting et al. 92; jelinek 85; kupiec 89]) text. once the parameters of the model are estimated, a sentence can then be automatically tagged by assigning it the tag sequence which is assigned the highest probability by the model. performance is often enhanced with the aid of various higher level pre- and postprocessing procedures or by manually tuning the model. a number of rule-based taggers have been built [klein and simmons 63; green and rubin 71; hindle 89]. [klein and simmons 63] and [green and rubin 71] both have error rates substantially higher than state of the art stochastic taggers. [hindle 89] disambiguates words within a deterministic parser. we wanted to determine whether a simple rule-based tagger without any knowledge of syntax can perform as well as a stochastic tagger, or if part of speech tagging really is a domain to which stochastic techniques are better suited. in this paper we describe a rule-based tagger which performs as well as taggers based upon probabilistic models. the rule-based tagger overcomes the limitations common in rule-based approaches to language processing: it is robust, and the rules are automatically acquired. in addition, the tagger has many advantages over stochastic taggers, including: a vast reduction in stored information required, the perspicuity of a small set of meaningful rules as opposed to the large tables of statistics needed for stochastic taggers, ease of finding and implementing improvements to the tagger, and better portability from one tag set or corpus genre to another.we have presented a simple part of speech tagger which performs as well as existing stochastic taggers, but has significant advantages over these taggers. there has been a dramatic increase in the application of probabilistic models to natural language processing over the last few years. the fact that the simple rule-based tagger can perform so well should offer encouragement for researchers to further explore rule-based tagging, searching for a better and more expressive set of patch templates and other variations on this simple but effective theme. in addition, the tagger has many advantages over stochastic taggers, including: a vast reduction in stored information required, the perspicuity of a small set of meaningful rules as opposed to the large tables of statistics needed for stochastic taggers, ease of finding and implementing improvements to the tagger, and better portability from one tag set or corpus genre to another. the rule-based tagger overcomes the limitations common in rule-based approaches to language processing: it is robust, and the rules are automatically acquired. perhaps the biggest contribution of this work is in demonstrating that the stochastic method is not the only viable approach for part of speech tagging. the tagger is extremely portable. the appeal of stochastic techniques over traditional rule-based techniques comes from the ease with which the necessary statistics can be automatically acquired and the fact that very little handcrafted knowledge need be built into the system.

DATASET_PACSUM/dataset/inputs/A94-1006.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the statistical corpus-based renaissance in computational linguistics has produced a number of interesting technologies, including part-of-speech tagging and bilingual word alignment. unfortunately, these technologies are still not as widely deployed in practical applications as they might be. part-ofspeech taggers are used in a few applications, such as speech synthesis (sproat et al., 1992) and question answering (kupiec, 1993b). word alignment is newer, found only in a few places (gale and church, 1991a; brown et al., 1993; dagan et al., 1993). it is used at ibm for estimating parameters of their statistical machine translation prototype (brown et al., 1993). we suggest that part of speech tagging and word alignment could have an important role in glossary construction for translation. glossaries are extremely important for translation. how would microsoft, or some other software vendor, want the term "character menu" to be translated in their manuals? technical terms are difficult for translators because they are generally not as familiar with the subject domain as either the author of the source text or the reader of the target text. in many cases, there may be a number of acceptable translations, but it is important for the sake of consistency to standardize on a single one. it would be unacceptable for a manual to use a variety of synonyms for a particular menu or button. customarily, translation houses make extensive job-specific glossaries to ensure consistency and correctness of technical terminology for large jobs. a glossary is a list of terms and their translations.' we will subdivide the task of constructing a glossary into two subtasks: (1) generating a list of terms, and (2) finding the translation equivalents. the first task will be referred to as the monolingual task and the second as the bilingual task. how should a glossary be constructed? translation schools teach their students to read as much background material as possible in both the source and target languages, an extremely time-consuming process, as the introduction to hann's (1992, p. 8) text on technical translation indicates: contrary to popular opinion, the job of a technical translator has little in common with other linguistic professions, such as literature translation, foreign correspondence or interpreting. apart from an expert knowledge of both languages..., all that is required for the latter professions is a few general dictionaries, whereas a technical translator needs a whole library of specialized dictionaries, encyclopedias and 'the source and target fields are standard, though many other fields can also be found, e.g., usage notes, part of speech constraints, comments, etc. technical literature in both languages; he is more concerned with the exact meanings of terms than with stylistic considerations and his profession requires certain 'detective' skills as well as linguistic and literary ones. beginners in this profession have an especially hard time... this book attempts to meet this requirement. unfortunately, the academic prescriptions are often too expensive for commercial practice. translators need just-in-time glossaries. they cannot afford to do a lot of background reading and "detective" work when they are being paid by the word. they need something more practical. we propose a tool, termight, that automates some of the more tedious and laborious aspects of terminology research. the tool relies on part-of-speech tagging and word-alignment technologies to extract candidate terms and translations. it then sorts the extracted candidates and presents them to the user along with reference concordance lines, supporting efficient construction of glossaries. the tool is currently being used by the translators at at&t business translation services (formerly at&t language line services). termight may prove useful in contexts other than human-based translation. primarily, it can support customization of machine translation (mt) lexicons to a new domain. in fact, the arguments for constructing a job-specific glossary for human-based translation may hold equally well for an mt-based process, emphasizing the need for a productivity tool. the monolingual component of termight can be used to construct terminology lists in other applications, such as technical writing, book indexing, hypertext linking, natural language interfaces, text categorization and indexing in digital libraries and information retrieval (salton, 1988; cherry, 1990; harding, 1982; bourigault, 1992; damerau, 1993), while the bilingual component can be useful for information retrieval in multilingual text collections (landauer and littman, 1990).we have shown that terminology research provides a good application for robust natural language technology, in particular for part-of-speech tagging and word-alignment algorithms. the statistical corpus-based renaissance in computational linguistics has produced a number of interesting technologies, including part-of-speech tagging and bilingual word alignment. in particular, we have found the following to be very effective: as the need for efficient knowledge acquisition tools becomes widely recognized, we hope that this experience with termight will be found useful for other text-related systems as well. unfortunately, these technologies are still not as widely deployed in practical applications as they might be. in fact, the arguments for constructing a job-specific glossary for human-based translation may hold equally well for an mt-based process, emphasizing the need for a productivity tool. the monolingual component of termight can be used to construct terminology lists in other applications, such as technical writing, book indexing, hypertext linking, natural language interfaces, text categorization and indexing in digital libraries and information retrieval (salton, 1988; cherry, 1990; harding, 1982; bourigault, 1992; damerau, 1993), while the bilingual component can be useful for information retrieval in multilingual text collections (landauer and littman, 1990). primarily, it can support customization of machine translation (mt) lexicons to a new domain.

DATASET_PACSUM/dataset/inputs/A94-1009.txt ADDED Viewed

	@@ -0,0 +1 @@

+ part-of-speech tagging is the process of assigning grammatical categories to individual words in a corpus. one widely used approach makes use of a statistical technique called a hidden markov model (hmm). the model is defined by two collections of parameters: the transition probabilities, which express the probability that a tag follows the preceding one (or two for a second order model); and the lexical probabilities, giving the probability that a word has a given tag without regard to words on either side of it. to tag a text, the tags with non-zero probability are hypothesised for each word, and the most probable sequence of tags given the sequence of words is determined from the probabilities. two algorithms are commonly used, known as the forward-backward (fb) and viterbi algorithms. fb assigns a probability to every tag on every word, while viterbi prunes tags which cannot be chosen because their probability is lower than the ones of competing hypotheses, with a corresponding gain in computational efficiency. for an introduction to the algorithms, see cutting et at. (1992), or the lucid description by sharman (1990). there are two principal sources for the parameters of the model. if a tagged corpus prepared by a human annotator is available, the transition and lexical probabilities can be estimated from the frequencies of pairs of tags and of tags associated with words. alternatively, a procedure called baumwelch (bw) re-estimation may be used, in which an untagged corpus is passed through the fb algorithm with some initial model, and the resulting probabilities used to determine new values for the lexical and transition probabilities. by iterating the algorithm with the same corpus, the parameters of the model can be made to converge on values which are locally optimal for the given text. the degree of convergence can be measured using a perplexity measure, the sum of plog2p for hypothesis probabilities p, which gives an estimate of the degree of disorder in the model. the algorithm is again described by cutting et ad. and by sharman, and a mathematical justification for it can be found in huang et at. (1990). the first major use of hmms for part of speech tagging was in claws (garside et a/., 1987) in the 1970s. with the availability of large corpora and fast computers, there has been a recent resurgence of interest, and a number of variations on and alternatives to the fb, viterbi and bw algorithms have been tried; see the work of, for example, church (church, 1988), brill (brill and marcus, 1992; brill, 1992), derose (derose, 1988) and kupiec (kupiec, 1992). one of the most effective taggers based on a pure hmm is that developed at xerox (cutting et al., 1992). an important aspect of this tagger is that it will give good accuracy with a minimal amount of manually tagged training data. 96% accuracy correct assignment of tags to word token, compared with a human annotator, is quoted, over a 500000 word corpus. the xerox tagger attempts to avoid the need for a hand-tagged training corpus as far as possible. instead, an approximate model is constructed by hand, which is then improved by bw re-estimation on an untagged training corpus. in the above example, 8 iterations were sufficient. the initial model set up so that some transitions and some tags in the lexicon are favoured, and hence having a higher initial probability. convergence of the model is improved by keeping the number of parameters in the model down. to assist in this, low frequency items in the lexicon are grouped together into equivalence classes, such that all words in a given equivalence class have the same tags and lexical probabilities, and whenever one of the words is looked up, then the data common to all of them is used. re-estimation on any of the words in a class therefore counts towards re-estimation for all of them'. the results of the xerox experiment appear very encouraging. preparing tagged corpora either by hand is labour-intensive and potentially error-prone, and although a semi-automatic approach can be used (marcus et al., 1993), it is a good thing to reduce the human involvement as much as possible. however, some careful examination of the experiment is needed. in the first place, cutting et a/. do not compare the success rate in their work with that achieved from a hand-tagged training text with no re-estimation. secondly, it is unclear how much the initial biasing contributes the success rate. if significant human intervention is needed to provide the biasing, then the advantages of automatic training become rather weaker, especially if such intervention is needed on each new text domain. the kind of biasing cutting et a/. describe reflects linguistic insights combined with an understanding of the predictions a tagger could reasonably be expected to make and the ones it could not. the aim of this paper is to examine the role that training plays in the tagging process, by an experimental evaluation of how the accuracy of the tagger varies with the initial conditions. the results suggest that a completely unconstrained initial model does not produce good quality results, and that one 'the technique was originally developed by kupiec (kupiec, 1989). accurately trained from a hand-tagged corpus will generally do better than using an approach based on re-estimation, even when the training comes from a different source. a second experiment shows that there are different patterns of re-estimation, and that these patterns vary more or less regularly with a broad characterisation of the initial conditions. the outcome of the two experiments together points to heuristics for making effective use of training and reestimation, together with some directions for further research. work similar to that described here has been carried out by merialdo (1994), with broadly similar conclusions. we will discuss this work below. the principal contribution of this work is to separate the effect of the lexical and transition parameters of the model, and to show how the results vary with different degree of similarity between the training and test data.from the observations in the previous section, we propose the following guidelines for how to train a hmm for use in tagging: able, use bw re-estimation with standard convergence tests such as perplexity. the principal contribution of this work is to separate the effect of the lexical and transition parameters of the model, and to show how the results vary with different degree of similarity between the training and test data. part-of-speech tagging is the process of assigning grammatical categories to individual words in a corpus. one widely used approach makes use of a statistical technique called a hidden markov model (hmm). we will discuss this work below. in the end it may turn out there is simply no way of making the prediction without a source of information extrinsic to both model and corpus. work similar to that described here has been carried out by merialdo (1994), with broadly similar conclusions. the general pattern of the results presented does not vary greatly with the corpus and tagset used. during the first experiment, it became apparent that baum-welch re-estimation sometimes decreases the accuracy as the iteration progresses. to tag a text, the tags with non-zero probability are hypothesised for each word, and the most probable sequence of tags given the sequence of words is determined from the probabilities.

DATASET_PACSUM/dataset/inputs/A94-1016.txt ADDED Viewed

	@@ -0,0 +1 @@

+ machine-readable dictionary (the collins spanish/english), the lexicons used by the kbmt modules, a large set of user-generated bilingual glossaries as well as a gazetteer and a list of proper and organization names. the outputs from these engines (target language words and phrases) are recorded in a chart whose positions correspond to words in the source language input. as a result of the operation of each of the mt engines, new edges are added to the chart, each labeled with the translation of a region of the input string and indexed by this region's beginning and end positions. we will refer to all of these edges as components (as in "components of the translation") for the remainder of this article. the kbmt and ebmt engines also carry a quality score for each output element. the kbmt scores are produced based on whether any questionable heuristics were used in the source analysis or target generation. the ebmt scores are produced using a technique based on human judgements, as described in (nirenburg et al., 1994a), submitted. figure 1 presents a general view of the operation of our multi-engine mt system. the chart manager selects the overall best cover from the collection of candidate partial translations by normalizing each component's quality score (positive, with larger being better), and then selecting the best combination of components with the help of the chart walk algorithm. figure 2 illustrates the result of this process on the example spanish sentence: al momenta de su yenta a iberia, viasa contaba con ocho aviones, que tenzan en promedio 13 anos de vuelo which can be translated into english as at the moment of its sale to iberia, viasa had eight airplanes, which had on average thirteen years of flight (time). this is a sentence from one of the 1993 arpa mt evaluation texts. for each component, the starting and ending positions in the chart, the corresponding source language words, and alternative translations are shown, as well as the engine and the engine-internal quality scores. inspection of these translations shows numerous problems; for example, at position 12, "aviones" is translated, among other things, as "aircrafts". it must be remembered that these were generated automatically from an on-line dictionary, without any lexical feature marking or other human intervention. it is well known that such automatic methods are at the moment less than perfect, to say the least. in our current system, this is not a major problem, since the results go through a mandatory editing step, as described below. the chart manager normalizes the internal scores to make them directly comparable. in the case of kbmt and ebmt, the pre-existing scores are modified, while lexical transfer results are scored based on the estimated reliability of individual databases, from 0.5 up to 15. currently the kbmt scores are reduced by a constant, except for known erroneous output, which has its score set to zero. the internal ebmt scores range from 0 being perfect to 10,000 being worthless; but the scores are nonlinear. so a region selected by a threshold is converted linearly into scores ranging from zero to a normalized maximum ebmt score. the normalization levels were empirically determined in the initial experiment by having several individuals judge the comparative average quality of the outputs in an actual translation run. in every case, the base score produced by the scoring functions is currently multiplied by the length of the candidate in words, on the assumption that longer items are better. we intend to test a variety of functions in order to find the right contribution of the length factor. figure 3 presents the chart walk algorithm used to produce a single, best, non-overlapping, contiguous combination (cover) of the available component translations, assuming correct component quality scores. the code is organized as a recursive divideand-conquer procedure: to calculate the cover of a region of the input, it is repeatedly split into two parts, at each possible position. each time, the best possible cover for each part is recursively found, and the two scores are combined to give a score for the chart walk containing the two best subwalks. these different splits are then compared with each other and with components from the chart spanning the whole region (if any), and the overall best result is without dynamic programming, this would have a d 2 combinatorial time complexity. dynamic programl 2.5 ming utilizes a large array to store partial results, so that the best cover of any given subsequence is only computed once; the second time that a recursive call would compute the same result, it is retrieved from the array instead. this reduces the time complexity to 0(n3), and in practice it uses an insignificant part of total processing time. g 5 all possible combinations of components are cornd 2 pared: this is not a heuristic method, but an efficient exhaustive one. this is what assures that the chog 5 sen cover is optimal. this assumes, in addition to the scores actually being correct, that the scores are compositional, in the sense that the combined score for a set of components really represents their quality as a group. this might not be the case, for example, if gaps or overlaps are allowed in some cases (perhaps where they contain the same words in the same positions). we calculate the combined score for a sequence of d 2 components as the weighted average of their individual scores. weighting by length is necessary so that g 5 the same components, when combined in a different order, produce the same combined scores. otherwise the algorithm can produce inconsistent results. e 8.8 the chart walk algorithm can also be thought of as filling in the two-dimensional dynamic-programming arrayl . figure 4 shows an intermediate point in the filling of the array. in this figure, each element (i,j) is initially the best score of any single chart compod 2 nent covering the input region from word i to word j. dashes indicate that no one component covers exnote that this array is a different data structure from the chart. actly that region. (in rows 1 through 7, the array has not yet been operated on, so it still shows its initial state.) after processing (see rows 9 through 22), each element is the score for the best set of components covering the input from word i to word j (the best cover for this substring)2. (only a truncated score is shown for each element in the figure, for readability. there is also a list of best components associated with each element.) the array is upper triangular since the starting position of a component i must be less than or equal to its ending position j. for any position, the score is calculated based on a combination of scores in the row to its left and in the column below it, versus the previous contents of the array cell for its position. so the array must be filled from the bottom-up, and left to right. intuitively, this is because larger regions must be built up from smaller regions within them. for example, to calculate element (8,10), we compute the length-weighted averages of the scores of the best walks over the pair of elements (8,8) and (9,10) versus the pair (8,9) and (10,10), and compare them with the scores of any single chart components going from 8 to 10 (there were none), and take the maximum. referring to figure 2 again, this corresponds to a choice between combining the translations of (8,8) viasa and (9,10) contaba con versus combining the (not shown) translations of (8,9) viasa contaba and (10,10) con. (this (8,9) element was itself previously built up from single word components.) thus, we compare (2*1+ 10*2)/3 = 7.33 with (3.5*2+2*1)/3 = 3.0 and select the first, 7.33. the first wins because contaba con has a high score as an idiom from the glossary. figure 5 shows the final array. when the element in the top-right corner is produced (5.78), the algorithm is finished, and the associated set of components is the final chart walk result shown in figure 2. it may seem that the scores should increase towards the top-right corner. this has not generally been the case. while the system produces a number of high-scoring short components, many lowscoring components have to be included to span the entire input. since the score is a weighted average, these low-scoring components pull the combined score down. a clear example can be seen at position (18,18), which has a score of 15. the scores above and to its right each average this 15 with a 5, for total values of 10.0 (all the lengths happen to be 1), and the score continues to decrease with distance from this point as one moves towards the final score, which does include the component for (18,18) in the cover. the chart-oriented integration of mt engines does not easily support deviations from the linear order of the source text elements, as when discontinuous constituents translate contiguous strings or in the case of cross-component substring order differences. we use a language pair-dependent set of postprocessing rules to alleviate this (for example, by switching the order of adjacent single-word adjective and noun components).ultimately, a multi-engine system depends on the quality of each particular engine. a less ambitious version of this idea would be to run the low-scoring engines only where there are gaps in the normally high-scoring engines. we use a language pair-dependent set of postprocessing rules to alleviate this (for example, by switching the order of adjacent single-word adjective and noun components). the outputs from these engines (target language words and phrases) are recorded in a chart whose positions correspond to words in the source language input. the chart-oriented integration of mt engines does not easily support deviations from the linear order of the source text elements, as when discontinuous constituents translate contiguous strings or in the case of cross-component substring order differences. as a result of the operation of each of the mt engines, new edges are added to the chart, each labeled with the translation of a region of the input string and indexed by this region's beginning and end positions. we will refer to all of these edges as components (as in "components of the translation") for the remainder of this article. machine-readable dictionary (the collins spanish/english), the lexicons used by the kbmt modules, a large set of user-generated bilingual glossaries as well as a gazetteer and a list of proper and organization names.

DATASET_PACSUM/dataset/inputs/A97-1004.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the task of identifying sentence boundaries in text has not received as much attention as it deserves. many freely available natural language processing tools require their input to be divided into sentences, but make no mention of how to accomplish this (e.g. (brill, 1994; collins, 1996)). others perform the division implicitly without discussing performance (e.g. (cutting et al., 1992)). on first glance, it may appear that using a short list, of sentence-final punctuation marks, such as ., ?, and !, is sufficient. however, these punctuation marks are not used exclusively to mark sentence breaks. for example, embedded quotations may contain any of the sentence-ending punctuation marks and . is used as a decimal point, in email addresses, to indicate ellipsis and in abbreviations. both ! and ? are somewhat less ambiguous *the authors would like to acknowledge the support of arpa grant n66001-94-c-6043, aro grant daah0494-g-0426 and nsf grant sbr89-20230. but appear in proper names and may be used multiple times for emphasis to mark a single sentence boundary. lexically-based rules could be written and exception lists used to disambiguate the difficult cases described above. however, the lists will never be exhaustive, and multiple rules may interact badly since punctuation marks exhibit absorption properties. sites which logically should be marked with multiple punctuation marks will often only have one ((nunberg, 1990) as summarized in (white, 1995)). for example, a sentence-ending abbreviation will most likely not be followed by an additional period if the abbreviation already contains one (e.g. note that d.0 is followed by only a single . in the president lives in washington, d.c.). as a result, we believe that manually writing rules is not a good approach. instead, we present a solution based on a maximum entropy model which requires a few hints about what. information to use and a corpus annotated with sentence boundaries. the model trains easily and performs comparably to systems that require vastly more information. training on 39441 sentences takes 18 minutes on a sun ultra sparc and disambiguating the boundaries in a single wall street journal article requires only 1.4 seconds.we would also like to thank the anonymous reviewers for their helpful insights. training on 39441 sentences takes 18 minutes on a sun ultra sparc and disambiguating the boundaries in a single wall street journal article requires only 1.4 seconds. the task of identifying sentence boundaries in text has not received as much attention as it deserves. we would like to thank david palmer for giving us the test data he and marti hearst used for their sentence detection experiments. the model trains easily and performs comparably to systems that require vastly more information. to our knowledge, there have been few papers about identifying sentence boundaries. liberman and church suggest in (liberma.n and church, 1992) that a. system could be quickly built to divide newswire text into sentences with a nearly negligible error rate, but do not actually build such a system. we have described an approach to identifying sentence boundaries which performs comparably to other state-of-the-art systems that require vastly more resources. instead, we present a solution based on a maximum entropy model which requires a few hints about what. information to use and a corpus annotated with sentence boundaries. we present two systems for identifying sentence boundaries. many freely available natural language processing tools require their input to be divided into sentences, but make no mention of how to accomplish this (e.g.

DATASET_PACSUM/dataset/inputs/A97-1011.txt ADDED Viewed

	@@ -0,0 +1 @@

+ we are concerned with surface-syntactic parsing of running text. our main goal is to describe syntactic analyses of sentences using dependency links that show the head-modifier relations between words. in addition, these links have labels that refer to the syntactic function of the modifying word. a simplified example is in figure 1, where the link between i and see denotes that i is the modifier of see and its syntactic function is that of subject. similarly, a modifies bird, and it is a determiner. first, in this paper, we explain some central concepts of the constraint grammar framework from which many of the ideas are derived. then, we give some linguistic background to the notations we are using, with a brief comparison to other current dependency formalisms and systems. new formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. finally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. the parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing english constraint grammar parser (karlsson et al., 1995). the parsers can be tested via www'.voutilainen and juha heikkild created the original engcg lexicon. we are using atro voutilainen's (1995) improved part-of-speech disambiguation grammar which runs in the cg-2 parser. the parsers can be tested via www'. we are concerned with surface-syntactic parsing of running text. in this paper, we have presented some main features of our new framework for dependency syntax. however, the comparison to other current systems suggests that our dependency parser is very promising both theoretically and practically. our work is partly based on the work done with the constraint grammar framework that was originally proposed by fred karlsson (1990). for instance, the results are not strictly comparable because the syntactic description is somewhat different. the evaluation was done using small excerpts of data, not used in the development of the system. our main goal is to describe syntactic analyses of sentences using dependency links that show the head-modifier relations between words. the distinction between the complements and the adjuncts is vague in the implementation; neither the complements nor the adjuncts are obligatory. means that a nominal head (nom-head is a set that contains part-of-speech tags that may represent a nominal head) may not appear anywhere to the left (not *-1). this "anywhere" to the left or right may be restricted by barriers, which restrict the area of the test.

DATASET_PACSUM/dataset/inputs/A97-1014.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the work reported in this paper aims at providing syntactically annotated corpora (treebanks') for stochastic grammar induction. in particular, we focus on several methodological issues concerning the annotation of non-configurational languages. in section 2, we examine the appropriateness of existing annotation schemes. on the basis of these considerations, we formulate several additional requirements. a formalism complying with these requirements is described in section 3. section 4 deals with the treatment of selected phenomena. for a description of the annotation tool see section 5.for a description of the annotation tool see section 5. its extension is subject to further investigations. as the annotation scheme described in this paper focusses on annotating argument structure rather than constituent trees, it differs from existing treebanks in several aspects. the work reported in this paper aims at providing syntactically annotated corpora (treebanks') for stochastic grammar induction. these differences can be illustrated by a comparison with the penn treebank annotation scheme. a uniform representation of local and non-local dependencies makes the structure more transparent'. partial automation included in the current version significantly reduces the manna.1 effort. the development of linguistically interpreted corpora presents a laborious and time-consuming task. owing to the partial automation, the average annotation efficiency improves by 25% (from around 4 minutes to 3 minutes per sentence). combining raw language data with linguistic information offers a promising basis for the development of new efficient and robust nlp methods. such a word order independent representation has the advantage of all structural information being encoded in a single data structure. in order to make the annotation process more efficient, extra effort has been put. into the development of an annotation tool. realworld texts annotated with different strata of linguistic information can be used for grammar induction.

DATASET_PACSUM/dataset/inputs/A97-1029.txt ADDED Viewed

	@@ -0,0 +1 @@

+ in the past decade, the speech recognition community has had huge successes in applying hidden markov models, or hmm's to their problems. more recently, the natural language processing community has effectively employed these models for part-ofspeech tagging, as in the seminal (church, 1988) and other, more recent efforts (weischedel et al., 1993). we would now propose that hmm's have successfully been applied to the problem of name-finding. we have built a named-entity (ne) recognition system using a slightly-modified version of an hmm; we call our system "nymble". to our knowledge, nymble out-performs the best published results of any other learning name-finder. furthermore, it performs at or above the 90% accuracy level, often considered "near-human performance". the system arose from the ne task as specified in the last message understanding conference (muc), where organization names, person names, location names, times, dates, percentages and money amounts were to be delimited in text using sgml-markup. we will describe the various models employed, the methods for training these models and the method for "decoding" on test data (the term "decoding" borrowed from the speech recognition community, since one goal of traversing an hmm is to recover the hidden state sequence). to date, we have successfully trained and used the model on both english and spanish, the latter for met, the multi-lingual entity task.given the incredibly difficult nature of many nlp tasks, this example of a learned, stochastic approach to name-finding lends credence to the argument that the nlp community ought to push these approaches, to find the limit of phenomena that may be captured by probabilistic, finite-state methods. in the past decade, the speech recognition community has had huge successes in applying hidden markov models, or hmm's to their problems. we have shown that using a fairly simple probabilistic model, finding names and other numerical entities as specified by the muc tasks can be performed with "near-human performance", often likened to an f of 90 or above. to date, we have successfully trained and used the model on both english and spanish, the latter for met, the multi-lingual entity task. we would like to incorporate the following into the current model: while our initial results have been quite favorable, there is still much that can be done potentially to improve performance and completely close the gap between learned and rule-based name-finding systems. the basic premise of the approach is to consider the raw text encountered when decoding as though it had passed through a noisy channel, where it had been originally marked with named entities.'

DATASET_PACSUM/dataset/inputs/A97-1030.txt ADDED Viewed

	@@ -0,0 +1 @@

+ text processing applications, such as machine translation systems, information retrieval systems or natural-language understanding systems, need to identify multi-word expressions that refer to proper names of people, organizations, places, laws and other entities. when encountering mrs. candy hill in input text, for example, a machine translation system should not attempt to look up the translation of candy and hill, but should translate mrs. to the appropriate personal title in the target language and preserve the rest of the name intact. similarly, an information retrieval system should not attempt to expand candy to all of its morphological variants or suggest synonyms (wacholder et al. 1994). the need to identify proper names has two aspects: the recognition of known names and the discovery of new names. since obtaining and maintaining a name database requires significant effort, many applications need to operate in the absence of such a resource. without a database, names need to be discovered in the text and linked to entities they refer to. even where name databases exist, text needs to be scanned for new names that are formed when entities, such as countries or commercial companies, are created, or for unknown names which become important when the entities they refer to become topical. this situation is the norm for dynamic applications such as news providing services or internet information indexing. the next section describes the different types of proper name ambiguities we have observed. section 3 discusses the role of context and world knowledge in their disambiguation; section 4 describes the process of name discovery as implemented in nominator, a module for proper name recognition developed at the ibm t.j. watson research center. sections 5-7 elaborate on nominator's disambiguation heuristics.sections 5-7 elaborate on nominator's disambiguation heuristics. ambiguity remains one of the main challenges in the processing of natural language text. because of these difficulties, we believe that for the forseeable future, practical applications to discover new names in text will continue to require the sort of human effort invested in nominator. text processing applications, such as machine translation systems, information retrieval systems or natural-language understanding systems, need to identify multi-word expressions that refer to proper names of people, organizations, places, laws and other entities. an evaluation of an earlier version of nominator, was performed on 88 wall street journal documents (nist 1993) that had been set aside for testing. in the rest of the paper we describe the resources and heuristics we have designed and implemented in nominator and the extent to which they resolve these ambiguities. name identification requires resolution of a subset of the types of structural and semantic ambiguities encountered in the analysis of nouns and noun phrases (nps) in natural language processing. all of these ambiguities must be dealt with if proper names are to be identified correctly. it assigns weak types such as ?human or fails to assign a type if the available information is not sufficient.

DATASET_PACSUM/dataset/inputs/A97-1039.txt ADDED Viewed

	@@ -0,0 +1 @@

+ systems that generate natural language output as part of their interaction with a user have become a major area of research and development. typically, natural language generation is divided into several phases, namely text planning (determining output content and structure), sentence planning (determining abstract target language resources to express content, such as lexical items and syntactic constructions), and realization (producing the final text string) (reiter, 1994). while text and sentence planning may sometimes be combined, a realizer is almost always included as a distinct module. it is in the realizer that knowledge about the target language resides (syntax, morphology, idiosyncratic properties of lexical items). realization is fairly well understood both from a linguistic and from a computational point of view, and therefore most projects that use text generation do not include the realizer in the scope of their research. instead, such projects use an off-the-shelf realizer, among which penman (bateman, 1996) and surge/fuf (elhadad and robin, 1996) are probably the most popular. in this technical note and demo we present a new off-theshelf realizer, realpro. realpro is derived from previous systems (iordanskaja et al., 1988; iordanslcaja et al., 1992; rambow and korelsky, 1992), but represents a new design and a completely new implementation. realpro has the following characteristics, which we believe are unique in this combination: we reserve a more detailed comparison with penman and fuf, as well as with alethgen/gl (coch, 1996) (which is perhaps the system most similar to realpro, since they are based on the same linguistic theory and are both implemented with speed in mind), for a more extensive paper. this technical note presents realpro, concentrating on its structure, its coverage, its interfaces, and its performance.systems that generate natural language output as part of their interaction with a user have become a major area of research and development. this technical note presents realpro, concentrating on its structure, its coverage, its interfaces, and its performance. we are grateful to r. kittredge, t. korelsky, d. mccullough, a. nasr, e. reiter, and m. white as well as to three anonymous reviewers for helpful comments about earlier drafts of this technical note and/or about realpro. the input to realpro is a syntactic dependency structure. the development of realpro was partially supported by usaf rome laboratory under contracts f3060293-c-0015, f30602-94-c-0124, and f30602-92-c-0163, and by darpa under contracts f30602-95-2-0005 and f30602-96-c-0220. this means that realpro gives the developer control over the output, while taking care of the linguistic details. realpro is licensed free of charge to qualified academic institutions, and is licensed for a fee to commercial sites. the system is fully operational, runs on pc as well as on unix work stations, and is currently used in an application we have developed (lavoie et al., 1997) as well as in several on-going projects (weather report generation, machine translation, project report generation). the architecture of realpro is based on meaningtext theory, which posits a sequence of correspondences between different levels of representation.

DATASET_PACSUM/dataset/inputs/A97-1052.txt ADDED Viewed

	@@ -0,0 +1 @@

+ predicate subcategorization is a key component of a lexical entry, because most, if not all, recent syntactic theories 'project' syntactic structure from the lexicon. therefore, a wide-coverage parser utilizing such a lexicalist grammar must have access to an accurate and comprehensive dictionary encoding (at a minimum) the number and category of a predicate's arguments and ideally also information about control with predicative arguments, semantic selection preferences on arguments, and so forth, to allow the recovery of the correct predicate-argument structure. if the parser uses statistical techniques to rank analyses, it is also critical that the dictionary encode the relative frequency of distinct subcategorization classes for each predicate. several substantial machine-readable subcategorization dictionaries exist for english, either built largely automatically from machine-readable versions of conventional learners' dictionaries, or manually by (computational) linguists (e.g. the alvey nl tools (anlt) dictionary, boguraev et al. (1987); the comlex syntax dictionary, grishman et al. (1994)). unfortunately, neither approach can yield a genuinely accurate or comprehensive computational lexicon, because both rest ultimately on the manual efforts of lexicographers / linguists and are, therefore, prone to errors of omission and commission which are hard or impossible to detect automatically (e.g. boguraev & briscoe, 1989; see also section 3.1 below for an example). furthermore, manual encoding is labour intensive and, therefore, it is costly to extend it to neologisms, information not currently encoded (such as relative frequency of different subcategorizations), or other (sub)languages. these problems are compounded by the fact that predicate subcategorization is closely associated to lexical sense and the senses of a word change between corpora, sublanguages and/or subject domains (jensen, 1991). in a recent experiment with a wide-coverage parsing system utilizing a lexicalist grammatical framework, briscoe & carroll (1993) observed that half of parse failures on unseen test data were caused by inaccurate subcategorization information in the anlt dictionary. the close connection between sense and subcategorization and between subject domain and sense makes it likely that a fully accurate 'static' subcategorization dictionary of a language is unattainable in any case. moreover, although schabes (1992) and others have proposed `lexicalized' probabilistic grammars to improve the accuracy of parse ranking, no wide-coverage parser has yet been constructed incorporating probabilities of different subcategorizations for individual predicates, because of the problems of accurately estimating them. these problems suggest that automatic construction or updating of subcategorization dictionaries from textual corpora is a more promising avenue to pursue. preliminary experiments acquiring a few verbal subcategorization classes have been reported by brent (1991, 1993), manning (1993), and ushioda et at. (1993). in these experiments the maximum number of distinct subcategorization classes recognized is sixteen, and only ushioda et at. attempt to derive relative subcategorization frequency for individual predicates. we describe a new system capable of distinguishing 160 verbal subcategorization classes—a superset of those found in the anlt and comlex syntax dictionaries. the classes also incorporate information about control of predicative arguments and alternations such as particle movement and extraposition. we report an initial experiment which demonstrates that this system is capable of acquiring the subcategorization classes of verbs and the relative frequencies of these classes with comparable accuracy to the less ambitious extant systems. we achieve this performance by exploiting a more sophisticated robust statistical parser which yields complete though 'shallow' parses, a more comprehensive subcategorization class classifier, and a priori estimates of the probability of membership of these classes. we also describe a small-scale experiment which demonstrates that subcategorization class frequency information for individual verbs can be used to improve parsing accuracy.predicate subcategorization is a key component of a lexical entry, because most, if not all, recent syntactic theories 'project' syntactic structure from the lexicon. the experiment and comparison reported above suggests that our more comprehensive subcategorization class extractor is able both to assign classes to individual verbal predicates and also to rank them according to relative frequency with comparable accuracy to extant systems. boguraev & briscoe, 1987). we achieve this performance by exploiting a more sophisticated robust statistical parser which yields complete though 'shallow' parses, a more comprehensive subcategorization class classifier, and a priori estimates of the probability of membership of these classes. we also describe a small-scale experiment which demonstrates that subcategorization class frequency information for individual verbs can be used to improve parsing accuracy. therefore, a wide-coverage parser utilizing such a lexicalist grammar must have access to an accurate and comprehensive dictionary encoding (at a minimum) the number and category of a predicate's arguments and ideally also information about control with predicative arguments, semantic selection preferences on arguments, and so forth, to allow the recovery of the correct predicate-argument structure. brent's (1993) approach to acquiring subcategorization is based on a philosophy of only exploiting unambiguous and determinate information in unanalysed corpora.

DATASET_PACSUM/dataset/inputs/C00-1007.txt ADDED Viewed

	@@ -0,0 +1 @@

+ moreover, in ma w cases it; is very important not to deviate from certain linguis- tic standards in generation, in which case hand- crafted grammars give excellent control. how- ever, in other applications tbr nlg the variety of the output is much bigger, and the demands on the quality of the output somewhat less strin- gent. a typical example is nlg in the con- text of (interlingua- or transthr-based) machine translation. another reason for reb~xing the quality of the output may be that not enough time is available to develop a flfll grammar tbr a new target language in nlg. in all these cases, stochastic ("empiricist") methods pro- vide an alternative to hand-crafted ("rational- ist") approaches to nlg. to our knowledge, the first to use stochastic techniques in nlg were langkilde and knight (1998a) and (1998b). in this paper, we present fergus (flexible em- piricist/rationalist generation using syntax). fertgus follows langkilde and knights seminal work in using an n-gram language model, but; we augment it with a tree-based stochastic model and a traditional tree-based syntactic grammar. more recent work on aspects of stochastic gen- eration include (langkilde and knight, 2000), (malouf, 1999) and (ratnaparkhi, 2000). betbre we describe in more detail how we use stochastic models in nlg, we recall the basic tasks in nlg (rainbow and korelsky, 1992; re- iter, 1994). during text p lanning, content and structure of the target text; are determined to achieve the overall communicative goal. dur- ing sentence planning, linguistic means - in particular, lexical and syntactic means are de- termined to convey smaller pieces of meaning. l)uring real izat ion, the specification chosen in sentence planning is transtbrmed into a surface string, by line~rizing and intlecting words in the sentence (and typically, adding function words). as in the work by langkilde and knight, our work ignores the text planning stage, but it; does address the sentence, planning and the realiza- tion stages. the structure of the paper is as tbllows.explo i t ing a probabi l ist ic hierarchical mode l for generat ion srinivas bangalore and owen rambow at&t labs research 180 park avenue f lorham park, nj 07932 {sr in?, rambow}@research, a r t .

DATASET_PACSUM/dataset/inputs/C00-1044.txt ADDED Viewed

	@@ -0,0 +1 @@

+ such features include sense, register, do- main spccilicity, pragmatic restrictions on usage, scnlan- lic markcdncss, and orientation, as well as automatically ictcnlifiecl links between words (e.g., semantic rclalcd- hess, syllollynly, antonylny, and tneronymy). aulomal- ically learning features of this type from hugc corpora allows the construction or augmentation of lexicons, and the assignment of scmanlic htbcls lo words and phrases in running text. this information in turn can bc used to help dcterlninc addilional features at the it?teal, clause, sentence, or document level. tiffs paper explores lira benelits that some lexical fea- tures of adjectives offer lor the prediction of a contexlual sentence-level feature, suojectivity. subjectivity in nat- ural language re[crs to aspects of language used to ex- press opinions and ewfluations. the computatiomtl task addressed here is to distinguish sentences used to present opinions and other tbrms of subjectivity (suojective sen- tences, e.g., "at several different layers, its a fascinating title") from sentences used to objectively present factual information (objective sentences, e.g., "bell industries inc. increased its quarterly to 10 cents from 7 cents a share"). much research in discourse processing has focused on task-oriented and insmmtional dialogs. the task ad- dressed here comes to the fore in other genres, especially news reporting and lnternet lorums, in which opinions of various agents are expressed and where subjectivity judgements couht help in recognizing inllammatory rues- sages ("llanles) and mining online sources for product reviews. ()thor (asks for whicll subjectivity recognition is potentially very useful include infornmtion extraction and information retrieval. assigning sub.icctivity labels to documents or portions of documents is an example of non-topical characteri?ation f information. current in- formation extraction and rolricval lechnology focuses al- most exclusively on lhe subject matter of the documcnls. yet, additiomtl components of a document inllucncc its relevance to imrlicuhu ? users or tasks, including, for ex- alnple, the evidential slatus el: lhc material presented, and attitudes adopted in fawn" or against a lmrticular person, event, or posilion (e.g., articles on a presidenlial cam- paign wrillen to promote a specific candidate). in sum- marization, subjectivity judgmcnls could be included in documcllt proiilcs to augment aulomatically produced docunacnt summaries, and to hel l) the user make rele- vance judgments when using a search engine. ()thor work on sub.iectivity (wicbc et al., 1999; bruce and wicbc, 2000) has established a positive and statisti- cally signilicant correlation with the presence of adiec- lives.effects of adjective orientation and gradability on sentence subjectivity vas i le ios hatz ivass i log lou depar tment o1 computer sc ience co lumbia un ivers i l y new york, ny 10027 vh@cs , co lumbia , edu janyce m. wiebe depar tment o f computer sc ience new mex ico state un ivers i ty las cruces , nm 88003 w iebe@cs , nmsu.

DATASET_PACSUM/dataset/inputs/C00-1072.txt ADDED Viewed

	@@ -0,0 +1 @@

+ toi)ic signatures can lie used to identify the t)resence of a (:omph~x conce.pt a concept hat consists of several related coinl)onents in fixed relationships. ]~.c.stauvant-uisit, for examph~, invoh,es at h,ast the concel)ts lltcgfit, t.(tt, pay, and possibly waiter, all(l dragon boat pcstivai (in tat- wan) involves the ct)llc(!l)t,s cal(tlztlt,s (a talisman to ward off evil), rnoza (something with the t)ower of preventing pestilen(:e and strengthening health), pic- tures of ch, un9 kuei (a nemesis of evil spirits), eggs standing on end, etc. only when the concepts co- occur is one licensed to infer the comph:x concept; cat or moza alone, for example, are not sufficient. at this time, we do not c.onsider the imerrelationships among tile concepts. since many texts may describe all the compo- nents of a comi)lex concept without ever exi)lic- itly mentioning the mlderlying complex concel/t--a tol)ic--itself, systems that have to identify topic(s), for summarization or information retrieval, require a method of infcuring comt)hx concellts flom their component words in the text. 2 re la ted work in late 1970s, ])e.long (dejong, 1982) developed a system called i"tiump (fast reading understand- ing and memory program) to skim newspaper sto- ries and extract the main details. frump uses a data structure called sketchy script to organize its world knowhdge. each sketchy script is what frumi ) knows al)out what can occur in l)articu- lar situations such as denmnstrations, earthquakes, labor strike.s, an(t so on. frump selects a t)artic- ular sketchy script based on clues to styled events in news articles. in other words, frump selects an eml)t3 ~ t(uni)late 1whose slots will be tilled on the fly as t"f[ump reads a news artme. a summary is gen- erated })ased on what has been (:al)tured or filled in the teml)iate. the recent success of infornmtion extractk)n re- search has encoreaged the fi{um1 ) api)roach. the summons (summarizing online news artmes) system (mckeown and radev, 1999) takes tem- l)late outputs of information extra(:tion systems de- velofmd for muc conference and generating smn- maries of multit)le news artmes. frump and sum- mons both rely on t/rior knowledge of their do- mains, th)wever, to acquire such t)rior knowledge is lal)or-intensive and time-consuming. i~)r exam-- l)le, the unive.rsity of massa(:husetts circus sys- l.enl use(l ill the muc-3 (saic, 1998) terrorism do- main required about 1500 i)erson-llours to define ex- traction lmtterns 2 (rilotf, 1996).the automated acquisit ion of topic signatures for text summarizat ion chin -yew l in and eduard hovy in fo rmat ion s(:i(umes i l l s t i tu te un ivers i ty of southern ca l i fo rn ia mar ina del rey, ca 90292, usa { cyl,hovy }c~isi.edu abst rac t in order to produce, a good summary, one has to identify the most relevant portions of a given text.

DATASET_PACSUM/dataset/inputs/C00-2136.txt ADDED Viewed

	@@ -0,0 +1 @@

+ we evaluate exdisco by com- paring the pertbrmance of discovered patterns against that of manually constructed systems on actual extraction tasks. 0 introduct ion intbrmation extraction is the selective xtrac- tion of specified types of intbrmation from nat- ural language text. the intbrmation to be extracted may consist of particular semantic classes of objects (entities), relationships among these entities, and events in which these entities participate. the extraction system places this intbrmation into a data base tbr retrieval and subsequent processing. in this paper we shall be concerned primar- ily with the extraction of intbrmation about events. in the terminology which has evolved tiom the message understanding conferences (muc, 1995; muc, 1993), we shall use the term subject domain to refer to a broad class of texts, such as business news, and tile term scenario to refer to tile specification of tile particular events to be extracted. for example, the "manage- ment succession" scenario for muc-6, which we shall refer to throughout this paper, involves in- formation about corporate executives tarting and leaving positions. the fundamental problem we face in port- ing an extraction system to a new scenario is to identify the many ways in which intbrmation about a type of event may be expressed in the text;. typically, there will be a few common tbrms of expression which will quickly come to nfind when a system is being developed. how- ever, the beauty of natural language (and the challenge tbr computational linguists) is that there are many variants which an imaginative writer cast use, and which the system needs to capture. finding these variants may involve studying very large amounts of text; in the sub- ject domain. this has been a major impediment to the portability and performance of event ex- traction systems. we present; in this paper a new approach to finding these variants automatically flom a large corpus, without the need to read or amlo- tate the corpus. this approach as been evalu- ated on actual event extraction scenarios. in the next section we outline the strncture of our extraction system, and describe the discov- ery task in the context of this system.automatic acquisition of domain knowledge for information extraction roman yangarber, ralph grishman past tapanainen courant inst i tute of conexor oy mathemat ica l sciences helsinki, f in land new york university {roman [ grishman}@cs, nyu.

DATASET_PACSUM/dataset/inputs/C00-2137.txt ADDED Viewed

	@@ -0,0 +1 @@

+ 5/]lell ,]le lcsllll;s are better with the new tcch- ni(lue , a question arises as t() wh(,l;h(;r these l:(`-- sult; (litleren(:es are due t() the new technique a(:t;ually 1)eing l)cl;t(x or just; due 1;o (:han(:e. un- tortmmtely, one usually callll()t) directly answer the qnesl;ion "what is the 1)robatfility that 1;11(; now l;(x:hni(luc, is t)el;lx~r givell l;he results on the t(,sl, dal;a sol;": i)(new technique is better [ test set results) ]~ul; with statistics, one cml answer the follow- ing proxy question: if the new technique was a(> tually no ditterent han the old t(,(hnique ((;he * this paper reports on work l)erfonncd at the mitr1,; corporation under the sul)porl: of the mitilj,; ,qponsored research l)rogrmn. warren grcit[, l ,ynette il irschlnm b christilm l)orall, john llen(lerson, kelmeth church, ted l)unning, wessel kraaij, milch marcus and an anony- mous reviewer l)rovided hell)rid suggestions. copyright @2000 the mitre corl)oration. all rights r(~s(nvcd. null hyl)othesis), wh~tt is 1:11(; 1)robat)ility that the results on the test set would l)e at least this skewed in the new techniques favor (box eta] . thai; is, what is p(test se, t results at least this skew(a in the new techni(lues favor i new technique is no (liffercnt than the old) if the i)robtfl)ility is small enough (5% off;on is used as the threshold), then one will rqiect the mill hyi)othems and say that the differences in 1;he results are :sta.tisl;ically siglfilicant" ai; that thrt,shold level. this 1)al)(n" examines some of th(`- 1)ossil)le me?hods for trying to detect statistically signif- leant diflelenc(`-s in three commonly used met- li(:s: telall, 1)re(ision and balanced f-score. many of these met;ire(is arc foun(t to be i)rol)lem- a.ti(" ill a, so, t; of exl)erinw, nts that are performed. thes(~ methods have a, tendency to ullderesti- mat(`- th(, signili(:ance, of the results, which tends t() 1hake one, 1)elieve thai; some new techni(tuc is no 1)el;l;er l;lmn the (:urrent technique even when il; is. this mtderest imate comes flom these lnc|h- ells assuming l;hat; the te(:hlfi(tues being con> lmrcd produce indepen(lc, nt results when in our exl)eriments , the techniques 1)eing coml)ared tend to 1)reduce l)ositively corr(`-lated results. to handle this problem, we, point out some st~ttistical tests, like the lnatche(t-pair t, sign and wilcoxon tests (harnett, 1982, see. 8.7 and 15.5), which do not make this assulnption. one call its(, l;llcse tes ts oll i;hc recall nlel;r ic, but l;he precision an(l 1)alanced f-score metric have too coml)lex a tbrm for these tests. for such com- 1)lex lne|;ri(;s~ we llse a colnplll;e-in|;clisiv(~ ran- domization test (cohen, 1995, sec. 5.3), which also ~tvoids this indet)en(lence assmnption.more accurate tes ts ibr the s ta t i s t i ca l s ign i f i cance of resu l t d i f ferences * alexander yeh mitre corp. 202 burli l lgl;on rd.

DATASET_PACSUM/dataset/inputs/C00-2163.txt ADDED Viewed

	@@ -0,0 +1 @@

+ here .fi = f denotes tile (15ench) source and e{ = e denotes the (english) target string. most smt models (brown et al., 1993; vogel et al., 1996) try to model word-to-word corresl)ondences between source and target words using an alignment nmpl)ing from source l)osition j to target position i = aj. we can rewrite tim t)robal)ility pr(fille~) t) 3, in- troducing the hidden alignments ai 1 := al ...aj...a.l (aj c {0 , . , /} ) : pr(f~lel) = ~pr(f i ,a~le{) .1 ? j -1 i~ = e h pr(fj ajlf i -"al e l ) q, j=l to allow fbr french words wlfich do not directly cor- respond to any english word an artificial empty word c0 is added to the target sentence at position i=0. the different alignment models we present pro- vide different decoint)ositions of pr(f~,a~le(). an alignnlent 5~ for which holds a~ = argmax pr(fi , al[ei) at for a specific model is called v i terb i al ignment of" this model. in this paper we will describe extensions to tile hidden-markov alignment model froln (vogel et al., 1.996) and compare tlmse to models 1 - 4 of (brown et al., 1993). we t)roi)ose to measure the quality of an alignment nlodel using the quality of tlle viterbi alignment compared to a manually-produced align- ment. this has the advantage that once having pro- duced a reference alignlnent, the evaluation itself can be performed automatically. in addition, it results in a very precise and relia.ble valuation criterion which is well suited to assess various design decisions in modeling and training of statistical alignment mod- els. it, is well known that manually pertbrming a word aligmnent is a colnplicated and ambiguous task (melamed, 1998). therefore, to produce tlle refer- ence alignment we use a relined annotation scheme which reduces the complications and mnbiguities oc- curring in the immual construction of a word align- ment. as we use tile alignment models for machine translation purposes, we also evahlate the resulting translation quality of different nlodels. 2 al ignment w i th hmm in the hidden-markov alignment model we assume a first-order dependence for tim aligmnents aj and that the translation probability depends olfly on aj and not oil (tj_l: - ~- el) =p(ajl.a compar i son of a l ignment mode ls for s ta t i s t i ca l mach ine trans la t ion franz josef och and hermann ney lehrstuhl fiir informatik vi, comlmter science department rwth aachen - university of technology d-52056 aachen, germany {och, ney}~inf ormat ik.

DATASET_PACSUM/dataset/inputs/C02-1011.txt ADDED Viewed

	@@ -0,0 +1 @@

+ we address here the problem of base np translation, in which for a given base noun phrase in a source language (e.g., ?information age? in english), we are to find out its possible translation(s) in a target language (e.g., ? in chinese). we define a base np as a simple and non-recursive noun phrase. in many cases, base nps represent holistic and non-divisible concepts, and thus accurate translation of them from one language to another is extremely important in applications like machine translation, cross language information retrieval, and foreign language writing assistance. in this paper, we propose a new method for base np translation, which contains two steps: (1) translation candidate collection, and (2) translation selection. in translation candidate collection, for a given base np in the source language, we look for its translation candidates in the target language. to do so, we use a word-to-word translation dictionary and corpus data in the target language on the web. in translation selection, we determine the possible translation(s) from among the candidates. we use non-parallel corpus data in the two languages on the web and employ one of the two methods which we have developed. in the first method, we view the problem as that of classification and employ an ensemble of na?ve bayesian classifiers constructed with the em algorithm. we will use ?em-nbc-ensemble? to denote this method, hereafter. in the second method, we view the problem as that of calculating similarities between context vectors and use tf-idf vectors also constructed with the em algorithm. we will use ?em-tf-idf? to denote this method. experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively. the results are significantly better than those of the baseline methods relying on existing technologies. the higher performance of our method can be attributed to the enormity of the web data used and the employment of the em algorithm.the higher performance of our method can be attributed to the enormity of the web data used and the employment of the em algorithm. we address here the problem of base np translation, in which for a given base noun phrase in a source language (e.g., ?information age? we also acknowledge shenjie li for help with program coding. this paper has proposed a new and effective method for base np translation by using web data and the em algorithm. the results are significantly better than those of the baseline methods relying on existing technologies. in english), we are to find out its possible translation(s) in a target language (e.g., ? 2.1 translation with non-parallel. we conducted experiments on translation of the base nps from english to chinese. experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively. in chinese). we extracted base nps (noun-noun pairs) from the encarta 1 english corpus using the tool developed by xun et al(2000). for nagata et al?s method, we found that it was almost impossible to find partial-parallel corpora in the non-web data. they observed that there are many partial parallel corpora between english and japanese on the web, and most typically english translations of japanese terms (words or phrases) are parenthesized and inserted immediately after the japanese terms in documents written in japanese.

DATASET_PACSUM/dataset/inputs/C02-1054.txt ADDED Viewed

	@@ -0,0 +1 @@

+ named entity (ne) recognition is a task in whichproper nouns and numerical information in a docu ment are detected and classified into categories suchas person, organization, and date. it is a key technol ogy of information extraction and open-domain question answering (voorhees and harman, 2000). we are building a trainable open-domain question answering system called saiqa-ii. in this paper, we show that an ne recognizer based on support vector machines (svms) gives better scores thanconventional systems. svms have given high per formance in various classification tasks (joachims, 1998; kudo and matsumoto, 2001). however, it turned out that off-the-shelf svm classifiers are too inefficient for ne recognition. the recognizer runs at a rate of only 85 bytes/sec on an athlon 1.3 ghz linux pc, while rule-based systems (e.g., isozaki, (2001)) can process several kilobytes in a second. the major reason is the inefficiency of svm classifiers. there are otherreports on the slowness of svm classifiers. another svm-based ne recognizer (yamada and mat sumoto, 2001) is 0.8 sentences/sec on a pentium iii 933 mhz pc. an svm-based part-of-speech (pos). tagger (nakagawa et al, 2001) is 20 tokens/sec on an alpha 21164a 500 mhz processor. it is difficult to use such slow systems in practical applications. in this paper, we present a method that makes the ne system substantially faster. this method can also be applied to other tasks in natural languageprocessing such as chunking and pos tagging. another problem with svms is its incomprehensibil ity. it is not clear which features are important or how they work. the above method is also useful for finding useless features. we also mention a method to reduce training time. 1.1 support vector machines. suppose we have a set of training data for a two class problem: , where ffflfi is a feature vector of the ffi -th sample in the training data and !$#%# is the label forthe sample. the goal is to find a decision func tion that accurately predicts for unseen . a non-linear svm classifier gives a decision function ( ) * sign ,+-) for an input vector where +-) .* / 0 21)3 546879: !6; here, () *=!$# means is a member of a cer tain class and () $* # means is not a mem ber. 7 s are called support vectors and are repre sentatives of training examples. is the numberof support vectors. therefore, computational com plexity of +?) is proportional to . support vectorsand other constants are determined by solving a cer tain quadratic programming problem. 4687@ is akernel that implicitly maps vectors into a higher di mensional space. typical kernels use dot products: 4687@ a*cbed7@ . a polynomial kernel of degree fis given by bg? *hi#j!kg l . we can use vari mm m m n m m m m m m m m m n m o o o o o n o o o o o o o o o o o o m : positive example, o : negative example n m , n o : support vectors figure 1: support vector machine ous kernels, and the design of an appropriate kernel for a particular application is an important research issue.figure 1 shows a linearly separable case. the de cision hyperplane defined by +-) p*rq separatespositive and negative examples by the largest mar gin. the solid line indicates the decision hyperplaneand two parallel dotted lines indicate the margin be tween positive and negative examples. since such aseparating hyperplane may not exist, a positive pa rameter s is introduced to allow misclassifications. see vapnik (1995). 1.2 svm-based ne recognition. as far as we know, the first svm-based ne system was proposed by yamada et al (2001) for japanese.his system is an extension of kudo?s chunking sys tem (kudo and matsumoto, 2001) that gave the best performance at conll-2000 shared tasks. in theirsystem, every word in a sentence is classified sequentially from the beginning or the end of a sen tence. however, since yamada has not compared it with other methods under the same conditions, it is not clear whether his ne system is better or not. here, we show that our svm-based ne system ismore accurate than conventional systems. our sys tem uses the viterbi search (allen, 1995) instead of sequential determination.for training, we use ?crl data?, which was prepared for irex (information retrieval and extrac tion exercise1, sekine and eriguchi (2000)). it has about 19,000 nes in 1,174 articles. we also use additional data by isozaki (2001). both datasets are based on mainichi newspaper?s 1994 and 1995 cd-roms. we use irex?s formal test data calledgeneral that has 1,510 named entities in 71 ar ticles from mainichi newspaper of 1999. systems are compared in terms of general?s f-measure 1http://cs.nyu.edu/cs/projects/proteus/irexwhich is the harmonic mean of ?recall? and ?preci sion? and is defined as follows. recall = m/(the number of correct nes), precision = m/(the number of nes extracted by a system), where m is the number of nes correctly extracted and classified by the system.we developed an svm-based ne system by following our ne system based on maximum entropy (me) modeling (isozaki, 2001). we sim ply replaced the me model with svm classifiers.the above datasets are processed by a morphological analyzer chasen 2.2.12. it tokenizes a sen tence into words and adds pos tags. chasen uses about 90 pos tags such as common-noun and location-name. since most unknown words are proper nouns, chasen?s parameters for unknownwords are modified for better results. then, a char acter type tag is added to each word. it uses 17character types such as all-kanji and small integer. see isozaki (2001) for details. now, japanese ne recognition is solved by theclassification of words (sekine et al, 1998; borth wick, 1999; uchimoto et al, 2000). for instance, the words in ?president george herbert bush saidclinton is . . . are classified as follows: ?president? = other, ?george? = person-begin, ?her bert? = person-middle, ?bush? = person-end, ?said? = other, ?clinton? = person-single, ?is? = other. in this way, the first word of a person?s name is labeled as person-begin. the last word is labeled as person-end. other words in the nameare person-middle. if a person?s name is expressed by a single word, it is labeled as person single. if a word does not belong to any namedentities, it is labeled as other. since irex de fines eight ne classes, words are classified into 33 ( *utwvex!k# ) categories.each sample is represented by 15 features be cause each word has three features (part-of-speech tag, character type, and the word itself), and two preceding words and two succeeding words are also used for context dependence. although infrequent features are usually removed to prevent overfitting, we use all features because svms are robust. each sample is represented by a long binary vector, i.e., a sequence of 0 (false) and 1 (true). for instance, ?bush? in the above example is represented by a 2http://chasen.aist-nara.ac.jp/ vector p*yg[z\#^]_ g[z `a] described below. only 15 elements are 1. bdcfe8ghji // current word is not ?alice? bdc klghme // current word is ?bush? bdc nghji // current word is not ?charlie? : bdcfe^opikpqpghme // current pos is a proper noun bdcfe^opinipghji // current pos is not a verb : bdc nqre^sre ghji // previous word is not ?henry? bdc nqre^skghme // previous word is ?herbert? :here, we have to consider the following problems. first, svms can solve only a two-class problem. therefore, we have to reduce the above multi class problem to a group of two-class problems. second, we have to consider consistency among word classes in a sentence. for instance, a word classified as person-begin should be followed by person-middle or person-end. it impliesthat the system has to determine the best combina tions of word classes from numerous possibilities.here, we solve these problems by combining exist ing methods. there are a few approaches to extend svms to cover t -class problems. here, we employ the ?oneclass versus all others? approach. that is, each clas sifier (%u ) is trained to distinguish members of a class v from non-members. in this method, two or more classifiers may give !$# to an unseen vector or no classifier may give !$# . one common way to avoid such situations is to compare + u ) values and to choose the class index v of the largest + u ) . the consistency problem is solved by the viterbi search. since svms do not output probabilities, we use the svm+sigmoid method (platt, 2000). that is, we use a sigmoid function wxg? j*y#zi#{! |l}~ {g to map + u ) to a probability-like value. the output of the viterbi search is adjusted by a postprocessor for wrong word boundaries. the adjustment rules are also statistically determined (isozaki, 2001). 1.3 comparison of ne recognizers. we use a fixed value ?* #q9q . f-measures are not very sensitive to unless is too small. whenwe used 1,038,986 training vectors, general?s f measure was 89.64% for ?*?q?# and 90.03% for 6*?#q9q . we employ the quadratic kernel ( f *y? ) because it gives the best results. polynomial kernels of degree 1, 2, and 3 resulted in 83.03%, 88.31%, f-measure (%) ? ? rg+dt ? ? me ? ? svm 0 20 40 60 80 100 120 crl data ???e? ?^??:??? 76 78 80 82 84 86 88 90 number of nes in training data ( ?? ) figure 2: f-measures of ne systems and 87.04% respectively when we used 569,994 training vectors. figure 2 compares ne recognizers in terms ofgeneral?s f-measures. ?svm? in the figure in dicates f-measures of our system trained by kudo?s tinysvm-0.073 with s?*?q?# . it attained 85.04% when we used only crl data. ?me? indicates our me system and ?rg+dt? indicates a rule-basedmachine learning system (isozaki, 2001). according to this graph, ?svm? is better than the other sys tems.however, svm classifiers are too slow. fa mous svm-light 3.50 (joachims, 1999) took 1.2 days to classify 569,994 vectors derived from 2 mb documents. that is, it runs at only 19 bytes/sec. tinysvm?s classifier seems best optimized among publicly available svm toolkits, but it still works at only 92 bytes/sec.our svm-based ne recognizer attained f = 90.03%. we also thank shigeru katagiri and ken-ichiro ishii for their support. named entity (ne) recognition is a task in whichproper nouns and numerical information in a docu ment are detected and classified into categories suchas person, organization, and date. tinysvm?s classifier seems best optimized among publicly available svm toolkits, but it still works at only 92 bytes/sec. that is, it runs at only 19 bytes/sec. it is a key technol ogy of information extraction and open-domain question answering (voorhees and harman, 2000). fa mous svm-light 3.50 (joachims, 1999) took 1.2 days to classify 569,994 vectors derived from 2 mb documents. is better than the other sys tems.however, svm classifiers are too slow. in this paper, we show that an ne recognizer based on support vector machines (svms) gives better scores thanconventional systems. we are building a trainable open-domain question answering system called saiqa-ii. according to this graph, ?svm? svms have given high per formance in various classification tasks (joachims, 1998; kudo and matsumoto, 2001). indicates a rule-basedmachine learning system (isozaki, 2001). ?me? indicates our me system and ?rg+dt? however, it turned out that off-the-shelf svm classifiers are too inefficient for ne recognition.

DATASET_PACSUM/dataset/inputs/C02-1114.txt ADDED Viewed

	@@ -0,0 +1 @@

+ semantic knowledge for particular domains isincreasingly important in nlp. many applications such as word-sense disambiguation, in formation extraction and speech recognitionall require lexicons. the coverage of handbuilt lexical resources such as wordnet (fellbaum, 1998) has increased dramatically in re cent years, but leaves several problems andchallenges. coverage is poor in many criti cal, rapidly changing domains such as current affairs, medicine and technology, where much time is still spent by human experts employed to recognise and classify new terms. mostlanguages remain poorly covered in compari son with english. hand-built lexical resourceswhich cannot be automatically updated can of ten be simply misleading. for example, using wordnet to recognise that the word apple refers to a fruit or a tree is a grave error in the many situations where this word refers to a computer manufacturer, a sense which wordnet does notcover. for nlp to reach a wider class of appli cations in practice, the ability to assemble andupdate appropriate semantic knowledge auto matically will be vital. this paper describes a method for arranging semantic information into a graph (bolloba?s, 1998), where the nodes are words and the edges(also called links) represent relationships be tween words. the paper is arranged as follows. section 2 reviews previous work on semanticsimilarity and lexical acquisition. section 3 de scribes how the graph model was built from the pos-tagged british national corpus. section 4 describes a new incremental algorithm used to build categories of words step by step from thegraph model. section 5 demonstrates this algo rithm in action and evaluates the results againstwordnet classes, obtaining state-of-the-art re sults. section 6 describes how the graph modelcan be used to recognise when words are polysemous and to obtain groups of words represen tative of the different senses.semantic knowledge for particular domains isincreasingly important in nlp. section 6 describes how the graph modelcan be used to recognise when words are polysemous and to obtain groups of words represen tative of the different senses. so far we have presented a graph model built upon noun co-occurrence which performs much better than previously reported methods at the task of automatic lexical acquisition. 2 1http://infomap.stanford.edu/graphs 2http://muchmore.dfki.defigure 1: automatically generated graph show ing the word apple and semantically related nouns this isan important task, because assembling and tuning lexicons for specific nlp systems is increas ingly necessary. many applications such as word-sense disambiguation, in formation extraction and speech recognitionall require lexicons. section 5 demonstrates this algo rithm in action and evaluates the results againstwordnet classes, obtaining state-of-the-art re sults. this research was supported in part by theresearch collaboration between the ntt communication science laboratories, nippon tele graph and telephone corporation and csli,stanford university, and by ec/nsf grant ist 1999-11438 for the muchmore project. acknowledgements the authors would like to thank the anonymous reviewers whose comments were a great help inmaking this paper more focussed: any short comings remain entirely our own responsibility. we now take a step furtherand present a simple method for not only as sembling words with similar meanings, but for empirically recognising when a word has several meanings.

DATASET_PACSUM/dataset/inputs/C02-1144.txt ADDED Viewed

	@@ -0,0 +1 @@

+ broad-coverage lexical resources such as wordnet are extremely useful in applications such as word sense disambiguation (leacock, chodorow, miller 1998) and question answering (pasca and harabagiu 2001). however, they often include many rare senses while missing domain-specific senses. for example, in wordnet, the words dog, computer and company all have a sense that is a hyponym of person. such rare senses make it difficult for a coreference resolution system to use wordnet to enforce the constraint that personal pronouns (e.g. he or she) must refer to a person. on the other hand, wordnet misses the user-interface object sense of the word dialog (as often used in software manuals). one way to deal with these problems is to use a clustering algorithm to automatically induce semantic classes (lin and pantel 2001). many clustering algorithms represent a cluster by the centroid of all of its members (e.g., k means) (mcqueen 1967) or by a representative element (e.g., k-medoids) (kaufmann and rousseeuw 1987). when averaging over all elements in a cluster, the centroid of a cluster may be unduly influenced by elements that only marginally belong to the cluster or by elements that also belong to other clusters. for example, when clustering words, we can use the contexts of the words as features and group together the words that tend to appear in similar contexts. for instance, u.s. state names can be clustered this way because they tend to appear in the following contexts: (list a) ___ appellate court campaign in ___ ___ capital governor of ___ ___ driver's license illegal in ___ ___ outlaws sth. primary in ___ ___'s sales tax senator for ___ if we create a centroid of all the state names, the centroid will also contain features such as: (list b) ___'s airport archbishop of ___ ___'s business district fly to ___ ___'s mayor mayor of ___ ___'s subway outskirts of ___ because some of the state names (like new york and washington) are also names of cities. using a single representative from a cluster may be problematic too because each individual element has its own idiosyncrasies that may not be shared by other members of the cluster. in this paper, we propose a clustering algo rithm, cbc (clustering by committee), in which the centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members. the subset is viewed as a committee that determines which other elements belong to the cluster. by carefully choosing committee members, the features of the centroid tend to be the more typical features of the target class. for example, our system chose the following committee members to compute the centroid of the state cluster: illinois, michigan, minnesota, iowa, wisconsin, indiana, nebraska and vermont. as a result, the centroid contains only features like those in list a. evaluating clustering results is a very difficult task. we introduce a new evaluation methodol ogy that is based on the editing distance between output clusters and classes extracted from wordnet (the answer key).we presented a clustering algorithm, cbc, for automatically discovering concepts from text. we introduce a new evaluation methodol ogy that is based on the editing distance between output clusters and classes extracted from wordnet (the answer key). this research was partly supported by natural sciences and engineering research council of canada grant ogp121338 and scholarship pgsb207797. as a result, the centroid contains only features like those in list a. evaluating clustering results is a very difficult task. however, they often include many rare senses while missing domain-specific senses. we generated clusters from a news corpus using cbc and compared them with classes extracted from wordnet (miller 1990). the parameters k and t are usually considered to be small numbers. broad-coverage lexical resources such as wordnet are extremely useful in applications such as word sense disambiguation (leacock, chodorow, miller 1998) and question answering (pasca and harabagiu 2001). five of the 943 clusters discovered by cbc from s13403 along with their features with top-15 highest mutual information and the wordnet classes that have the largest intersection with each cluster. test data. clustering algorithms are generally categorized as hierarchical and partitional. to extract classes from wordnet, we first estimate the probability of a random word belonging to a subhierarchy (a synset and its hyponyms).

DATASET_PACSUM/dataset/inputs/C02-1145.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the penn chinese treebank (ctb) is an ongoing project, with its objective being to create a segmented chinese corpus annotated with pos tags and syntactic brackets. the first installment of the project (ctb-i) consists of xinhua newswire between the years 1994 and 1998, totaling 100,000 words, fully segmented, pos-tagged and syntactically bracketed and it has been released to the public via the penn linguistic data consortium (ldc). the preliminary results of this phase of the project have been reported in xia et al (2000). currently the second installment of the project, the 400,000-word ctb-ii is being developed and is expected to be completed early in the year 2003. ctb-ii will follow the standards set up in the segmentation (xia 2000b), pos tagging (xia 2000a) and bracketing guidelines (xue and xia 2000) and it will use articles from peoples' daily, hong kong newswire and material translated into chinese from other languages in addition to the xinhua newswire used in ctb-i in an effort to diversify the sources. the availability of ctb-i changed our approach to ctb-ii considerably. due to the existence of ctb-i, we were able to train new automatic chinese language processing (clp) tools, which crucially use annotated corpora as training material. these tools are then used for preprocessing in the development of the ctb-ii. we also developed tools to control the quality of the corpus. in this paper, we will address three issues in the development of the chinese treebank: annotation speed, annotation accuracy and usability of the corpus. specifically, we attempt to answer four questions: (i) how do we speed up the annotation process, (ii) how do we maintain high quality, i.e. annotation accuracy and inter-annotator consistency during the annotation process, and (iii) for what purposes is the corpus applicable, and (iv) what are our future plans? although we will touch upon linguistic problems that are specific to chinese, we believe these issues are general enough for the development of any single language corpus. 1 annotation speed. there are three main factors that affect the annotation speed : annotators? background, guideline design and more importantly, the availability of preprocessing tools. we will discuss how each of these three factors affects annotation speed. 1.1 annotator background. even with the best sets of guidelines, it is important that annotators have received considerable training in linguistics, particularly in syntax. in both the segmentation/pos tagging phase and the syntactic bracketing phase, understanding the structure of the sentences is essential for correct annotation with reasonable speed. for example, the penn chinese treebank (ctb) is an ongoing project, with its objective being to create a segmented chinese corpus annotated with pos tags and syntactic brackets. for example, in both the segmentation/pos tagging phase and the syntactic bracketing phase, understanding the structure of the sentences is essential for correct annotation with reasonable speed. the preliminary results of this phase of the project have been reported in xia et al (2000). even with the best sets of guidelines, it is important that annotators have received considerable training in linguistics, particularly in syntax. the first installment of the project (ctb-i) consists of xinhua newswire between the years 1994 and 1998, totaling 100,000 words, fully segmented, pos-tagged and syntactically bracketed and it has been released to the public via the penn linguistic data consortium (ldc). currently the second installment of the project, the 400,000-word ctb-ii is being developed and is expected to be completed early in the year 2003. 1.1 annotator background. we will discuss how each of these three factors affects annotation speed. the availability of ctb-i changed our approach to ctb-ii considerably. background, guideline design and more importantly, the availability of preprocessing tools. ctb-ii will follow the standards set up in the segmentation (xia 2000b), pos tagging (xia 2000a) and bracketing guidelines (xue and xia 2000) and it will use articles from peoples' daily, hong kong newswire and material translated into chinese from other languages in addition to the xinhua newswire used in ctb-i in an effort to diversify the sources.

DATASET_PACSUM/dataset/inputs/C02-1150.txt ADDED Viewed

	@@ -0,0 +1 @@

+ open-domain question answering (lehnert, 1986; harabagiu et al, 2001; light et al, 2001) and storycomprehension (hirschman et al, 1999) have become important directions in natural language pro cessing. question answering is a retrieval task morechallenging than common search engine tasks be cause its purpose is to find an accurate and conciseanswer to a question rather than a relevant docu ment. the difficulty is more acute in tasks such as story comprehension in which the target text is less likely to overlap with the text in the questions. for this reason, advanced natural language techniques rather than simple key term extraction are needed.one of the important stages in this process is analyz ing the question to a degree that allows determining the ?type? of the sought after answer. in the treccompetition (voorhees, 2000), participants are requested to build a system which, given a set of en glish questions, can automatically extract answers (a short phrase) of no more than 50 bytes from a5-gigabyte document library. participants have re research supported by nsf grants iis-9801638 and itr iis 0085836 and an onr muri award. alized that locating an answer accurately hinges on first filtering out a wide range of candidates (hovy et al, 2001; ittycheriah et al, 2001) based on some categorization of answer types. this work develops a machine learning approach to question classification (qc) (harabagiu et al, 2001; hermjakob, 2001). our goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answeringprocess. for example, when considering the question q: what canadian city has the largest popula tion?, the hope is to classify this question as havinganswer type city, implying that only candidate an swers that are cities need consideration.based on the snow learning architecture, we develop a hierarchical classifier that is guided by a lay ered semantic hierarchy of answer types and is able to classify questions into fine-grained classes. wesuggest that it is useful to consider this classifica tion task as a multi-label classification and find that it is possible to achieve good classification results(over 90%) despite the fact that the number of dif ferent labels used is fairly large, 50. we observe thatlocal features are not sufficient to support this accu racy, and that inducing semantic features is crucial for good performance. the paper is organized as follows: sec. 2 presents the question classification problem; sec. 3 discusses the learning issues involved in qc and presents ourlearning approach; sec. 4 describes our experimen tal study.this paper presents a machine learning approach to question classification. 4 describes our experimen tal study. in future work we plan to investigate further the application of deeper semantic analysis (including better named entity and semantic categorization) to feature extraction, automate the generation of thesemantic features and develop a better understand ing to some of the learning issues involved in thedifference between a flat and a hierarchical classi fier. question answering is a retrieval task morechallenging than common search engine tasks be cause its purpose is to find an accurate and conciseanswer to a question rather than a relevant docu ment. we define question classification(qc) here to be the task that, given a question, maps it to one of k classes, which provide a semantic constraint on the sought-after answer1. open-domain question answering (lehnert, 1986; harabagiu et al, 2001; light et al, 2001) and storycomprehension (hirschman et al, 1999) have become important directions in natural language pro cessing. the ambiguity causes the classifier not to output equivalent term as the first choice. we designed two experiments to test the accuracy ofour classifier on trec questions. what do bats eat?. in this case, both classes are ac ceptable. the first experi ment evaluates the contribution of different featuretypes to the quality of the classification.

DATASET_PACSUM/dataset/inputs/C02-2025.txt ADDED Viewed

	@@ -0,0 +1 @@

+ for the past decade or more, symbolic, linguistically ori- ented methods and statistical or machine learning ap- proaches to nlp have often been perceived as incompat- ible or even competing paradigms. while shallow and probabilistic processing techniques have produced use- ful results in many classes of applications, they have not met the full range of needs for nlp, particularly where precise interpretation is important, or where the variety of linguistic expression is large relative to the amount of training data available. on the other hand, deep approaches to nlp have only recently achieved broad enough grammatical coverage and sufficient processing efficiency to allow the use of precise linguistic grammars in certain types of real-world applications. in particular, applications of broad-coverage analyti- cal grammars for parsing or generation require the use of sophisticated statistical techniques for resolving ambigu- ities; the transfer of head-driven phrase structure gram- mar (hpsg) systems into industry, for example, has am- plified the need for general parse ranking, disambigua- tion, and robust recovery techniques. we observe general consensus on the necessity for bridging activities, com- bining symbolic and stochastic approaches to nlp. but although we find promising research in stochastic pars- ing in a number of frameworks, there is a lack of appro- priately rich and dynamic language corpora for hpsg. likewise, stochastic parsing has so far been focussed on information-extraction-type applications and lacks any depth of semantic interpretation. the redwoods initia- tive is designed to fill in this gap. in the next section, we present some of the motivation for the lingo redwoods project as a treebank develop- ment process. although construction of the treebank is in its early stages, we present in section 3 some prelim- inary results of using the treebank data already acquired on concrete applications. we show, for instance, that even simple statistical models of parse ranking trained on the redwoods corpus built so far can disambiguate parses with close to 80% accuracy. 2 a rich and dynamic treebank the redwoods treebank is based on open-source hpsg resources developed by a broad consortium of re- search groups including researchers at stanford (usa), saarbru?cken (germany), cambridge, edinburgh, and sussex (uk), and tokyo (japan). their wide distribution and common acceptance make the hpsg framework and resources an excellent anchor point for the redwoods treebanking initiative. the key innovative aspect of the redwoods ap- proach to treebanking is the anchoring of all linguis- tic data captured in the treebank to the hpsg frame- work and a generally-available broad-coverage gram- mar of english, the lingo english resource grammar (flickinger, 2000) as implemented with the lkb gram- mar development environment (copestake, 2002). un- like existing treebanks, there is no need to define a (new) form of grammatical representation specific to the tree- bank.the lingo redwoods treebank motivation and preliminary applications stephan oepen, kristina toutanova, stuart shieber, christopher manning, dan flickinger, and thorsten brants {oe |kristina |manning |dan}@csli.stanford.edu, [email protected], [email protected] abstract the lingo redwoods initiative is a seed activity in the de- sign and development of a new type of treebank.

DATASET_PACSUM/dataset/inputs/C04-1010.txt ADDED Viewed

	@@ -0,0 +1 @@

+ there has been a steadily increasing interest in syntactic parsing based on dependency analysis in re cent years. one important reason seems to be thatdependency parsing offers a good compromise be tween the conflicting demands of analysis depth, on the one hand, and robustness and efficiency, on the other. thus, whereas a complete dependency structure provides a fully disambiguated analysisof a sentence, this analysis is typically less complex than in frameworks based on constituent analysis and can therefore often be computed determin istically with reasonable accuracy. deterministicmethods for dependency parsing have now been ap plied to a variety of languages, including japanese (kudo and matsumoto, 2000), english (yamada and matsumoto, 2003), turkish (oflazer, 2003), and swedish (nivre et al, 2004). for english, the interest in dependency parsing has been weaker than for other languages. to some extent, this can probably be explained by the strong tradition of constituent analysis in anglo-american linguistics, but this trend has been reinforced by the fact that the major treebank of american english,the penn treebank (marcus et al, 1993), is anno tated primarily with constituent analysis. on the other hand, the best available parsers trained on thepenn treebank, those of collins (1997) and charniak (2000), use statistical models for disambigua tion that make crucial use of dependency relations. moreover, the deterministic dependency parser of yamada and matsumoto (2003), when trained on the penn treebank, gives a dependency accuracy that is almost as good as that of collins (1997) and charniak (2000). the parser described in this paper is similar to that of yamada and matsumoto (2003) in that it uses a deterministic parsing algorithm in combination with a classifier induced from a treebank. however, there are also important differences between the twoapproaches. first of all, whereas yamada and matsumoto employs a strict bottom-up algorithm (es sentially shift-reduce parsing) with multiple passes over the input, the present parser uses the algorithmproposed in nivre (2003), which combines bottom up and top-down processing in a single pass in order to achieve incrementality. this also means that the time complexity of the algorithm used here is linearin the size of the input, while the algorithm of ya mada and matsumoto is quadratic in the worst case. another difference is that yamada and matsumoto use support vector machines (vapnik, 1995), whilewe instead rely on memory-based learning (daele mans, 1999). most importantly, however, the parser presented in this paper constructs labeled dependency graphs, i.e. dependency graphs where arcs are labeled with dependency types. as far as we know, this makesit different from all previous systems for dependency parsing applied to the penn treebank (eis ner, 1996; yamada and matsumoto, 2003), althoughthere are systems that extract labeled grammatical relations based on shallow parsing, e.g. buchholz (2002). the fact that we are working with labeled dependency graphs is also one of the motivations for choosing memory-based learning over sup port vector machines, since we require a multi-class classifier. even though it is possible to use svmfor multi-class classification, this can get cumber some when the number of classes is large. (for the the ? dep finger-pointing ? np-sbj has already ? advp begun ? vp . ? dep figure 1: dependency graph for english sentenceunlabeled dependency parser of yamada and matsumoto (2003) the classification problem only in volves three classes.) the parsing methodology investigated here haspreviously been applied to swedish, where promis ing results were obtained with a relatively smalltreebank (approximately 5000 sentences for train ing), resulting in an attachment score of 84.7% and a labeled accuracy of 80.6% (nivre et al, 2004).1 however, since there are no comparable resultsavailable for swedish, it is difficult to assess the significance of these findings, which is one of the reasons why we want to apply the method to a bench mark corpus such as the the penn treebank, even though the annotation in this corpus is not ideal for labeled dependency parsing.the paper is structured as follows. section 2 describes the parsing algorithm, while section 3 ex plains how memory-based learning is used to guidethe parser. experimental results are reported in sec tion 4, and conclusions are stated in section 5.the conversion of the penn tree bank to dependency trees has been performed using head rules kindly provided by hiroyasu yamada and yuji matsumoto. there has been a steadily increasing interest in syntactic parsing based on dependency analysis in re cent years. experimental results are reported in sec tion 4, and conclusions are stated in section 5. sentences whose unlabeled dependency structure is completely correct (yamada and mat sumoto, 2003). one important reason seems to be thatdependency parsing offers a good compromise be tween the conflicting demands of analysis depth, on the one hand, and robustness and efficiency, on the other. the memory-based classifiers used in the experiments have been constructed using thetilburg memory-based learner (timbl) (daelemans et al, 2003). first of all, we see that model 1 gives better accuracy than model 2 with the smaller label set g, which confirms our expectations that the added part-of-speech featuresare helpful when the dependency labels are less informative. acknowledgements the work presented in this paper has been supportedby a grant from the swedish research council (621 2002-4207). all metrics except cm are calculated as meanscores per word, and punctuation tokens are con sistently excluded.table 1 shows the attachment score, both unla beled and labeled, for the two different state models with the two different label sets.

DATASET_PACSUM/dataset/inputs/C04-1024.txt ADDED Viewed

	@@ -0,0 +1 @@

+ large context-free grammars extracted from tree banks achieve high coverage and accuracy, but they are difficult to parse with because of their massive ambiguity. the application of standard chart-parsing techniques often fails due to excessive memory and runtime requirements.treebank grammars are mostly used as probabilis tic grammars and users are usually only interested in the best analysis, the viterbi parse. to speed up viterbi parsing, sophisticated search strategies havebeen developed which find the most probable anal ysis without examining the whole set of possible analyses (charniak et al, 1998; klein and manning,2003a). these methods reduce the number of gener ated edges, but increase the amount of time needed for each edge. the parser described in this paper follows a contrary approach: instead of reducing the number of edges, it minimises the costs of building edges in terms of memory and runtime.the new parser, called bitpar, is based on a bit vector implementation (cf. (graham et al, 1980)) of the well-known cocke-younger-kasami (cky) algorithm (kasami, 1965; younger, 1967). it buildsa compact ?parse forest? representation of all anal yses in two steps. in the first step, a cky-style recogniser fills the chart with constituents. in the second step, the parse forest is built top-down from the chart. viterbi parses are computed in four steps. again, the first step is a cky recogniser which is followed by a top-down filtering of the chart, the bottom-up computation of the viterbi probabilities, and the top-down extraction of the best parse.the rest of the paper is organised as follows: sec tion 2 explains the transformation of the grammar to chomsky normal form. the following sectionsdescribe the recogniser algorithm (sec. 3), improvements of the recogniser by means of bit-vector op erations (sec. 4), and the generation of parse forests(sec. 5), and viterbi parses (sec. 6). section 7 discusses the advantages of the new architecture, sec tion 8 describes experimental results, and section 9 summarises the paper.(the rule a section 7 discusses the advantages of the new architecture, sec tion 8 describes experimental results, and section 9 summarises the paper. the cky algorithm requires a grammar in chom sky normal form where the right-hand side of eachrule either consists of two non-terminals or a single terminal symbol. large context-free grammars extracted from tree banks achieve high coverage and accuracy, but they are difficult to parse with because of their massive ambiguity. 5), and viterbi parses (sec. the application of standard chart-parsing techniques often fails due to excessive memory and runtime requirements.treebank grammars are mostly used as probabilis tic grammars and users are usually only interested in the best analysis, the viterbi parse. boring symbols on the right-hand sides of rules. bitpar uses a modified ver sion of the cky algorithm allowing also chain rules (rules with a single non-terminal on the right-handside). 4), and the generation of parse forests(sec. to speed up viterbi parsing, sophisticated search strategies havebeen developed which find the most probable anal ysis without examining the whole set of possible analyses (charniak et al, 1998; klein and manning,2003a). 3), improvements of the recogniser by means of bit-vector op erations (sec. these methods reduce the number of gener ated edges, but increase the amount of time needed for each edge.

DATASET_PACSUM/dataset/inputs/C04-1041.txt ADDED Viewed

	@@ -0,0 +1 @@

+ lexicalised grammar formalisms such as lexicalized tree adjoining grammar (ltag) and com binatory categorial grammar (ccg) assign one or more syntactic structures to each word in a sentencewhich are then manipulated by the parser. supertag ging was introduced for ltag as a way of increasingparsing efficiency by reducing the number of struc tures assigned to each word (bangalore and joshi, 1999). supertagging has more recently been applied to ccg (clark, 2002; curran and clark, 2003).supertagging accuracy is relatively high for man ually constructed ltags (bangalore and joshi,1999). however, for ltags extracted automati cally from the penn treebank, performance is much lower (chen et al, 1999; chen et al, 2002). in fact, performance for such grammars is below that needed for successful integration into a full parser (sarkar et al, 2000). in this paper we demonstratethat ccg supertagging accuracy is not only sufficient for accurate and robust parsing using an auto matically extracted grammar, but also offers several practical advantages. our wide-coverage ccg parser uses a log-linear model to select an analysis. the model paramaters are estimated using a discriminative method, that is,one which requires all incorrect parses for a sentence as well as the correct parse. since an auto matically extracted ccg grammar can produce anextremely large number of parses, the use of a su pertagger is crucial in limiting the total number of parses for the training data to a computationally manageable number. the supertagger is also crucial for increasing thespeed of the parser. we show that spectacular in creases in speed can be obtained, without affectingaccuracy or coverage, by tightly integrating the su pertagger with the ccg grammar and parser. to achieve maximum speed, the supertagger initially assigns only a small number of ccg categories toeach word, and the parser only requests more cate gories from the supertagger if it cannot provide an analysis. we also demonstrate how extra constraints on the category combinations, and the application of beam search using the parsing model, can further increase parsing speed.this is the first work we are aware of to succes fully integrate a supertagger with a full parser which uses a lexicalised grammar automatically extractedfrom the penn treebank. we also report signifi cantly higher parsing speeds on newspaper text than any previously reported for a full wide-coverage parser. our results confirm that wide-coverage ccg parsing is feasible for many large-scale nlp tasks.this research was supported by epsrc grant gr/m96889, and a commonwealth scholarship and a sydney university travelling scholarship to the second author. this paper has shown that by tightly integrating a supertagger with a ccg parser, very fast parse times can be achieved for penn treebank wsj text. our results confirm that wide-coverage ccg parsing is feasible for many large-scale nlp tasks. lexicalised grammar formalisms such as lexicalized tree adjoining grammar (ltag) and com binatory categorial grammar (ccg) assign one or more syntactic structures to each word in a sentencewhich are then manipulated by the parser. supertag ging was introduced for ltag as a way of increasingparsing efficiency by reducing the number of struc tures assigned to each word (bangalore and joshi, 1999). the best speeds we have reported for the ccg parser are an order of magnitude faster. to give one example, the number of categories in the tag dictionary?s entry for the wordis is 45 (only considering categories which have appeared at least 10 times in the training data). we also report signifi cantly higher parsing speeds on newspaper text than any previously reported for a full wide-coverage parser. however, in the sentence mr. vinken is chairman of elsevier n.v., the dutch publishing group., the supertag ger correctly assigns 1 category to is for ? = 0.1, and 3 categories for ? = 0.01.

DATASET_PACSUM/dataset/inputs/C04-1051.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the importance of learning to manipulate monolingual paraphrase relationships for applications like summarization, search, and dialog has been highlighted by a number of recent efforts (barzilay & mckeown 2001; shinyama et al 2002; lee & barzilay 2003; lin & pantel 2001). while several different learning methods have been applied to this problem, all share a need for large amounts of data in the form of pairs or sets of strings that are likely to exhibit lexical and/or structural paraphrase alternations. one approach1 1 an alternative approach involves identifying anchor points--pairs of words linked in a known way--and collecting the strings that intervene. (shinyama, et al 2002; lin & pantel 2001). since our interest is in that has been successfully used is edit distance, a measure of similarity between strings. the assumption is that strings separated by a small edit distance will tend to be similar in meaning: the leading indicators measure the economy? the leading index measures the economy?. lee & barzilay (2003), for example, use multi sequence alignment (msa) to build a corpus of paraphrases involving terrorist acts. their goal is to extract sentential templates that can be used in high-precision generation of paraphrase alter nations within a limited domain. our goal here is rather different: our interest lies in constructing a monolingual broad-domain corpus of pairwise aligned sentences. such data would be amenable to conventional statistical machine translation (smt) techniques (e.g., those discussed in och & ney 2003).2 in what follows we compare two strategies for unsupervised construction of such a corpus, one employing string similarity and the other associating sentences that may overlap very little at the string level. we measure the relative utility of the two derived monolingual corpora in the context of word alignment techniques developed originally for bilingual text. we show that although the edit distance corpus is well-suited as training data for the alignment algorithms currently used in smt, it is an incomplete source of information about paraphrase relations, which exhibit many of the characteristics of comparable bilingual corpora or free translations. many of the more complex alternations that characterize monolingual paraphrase, such as large-scale lexical alternations and constituent reorderings, are not readily learning sentence level paraphrases, including major constituent reorganizations, we do not address this approach here. 2 barzilay & mckeown (2001) consider the possibility of using smt machinery, but reject the idea because of the noisy, comparable nature of their dataset. captured by edit distance techniques, which conflate semantic similarity with formal similarity. we conclude that paraphrase research would benefit by identifying richer data sources and developing appropriate learning techniques.we remain, however, responsible for all content. edit distance identifies sentence pairs that exhibit lexical and short phrasal alternations that can be aligned with considerable success. we conclude that paraphrase research would benefit by identifying richer data sources and developing appropriate learning techniques. we have also benefited from discussions with ken church, mark johnson, daniel marcu and franz och. the importance of learning to manipulate monolingual paraphrase relationships for applications like summarization, search, and dialog has been highlighted by a number of recent efforts (barzilay & mckeown 2001; shinyama et al 2002; lee & barzilay 2003; lin & pantel 2001). given a large dataset and a well-motivated clustering of documents, useful datasets can be gleaned even without resorting to more sophisticated techniques figure 2. captured by edit distance techniques, which conflate semantic similarity with formal similarity. the second relied on a discourse-based heuristic, specific to the news genre, to identify likely paraphrase pairs even when they have little superficial similarity. while several different learning methods have been applied to this problem, all share a need for large amounts of data in the form of pairs or sets of strings that are likely to exhibit lexical and/or structural paraphrase alternations. our two paraphrase datasets are distilled from a corpus of news articles gathered from thousands of news sources over an extended period.

DATASET_PACSUM/dataset/inputs/C04-1059.txt ADDED Viewed

	@@ -0,0 +1 @@

+ language models (lm) are applied in many natural language processing applications, such as speech recognition and machine translation, to encapsulate syntactic, semantic and pragmatic information. for systems which learn from given data we frequently observe a severe drop in performance when moving to a new genre or new domain. in speech recognition a number of adaptation techniques have been developed to cope with this situation. in statistical machine translation we have a similar situation, i.e. estimate the model parameter from some data, and use the system to translate sentences which may not be well covered by the training data. therefore, the potential of adaptation techniques needs to be explored for machine translation applications. statistical machine translation is based on the noisy channel model, where the translation hypothesis is searched over the space defined by a translation model and a target language (brown et al, 1993). statistical machine translation can be formulated as follows: )()|(maxarg)|(maxarg* tptspstpt tt ?== where t is the target sentence, and s is the source sentence. p(t) is the target language model and p(s|t) is the translation model. the argmax operation is the search, which is done by the decoder. in the current study we modify the target language model p(t), to represent the test data better, and thereby improve the translation quality. (janiszek, et al 2001) list the following approaches to language model adaptation: ? linear interpolation of a general and a domain specific model (seymore, rosenfeld, 1997). back off of domain specific probabilities with those of a specific model (besling, meier, 1995). retrieval of documents pertinent to the new domain and training a language model on-line with those data (iyer, ostendorf, 1999, mahajan et. al. 1999). maximum entropy, minimum discrimination adaptation (chen, et. al., 1998). adaptation by linear transformation of vectors of bigram counts in a reduced space (demori, federico, 1999). smoothing and adaptation in a dual space via latent semantic analysis, modeling long-term semantic dependencies, and trigger combinations. (j. bellegarda, 2000). our approach can be characterized as unsupervised data augmentation by retrieval of relevant documents from large monolingual corpora, and interpolation of the specific language model, build from the retrieved data, with a background language model. to be more specific, the following steps are carried out to do the language model adaptation. first, a baseline statistical machine translation system, using a large general language model, is applied to generate initial translations. then these translations hypotheses are reformulated as queries to retrieve similar sentences from a very large text collection. a small domain specific language model is build using the retrieved sentences and linearly interpolated with the background language model. this new interpolated language model in applied in a second decoding run to produce the final translations. there are a number of interesting questions pertaining to this approach: ? which information can and should used to generate the queries: the first-best translation only, or also translation alternatives. how should we construct the queries, just as simple bag-of-words, or can we incorporate more structure to make them more powerful. how many documents should be retrieved to build the specific language models, and on what granularity should this be done, i.e. what is a document in the information retrieval process. the paper is structured as follows: section 2 outlines the sentence retrieval approach, and three bag-of-words query models are designed and explored; structured query models are introduced in section 3. in section 4 we present translation experiments are presented for the different query. finally, summary is given in section 5.in this paper, we studied language model adaptation for statistical machine translation. this might be especially useful for structured query models generated from the translation lattices. finally, summary is given in section 5. language models (lm) are applied in many natural language processing applications, such as speech recognition and machine translation, to encapsulate syntactic, semantic and pragmatic information. in section 4 we present translation experiments are presented for the different query. for systems which learn from given data we frequently observe a severe drop in performance when moving to a new genre or new domain. the paper is structured as follows: section 2 outlines the sentence retrieval approach, and three bag-of-words query models are designed and explored; structured query models are introduced in section 3. in speech recognition a number of adaptation techniques have been developed to cope with this situation. our language model adaptation is an unsupervised data augmentation approach guided by query models. on the other side the oracle experiment also shows that the optimally expected improvement is limited by the translation model and decoding algorithm used in the current smt system. this also means tmq is subject to more noise. experiments are carried out on a standard statistical machine translation task defined in the nist evaluation in june 2002.

DATASET_PACSUM/dataset/inputs/C04-1072.txt ADDED Viewed

	@@ -0,0 +1 @@

+ to automatically evaluate machine translations, the machine translation community recently adopted an n-gram co-occurrence scoring procedure bleu (papineni et al 2001). a similar metric, nist, used by nist (nist 2002) in a couple of machine translation evaluations in the past two years is based on bleu. the main idea of bleu is to measure the translation closeness between a candidate translation and a set of reference translations with a numerical metric. although the idea of using objective functions to automatically evaluate machine translation quality is not new (su et al 1992), the success of bleu prompts a lot of interests in developing better automatic evaluation metrics. for example, akiba et al (2001) proposed a metric called red based on edit distances over a set of multiple references. nie?en et al (2000) calculated the length normalized edit distance, called word error rate (wer), between a candidate and multiple reference translations. leusch et al (2003) proposed a related measure called position independent word error rate (per) that did not consider word position, i.e. using bag-of-words instead. turian et al (2003) introduced general text matcher (gtm) based on accuracy measures such as recall, precision, and f-measure. with so many different automatic metrics available, it is necessary to have a common and objective way to evaluate these metrics. comparison of automatic evaluation metrics are usually conducted on corpus level using correlation analysis between human scores and automatic scores such as bleu, nist, wer, and per. however, the performance of automatic metrics in terms of human vs. system correlation analysis is not stable across different evaluation settings. for example, table 1 shows the pearson?s linear correlation coefficient analysis of 8 machine translation systems from 2003 nist chinese english machine translation evaluation. the pearson? correlation coefficients are computed according to different automatic evaluation methods vs. human assigned adequacy and fluency. bleu1, 4, and 12 are bleu with maximum n-gram lengths of 1, 4, and 12 respectively. gtm10, 20, and 30 are gtm with exponents of 1.0, 2.0, and 3.0 respectively. 95% confidence intervals are estimated using bootstrap resampling (davison and hinkley 1997). from the bleu group, we found that shorter bleu has better adequacy correlation while longer bleu has better fluency correlation. gtm with smaller exponent has better adequacy correlation and gtm with larger exponent has better fluency correlation. nist is very good in adequacy correlation but not as good as gtm30 in fluency correlation. based on these observations, we are not able to conclude which metric is the best because it depends on the manual evaluation criteria. this results also indicate that high correlation between human and automatic scores in both adequacy and fluency cannot always been achieved at the same time. the best performing metrics in fluency according to table 1 are bleu12 and gtm30 (dark/green cells). however, many metrics are statistically equivalent (gray cells) to them when we factor in the 95% confidence intervals. for example, even per is as good as bleu12 in adequacy. one reason for this might be due to data sparseness since only 8 systems are available. the other potential problem for correlation analysis of human vs. automatic framework is that high corpus-level correlation might not translate to high sentence-level correlation. however, high sentence-level correlation is often an important property that machine translation researchers look for. for example, candidate translations shorter than 12 words would have zero bleu12 score but bleu12 has the best correlation with human judgment in fluency as shown in table 1. in order to evaluate the ever increasing number of automatic evaluation metrics for machine translation objectively, efficiently, and reliably, we introduce a new evaluation method: orange. we describe orange in details in section 2 and briefly introduce three new automatic metrics that will be used in comparisons in section 3. the results of comparing several existing automatic metrics and the three new automatic metrics using orange will be presented in section 4. we conclude this paper and discuss future directions in section 5.we conclude this paper and discuss future directions in section 5. however, we plan to conduct the sampling procedure to verify this is indeed the case. we conjecture that this is the case for the currently available machine translation systems. the orange score for each metric is calculated as the average rank of the average reference (oracle) score over the whole corpus (872 sentences) divided by the length of the n-best list plus 1. the results of comparing several existing automatic metrics and the three new automatic metrics using orange will be presented in section 4. if the portion is small then the orange method can be confidently applied. to automatically evaluate machine translations, the machine translation community recently adopted an n-gram co-occurrence scoring procedure bleu (papineni et al 2001). assuming the length of the n-best list is n and the size of the corpus is s (in number of sentences), we compute orange as follows: orange = )1( )( 1 + ??? ranging from 0 to 9 (rouge-s0 to s9) and without any skip distance limit (rouge-s*) we compute the average score of the references and then rank the candidate translations and the references according to these automatic scores.

DATASET_PACSUM/dataset/inputs/C04-1080.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the empiricist revolution in computational linguistics has dramatically shifted the accepted boundary between what kinds of knowledge are best supplied by humans and what kinds are best learned from data, with much of the human supplied knowledge now being in the form of annotations of data. as we look to the future, we expect that relatively unsupervised methods will grow in applicability, reducing the need for expensive human annotation of data. with respect to part-of-speech tagging, we believe that the way forward from the relatively small number of languages for which we can currently identify parts of speech in context with reasonable accuracy will make use of unsupervised methods that require only an untagged corpus and a lexicon of words and their possible parts of speech. we believe this based on the fact that such lexicons exist for many more languages (in the form of conventional dictionaries) than extensive human-tagged training corpora exist for. unsupervised part-of-speech tagging, as defined above, has been attempted using a variety of learning algorithms (brill 1995, church, 1988, cutting et. al. 1992, elworthy, 1994 kupiec 1992, merialdo 1991). while this makes unsupervised part-of-speech tagging a relatively well-studied problem, published results to date have not been comparable with respect to the training and test data used, or the lexicons which have been made available to the learners. in this paper, we provide the first comprehensive comparison of methods for unsupervised part-of speech tagging. in addition, we explore two new ideas for improving tagging accuracy. first, we explore an hmm approach to tagging that uses context on both sides of the word to be tagged, inspired by previous work on building bidirectionality into graphical models (lafferty et. al. 2001, toutanova et. al. 2003). second we describe a method for sequential unsupervised training of tag sequence and lexical probabilities in an hmm, which we observe leads to improved accuracy over simultaneous training with certain types of models. in section 2, we provide a brief description of the methods we evaluate and review published results. section 3 describes the contextualized variation on hmm tagging that we have explored. in section 4 we provide a direct comparison of several unsupervised part-of-speech taggers, which is followed by section 5, in which we present a new method for training with suboptimal lexicons. in section 6, we revisit our new approach to hmm tagging, this time, in the supervised framework.in the future, we will consider making an increase the context-size, which helped toutanova et al (2003). in section 6, we revisit our new approach to hmm tagging, this time, in the supervised framework. the empiricist revolution in computational linguistics has dramatically shifted the accepted boundary between what kinds of knowledge are best supplied by humans and what kinds are best learned from data, with much of the human supplied knowledge now being in the form of annotations of data. this result falls only slightly below the full-blown training intensive dependency-based conditional model. we have presented a comprehensive evaluation of several methods for unsupervised part-of-speech tagging, comparing several variations of hidden markov model taggers and unsupervised transformation-based learning using the same corpus and same lexicons. in section 4 we provide a direct comparison of several unsupervised part-of-speech taggers, which is followed by section 5, in which we present a new method for training with suboptimal lexicons. using a 50% 50% train-test split of the penn treebank to assess hmms, maximum entropy markov models (memms) and conditional random fields (crfs), they found that crfs, which make use of observation features from both the past and future, outperformed hmms which in turn outperformed memms.

DATASET_PACSUM/dataset/inputs/C04-1081.txt ADDED Viewed

	@@ -0,0 +1 @@

+ unlike english and other western languages, many asian languages such as chinese, japanese, and thai, do not delimit words by white-space. wordsegmentation is therefore a key precursor for language processing tasks in these languages. for chinese, there has been significant research on find ing word boundaries in unsegmented sequences(see (sproat and shih, 2002) for a review). un fortunately, building a chinese word segmentation system is complicated by the fact that there is no standard definition of word boundaries in chinese. approaches to chinese segmentation fall roughly into two categories: heuristic dictionary-based methods and statistical machine learning methods.in dictionary-based methods, a predefined dictio nary is used along with hand-generated rules for segmenting input sequence (wu, 1999). howeverthese approaches have been limited by the impossibility of creating a lexicon that includes all possible chinese words and by the lack of robust statistical inference in the rules. machine learning approaches are more desirable and have been successful in both unsupervised learning (peng and schuur mans, 2001) and supervised learning (teahan et al, 2000). many current approaches suffer from either lackof exact inference over sequences or difficulty in incorporating domain knowledge effectively into seg mentation. domain knowledge is either not used, used in a limited way, or used in a complicated way spread across different components. for example,the n-gram generative language modeling based ap proach of teahan et al(2000) does not use domainknowledge. gao et al(2003) uses class-based language for word segmentation where some word cat egory information can be incorporated. zhang et al (2003) use a hierarchical hidden markov model to incorporate lexical knowledge. a recent advance in this area is xue (2003), in which the author uses a sliding-window maximum entropy classifier to tag chinese characters into one of four position tags, and then covert these tags into a segmentation using rules. maximum entropy models give tremendousflexibility to incorporate arbitrary features. how ever, a traditional maximum entropy tagger, as used in xue (2003), labels characters without consideringdependencies among the predicted segmentation labels that is inherent in the state transitions of finite state sequence models. linear-chain conditional random fields (crfs) (lafferty et al, 2001) are models that address both issues above. unlike heuristic methods, they are principled probabilistic finite state models onwhich exact inference over sequences can be ef ficiently performed. unlike generative n-gram or hidden markov models, they have the ability to straightforwardly combine rich domain knowledge, for example in this paper, in the form of multiple readily-available lexicons. furthermore, they arediscriminatively-trained, and are often more accurate than generative models, even with the same fea tures. in their most general form, crfs are arbitrary undirected graphical models trained to maximize the conditional probability of the desired outputs given the corresponding inputs. in the linear-chainspecial case we use here, they can be roughly un derstood as discriminatively-trained hidden markovmodels with next-state transition functions represented by exponential models (as in maximum en tropy classifiers), and with great flexibility to viewthe observation sequence in terms of arbitrary, over lapping features, with long-range dependencies, and at multiple levels of granularity. these beneficialproperties suggests that crfs are a promising ap proach for chinese word segmentation.new word detection is one of the most impor tant problems in chinese information processing.many machine learning approaches have been pro posed (chen and bai, 1998; wu and jiang, 2000; nie et al, 1995). new word detection is normally considered as a separate process from segmentation.however, integrating them would benefit both seg mentation and new word detection. crfs provide aconvenient framework for doing this. they can pro duce not only a segmentation, but also confidence in local segmentation decisions, which can be usedto find new, unfamiliar character sequences sur rounded by high-confidence segmentations. thus, our new word detection is not a stand-alone process, but an integral part of segmentation. newly detected words are re-incorporated into our word lexicon,and used to improve segmentation. improved seg mentation can then be further used to improve new word detection. comparing chinese word segmentation accuracyacross systems can be difficult because many re search papers use different data sets and different ground-rules. some published results claim 98% or99% segmentation precision and recall, but these ei ther count only the words that occur in the lexicon, or use unrealistically simple data, lexicons that haveextremely small (or artificially non-existant) outof-vocabulary rates, short sentences or many numbers. a recent chinese word segmentation competition (sproat and emerson, 2003) has made compar isons easier. the competition provided four datasets with significantly different segmentation guidelines, and consistent train-test splits. the performance ofparticipating system varies significantly across different datasets. our system achieves top performance in two of the runs, and a state-of-the-art per formance on average. this indicates that crfs are a viable model for robust chinese word segmentation.this indicates that crfs are a viable model for robust chinese word segmentation. unlike english and other western languages, many asian languages such as chinese, japanese, and thai, do not delimit words by white-space. wordsegmentation is therefore a key precursor for language processing tasks in these languages. the contribution of this paper is three-fold. our system achieves top performance in two of the runs, and a state-of-the-art per formance on average. for chinese, there has been significant research on find ing word boundaries in unsegmented sequences(see (sproat and shih, 2002) for a review). the performance ofparticipating system varies significantly across different datasets. acknowledgmentsthis work was supported in part by the center for intelligent information retrieval, in part by the cen tral intelligence agency, the national security agencyand national science foundation under nsf grant #iis 0326249, and in part by spawarsyscen-sd grant number n66001-02-1-8903. to make a comprehensive evaluation, we use allfour of the datasets from a recent chinese word segmentation bake-off competition (sproat and emer son, 2003). conditional random fields (crfs) are undirected graphical models trained to maximize a conditional probability (lafferty et al, 2001). however, training is a one-time process, and testing time is still linear in the length of the input.

DATASET_PACSUM/dataset/inputs/C04-1111.txt ADDED Viewed

	@@ -0,0 +1 @@

+ the natural language processing (nlp) com munity has recently seen a growth in corpus-based methods. algorithms light in linguistic theories but rich in available training data have been successfully applied to several applications such as ma chine translation (och and ney 2002), information extraction (etzioni et al 2004), and question an swering (brill et al 2001). in the last decade, we have seen an explosion in the amount of available digital text resources. it is estimated that the internet contains hundreds of terabytes of text data, most of which is in an unstructured format. yet, many nlp algorithms tap into only megabytes or gigabytes of this information. in this paper, we make a step towards acquiring semantic knowledge from terabytes of data. we present an algorithm for extracting is-a relations, designed for the terascale, and compare it to a state of the art method that employs deep analysis of text (pantel and ravichandran 2004). we show that by simply utilizing more data on this task, we can achieve similar performance to a linguisticallyrich approach. the current state of the art co occurrence model requires an estimated 10 years just to parse a 1tb corpus (see table 1). instead of using a syntactically motivated co-occurrence ap proach as above, our system uses lexico-syntactic rules. in particular, it finds lexico-pos patterns by making modifications to the basic edit distance algorithm. once these patterns have been learnt, the algorithm for finding new is-a relations runs in o(n), where n is the number of sentences. in semantic hierarchies such as wordnet (miller 1990), an is-a relation between two words x and y represents a subordinate relationship (i.e. x is more specific than y). many algorithms have recently been proposed to automatically mine is-a (hypo nym/hypernym) relations between words. here, we focus on is-a relations that are characterized by the questions ?what/who is x?? for example, table 2 shows a sample of 10 is-a relations discovered by the algorithms presented in this paper. in this table, we call azalea, tiramisu, and winona ryder in stances of the respective concepts flower, dessert and actress. these kinds of is-a relations would be useful for various purposes such as ontology con struction, semantic information retrieval, question answering, etc. the main contribution of this paper is a comparison of the quality of our pattern-based and co occurrence models as a function of processing time and corpus size. also, the paper lays a foundation for terascale acquisition of knowledge. we will show that, for very small or very large corpora or for situations where recall is valued over precision, the pattern-based approach is best.there is a long standing need for higher quality performance in nlp systems. the natural language processing (nlp) com munity has recently seen a growth in corpus-based methods. our biggest challenge as we venture to the terascale is to use our new found wealth not only to build better systems, but to im prove our understanding of language. we will show that, for very small or very large corpora or for situations where recall is valued over precision, the pattern-based approach is best. also, the paper lays a foundation for terascale acquisition of knowledge. previous approaches to extracting is-a relations fall under two categories: pattern-based and co occurrence-based approaches. re cently, pantel and ravichandran (2004) extended this approach by making use of all syntactic de pendency features for each noun. there is promise for increasing our system accuracy by re ranking the outputs of the top-5 hypernyms. the per formance of the system in the top 5 category is much better than that of wordnet (38%). the focus is on the precision and recall of the systems as a func tion of the corpus size. algorithms light in linguistic theories but rich in available training data have been successfully applied to several applications such as ma chine translation (och and ney 2002), information extraction (etzioni et al 2004), and question an swering (brill et al 2001).