named entity (ne) recognition is a task in whichproper nouns and numerical information in a docu ment are detected and classified into categories suchas person, organization, and date. it is a key technol ogy of information extraction and open-domain question answering (voorhees and harman, 2000). we are building a trainable open-domain question answering system called saiqa-ii. in this paper, we show that an ne recognizer based on support vector machines (svms) gives better scores thanconventional systems. svms have given high per formance in various classification tasks (joachims, 1998; kudo and matsumoto, 2001). however, it turned out that off-the-shelf svm classifiers are too inefficient for ne recognition. the recognizer runs at a rate of only 85 bytes/sec on an athlon 1.3 ghz linux pc, while rule-based systems (e.g., isozaki, (2001)) can process several kilobytes in a second. the major reason is the inefficiency of svm classifiers. there are otherreports on the slowness of svm classifiers. another svm-based ne recognizer (yamada and mat sumoto, 2001) is 0.8 sentences/sec on a pentium iii 933 mhz pc. an svm-based part-of-speech (pos). tagger (nakagawa et al, 2001) is 20 tokens/sec on an alpha 21164a 500 mhz processor. it is difficult to use such slow systems in practical applications. in this paper, we present a method that makes the ne system substantially faster. this method can also be applied to other tasks in natural languageprocessing such as chunking and pos tagging. another problem with svms is its incomprehensibil ity. it is not clear which features are important or how they work. the above method is also useful for finding useless features. we also mention a method to reduce training time. 1.1 support vector machines. suppose we have a set of training data for a two class problem: , where ffflfi is a feature vector of the ffi -th sample in the training data and !$#%# is the label forthe sample. the goal is to find a decision func tion that accurately predicts for unseen . a non-linear svm classifier gives a decision function ( ) * sign ,+-) for an input vector where +-) .* / 0 21)3 546879: !6; here, () *=!$# means is a member of a cer tain class and () $* # means is not a mem ber. 7 s are called support vectors and are repre sentatives of training examples. is the numberof support vectors. therefore, computational com plexity of +?) is proportional to . support vectorsand other constants are determined by solving a cer tain quadratic programming problem. 4687@ is akernel that implicitly maps vectors into a higher di mensional space. typical kernels use dot products: 4687@ a*cbed7@ . a polynomial kernel of degree fis given by bg? *hi#j!kg l . we can use vari mm m m n m m m m m m m m m n m o o o o o n o o o o o o o o o o o o m : positive example, o : negative example n m , n o : support vectors figure 1: support vector machine ous kernels, and the design of an appropriate kernel for a particular application is an important research issue.figure 1 shows a linearly separable case. the de cision hyperplane defined by +-) p*rq separatespositive and negative examples by the largest mar gin. the solid line indicates the decision hyperplaneand two parallel dotted lines indicate the margin be tween positive and negative examples. since such aseparating hyperplane may not exist, a positive pa rameter s is introduced to allow misclassifications. see vapnik (1995). 1.2 svm-based ne recognition. as far as we know, the first svm-based ne system was proposed by yamada et al (2001) for japanese.his system is an extension of kudo?s chunking sys tem (kudo and matsumoto, 2001) that gave the best performance at conll-2000 shared tasks. in theirsystem, every word in a sentence is classified sequentially from the beginning or the end of a sen tence. however, since yamada has not compared it with other methods under the same conditions, it is not clear whether his ne system is better or not. here, we show that our svm-based ne system ismore accurate than conventional systems. our sys tem uses the viterbi search (allen, 1995) instead of sequential determination.for training, we use ?crl data?, which was prepared for irex (information retrieval and extrac tion exercise1, sekine and eriguchi (2000)). it has about 19,000 nes in 1,174 articles. we also use additional data by isozaki (2001). both datasets are based on mainichi newspaper?s 1994 and 1995 cd-roms. we use irex?s formal test data calledgeneral that has 1,510 named entities in 71 ar ticles from mainichi newspaper of 1999. systems are compared in terms of general?s f-measure 1http://cs.nyu.edu/cs/projects/proteus/irexwhich is the harmonic mean of ?recall? and ?preci sion? and is defined as follows. recall = m/(the number of correct nes), precision = m/(the number of nes extracted by a system), where m is the number of nes correctly extracted and classified by the system.we developed an svm-based ne system by following our ne system based on maximum entropy (me) modeling (isozaki, 2001). we sim ply replaced the me model with svm classifiers.the above datasets are processed by a morphological analyzer chasen 2.2.12. it tokenizes a sen tence into words and adds pos tags. chasen uses about 90 pos tags such as common-noun and location-name. since most unknown words are proper nouns, chasen?s parameters for unknownwords are modified for better results. then, a char acter type tag is added to each word. it uses 17character types such as all-kanji and small integer. see isozaki (2001) for details. now, japanese ne recognition is solved by theclassification of words (sekine et al, 1998; borth wick, 1999; uchimoto et al, 2000). for instance, the words in ?president george herbert bush saidclinton is . . . are classified as follows: ?president? = other, ?george? = person-begin, ?her bert? = person-middle, ?bush? = person-end, ?said? = other, ?clinton? = person-single, ?is? = other. in this way, the first word of a person?s name is labeled as person-begin. the last word is labeled as person-end. other words in the nameare person-middle. if a person?s name is expressed by a single word, it is labeled as person single. if a word does not belong to any namedentities, it is labeled as other. since irex de fines eight ne classes, words are classified into 33 ( *utwvex!k# ) categories.each sample is represented by 15 features be cause each word has three features (part-of-speech tag, character type, and the word itself), and two preceding words and two succeeding words are also used for context dependence. although infrequent features are usually removed to prevent overfitting, we use all features because svms are robust. each sample is represented by a long binary vector, i.e., a sequence of 0 (false) and 1 (true). for instance, ?bush? in the above example is represented by a 2http://chasen.aist-nara.ac.jp/ vector p*yg[z\#^]_ g[z `a] described below. only 15 elements are 1. bdcfe8ghji // current word is not ?alice? bdc klghme // current word is ?bush? bdc nghji // current word is not ?charlie? : bdcfe^opikpqpghme // current pos is a proper noun bdcfe^opinipghji // current pos is not a verb : bdc nqre^sre ghji // previous word is not ?henry? bdc nqre^skghme // previous word is ?herbert? :here, we have to consider the following problems. first, svms can solve only a two-class problem. therefore, we have to reduce the above multi class problem to a group of two-class problems. second, we have to consider consistency among word classes in a sentence. for instance, a word classified as person-begin should be followed by person-middle or person-end. it impliesthat the system has to determine the best combina tions of word classes from numerous possibilities.here, we solve these problems by combining exist ing methods. there are a few approaches to extend svms to cover t -class problems. here, we employ the ?oneclass versus all others? approach. that is, each clas sifier (%u ) is trained to distinguish members of a class v from non-members. in this method, two or more classifiers may give !$# to an unseen vector or no classifier may give !$# . one common way to avoid such situations is to compare + u ) values and to choose the class index v of the largest + u ) . the consistency problem is solved by the viterbi search. since svms do not output probabilities, we use the svm+sigmoid method (platt, 2000). that is, we use a sigmoid function wxg? j*y#zi#{! |l}~ {g to map + u ) to a probability-like value. the output of the viterbi search is adjusted by a postprocessor for wrong word boundaries. the adjustment rules are also statistically determined (isozaki, 2001). 1.3 comparison of ne recognizers. we use a fixed value ?* #q9q . f-measures are not very sensitive to  unless  is too small. whenwe used 1,038,986 training vectors, general?s f measure was 89.64% for ?*?q?# and 90.03% for 6*?#q9q . we employ the quadratic kernel ( f *y? ) because it gives the best results. polynomial kernels of degree 1, 2, and 3 resulted in 83.03%, 88.31%, f-measure (%) ? ? rg+dt ? ? me ? ? svm 0 20 40 60 80 100 120 crl data ???e? ?^??:??? 76 78 80 82 84 86 88 90 number of nes in training data ( ?? ) figure 2: f-measures of ne systems and 87.04% respectively when we used 569,994 training vectors. figure 2 compares ne recognizers in terms ofgeneral?s f-measures. ?svm? in the figure in dicates f-measures of our system trained by kudo?s tinysvm-0.073 with s?*?q?# . it attained 85.04% when we used only crl data. ?me? indicates our me system and ?rg+dt? indicates a rule-basedmachine learning system (isozaki, 2001). according to this graph, ?svm? is better than the other sys tems.however, svm classifiers are too slow. fa mous svm-light 3.50 (joachims, 1999) took 1.2 days to classify 569,994 vectors derived from 2 mb documents. that is, it runs at only 19 bytes/sec. tinysvm?s classifier seems best optimized among publicly available svm toolkits, but it still works at only 92 bytes/sec.our svm-based ne recognizer attained f = 90.03%. we also thank shigeru katagiri and ken-ichiro ishii for their support. named entity (ne) recognition is a task in whichproper nouns and numerical information in a docu ment are detected and classified into categories suchas person, organization, and date. tinysvm?s classifier seems best optimized among publicly available svm toolkits, but it still works at only 92 bytes/sec. that is, it runs at only 19 bytes/sec. it is a key technol ogy of information extraction and open-domain question answering (voorhees and harman, 2000). fa mous svm-light 3.50 (joachims, 1999) took 1.2 days to classify 569,994 vectors derived from 2 mb documents. is better than the other sys tems.however, svm classifiers are too slow. in this paper, we show that an ne recognizer based on support vector machines (svms) gives better scores thanconventional systems. we are building a trainable open-domain question answering system called saiqa-ii. according to this graph, ?svm? svms have given high per formance in various classification tasks (joachims, 1998; kudo and matsumoto, 2001). indicates a rule-basedmachine learning system (isozaki, 2001). ?me? indicates our me system and ?rg+dt? however, it turned out that off-the-shelf svm classifiers are too inefficient for ne recognition.