Spaces:

sunnychenxiwang
/

EasyDetect

Sleeping

App Files Files Community

EasyDetect / pipeline /nltk /chunk /__init__.py

sunnychenxiwang

update nltk

d916065 about 1 year ago

raw

history blame

7.6 kB

	# Natural Language Toolkit: Chunkers
	#
	# Copyright (C) 2001-2023 NLTK Project
	# Author: Steven Bird <[email protected]>
	# Edward Loper <[email protected]>
	# URL: <https://www.nltk.org/>
	# For license information, see LICENSE.TXT
	#

	"""
	Classes and interfaces for identifying non-overlapping linguistic
	groups (such as base noun phrases) in unrestricted text. This task is
	called "chunk parsing" or "chunking", and the identified groups are
	called "chunks". The chunked text is represented using a shallow
	tree called a "chunk structure." A chunk structure is a tree
	containing tokens and chunks, where each chunk is a subtree containing
	only tokens. For example, the chunk structure for base noun phrase
	chunks in the sentence "I saw the big dog on the hill" is::

	(SENTENCE:
	(NP: <I>)
	<saw>
	(NP: <the> <big> <dog>)
	<on>
	(NP: <the> <hill>))

	To convert a chunk structure back to a list of tokens, simply use the
	chunk structure's ``leaves()`` method.

	This module defines ``ChunkParserI``, a standard interface for
	chunking texts; and ``RegexpChunkParser``, a regular-expression based
	implementation of that interface. It also defines ``ChunkScore``, a
	utility class for scoring chunk parsers.

	RegexpChunkParser
	=================

	``RegexpChunkParser`` is an implementation of the chunk parser interface
	that uses regular-expressions over tags to chunk a text. Its
	``parse()`` method first constructs a ``ChunkString``, which encodes a
	particular chunking of the input text. Initially, nothing is
	chunked. ``parse.RegexpChunkParser`` then applies a sequence of
	``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
	the chunking that it encodes. Finally, the ``ChunkString`` is
	transformed back into a chunk structure, which is returned.

	``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
	For example, you can use an ``RegexpChunkParser`` to chunk the noun
	phrases in a text, or the verb phrases in a text; but you can not
	use it to simultaneously chunk both noun phrases and verb phrases in
	the same text. (This is a limitation of ``RegexpChunkParser``, not of
	chunk parsers in general.)

	RegexpChunkRules
	----------------

	A ``RegexpChunkRule`` is a transformational rule that updates the
	chunking of a text by modifying its ``ChunkString``. Each
	``RegexpChunkRule`` defines the ``apply()`` method, which modifies
	the chunking encoded by a ``ChunkString``. The
	``RegexpChunkRule`` class itself can be used to implement any
	transformational rule based on regular expressions. There are
	also a number of subclasses, which can be used to implement
	simpler types of rules:

	- ``ChunkRule`` chunks anything that matches a given regular
	expression.
	- ``StripRule`` strips anything that matches a given regular
	expression.
	- ``UnChunkRule`` will un-chunk any chunk that matches a given
	regular expression.
	- ``MergeRule`` can be used to merge two contiguous chunks.
	- ``SplitRule`` can be used to split a single chunk into two
	smaller chunks.
	- ``ExpandLeftRule`` will expand a chunk to incorporate new
	unchunked material on the left.
	- ``ExpandRightRule`` will expand a chunk to incorporate new
	unchunked material on the right.

	Tag Patterns
	~~~~~~~~~~~~

	A ``RegexpChunkRule`` uses a modified version of regular
	expression patterns, called "tag patterns". Tag patterns are
	used to match sequences of tags. Examples of tag patterns are::

	r'(<DT>\|<JJ>\|<NN>)+'
	r'<NN>+'
	r'<NN.*>'

	The differences between regular expression patterns and tag
	patterns are:

	- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
	``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
	``'<NN'`` followed by one or more repetitions of ``'>'``.
	- Whitespace in tag patterns is ignored. So
	``'<DT> \| <NN>'`` is equivalent to ``'<DT>\|<NN>'``
	- In tag patterns, ``'.'`` is equivalent to ``'[^{}<>]'``; so
	``'<NN.*>'`` matches any single tag starting with ``'NN'``.

	The function ``tag_pattern2re_pattern`` can be used to transform
	a tag pattern to an equivalent regular expression pattern.

	Efficiency
	----------

	Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
	rate of about 300 tokens/second, with a moderately complex rule set.

	There may be problems if ``RegexpChunkParser`` is used with more than
	5,000 tokens at a time. In particular, evaluation of some regular
	expressions may cause the Python regular expression engine to
	exceed its maximum recursion depth. We have attempted to minimize
	these problems, but it is impossible to avoid them completely. We
	therefore recommend that you apply the chunk parser to a single
	sentence at a time.

	Emacs Tip
	---------

	If you evaluate the following elisp expression in emacs, it will
	colorize a ``ChunkString`` when you use an interactive python shell
	with emacs or xemacs ("C-c !")::

	(let ()
	(defconst comint-mode-font-lock-keywords
	'(("<[^>]+>" 0 'font-lock-reference-face)
	("[{}]" 0 'font-lock-function-name-face)))
	(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

	You can evaluate this code by copying it to a temporary buffer,
	placing the cursor after the last close parenthesis, and typing
	"``C-x C-e``". You should evaluate it before running the interactive
	session. The change will last until you close emacs.

	Unresolved Issues
	-----------------

	If we use the ``re`` module for regular expressions, Python's
	regular expression engine generates "maximum recursion depth
	exceeded" errors when processing very large texts, even for
	regular expressions that should not require any recursion. We
	therefore use the ``pre`` module instead. But note that ``pre``
	does not include Unicode support, so this module will not work
	with unicode strings. Note also that ``pre`` regular expressions
	are not quite as advanced as ``re`` ones (e.g., no leftward
	zero-length assertions).

	:type CHUNK_TAG_PATTERN: regexp
	:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
	pattern is valid.
	"""

	from nltk.chunk.api import ChunkParserI
	from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
	from nltk.chunk.util import (
	ChunkScore,
	accuracy,
	conllstr2tree,
	conlltags2tree,
	ieerstr2tree,
	tagstr2tree,
	tree2conllstr,
	tree2conlltags,
	)
	from nltk.data import load

	# Standard treebank POS tagger
	_BINARY_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_binary.pickle"
	_MULTICLASS_NE_CHUNKER = "chunkers/maxent_ne_chunker/english_ace_multiclass.pickle"


	def ne_chunk(tagged_tokens, binary=False):
	"""
	Use NLTK's currently recommended named entity chunker to
	chunk the given list of tagged tokens.
	"""
	if binary:
	chunker_pickle = _BINARY_NE_CHUNKER
	else:
	chunker_pickle = _MULTICLASS_NE_CHUNKER
	chunker = load(chunker_pickle)
	return chunker.parse(tagged_tokens)


	def ne_chunk_sents(tagged_sentences, binary=False):
	"""
	Use NLTK's currently recommended named entity chunker to chunk the
	given list of tagged sentences, each consisting of a list of tagged tokens.
	"""
	if binary:
	chunker_pickle = _BINARY_NE_CHUNKER
	else:
	chunker_pickle = _MULTICLASS_NE_CHUNKER
	chunker = load(chunker_pickle)
	return chunker.parse_sents(tagged_sentences)