Spaces:

sunnychenxiwang
/

EasyDetect

Sleeping

App Files Files Community

EasyDetect / pipeline /nltk /cluster /__init__.py

sunnychenxiwang

update nltk

d916065 over 1 year ago

raw

history blame

4.36 kB

	# Natural Language Toolkit: Clusterers
	#
	# Copyright (C) 2001-2023 NLTK Project
	# Author: Trevor Cohn <[email protected]>
	# URL: <https://www.nltk.org/>
	# For license information, see LICENSE.TXT

	"""
	This module contains a number of basic clustering algorithms. Clustering
	describes the task of discovering groups of similar items with a large
	collection. It is also describe as unsupervised machine learning, as the data
	from which it learns is unannotated with class information, as is the case for
	supervised learning. Annotated data is difficult and expensive to obtain in
	the quantities required for the majority of supervised learning algorithms.
	This problem, the knowledge acquisition bottleneck, is common to most natural
	language processing tasks, thus fueling the need for quality unsupervised
	approaches.

	This module contains a k-means clusterer, E-M clusterer and a group average
	agglomerative clusterer (GAAC). All these clusterers involve finding good
	cluster groupings for a set of vectors in multi-dimensional space.

	The K-means clusterer starts with k arbitrary chosen means then allocates each
	vector to the cluster with the closest mean. It then recalculates the means of
	each cluster as the centroid of the vectors in the cluster. This process
	repeats until the cluster memberships stabilise. This is a hill-climbing
	algorithm which may converge to a local maximum. Hence the clustering is
	often repeated with random initial means and the most commonly occurring
	output means are chosen.

	The GAAC clusterer starts with each of the N vectors as singleton clusters.
	It then iteratively merges pairs of clusters which have the closest centroids.
	This continues until there is only one cluster. The order of merges gives rise
	to a dendrogram - a tree with the earlier merges lower than later merges. The
	membership of a given number of clusters c, 1 <= c <= N, can be found by
	cutting the dendrogram at depth c.

	The Gaussian EM clusterer models the vectors as being produced by a mixture
	of k Gaussian sources. The parameters of these sources (prior probability,
	mean and covariance matrix) are then found to maximise the likelihood of the
	given data. This is done with the expectation maximisation algorithm. It
	starts with k arbitrarily chosen means, priors and covariance matrices. It
	then calculates the membership probabilities for each vector in each of the
	clusters - this is the 'E' step. The cluster parameters are then updated in
	the 'M' step using the maximum likelihood estimate from the cluster membership
	probabilities. This process continues until the likelihood of the data does
	not significantly increase.

	They all extend the ClusterI interface which defines common operations
	available with each clusterer. These operations include:

	- cluster: clusters a sequence of vectors
	- classify: assign a vector to a cluster
	- classification_probdist: give the probability distribution over cluster memberships

	The current existing classifiers also extend cluster.VectorSpace, an
	abstract class which allows for singular value decomposition (SVD) and vector
	normalisation. SVD is used to reduce the dimensionality of the vector space in
	such a manner as to preserve as much of the variation as possible, by
	reparameterising the axes in order of variability and discarding all bar the
	first d dimensions. Normalisation ensures that vectors fall in the unit
	hypersphere.

	Usage example (see also demo())::

	from nltk import cluster
	from nltk.cluster import euclidean_distance
	from numpy import array

	vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]

	# initialise the clusterer (will also assign the vectors to clusters)
	clusterer = cluster.KMeansClusterer(2, euclidean_distance)
	clusterer.cluster(vectors, True)

	# classify a new vector
	print(clusterer.classify(array([3, 3])))

	Note that the vectors must use numpy array-like
	objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
	efficiency when required.
	"""

	from nltk.cluster.em import EMClusterer
	from nltk.cluster.gaac import GAAClusterer
	from nltk.cluster.kmeans import KMeansClusterer
	from nltk.cluster.util import (
	Dendrogram,
	VectorSpaceClusterer,
	cosine_distance,
	euclidean_distance,
	)