Spaces:
Sleeping
Sleeping
.. Copyright (C) 2001-2023 NLTK Project | |
.. For license information, see LICENSE.TXT | |
======================================= | |
Demonstrate word embedding using Gensim | |
======================================= | |
>>> from nltk.test.gensim_fixt import setup_module | |
>>> setup_module() | |
We demonstrate three functions: | |
- Train the word embeddings using brown corpus; | |
- Load the pre-trained model and perform simple tasks; and | |
- Pruning the pre-trained binary model. | |
>>> import gensim | |
--------------- | |
Train the model | |
--------------- | |
Here we train a word embedding using the Brown Corpus: | |
>>> from nltk.corpus import brown | |
>>> train_set = brown.sents()[:10000] | |
>>> model = gensim.models.Word2Vec(train_set) | |
It might take some time to train the model. So, after it is trained, it can be saved as follows: | |
>>> model.save('brown.embedding') | |
>>> new_model = gensim.models.Word2Vec.load('brown.embedding') | |
The model will be the list of words with their embedding. We can easily get the vector representation of a word. | |
>>> len(new_model.wv['university']) | |
100 | |
There are some supporting functions already implemented in Gensim to manipulate with word embeddings. | |
For example, to compute the cosine similarity between 2 words: | |
>>> new_model.wv.similarity('university','school') > 0.3 | |
True | |
--------------------------- | |
Using the pre-trained model | |
--------------------------- | |
NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset. | |
The full model is from https://code.google.com/p/word2vec/ (about 3 GB). | |
>>> from nltk.data import find | |
>>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt')) | |
>>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False) | |
We pruned the model to only include the most common words (~44k words). | |
>>> len(model) | |
43981 | |
Each word is represented in the space of 300 dimensions: | |
>>> len(model['university']) | |
300 | |
Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score. | |
>>> model.most_similar(positive=['university'], topn = 3) | |
[('universities', 0.70039...), ('faculty', 0.67809...), ('undergraduate', 0.65870...)] | |
Finding a word that is not in a list is also supported, although, implementing this by yourself is simple. | |
>>> model.doesnt_match('breakfast cereal dinner lunch'.split()) | |
'cereal' | |
Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example, | |
the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'. | |
>>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1) | |
[('queen', 0.71181...)] | |
>>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1) | |
[('France', 0.78840...)] | |
We can visualize the word embeddings using t-SNE (https://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words. | |
| import numpy as np | |
| labels = [] | |
| count = 0 | |
| max_count = 1000 | |
| X = np.zeros(shape=(max_count,len(model['university']))) | |
| | |
| for term in model.index_to_key: | |
| X[count] = model[term] | |
| labels.append(term) | |
| count+= 1 | |
| if count >= max_count: break | |
| | |
| # It is recommended to use PCA first to reduce to ~50 dimensions | |
| from sklearn.decomposition import PCA | |
| pca = PCA(n_components=50) | |
| X_50 = pca.fit_transform(X) | |
| | |
| # Using TSNE to further reduce to 2 dimensions | |
| from sklearn.manifold import TSNE | |
| model_tsne = TSNE(n_components=2, random_state=0) | |
| Y = model_tsne.fit_transform(X_50) | |
| | |
| # Show the scatter plot | |
| import matplotlib.pyplot as plt | |
| plt.scatter(Y[:,0], Y[:,1], 20) | |
| | |
| # Add labels | |
| for label, x, y in zip(labels, Y[:, 0], Y[:, 1]): | |
| plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10) | |
| | |
| plt.show() | |
------------------------------ | |
Prune the trained binary model | |
------------------------------ | |
Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/ | |
We use this code to get the `word2vec_sample` model. | |
| import gensim | |
| # Load the binary model | |
| model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True) | |
| | |
| # Only output word that appear in the Brown corpus | |
| from nltk.corpus import brown | |
| words = set(brown.words()) | |
| print(len(words)) | |
| | |
| # Output presented word to a temporary file | |
| out_file = 'pruned.word2vec.txt' | |
| with open(out_file,'w') as f: | |
| word_presented = words.intersection(model.index_to_key) | |
| f.write('{} {}\n'.format(len(word_presented),len(model['word']))) | |
| | |
| for word in word_presented: | |
| f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word]))) | |