File size: 5,200 Bytes
d916065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT


=======================================
Demonstrate word embedding using Gensim

=======================================

    >>> from nltk.test.gensim_fixt import setup_module
    >>> setup_module()

We demonstrate three functions:
- Train the word embeddings using brown corpus;
- Load the pre-trained model and perform simple tasks; and
- Pruning the pre-trained binary model.

    >>> import gensim


---------------
Train the model

---------------

Here we train a word embedding using the Brown Corpus:

    >>> from nltk.corpus import brown
    >>> train_set = brown.sents()[:10000]
    >>> model = gensim.models.Word2Vec(train_set)

It might take some time to train the model. So, after it is trained, it can be saved as follows:

    >>> model.save('brown.embedding')
    >>> new_model = gensim.models.Word2Vec.load('brown.embedding')

The model will be the list of words with their embedding. We can easily get the vector representation of a word.

    >>> len(new_model.wv['university'])
    100

There are some supporting functions already implemented in Gensim to manipulate with word embeddings.
For example, to compute the cosine similarity between 2 words:

    >>> new_model.wv.similarity('university','school') > 0.3
    True


---------------------------
Using the pre-trained model

---------------------------

NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset.
The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

    >>> from nltk.data import find
    >>> word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
    >>> model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

We pruned the model to only include the most common words (~44k words).

    >>> len(model)
    43981

Each word is represented in the space of 300 dimensions:

    >>> len(model['university'])
    300

Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.

    >>> model.most_similar(positive=['university'], topn = 3)
    [('universities', 0.70039...), ('faculty', 0.67809...), ('undergraduate', 0.65870...)]

Finding a word that is not in a list is also supported, although, implementing this by yourself is simple.

    >>> model.doesnt_match('breakfast cereal dinner lunch'.split())
    'cereal'

Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example,
the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.

    >>> model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)
    [('queen', 0.71181...)]

    >>> model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)
    [('France', 0.78840...)]

We can visualize the word embeddings using t-SNE (https://lvdmaaten.github.io/tsne/). For this demonstration, we visualize the first 1000 words.

|    import numpy as np
|    labels = []
|    count = 0
|    max_count = 1000

|    X = np.zeros(shape=(max_count,len(model['university'])))

|

|    for term in model.index_to_key:

|        X[count] = model[term]

|        labels.append(term)

|        count+= 1

|        if count >= max_count: break

|

|    # It is recommended to use PCA first to reduce to ~50 dimensions

|    from sklearn.decomposition import PCA

|    pca = PCA(n_components=50)

|    X_50 = pca.fit_transform(X)

|

|    # Using TSNE to further reduce to 2 dimensions

|    from sklearn.manifold import TSNE

|    model_tsne = TSNE(n_components=2, random_state=0)

|    Y = model_tsne.fit_transform(X_50)

|

|    # Show the scatter plot

|    import matplotlib.pyplot as plt

|    plt.scatter(Y[:,0], Y[:,1], 20)

|

|    # Add labels

|    for label, x, y in zip(labels, Y[:, 0], Y[:, 1]):

|        plt.annotate(label, xy = (x,y), xytext = (0, 0), textcoords = 'offset points', size = 10)

|

|    plt.show()



------------------------------

Prune the trained binary model

------------------------------



Here is the supporting code to extract part of the binary model (GoogleNews-vectors-negative300.bin.gz) from https://code.google.com/p/word2vec/

We use this code to get the `word2vec_sample` model.



|    import gensim

|    # Load the binary model

|    model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True)

|

|    # Only output word that appear in the Brown corpus

|    from nltk.corpus import brown

|    words = set(brown.words())

|    print(len(words))

|

|    # Output presented word to a temporary file

|    out_file = 'pruned.word2vec.txt'

|    with open(out_file,'w') as f:

|        word_presented = words.intersection(model.index_to_key)

|        f.write('{} {}\n'.format(len(word_presented),len(model['word'])))

|

|        for word in word_presented:
|            f.write('{} {}\n'.format(word, ' '.join(str(value) for value in model[word])))