CC100 GloVe Embeddings for KM Language

Model Description

Language: km
Embedding Algorithm: GloVe (Global Vectors for Word Representation)
Vocabulary Size: 417310
Vector Dimensions: 300
Training Data: CC100 dataset

Training Information

We trained GloVe embeddings using the original C code. The model was trained by stochastically sampling nonzero elements from the co-occurrence matrix, over 100 iterations, to produce 300-dimensional vectors. We used a context window of ten words to the left and ten words to the right. Words with fewer than 5 co-occurrences were excluded for languages with over 1 million tokens in the training data, and the threshold was set to 2 for languages with smaller datasets.

We used data from CC100 for training the static word embeddings. We set xmax = 100, α = 3/4, and used AdaGrad optimization with an initial learning rate of 0.05.

Usage

These embeddings can be used for various NLP tasks such as text classification, named entity recognition, and as input features for neural networks.

Citation

If you use these embeddings in your research, please cite:

@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

License

These embeddings are released under the CC-BY-SA 4.0 License.