metadata

license: apache-2.0
datasets:
  - AiresPucrs/sentiment-analysis
language:
  - en
metrics:
  - accuracy
library_name: keras

Embedding-model-16

Model Overview

The Embedding-model-16 is a language model for sentiment analysis.

Details

Size: 160,289 parameters
Model type: word embeddings
Optimizer: Adam
Number of Epochs: 20
Embedding size: 16
Hardware: Tesla V4
Emissions: Not measured
Total Energy Consumption: Not measured

How to Use

To run inference on this model, you can use the following code snippet:

import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download

# Download the model
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
                filename="english_embedding_vocabulary_16.keras",
                local_dir="./",
                repo_type="model"
                )

# Download the embedding vocabulary txt file
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
                filename="english_embedding_vocabulary.txt",
                local_dir="./",
                repo_type="model"
                )

model = tf.keras.models.load_model('english_embedding_vocabulary_16.keras')

# Compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

with open('english_embedding_vocabulary.txt', encoding='utf-8') as fp:
    english_embedding_vocabulary = [line.strip() for line in fp]
    fp.close()

embeddings = model.get_layer('embedding').get_weights()[0]

words_embeddings = {}

# iterating through the elements of list
for i, word in enumerate(english_embedding_vocabulary):
    # here we skip the embedding/token 0 (""), because is just the PAD token.
    if i == 0:
        continue
    words_embeddings[word] = embeddings[i]

print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape)
print("Vocabulary Size: ", len(words_embeddings.keys()))

Intended Use

This model was created for research purposes only. We do not recommend any application of this model outside this scope.

Performance Metrics

The model achieved an accuracy of 84% on validation data.

Training Data

The model was trained using a dataset that was put together by combining several datasets for sentiment classification available on Kaggle:

The IMDB 50K dataset: 0K movie reviews for natural language processing or Text analytics.
The Twitter US Airline Sentiment dataset: originated from the Crowdflower's Data for Everyone library.
Our google_play_apps_review dataset: built using the google_play_scraper in this notebook.
The EcoPreprocessed dataset: scrapped amazon product reviews.

Limitations

We do not recommend using this model in real-world applications. It was solely developed for academic and educational purposes.

Cite as

@misc{teenytinycastle,
    doi = {10.5281/zenodo.7112065},
    url = {https://github.com/Nkluge-correa/teeny-tiny_castle},
    author = {Nicholas Kluge Corr{\^e}a},
    title = {Teeny-Tiny Castle},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
}

License

This model is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.