Francesco-A commited on
Commit
e2f6cd1
1 Parent(s): 4952086

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -7
README.md CHANGED
@@ -6,13 +6,14 @@ tags:
6
  datasets:
7
  - imdb
8
  model-index:
9
- - name: distilbert-base-uncased-finetuned-imdb-v2
10
  results: []
 
 
 
 
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
-
16
  # distilbert-base-uncased-finetuned-imdb-v2
17
 
18
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the imdb dataset.
@@ -21,15 +22,15 @@ It achieves the following results on the evaluation set:
21
 
22
  ## Model description
23
 
24
- More information needed
25
 
26
  ## Intended uses & limitations
27
 
28
- More information needed
29
 
30
  ## Training and evaluation data
31
 
32
- More information needed
33
 
34
  ## Training procedure
35
 
@@ -59,3 +60,48 @@ The following hyperparameters were used during training:
59
  - Pytorch 2.0.1+cu118
60
  - Datasets 2.14.4
61
  - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  datasets:
7
  - imdb
8
  model-index:
9
+ - name: distilbert-base-uncased-finetuned-imdb
10
  results: []
11
+ language:
12
+ - en
13
+ metrics:
14
+ - perplexity
15
  ---
16
 
 
 
 
17
  # distilbert-base-uncased-finetuned-imdb-v2
18
 
19
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the imdb dataset.
 
22
 
23
  ## Model description
24
 
25
+ This model is a fine-tuned version of DistilBERT base uncased on the IMDb dataset. It was trained to predict the next word in a sentence using masked language modeling. The model has been fine-tuned to adapt to the language patterns and sentiment present in movie reviews.
26
 
27
  ## Intended uses & limitations
28
 
29
+ This model is primarily designed for the fill-mask task, a type of language modeling where the model is trained to predict missing words within a given context. It excels at completing sentences or phrases by predicting the most likely missing word based on the surrounding text. This functionality makes it valuable for a wide range of natural language processing tasks, such as generating coherent text, improving auto-completion in writing applications, and enhancing conversational agents' responses. However, it may have limitations in handling domain-specific language or topics not present in the IMDb dataset. Additionally, it may not perform well on languages other than English.
30
 
31
  ## Training and evaluation data
32
 
33
+ The model was trained on a subset of the IMDb dataset, containing 40,000 reviews for fine-tuning. The evaluation was conducted on a separate test set of 6,000 reviews.
34
 
35
  ## Training procedure
36
 
 
60
  - Pytorch 2.0.1+cu118
61
  - Datasets 2.14.4
62
  - Tokenizers 0.13.3
63
+
64
+ ## How to use
65
+
66
+ ```python
67
+ import torch
68
+ import pandas as pd
69
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
70
+
71
+ # Load the tokenizer and model
72
+ tokenizer = AutoTokenizer.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")
73
+ model = AutoModelForMaskedLM.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")
74
+
75
+ # Example sentence
76
+ sentence = "This movie is really [MASK]."
77
+
78
+ # Tokenize the sentence
79
+ inputs = tokenizer(sentence, return_tensors="pt")
80
+
81
+ # Get the model's predictions
82
+ with torch.no_grad():
83
+ outputs = model(**inputs)
84
+
85
+ # Get the top-k predicted tokens and their probabilities
86
+ k = 5 # Number of top predictions to retrieve
87
+ masked_token_index = inputs["input_ids"].tolist()[0].index(tokenizer.mask_token_id)
88
+ predicted_token_logits = outputs.logits[0, masked_token_index]
89
+ topk_values, topk_indices = torch.topk(torch.softmax(predicted_token_logits, dim=-1), k)
90
+
91
+ # Convert top predicted token indices to words
92
+ predicted_tokens = [tokenizer.decode(idx.item()) for idx in topk_indices]
93
+ # Convert probabilities to Python floats
94
+ probs = topk_values.tolist()
95
+
96
+ # Create a DataFrame to display the top predicted words and probabilities
97
+ data = {
98
+ "Predicted Words": predicted_tokens,
99
+ "Probability": probs,
100
+ }
101
+
102
+ df = pd.DataFrame(data)
103
+
104
+ # Display the DataFrame
105
+ df
106
+
107
+ ```