Update README.md

e2f6cd1 about 1 year ago

No virus

3.69 kB

	---
	license: apache-2.0
	base_model: distilbert-base-uncased
	tags:
	- generated_from_trainer
	datasets:
	- imdb
	model-index:
	- name: distilbert-base-uncased-finetuned-imdb
	results: []
	language:
	- en
	metrics:
	- perplexity
	---

	# distilbert-base-uncased-finetuned-imdb-v2

	This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the imdb dataset.
	It achieves the following results on the evaluation set:
	- Loss: 2.3033

	## Model description

	This model is a fine-tuned version of DistilBERT base uncased on the IMDb dataset. It was trained to predict the next word in a sentence using masked language modeling. The model has been fine-tuned to adapt to the language patterns and sentiment present in movie reviews.

	## Intended uses & limitations

	This model is primarily designed for the fill-mask task, a type of language modeling where the model is trained to predict missing words within a given context. It excels at completing sentences or phrases by predicting the most likely missing word based on the surrounding text. This functionality makes it valuable for a wide range of natural language processing tasks, such as generating coherent text, improving auto-completion in writing applications, and enhancing conversational agents' responses. However, it may have limitations in handling domain-specific language or topics not present in the IMDb dataset. Additionally, it may not perform well on languages other than English.

	## Training and evaluation data

	The model was trained on a subset of the IMDb dataset, containing 40,000 reviews for fine-tuning. The evaluation was conducted on a separate test set of 6,000 reviews.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 64
	- eval_batch_size: 64
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|
	\| 2.4912 \| 1.0 \| 625 \| 2.3564 \|
	\| 2.4209 \| 2.0 \| 1250 \| 2.3311 \|
	\| 2.4 \| 3.0 \| 1875 \| 2.3038 \|


	### Framework versions

	- Transformers 4.31.0
	- Pytorch 2.0.1+cu118
	- Datasets 2.14.4
	- Tokenizers 0.13.3

	## How to use

	```python
	import torch
	import pandas as pd
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")
	model = AutoModelForMaskedLM.from_pretrained("Francesco-A/distilbert-base-uncased-finetuned-imdb-v2")

	# Example sentence
	sentence = "This movie is really [MASK]."

	# Tokenize the sentence
	inputs = tokenizer(sentence, return_tensors="pt")

	# Get the model's predictions
	with torch.no_grad():
	outputs = model(**inputs)

	# Get the top-k predicted tokens and their probabilities
	k = 5 # Number of top predictions to retrieve
	masked_token_index = inputs["input_ids"].tolist()[0].index(tokenizer.mask_token_id)
	predicted_token_logits = outputs.logits[0, masked_token_index]
	topk_values, topk_indices = torch.topk(torch.softmax(predicted_token_logits, dim=-1), k)

	# Convert top predicted token indices to words
	predicted_tokens = [tokenizer.decode(idx.item()) for idx in topk_indices]
	# Convert probabilities to Python floats
	probs = topk_values.tolist()

	# Create a DataFrame to display the top predicted words and probabilities
	data = {
	"Predicted Words": predicted_tokens,
	"Probability": probs,
	}

	df = pd.DataFrame(data)

	# Display the DataFrame
	df

	```