|
--- |
|
license: mit |
|
datasets: |
|
- FinGPT/fingpt-sentiment-train |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
widget: |
|
- text: "The current lay-offs are additional to the temporary lay-offs agreed in December 2008 and in May 2009." |
|
- text: "Last month we added to our $GILD position and started a new one in $BWLD. We see slow, steady, unspectacular growth going forward near term." |
|
--- |
|
|
|
# Model Card for Model ID |
|
https://huggingface.co/rezacsedu/financial_sentiment_analysis_gpt2_model |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
This a fine-tuned GPT2 model on the https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train dataset for the downstream financial sentiment analysis. |
|
|
|
- **Developed by:** Rezaul Karim, Ph.D. |
|
- **Model type:** GPT2ForSequenceClassification (Fine-tuned GPT2) |
|
- **Language(s) (NLP):** financial sentiment analysis |
|
- **License:** MIT |
|
- **Finetuned from the model:** https://huggingface.co/datasets/mteb/tweet_sentiment_extraction |
|
|
|
### Model Sources |
|
- **Repository:** https://github.com/rezacsedu/financial_sentiment_analysis_LLM |
|
- **Paper [optional]:** on the way |
|
- **Demo [optional]:** on the way |
|
|
|
## Uses |
|
The model is already fine-tuned for downstream financial sentiment analysis tasks. |
|
|
|
``` |
|
import torch |
|
|
|
# Load your fine-tuned model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained("fine_tuned_finsetiment_model") |
|
tokenizer = AutoTokenizer.from_pretrained("fine_tuned_finsetiment_model") |
|
|
|
# Define the label mapping as provided |
|
label_mapping_reverse = { |
|
'0': 'Mildly positive', |
|
'1': 'Mildly negative', |
|
'2': 'Moderately negative', |
|
'3': 'Moderately positive', |
|
'4': 'Positive', |
|
'5': 'Negative', |
|
'6': 'Neutral', |
|
'7': 'Strongly negative', |
|
'8': 'Strongly positive' |
|
} |
|
|
|
def model_predict(text): |
|
# Tokenize the input text |
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
# Get predictions from the model |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
# Convert to probabilities |
|
probabilities = torch.nn.functional.softmax(logits, dim=-1) |
|
|
|
# Create a list of tuples with label and probability |
|
label_prob_pairs = [(label_mapping_reverse[label_idx], prob.item()) |
|
for label_idx, prob in enumerate(probabilities.squeeze())] |
|
|
|
# Sort the list by probability in descending order |
|
sorted_label_prob_pairs = sorted(label_prob_pairs, key=lambda pair: pair[1], reverse=True) |
|
|
|
# Return the sorted list of label-probability pairs |
|
return sorted_label_prob_pairs |
|
|
|
# Example usage |
|
text = "Intel Corporation (NASDAQ: INTC) has unveiled a remote verification platform called Project Amber" |
|
predictions = model_predict(text) |
|
for label, prob in predictions: |
|
print(f"{label}: {prob:.3f}") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
``` |
|
from transformers import GPT2Tokenizer |
|
|
|
dataset = load_dataset("FinGPT/fingpt-sentiment-train") |
|
|
|
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
def tokenize_function(examples): |
|
return tokenizer(examples["input"], padding="max_length", truncation=True) |
|
|
|
tokenized_datasets = dataset.map(tokenize_function, batched=True) |
|
|
|
from datasets import DatasetDict |
|
import random |
|
import string |
|
|
|
def generate_random_id(): |
|
return ''.join(random.choices(string.ascii_lowercase + string.digits, k=10)) |
|
|
|
unique_outputs = set(dataset['train']['output']) |
|
|
|
#label_mapping = {'mildly positive': 0, 'positive': 1, 'strong positive':2, 'moderately positive': 3, 'negative': 4, 'neutral': 5} # Add more mappings as needed |
|
label_mapping = {label: index for index, label in enumerate(unique_outputs)} |
|
|
|
def transform_dataset(dataset): |
|
dataset = dataset.rename_column('input', 'text') |
|
dataset = dataset.rename_column('output', 'label_text') |
|
|
|
dataset = dataset.remove_columns(['instruction']) |
|
|
|
dataset = dataset.add_column('id', [generate_random_id() for _ in range(dataset.num_rows)]) |
|
dataset = dataset.add_column('label', [label_mapping[label_text] for label_text in dataset['label_text']]) |
|
|
|
return dataset |
|
|
|
transformed_dataset = DatasetDict({'train': transform_dataset(tokenized_datasets['train'])}) |
|
transformed_dataset['train'].set_format(type=None, columns=['id', 'text', 'label', 'label_text', 'input_ids', 'attention_mask']) |
|
|
|
train_test_split = transformed_dataset['train'].train_test_split(test_size=0.3, seed=42) |
|
|
|
tokenized_datasets['test'] = train_test_split['test'] |
|
tokenized_datasets['train'] = train_test_split['train'] |
|
|
|
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100)) |
|
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100)) |
|
``` |
|
|
|
### Fine-tune Procedure |
|
|
|
``` |
|
from transformers import GPT2ForSequenceClassification |
|
from transformers import TrainingArguments, Trainer |
|
|
|
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=9) |
|
|
|
training_args = TrainingArguments( |
|
output_dir="test_trainer", |
|
#evaluation_strategy="epoch", |
|
per_device_train_batch_size=1, # Reduce batch size here |
|
per_device_eval_batch_size=1, # Optionally, reduce for evaluation as well |
|
gradient_accumulation_steps=4 |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=small_train_dataset, |
|
eval_dataset=small_eval_dataset, |
|
compute_metrics=compute_metrics, |
|
) |
|
|
|
trainer.train() |
|
trainer.evaluate() |
|
trainer.save_model("fine_tuned_finsetiment_model") |
|
``` |
|
|
|
#### Training Hyperparameters |
|
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
## Evaluation |
|
|
|
``` |
|
import evaluate |
|
|
|
metric = evaluate.load("accuracy") |
|
|
|
def compute_metrics(eval_pred): |
|
logits, labels = eval_pred |
|
predictions = np.argmax(logits, axis=-1) |
|
|
|
return metric.compute(predictions=predictions, references=labels) |
|
``` |
|
|
|
#### Summary |
|
|
|
## Citation [optional] |
|
|
|
<!-- If a paper or blog post introduces the model, the APA and Bibtex information should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
## Model Card Contact |
|
[email protected] |