|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-classification |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
tags: |
|
- finance |
|
--- |
|
|
|
## **Sentiment Inferencing model for stock related commments** |
|
|
|
#### *A project by NUS ISS students Frank Cao, Gerong Zhang, Jiaqi Yao, Sikai Ni, Yunduo Zhang* |
|
|
|
<br /> |
|
|
|
### Description |
|
|
|
This model is fine tuned with roberta-base model on 3200000 comments from stocktwits, with the user labeled tags 'Bullish' or 'Bearish' |
|
|
|
try something that the individual investors may say on the investment forum on the inference API, for example, try 'red' and 'green'. |
|
|
|
[code on github](https://github.com/Gitrexx/PLPPM_Sentiment_Analysis_via_Stocktwits/tree/main/SentimentEngine) |
|
|
|
<br /> |
|
|
|
### Training information |
|
- batch size 32 |
|
- learning rate 2e-5 |
|
|
|
| | Train loss | Validation loss | Validation accuracy | |
|
| ----------- | ----------- | ---------------- | ------------------- | |
|
| epoch1 | 0.3495 | 0.2956 | 0.8679 | |
|
| epoch2 | 0.2717 | 0.2235 | 0.9021 | |
|
| epoch3 | 0.2360 | 0.1875 | 0.9210 | |
|
| epoch4 | 0.2106 | 0.1603 | 0.9343 | |
|
|
|
<br /> |
|
|
|
# How to use |
|
```python |
|
from transformers import RobertaForSequenceClassification, RobertaTokenizer |
|
from transformers import pipeline |
|
import pandas as pd |
|
import emoji |
|
|
|
# the model was trained upon below preprocessing |
|
def process_text(texts): |
|
|
|
# remove URLs |
|
texts = re.sub(r'https?://\S+', "", texts) |
|
texts = re.sub(r'www.\S+', "", texts) |
|
# remove ' |
|
texts = texts.replace(''', "'") |
|
# remove symbol names |
|
texts = re.sub(r'(\#)(\S+)', r'hashtag_\2', texts) |
|
texts = re.sub(r'(\$)([A-Za-z]+)', r'cashtag_\2', texts) |
|
# remove usernames |
|
texts = re.sub(r'(\@)(\S+)', r'mention_\2', texts) |
|
# demojize |
|
texts = emoji.demojize(texts, delimiters=("", " ")) |
|
|
|
return texts.strip() |
|
|
|
tokenizer_loaded = RobertaTokenizer.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned') |
|
model_loaded = RobertaForSequenceClassification.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned') |
|
|
|
nlp = pipeline("text-classification", model=model_loaded, tokenizer=tokenizer_loaded) |
|
|
|
sentences = pd.Series(['just buy','just sell it', |
|
'entity rocket to the sky!', |
|
'go down','even though it is going up, I still think it will not keep this trend in the near future']) |
|
# sentences = list(sentences.apply(process_text)) # if input text contains https, @ or # or $ symbols, better apply preprocess to get a more accurate result |
|
sentences = list(sentences) |
|
results = nlp(sentences) |
|
print(results) # 2 labels, label 0 is bearish, label 1 is bullish |
|
|
|
``` |