|
--- |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "đậu xanh rau <mask>" |
|
--- |
|
# <a name="introduction"></a> ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing (EMNLP 2023 - Main) |
|
**Disclaimer**: The paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene. |
|
|
|
ViSoBERT is the state-of-the-art language model for Vietnamese social media tasks: |
|
|
|
- ViSoBERT is the first monolingual MLM (XLM-R architecture) from scratch specifically for Vietnamese social media text. |
|
- ViSoBERT outperforms previous monolingual, multilingual, and multilingual social media approaches, obtaining new state-of-the-art performances on four downstream Vietnamese social media tasks. |
|
|
|
The general architecture and experimental results of ViSoBERT can be found in our [paper](https://openreview.net/forum?id=gqkg54QNDY): |
|
|
|
@inproceedings{ |
|
anonymous2023plmvismt, |
|
title={{PLM}4Vi{SMT}: A Pre-Trained Language Model for Vietnamese Social Media Text Processing}, |
|
author={Anonymous}, |
|
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing}, |
|
year={2023}, |
|
url={https://openreview.net/forum?id=gqkg54QNDY} |
|
} |
|
|
|
|
|
**Please CITE** our paper when ViSoBERT is used to help produce published results or is incorporated into other software. |
|
|
|
**Installation** Install `transformers` with pip: `pip install transformers` and `SentencePiece` with `pip install SentencePiece` |
|
|
|
**Example usage** |
|
from transformers import AutoModel,AutoTokenizer |
|
import torch |
|
|
|
model= AutoModel.from_pretrained('uitnlp/visobert') |
|
tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert') |
|
|
|
encoding = tokenizer('dau xanh rau ma',return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
output = model(**encoding) |