|
--- |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "đậu xanh rau <mask>" |
|
--- |
|
# <a name="introduction"></a> ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing (EMNLP 2023 - Main) |
|
**Disclaimer**: The paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene. |
|
|
|
ViSoBERT is the state-of-the-art language model for Vietnamese social media tasks: |
|
|
|
- ViSoBERT is the first monolingual MLM (XLM-R architecture) from scratch specifically for Vietnamese social media text. |
|
- ViSoBERT outperforms previous monolingual, multilingual, and multilingual social media approaches, obtaining new state-of-the-art performances on four downstream Vietnamese social media tasks. |
|
|
|
The general architecture and experimental results of ViSoBERT can be found in our [paper](https://arxiv.org/abs/2310.11166): |
|
|
|
@misc{nguyen2023visobert, |
|
title={ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing}, |
|
author={Quoc-Nam Nguyen and Thang Chau Phan and Duc-Vu Nguyen and Kiet Van Nguyen}, |
|
year={2023}, |
|
eprint={2310.11166}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
|
|
|
|
**Please CITE** our paper when ViSoBERT is used to help produce published results or is incorporated into other software. |
|
|
|
**Installation** |
|
|
|
Install `transformers` and `SentencePiece` packages: |
|
|
|
pip install transformers |
|
pip install SentencePiece |
|
|
|
**Example usage** |
|
```python |
|
from transformers import AutoModel,AutoTokenizer |
|
import torch |
|
|
|
model= AutoModel.from_pretrained('uitnlp/visobert') |
|
tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert') |
|
|
|
encoding = tokenizer('dau xanh rau ma',return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
output = model(**encoding) |
|
``` |