|
--- |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "hào quang rực <mask>" |
|
--- |
|
# <a name="introduction"></a> ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing (EMNLP 2023 - Main) |
|
**Disclaimer**: The paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene. |
|
|
|
ViSoBERT is the state-of-the-art language model for Vietnamese social media tasks: |
|
|
|
- ViSoBERT is the first monolingual MLM ([XLM-R](https://github.com/facebookresearch/XLM#xlm-r-new-model) architecture) built specifically for Vietnamese social media texts. |
|
- ViSoBERT outperforms previous monolingual, multilingual, and multilingual social media approaches, obtaining new state-of-the-art performances on four downstream Vietnamese social media tasks. |
|
|
|
The general architecture and experimental results of ViSoBERT can be found in our [paper](https://arxiv.org/abs/2310.11166): |
|
|
|
@misc{nguyen2023visobert, |
|
title={ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing}, |
|
author={Quoc-Nam Nguyen and Thang Chau Phan and Duc-Vu Nguyen and Kiet Van Nguyen}, |
|
year={2023}, |
|
eprint={2310.11166}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
|
|
The pretraining dataset of our paper is available at: [Pretraining dataset](https://drive.google.com/drive/folders/1C144LOlkbH78m0-JoMckpRXubV7XT7Kb) |
|
|
|
**Please CITE** our paper when ViSoBERT is used to help produce published results or is incorporated into other software. |
|
|
|
**Installation** |
|
|
|
Install `transformers` and `SentencePiece` packages: |
|
|
|
pip install transformers |
|
pip install SentencePiece |
|
|
|
**Example usage** |
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
|
|
model= AutoModel.from_pretrained('uitnlp/visobert') |
|
tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert') |
|
|
|
encoding = tokenizer('hào quang rực rỡ', return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
output = model(**encoding) |
|
``` |