uitnlp
/

visobert

Vietnamese Pre-trained Model

Sentiment Analysis

Hate Speech Detection

Emotionn Recognition

Inference Endpoints

Model card Files Files and versions Community

visobert / README.md

nqnam02's picture

Update README.md

e9ed184 12 months ago

|

2.02 kB

	---
	pipeline_tag: fill-mask
	widget:
	- text: "hào quang rực <mask>"
	---
	# <a name="introduction"></a> ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing (EMNLP 2023 - Main)
	Disclaimer: The paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.

	ViSoBERT is the state-of-the-art language model for Vietnamese social media tasks:

	- ViSoBERT is the first monolingual MLM ([XLM-R](https://github.com/facebookresearch/XLM#xlm-r-new-model) architecture) built specifically for Vietnamese social media texts.
	- ViSoBERT outperforms previous monolingual, multilingual, and multilingual social media approaches, obtaining new state-of-the-art performances on four downstream Vietnamese social media tasks.

	The general architecture and experimental results of ViSoBERT can be found in our [paper](https://arxiv.org/abs/2310.11166):

	@misc{nguyen2023visobert,
	title={ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing},
	author={Quoc-Nam Nguyen and Thang Chau Phan and Duc-Vu Nguyen and Kiet Van Nguyen},
	year={2023},
	eprint={2310.11166},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	The pretraining dataset of our paper is available at: [Pretraining dataset](https://drive.google.com/drive/folders/1C144LOlkbH78m0-JoMckpRXubV7XT7Kb)

	Please CITE our paper when ViSoBERT is used to help produce published results or is incorporated into other software.

	Installation

	Install `transformers` and `SentencePiece` packages:

	pip install transformers
	pip install SentencePiece

	Example usage
	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	model= AutoModel.from_pretrained('uitnlp/visobert')
	tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert')

	encoding = tokenizer('hào quang rực rỡ', return_tensors='pt')

	with torch.no_grad():
	output = model(**encoding)
	```