uitnlp
/

CafeBERT

Vietnamese Question Answering

Vietnamese Reading Comprehension

Vietnamese Language Understanding

Vietnamese Natural Language Inference

Inference Endpoints

Model card Files Files and versions Community

ThuanPhong commited on Mar 23, 2024

Commit

b4b6358

·

verified ·

1 Parent(s): 7ac99d6

Update README.md

Files changed (1) hide show

README.md +32 -0

README.md CHANGED Viewed

@@ -7,3 +7,35 @@ widget:
 - text: "Cà phê được trồng nhiều ở khu vực Tây <mask> của Việt Nam."
   example_title: "Example 2"
 ---

 - text: "Cà phê được trồng nhiều ở khu vực Tây <mask> của Việt Nam."
   example_title: "Example 2"
 ---
+# <a name="introduction"></a> CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings)
+The pre-trained CafeBERT model is the state-of-the-art language model for Vietnamese *(Cafe or coffee is a popular drink every morning in Vietnam)*:
+CafeBERT is a large-scale multilingual language model with strong support for Vietnamese. The model is based on XLM-Roberta (the state-of-the-art multilingual language model) and is enhanced with a large Vietnamese corpus with many domains: Wikipedia, newspapers... CafeBERT has outstanding performance on the VLUE benchmark and other tasks, like: machine reading comprehension, text classification, natural language inference, part-of-speech tagging...
+The general architecture and experimental results of PhoBERT can be found in our paper:
+Please **CITE** our paper when CafeBERT is used to help produce published results or is incorporated into other software.
+**Installation**
+Install `transformers` and `SentencePiece` packages:
+    pip install transformers
+    pip install SentencePiece
+**Example usage**
+```python
+from transformers import AutoModel, AutoTokenizer
+import torch
+model= AutoModel.from_pretrained('uitnlp/CafeBERT')
+tokenizer = AutoTokenizer.from_pretrained('uitnlp/CafeBERT')
+encoding = tokenizer('Cà phê được trồng nhiều ở khu vực Tây Nguyên của Việt Nam.', return_tensors='pt')
+with torch.no_grad():
+  output = model(**encoding)
+```