Extend vocabulary and Pretrain
We utilized SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens.
For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources:
- NewsCorpus
- Vietnamese Wikipedia
- Vietnamese books
- Vietnamese legal documents
- Vietnamese legal text
- English Wikipedia
- Chinese Text
How to use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()
Finetune and Benchmark
- Wikilingua
- Vietnews
- Pho_NER
- .....
Citation
- Hatto
- Ipcoms
- Downloads last month
- 662
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.