MixSUra-SFT / README.md

Update README.md

7e258f1 verified 7 months ago

8.3 kB

	---
	library_name: transformers
	tags:
	- llama-factory
	license: apache-2.0
	datasets:
	- wikimedia/wikipedia
	- tinhpx2911/vanhoc_processed
	- Intel/orca_dpo_pairs
	- ura-hcmut/orca_dpo_pairs
	- ura-hcmut/PhoMT-dpo
	- ura-hcmut/OPUS100-dpo
	- ura-hcmut/vietnews-dpo
	- ura-hcmut/wiki_lingua-dpo
	- ura-hcmut/VSEC-dpo
	- ura-hcmut/10vancauhoi
	- ura-hcmut/zalo_e2eqa-dpo
	language:
	- vi
	- en
	extra_gated_prompt: >-
	Please read the Apache 2 license before accepting it.
	extra_gated_fields:
	Name: text
	Email: text
	Affiliation: text
	Country: text
	I accept the Apache 2 License Agreement: checkbox
	---

	# MixSUra-SFT

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	With a strong commitment to enhancing the quality of large language models for the Vietnamese language, a collaborative effort was undertaken by Vietnamese researchers hailing from Ho Chi Minh University of Technology (HCMUT) - Vietnam National University HCMC and Stanford University.
	Our endeavor involved the meticulous fine-tuning of various models using Vietnamese articles sourced from Wikipedia and other sources (if neccessary). In line with our dedication to fostering community progress, we are pleased to offer our models free of charge for research purposes.
	For those who wish to delve further into our research and its details, we encourage you to explore the comprehensive information provided below.

	- Developed by:
	- Duc Q. Nguyen
	- Sang T. Truong
	- Toan D. V. Nguyen
	- Dong D. Le
	- Nhi N. Truong
	- Tho Quan
	- Sanmi Koyejo
	- Funded by:
	- Microsoft Accelerating Foundation Models Research program
	- Stanford University
	- Ho Chi Minh University of Technology (HCMUT) - VNU-HCM
	- DsciLab (Faculty of Computer Science & Engineering, HCMUT - VNU-HCM)
	- Model type: Text generation
	- Languages: Vietnamese, English
	- License: Apache 2.0
	- Finetuned from model: Mixtral 8x7B

	### Model Sources

	We publicly provide starter source code for fine-tuning, evaluation adn deployment of our models.

	- Framework: [ViLLM](https://github.com/stair-lab/villm)
	- Paper: Our paper was accepted at NAACL 2024. [Link](https://arxiv.org/abs/2403.02715)

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	You can use our models to perform various tasks containing

	* Question answering (with context)
	* Summarization
	* Language modelling
	* Text classification
	* Translation
	* Code generation
	* Reasoning

	### Downstream Use

	This model can serve as an encoder for a wide range of downstream tasks, spanning from pure natural language processing to combinations of natural language processing with computer vision or speech processing.

	### Out-of-Scope Use

	While our models have undergone fine-tuning using extensive Vietnamese datasets, they may not perform optimally in specialized domains necessitating profound domain expertise, such as medicine, politics, chemistry, etc. We kindly request that you refrain from employing our models for political purposes or any endeavors that may cause harm to individuals or compromise the sovereignty and territorial integrity of Vietnam.

	## Bias, Risks, and Limitations

	Unless required by applicable law, the MixSUra materials and any output and results therefrom are provided on an "as is" basis, without warranties of any kind, either express or implied, including, without limitation, any warranties of title, non-infringement, merchantability, or fitness for a particular purpose. you are solely responsible for determining the appropriateness of using or redistributing the MixSUra materials and assume any risks associated with your use of the MixSUra materials and any output and results.

	### Recommendations

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. In order for the model to work well, you may need perform prompt engineering to create appropriate prompts before inference.

	## How to Get Started with the Model

	If you intend to use Ollama, please check this [repo](https://ollama.com/nqduc/mixsura-sft).
	Use the code below to get started with the model.

	```python
	import transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer

	pipeline_kwargs={
	"temperature": 0.5,
	"max_new_tokens": 8192,
	"top_k": 0.99,
	"top_p": 3

	}

	if __name__ == "__main__":
	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"ura-hcmut/MixSUra-SFT",
	device_map="auto"
	)
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	"ura-hcmut/MixSUra-SFT",
	trust_remote_code=True
	)

	pipeline = transformers.pipeline(
	model=model,
	tokenizer=tokenizer,
	return_full_text=False,
	task='text-generation',
	**pipeline_kwargs
	)

	query_template = "<s> [INST] Bạn là một trợ lý thông minh. Hãy thực hiện các yêu cầu hoặc trả lời câu hỏi từ người dùng bằng tiếng Việt.\n {query}[/INST] "

	while True:
	query = input("Query: ")
	if query == "exit":
	break

	query = query_template.format(query=query)
	answer = pipeline(query)[0]["generated_text"]
	print(answer)
	```

	## Finetuning Details

	### Finetuning Data

	List of datasets used for finetuning:
	- Pretraining:
	* Vietnamese Wikipedia: [https://huggingface.co/datasets/vietgpt/wikipedia_vi](https://huggingface.co/datasets/vietgpt/wikipedia_vi)
	* Vanhoc: [https://huggingface.co/datasets/tinhpx2911/vanhoc_processed](https://huggingface.co/datasets/tinhpx2911/vanhoc_processed)
	- Supervised finetuning:
	* orca_dpo_pairs: [https://huggingface.co/datasets/Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
	* Our datasets: [https://huggingface.co/ura-hcmut](https://huggingface.co/ura-hcmut) > Datasets

	### Finetuning Procedure

	We utilize the causal language modelling (next token prediction) procedure to finetune our models. Available tutorial is available at [https://huggingface.co/docs/transformers/tasks/language_modeling](https://huggingface.co/docs/transformers/tasks/language_modeling).

	Our framework is available at: [https://github.com/martinakaduc/SUra-Factory](https://github.com/martinakaduc/SUra-Factory)

	#### Finetuning Hyperparameters

	- Training regime: BFloat16 Mixed Precision
	- Lora rank: 256
	- Batch size: 2048
	- Optimizer: AdamW
	- Learning rate: 1e-4
	- Epochs: 2

	## Evaluation

	Our models are tested with various tasks. The detail of evaluation process can be found at our [Leaderboard](https://ai.stanford.edu/~sttruong/villm).


	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: 4 x A100 40GB
	- Hours used: 1850h
	- Carbon Emitted: ~200 kg CO2 eq.

	## Citation

	If you use MixSura materials in your research, please cite our model(s) as below.

	BibTeX:

	```plaintext
	@inproceedings{crossing2024,
	title = "Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models",
	author = "Truong, Sang T. and Nguyen, Duc Q. and Nguyen, Toan D. V. and Le, Dong D. and Truong, Nhi N. and Quan, Tho and Koyejo, Sanmi",
	booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
	month = June,
	year = "2024",
	address = "Seattle, Washington",
	publisher = "Association for Computational Linguistics",
	url = "",
	pages = "",
	}
	```

	## Model Card Authors

	## Contact

	* Mr. Duc Q. Nguyen: [email protected]
	* Mr. Sang T. Truong: [email protected]
	* Assoc. Prof. Tho Quan: [email protected]