Refact-1_6-base / README.md

Create README.md

1ff9414 verified about 1 year ago

3.63 kB

	---
	pipeline_tag: text-generation
	inference: true
	widget:
	- text: 'def print_hello_world():'
	example_title: Hello world
	group: Python
	license: bigscience-openrail-m
	datasets:
	- books
	- arxiv
	- c4
	- falcon-refinedweb
	- wiki
	- github-issues
	- stack_markdown
	library_name: transformers
	tags:
	- code
	language:
	- en
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/643a9dd0c5f633a7fa7e804a/HkB0QYV0BbmB3ktMugbZy.png)


	# Refact-1.6B-base

	Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready 🎉
	The model might contain some problems, especially with the FIM format


	# It Works As a Chat

	The primary application of this model is code completion (infill) in multiple programming languages.
	But it works as a chat quite well.


	# Example

	Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:

	```python
	# pip install -q transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer

	checkpoint = "smallcloudai/Refact-1_6B-fim"
	device = "cuda" # for GPU usage or "cpu" for CPU usage

	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

	prompt = '<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>'

	inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_length=100, temperature=0.2)
	print("-"*80)
	print(tokenizer.decode(outputs[0]))
	```

	# Chat Format

	The same model works as chat (experimental).

	```python
	prompt_template = "<empty_output>SYSTEM {system}\n" \
	"<empty_output>USER {query}\n" \
	"<empty_output>ASSISTANT"
	prompt = prompt_template.format(system="You are a programming assistant",
	query="How do I sort a list in Python?")
	```

	# Architecture

	As described in more detail in the blog post, we used:

	- [ALiBi](https://arxiv.org/abs/2108.12409) based attention
	- [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf)
	- [Multi Query Attention](https://arxiv.org/abs/1911.02150)

	We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below.


	# Training

	For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets.
	Filtering is the key to success of this model:

	- We only used text in English
	- Only topics related to computer science
	- Applied heavy deduplication

	The text to code proportion was 50:50, model trained for 1.2T tokens.

	We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so
	its practical use is limited. But if you still want it, write us a message on Discord.


	# Limitations and Bias

	The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in
	code comments. Its performance on non-English languages is lower, for sure.


	# Model Stats

	- Architecture: LLAMA-like model with multi-query attention
	- Objectives Fill-in-the-Middle, Chat
	- Tokens context: 4096
	- Pretraining tokens: 1.2T
	- Finetuning tokens: 40B
	- Precision: bfloat16
	- GPUs 64 NVidia A5000
	- Training time 28 days


	# License

	The model is licensed under the BigScience OpenRAIL-M v1 license agreement


	# Citation

	If you are using this model, please give a link to this page.