File size: 3,941 Bytes
c16bd6d
 
 
 
 
 
81fddac
c16bd6d
 
 
 
33ee0b5
c16bd6d
 
 
33ee0b5
c16bd6d
 
33ee0b5
 
c16bd6d
 
 
 
b7bd64d
 
9c1178b
b7bd64d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41b03ed
b7bd64d
0b4f013
b7bd64d
 
 
 
 
 
 
 
 
 
ee26113
 
b7bd64d
 
 
6afdf95
b7bd64d
 
 
6afdf95
b7bd64d
 
6afdf95
 
 
 
 
 
 
 
9c1178b
 
b7bd64d
ee5b3a4
 
 
 
 
 
 
 
 
 
 
 
 
07cb281
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
task_categories:
- text-generation
---
# Description
This language model is the version 0.0 of a Gradio Coding Assistant. It is an instruction fine-tuned version of [StarCoder](https://huggingface.co/bigcode/starcoder) that is
designed to provide assistance to developers who use [gradio](https://www.gradio.app).

# Dataset
The dataset is multi-source. Its content comes from the following sources
- The stack

More precisely, we looked into [the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) which contain codes permissive licenses. We shortlisted the files whose
content incorporated the keyword `gradio`.
- GitHub Issues

We scrapped all the issues of the official repository [the-gradio-app/gradio](https://github.com/gradio-app/gradio) and added them to our training dataset.
- Spaces on Hugging Face Hub

We used the [HuggingFace_Hub API](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) to scrape the data from the spaces which are designed with gradio. We kept track of those
with permissive licenses, namely MIT and Apache 2.0. This set of code was further deduplicated.

# Training setting and hyperparameters
For our fine-tuning, we decided to follow a 2-step strategy.
- Pretraining (Fine-tuning) with next token prediction on the previously built gradio dataset (this step should familiarize the model with the gradio syntax.).
- Instruction fine-tuning on an instruction dataset (this step should make the model conversational.).
For both steps, we made use of parameter-efficient fine-tuning via the library [PEFT](https://github.com/huggingface/peft), more precisely [LoRA](https://arxiv.org/abs/2106.09685). Our
training script is the famous [starcoder fine-tuning script](https://github.com/bigcode-project/starcoder).

## Resources
Our training was done of 8 A100 GPUs of 80GB.

## Pretraining
These are the parameters that we used :
- learning rate : 5e-4
- warmup_steps :
- gradient_accumulation_steps : 4
- batch_size : 1
- sequence length : 2048
- max_steps : 1000
- warmup_steps : 5
- weight_decay : 0.05
- learning rate scheduler : cosine

**LORA PARAMETERS** :
- r = 16
- alpha = 32
- dropout = 0.05

We stopped the training before the end and kept the *checkpoint-100* for the second step.

## Fine-tuning
This step consisted into the instruction fine-tuning of the previous checkpoint. For that purpose, we used a modified version of [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).
The template for the instruction fine-tuning was `Question: {question}\n\nAnswer: {answer}`. We used exactly the same parameters we used during the pretraining and we kept the *checkpoint-50*.

# Usage
The usage is straightforward and very similar to any other instruction fine-tuned model.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint_name="ArmelR/starcoder-gradio-v0"
model = AutoModelForCausalLM.from_pretrained(checkpoint_name)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)

prompt = "Create a gradio application that help to convert temperature in celcius into temperature in Fahrenheit"
inputs = tokenizer(f"Question: {prompt}\n\nAnswer: ", return_tensors="pt")

outputs = model.generate(
  inputs["input_ids"],
  temperature=0.2,
  top_p=0.95,
  max_new_tokens=200
)

input_len=len(inputs["input_ids"])
print(tokenizer.decode(outputs[0][input_len:]))
```
# Updates
Gradio dataset `.filter(lambda x : ("gradio" in x["content"] or "gr." in x["content"]) and "streamlit" not in x["content"]`)
Guanaco `ArmelR/oasst1_guanaco`
- StarCoderbase (950, 1350)
  - max_steps = 2000
  - shuffle_buffer = 100
  - batch_size = 2
  - gradient_accumulation_steps = 4
  - num_warmup_steps = 100
  - weight_decay = 0.01
- StarCoderplus (2000)

Guanaco multi-turn (HuggingFaceH4/oasst1_en)
# More information
For further information, refer to [StarCoder](https://huggingface.co/bigcode/starcoder).