ericflo commited on
Commit
a365ae2
1 Parent(s): c654cd6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -0
README.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - llama
6
+ - llm
7
+ - fine-tuning
8
+ - fill-in-the-middle
9
+ - instruction-following
10
+ license: apache-2.0
11
+ datasets:
12
+ - mlabonne/FineTome-100k
13
+ - mlfoundations/dclm-baseline-1.0-parquet
14
+ - wikimedia/wikipedia
15
+ - bigcode/starcoderdata
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Custom LLM with Full Fine-Tuning
20
+
21
+ ## Model Overview
22
+
23
+ This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
24
+
25
+ - **Developer:** Eric Florenzano
26
+ - **Model Type:** Large Language Model (LLM)
27
+ - **Language(s):** English, with a focus on Python for code-related tasks
28
+ - **License:** Apache-2.0
29
+ - **Base Model:** meta-llama/Meta-Llama-3.1-8B
30
+
31
+ ## Unique Training Approach
32
+
33
+ This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
34
+
35
+ - **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
36
+ - **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
37
+ - **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
38
+ - **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
39
+ - **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.
40
+
41
+ ## Training Data
42
+
43
+ The model is trained on a blend of high-quality data sources:
44
+
45
+ - **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
46
+ - **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
47
+ - **English, Spanish, and French Wikipedia:** For broad language understanding.
48
+ - **Starcoder:** High-quality Python-focused code dataset for code completion tasks.
49
+
50
+ ## Training Procedure
51
+
52
+ ### Setup
53
+
54
+ ```bash
55
+ pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
56
+ ```
57
+
58
+ ## Key Features
59
+
60
+ 1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
61
+ 2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
62
+ 3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
63
+ 4. **Gradient Checkpointing:** Enables training with limited GPU memory.
64
+ 5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
65
+ 6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.
66
+
67
+ ## Advanced Training Techniques
68
+
69
+ This model incorporates several advanced training techniques to enhance its capabilities:
70
+
71
+ ### 1. Fill-in-the-Middle (FIM) Capability
72
+
73
+ FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
74
+
75
+ #### Using FIM with the Model
76
+
77
+ To use the FIM capability, structure your input with special tokens:
78
+
79
+ - `<|fim_start|>`: Marks the start of the FIM input
80
+ - `<|fim_marker|>`: Separates the prefix from the suffix
81
+ - `<|fim_gen|>`: Indicates where the generated content should begin
82
+ - `<|fim_end|>`: Marks the end of the FIM input
83
+
84
+ Example FIM input:
85
+ ```
86
+ <|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
87
+ ```
88
+
89
+ The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.
90
+
91
+ ### 2. Reverse Prediction and Instruction Backtranslation
92
+
93
+ This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
94
+
95
+ #### How it works:
96
+ 1. The model is given a snippet of conversation or text.
97
+ 2. It's then tasked with predicting what came before this snippet.
98
+ 3. This process helps the model understand context, conversation flow, and logical progression of ideas.
99
+
100
+ #### Benefits:
101
+ - Improved context understanding
102
+ - Enhanced ability to maintain coherent, contextually appropriate conversations
103
+ - Better grasp of cause-and-effect relationships in text
104
+
105
+ #### Example use case:
106
+ Input:
107
+ ```
108
+ Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
109
+ ```
110
+ Task: Predict the previous exchange in this conversation.
111
+
112
+ Possible model output:
113
+ ```
114
+ Human: What's the capital of France?
115
+ Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
116
+ Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
117
+ ```
118
+
119
+ ### 3. Meta-FIM
120
+
121
+ Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
122
+
123
+ #### Benefits:
124
+ - Enhanced understanding of long-range dependencies in text
125
+ - Improved ability to maintain coherence across longer contexts
126
+ - Better performance on tasks requiring integration of information from multiple parts of a document or conversation
127
+
128
+ #### Example:
129
+ ```
130
+ <|fim_start|>Human: What's the weather like today?
131
+ Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
132
+ Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
133
+ Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
134
+ Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
135
+ 1. Eiffel Tower
136
+ 2. Louvre Museum
137
+ 3. Notre-Dame Cathedral
138
+ 4. Arc de Triomphe
139
+ 5. Sacré-Cœur Basilica<|fim_end|>
140
+ ```
141
+
142
+ In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
143
+
144
+ ## Evaluation
145
+
146
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
147
+ |-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
148
+ |tinyBenchmarks | N/A| | | | | | | |
149
+ | - tinyArc | 0|none | 25|acc_norm |↑ |0.5791|± | N/A|
150
+ | - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.5536|± | N/A|
151
+ | | |strict-match | 5|exact_match|↑ |0.5536|± | N/A|
152
+ | - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.8391|± | N/A|
153
+ | - tinyMMLU | 0|none | 0|acc_norm |↑ |0.6377|± | N/A|
154
+ | - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4914|± | N/A|
155
+ | - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.7608|± | N/A|
156
+
157
+ ### Training Command
158
+
159
+ ```bash
160
+ python sft_14.py \
161
+ --run_name="llama3.1-8b-continued3" \
162
+ --model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
163
+ --dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
164
+ --report_to="wandb" \
165
+ --optim="adamw_bnb_8bit" \
166
+ --lr_scheduler_type="cosine" \
167
+ --max_steps=100000 \
168
+ --max_seq_length=64000 \
169
+ --learning_rate=0.00001 \
170
+ --attn_implementation="flash_attention_2" \
171
+ --save_strategy="steps" \
172
+ --save_steps 50 \
173
+ --save_total_limit=10 \
174
+ --per_device_train_batch_size=1 \
175
+ --per_device_eval_batch_size=1 \
176
+ --gradient_accumulation_steps=8 \
177
+ --logging_steps=1 \
178
+ --num_train_epochs=1 \
179
+ --push_to_hub \
180
+ --hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
181
+ --hub_strategy="all_checkpoints" \
182
+ --gradient_checkpointing \
183
+ --use_liger=true \
184
+ --packing=true \
185
+ --torch_dtype="bfloat16" \
186
+ --output_dir="continuedtraining3_output"
187
+ ```
188
+
189
+ ## Intended Uses
190
+
191
+ This model is designed for:
192
+
193
+ - Text Completion and Generation
194
+ - Code Completion (especially Python)
195
+ - Instruction Following
196
+ - General Language Understanding
197
+ - Context-Aware Text Infilling (using FIM)
198
+
199
+ ## Limitations and Biases
200
+
201
+ - The model may exhibit biases present in the training data.
202
+ - It lacks real-time knowledge beyond its training data.
203
+ - Should not be used for critical decision-making without human oversight.
204
+
205
+ ## Technical Specifications
206
+
207
+ - **Base Model:** meta-llama/Meta-Llama-3.1-8B
208
+ - **Training Approach:** Full Fine-Tuning
209
+ - **Library:** Hugging Face Transformers and TRL
210
+
211
+ ## Contact
212
+
213
+ For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co/ericflo/Llama-3.1-8B-ContinuedTraining3-FFT).