ericflo
/

Llama-3.1-8B-ContinuedTraining2-FFT

+---
+language:
+ - en
+tags:
+ - llama
+ - llm
+ - fine-tuning
+ - fill-in-the-middle
+ - instruction-following
+license: apache-2.0
+datasets:
+ - mlabonne/FineTome-100k
+ - mlfoundations/dclm-baseline-1.0-parquet
+ - wikimedia/wikipedia
+ - bigcode/starcoderdata
+pipeline_tag: text-generation
+---
+# Custom LLM with Full Fine-Tuning
+## Model Overview
+This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
+- **Developer:** Eric Florenzano
+- **Model Type:** Large Language Model (LLM)
+- **Language(s):** English, with a focus on Python for code-related tasks
+- **License:** Apache-2.0
+- **Base Model:** meta-llama/Meta-Llama-3.1-8B
+## Unique Training Approach
+This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
+- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
+- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
+- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
+- **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
+- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.
+## Training Data
+The model is trained on a blend of high-quality data sources:
+- **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
+- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
+- **English, Spanish, and French Wikipedia:** For broad language understanding.
+- **Starcoder:** High-quality Python-focused code dataset for code completion tasks.
+## Training Procedure
+### Setup
+```bash
+pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
+```
+## Key Features
+1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
+2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
+3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
+4. **Gradient Checkpointing:** Enables training with limited GPU memory.
+5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
+6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.
+## Advanced Training Techniques
+This model incorporates several advanced training techniques to enhance its capabilities:
+### 1. Fill-in-the-Middle (FIM) Capability
+FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
+#### Using FIM with the Model
+To use the FIM capability, structure your input with special tokens:
+- `<|fim_start|>`: Marks the start of the FIM input
+- `<|fim_marker|>`: Separates the prefix from the suffix
+- `<|fim_gen|>`: Indicates where the generated content should begin
+- `<|fim_end|>`: Marks the end of the FIM input
+Example FIM input:
+```
+<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
+```
+The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.
+### 2. Reverse Prediction and Instruction Backtranslation
+This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
+#### How it works:
+1. The model is given a snippet of conversation or text.
+2. It's then tasked with predicting what came before this snippet.
+3. This process helps the model understand context, conversation flow, and logical progression of ideas.
+#### Benefits:
+- Improved context understanding
+- Enhanced ability to maintain coherent, contextually appropriate conversations
+- Better grasp of cause-and-effect relationships in text
+#### Example use case:
+Input:
+```
+Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
+```
+Task: Predict the previous exchange in this conversation.
+Possible model output:
+```
+Human: What's the capital of France?
+Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
+Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
+```
+### 3. Meta-FIM
+Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
+#### Benefits:
+- Enhanced understanding of long-range dependencies in text
+- Improved ability to maintain coherence across longer contexts
+- Better performance on tasks requiring integration of information from multiple parts of a document or conversation
+#### Example:
+```
+<|fim_start|>Human: What's the weather like today?
+Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
+Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
+Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
+Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
+1. Eiffel Tower
+2. Louvre Museum
+3. Notre-Dame Cathedral
+4. Arc de Triomphe
+5. Sacré-Cœur Basilica<|fim_end|>
+```
+In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
+## Evaluation
+| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
+|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
+|tinyBenchmarks | N/A| | | | | | | |
+| - tinyArc | 0|none | 25|acc_norm |↑ |0.5791|± | N/A|
+| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.5536|± | N/A|
+| | |strict-match | 5|exact_match|↑ |0.5536|± | N/A|
+| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.8391|± | N/A|
+| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.6377|± | N/A|
+| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4914|± | N/A|
+| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.7608|± | N/A|
+### Training Command
+```bash
+python sft_14.py \
+ --run_name="llama3.1-8b-continued3" \
+ --model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
+ --dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
+ --report_to="wandb" \
+ --optim="adamw_bnb_8bit" \
+ --lr_scheduler_type="cosine" \
+ --max_steps=100000 \
+ --max_seq_length=64000 \
+ --learning_rate=0.00001 \
+ --attn_implementation="flash_attention_2" \
+ --save_strategy="steps" \
+ --save_steps 50 \
+ --save_total_limit=10 \
+ --per_device_train_batch_size=1 \
+ --per_device_eval_batch_size=1 \
+ --gradient_accumulation_steps=8 \
+ --logging_steps=1 \
+ --num_train_epochs=1 \
+ --push_to_hub \
+ --hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
+ --hub_strategy="all_checkpoints" \
+ --gradient_checkpointing \
+ --use_liger=true \
+ --packing=true \
+ --torch_dtype="bfloat16" \
+ --output_dir="continuedtraining3_output"
+```
+## Intended Uses
+This model is designed for:
+- Text Completion and Generation
+- Code Completion (especially Python)
+- Instruction Following
+- General Language Understanding
+- Context-Aware Text Infilling (using FIM)
+## Limitations and Biases
+- The model may exhibit biases present in the training data.
+- It lacks real-time knowledge beyond its training data.
+- Should not be used for critical decision-making without human oversight.
+## Technical Specifications
+- **Base Model:** meta-llama/Meta-Llama-3.1-8B
+- **Training Approach:** Full Fine-Tuning
+- **Library:** Hugging Face Transformers and TRL
+## Contact
+For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co/ericflo/Llama-3.1-8B-ContinuedTraining3-FFT).