|
# Llama-3-3B CodeSearchNet Fine-tuned
|
|
|
|
This repository hosts a **Llama 3 (3B) model** fine-tuned on the **CodeSearchNet dataset**, which contains code in six programming languages.
|
|
|
|
## π Model Details
|
|
|
|
- **Base Model**: Llama 3 (3B)
|
|
- **Fine-tuning Dataset**: CodeSearchNet
|
|
- **Languages Covered**: Python, Java, JavaScript, PHP, Ruby, Go
|
|
- **Training Method**: Supervised fine-tuning (SFT) with a contrastive loss objective for code search tasks
|
|
- **Tokenization**: Llama 3 tokenizer with additional tokens for code-specific keywords
|
|
- **Frameworks Used**: Hugging Face `transformers`, PyTorch, PEFT (for LoRA-based tuning)
|
|
|
|
## π Dataset
|
|
|
|
The model is trained on the **CodeSearchNet** dataset, which contains:
|
|
- Function-level code snippets
|
|
- Paired natural language descriptions
|
|
- Multiple programming languages for multi-language search support
|
|
|
|
### **Dataset Sources**
|
|
- [CodeSearchNet Dataset](https://github.com/github/CodeSearchNet)
|
|
- Contains ~2M code snippets from open-source repositories
|
|
|
|
## π Training Setup
|
|
|
|
- **Hardware**: NVIDIA A100 GPUs
|
|
- **Batch Size**: 16
|
|
- **Learning Rate**: 2e-5 with cosine annealing
|
|
- **Max Sequence Length**: 512
|
|
- **Fine-tuning Duration**: 3 epochs
|
|
|
|
## π Intended Use
|
|
|
|
- **Code Search**: Retrieve relevant code snippets given a natural language query
|
|
- **Code Completion**: Provide context-aware code suggestions
|
|
- **Code-to-Text Generation**: Explain code functionality in natural language
|
|
- **Multi-language Code Retrieval**: Search across different programming languages |