Abstract
Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.
Community
LoRACode introduces a Low-Rank Adaptation (LoRA)-based fine-tuning framework for efficient and scalable code embeddings, significantly reducing trainable parameters while improving code retrieval performance in Code2Code and Text2Code search tasks.
LoRA-based Parameter-Efficient Fine-Tuning for Code Embeddings: Unlike traditional fine-tuning methods that require modifying the entire model, LoRACode applies LoRA by introducing low-rank adaptation matrices in the query and value projection layers of transformer models. This reduces trainable parameters to ~1.83%–1.85% of the base model while maintaining or improving retrieval accuracy. Fine-tuning is significantly faster -- 2 million samples in 25 minutes on two H100 GPUs.
Task-Specific and Language-Specific Adapters Fine-tuned for Code2Code retrieval (matching code snippets) and Text2Code retrieval (mapping natural language queries to code). Separate LoRA adapters fine-tuned for six programming languages (Go, Java, JavaScript, PHP, Python, Ruby), outperforming generic task-based adapters. Language-specific fine-tuning captures syntactic and contextual variations better than multilingual training.
Integration of LoRA with Contrastive Fine-Tuning LoRACode employs a contrastive learning objective with a cosine similarity-based loss function to improve retrieval accuracy. The fine-tuning process is implemented using ContrastiveTrainer, a custom Hugging Face extension. Embeddings are extracted using a last-token pooling strategy to retain semantic richness.
Performance Gains Over Existing Models: Up to 9.1% increase in Mean Reciprocal Rank (MRR) for Code2Code retrieval. Up to 86.69% increase in MRR for Text2Code retrieval (Python-specific model). Surpasses CodeBERT, GraphCodeBERT, and UniXcoder in retrieval accuracy while being significantly more computationally efficient.
Scalability and Computational Efficiency: LoRACode fine-tunes models using fewer resources than OpenAI’s proprietary embeddings while achieving comparable or better retrieval accuracy. It enables cross-language retrieval, allowing embeddings trained on one language to generalize to others.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper