Papers
arxiv:2503.05315

LoRACode: LoRA Adapters for Code Embeddings

Published on Mar 7
· Submitted by amanchadha on Mar 10

Abstract

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

Community

Paper author Paper submitter

Screenshot 2025-03-09 at 9.43.05 PM.jpg

  • LoRACode introduces a Low-Rank Adaptation (LoRA)-based fine-tuning framework for efficient and scalable code embeddings, significantly reducing trainable parameters while improving code retrieval performance in Code2Code and Text2Code search tasks.

  • LoRA-based Parameter-Efficient Fine-Tuning for Code Embeddings: Unlike traditional fine-tuning methods that require modifying the entire model, LoRACode applies LoRA by introducing low-rank adaptation matrices in the query and value projection layers of transformer models. This reduces trainable parameters to ~1.83%–1.85% of the base model while maintaining or improving retrieval accuracy. Fine-tuning is significantly faster -- 2 million samples in 25 minutes on two H100 GPUs.

  • Task-Specific and Language-Specific Adapters Fine-tuned for Code2Code retrieval (matching code snippets) and Text2Code retrieval (mapping natural language queries to code). Separate LoRA adapters fine-tuned for six programming languages (Go, Java, JavaScript, PHP, Python, Ruby), outperforming generic task-based adapters. Language-specific fine-tuning captures syntactic and contextual variations better than multilingual training.

  • Integration of LoRA with Contrastive Fine-Tuning LoRACode employs a contrastive learning objective with a cosine similarity-based loss function to improve retrieval accuracy. The fine-tuning process is implemented using ContrastiveTrainer, a custom Hugging Face extension. Embeddings are extracted using a last-token pooling strategy to retain semantic richness.

  • Performance Gains Over Existing Models: Up to 9.1% increase in Mean Reciprocal Rank (MRR) for Code2Code retrieval. Up to 86.69% increase in MRR for Text2Code retrieval (Python-specific model). Surpasses CodeBERT, GraphCodeBERT, and UniXcoder in retrieval accuracy while being significantly more computationally efficient.

  • Scalability and Computational Efficiency: LoRACode fine-tunes models using fewer resources than OpenAI’s proprietary embeddings while achieving comparable or better retrieval accuracy. It enables cross-language retrieval, allowing embeddings trained on one language to generalize to others.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.05315 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.05315 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.05315 in a Space README.md to link it from this page.

Collections including this paper 1