arxiv:2503.05315

LoRACode: LoRA Adapters for Code Embeddings

Published on Mar 7

· Submitted by

amanchadha on Mar 10

Upvote

Authors:

Saumya Chaturvedi ,

Aman Chadha ,

Laurent Bindschaedler

Abstract

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter about 20 hours ago

LoRACode introduces a Low-Rank Adaptation (LoRA)-based fine-tuning framework for efficient and scalable code embeddings, significantly reducing trainable parameters while improving code retrieval performance in Code2Code and Text2Code search tasks.
LoRA-based Parameter-Efficient Fine-Tuning for Code Embeddings: Unlike traditional fine-tuning methods that require modifying the entire model, LoRACode applies LoRA by introducing low-rank adaptation matrices in the query and value projection layers of transformer models. This reduces trainable parameters to ~1.83%–1.85% of the base model while maintaining or improving retrieval accuracy. Fine-tuning is significantly faster -- 2 million samples in 25 minutes on two H100 GPUs.
Task-Specific and Language-Specific Adapters Fine-tuned for Code2Code retrieval (matching code snippets) and Text2Code retrieval (mapping natural language queries to code). Separate LoRA adapters fine-tuned for six programming languages (Go, Java, JavaScript, PHP, Python, Ruby), outperforming generic task-based adapters. Language-specific fine-tuning captures syntactic and contextual variations better than multilingual training.
Integration of LoRA with Contrastive Fine-Tuning LoRACode employs a contrastive learning objective with a cosine similarity-based loss function to improve retrieval accuracy. The fine-tuning process is implemented using ContrastiveTrainer, a custom Hugging Face extension. Embeddings are extracted using a last-token pooling strategy to retain semantic richness.
Performance Gains Over Existing Models: Up to 9.1% increase in Mean Reciprocal Rank (MRR) for Code2Code retrieval. Up to 86.69% increase in MRR for Text2Code retrieval (Python-specific model). Surpasses CodeBERT, GraphCodeBERT, and UniXcoder in retrieval accuracy while being significantly more computationally efficient.
Scalability and Computational Efficiency: LoRACode fine-tunes models using fewer resources than OpenAI’s proprietary embeddings while achieving comparable or better retrieval accuracy. It enables cross-language retrieval, allowing embeddings trained on one language to generalize to others.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.05315 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.05315 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.05315 in a Space README.md to link it from this page.