File size: 2,906 Bytes
46f3b50 58647fb 46f3b50 58647fb d672216 58647fb d672216 58647fb d672216 58647fb d672216 13ab177 58647fb d672216 58647fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
license: apache-2.0
datasets:
- bigcode/the-stack-dedup
library_name: transformers
language:
- code
---
## CodeSage-Large
### Updates
* [12/2024] <span style="color:blue">We are excited to announce the release of the CodeSage V2 model family with largely improved performance and flexible embedding dimensions!</span> Please check out our [models](https://huggingface.co/codesage) and [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details.
* [11/2024] You can now access CodeSage models through SentenceTransformer.
### Model description
CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
[Code Representation Learning At Scale by
Dejiao Zhang*, Wasi Uddin Ahmad*, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang](https://arxiv.org/abs/2402.01935) (* indicates equal contribution).
### Pretraining data
This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (9 in total) are as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby.
### Training procedure
This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
### How to Use
This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 1024 dimension.
1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
```
from transformers import AutoModel, AutoTokenizer
checkpoint = "codesage/codesage-large"
device = "cuda" # for GPU usage or "cpu" for CPU usage
# Note: CodeSage requires adding eos token at the end of each tokenized sequence
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
embedding = model(inputs)[0]
```
2. Accessing CodeSage via SentenceTransformer
```
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("codesage/codesage-large", trust_remote_code=True)
```
### BibTeX entry and citation info
```
@inproceedings{
zhang2024codesage,
title={CodeSage: Code Representation Learning At Scale},
author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=vfzRRjumpX}
}
``` |