Transformers
PyTorch
code
custom_code
Inference Endpoints
codesage commited on
Commit
256829f
1 Parent(s): 60bb0ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md CHANGED
@@ -1,3 +1,50 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - bigcode/the-stack-dedup
5
+ library_name: transformers
6
  ---
7
+
8
+ ## CodeSage-Small
9
+
10
+ ### Model description
11
+ CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
12
+
13
+ [Code Representation Learning At Scale by
14
+ Dejiao Zhang*, Wasi Uddin Ahmad*, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang](https://arxiv.org/abs/2402.01935) (* indicates equal contribution).
15
+
16
+ ### Pretraining data
17
+ This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (9 in total) are as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby.
18
+
19
+ ### Training procedure
20
+ This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
21
+
22
+ ### How to use
23
+ This checkpoint consists of an encoder (130M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
24
+
25
+ ```
26
+ from transformers import AutoModel, AutoTokenizer
27
+
28
+ checkpoint = "codesage/codesage-small"
29
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
32
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
33
+
34
+ inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
35
+ embedding = model(inputs)[0]
36
+ print(f'Dimension of the embedding: {embedding.size()[0]}, with norm={embedding.norm().item()}')
37
+ print(embedding)
38
+ ```
39
+
40
+ ### BibTeX entry and citation info
41
+ ```
42
+ @inproceedings{
43
+ zhang2024codesage,
44
+ title={CodeSage: Code Representation Learning At Scale},
45
+ author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang},
46
+ booktitle={The Twelfth International Conference on Learning Representations},
47
+ year={2024},
48
+ url={https://openreview.net/forum?id=vfzRRjumpX}
49
+ }
50
+ ```