File size: 4,094 Bytes
b08a7ee
 
5ab02d7
 
 
 
 
 
 
 
b08a7ee
5ab02d7
 
 
 
 
91d8b3e
 
 
5ab02d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91d8b3e
5ab02d7
 
 
 
3f3fd19
5ab02d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---

license: mit
language:
- en
tags:
- sentence-embedding
- sentence-similarity
- transformers
- feature-extraction
pipeline_tag: sentence-similarity
---


# MiniCPM-2B-Text-Embedding-cft

## Description

This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. 

⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️ If you want to see the version utilizing hard-negative examples in the training process, please refer [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft)

## Base Model

[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)

## Usage

1. Clone MiniCPM-2B-dpo-bf16 repository

```bash

git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16

```

2. Change a tokenizer setting in `tokenizer_config.json`

```json

"add_eos_token": true

```

3. Use the model

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

import numpy as np



class MiniCPMSentenceEmbedding:

    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):

        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        self.model = AutoModelForCausalLM.from_pretrained(model_path, 

                                                          torch_dtype=torch.bfloat16,

                                                          device_map='cuda',

                                                          trust_remote_code=True)

        if adapter_path != None:

            # Load fine-tuned LoRA

            self.model.load_adapter(adapter_path)



    def get_last_hidden_state(self, text):

        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')

        with torch.no_grad():

            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]

        return out.squeeze().float().cpu().numpy()



    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:

        """

        Returns a list of embeddings for the given sentences.

        

        Args:

            sentences: List of sentences to encode



        Returns:

            List of embeddings for the given sentences

        """



        out = []



        for s in sentences:

            out.append(self.get_last_hidden_state(s))



        return out



minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft-pos')



example_sentences = ["I don't like apples", "I like apples"]



encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)



print(encoded_sentences) 



```

## Training Details

⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️

| **Training Details**    | **Value**         |
|-------------------------|-------------------|
| Loss                    | InfoNCE           |
| Batch Size              | 40                |
| InfoNCE Temperature     | 0.05              |
| Learning Rate           | 1e-05             |
| Warmup Steps            | 100               |
| Learning Rate Scheduler | CosineAnnealingLR |
| LoRA Rank               | 8                 |
| LoRA Alpha              | 32                |
| LoRA Dropout            | 0.1               |
| Training Precision      | bf16              |
| Max Epoch               | 1                 |
| GPU                     | RTX3090           |
| Num GPUs                | 4                 |

## Training Scripts

**_(coming soon...)_**

## Evaluation Results

**_(coming soon...)_**

## Contributors

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

## Foot Notes

This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.