trapoom555
/

MiniCPM-2B-Text-Embedding-cft-pos

Sentence Similarity

sentence-embedding

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

MiniCPM-2B-Text-Embedding-cft-pos / README.md

trapoom555's picture

modify readme

3f3fd19 8 months ago

|

history blame contribute delete

4.09 kB

	---
	license: mit
	language:
	- en
	tags:
	- sentence-embedding
	- sentence-similarity
	- transformers
	- feature-extraction
	pipeline_tag: sentence-similarity
	---

	# MiniCPM-2B-Text-Embedding-cft

	## Description

	This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.

	⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️ If you want to see the version utilizing hard-negative examples in the training process, please refer [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft)

	## Base Model

	[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)

	## Usage

	1. Clone MiniCPM-2B-dpo-bf16 repository

	```bash
	git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
	```

	2. Change a tokenizer setting in `tokenizer_config.json`

	```json
	"add_eos_token": true
	```

	3. Use the model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	import numpy as np

	class MiniCPMSentenceEmbedding:
	def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
	self.tokenizer = AutoTokenizer.from_pretrained(model_path)
	self.model = AutoModelForCausalLM.from_pretrained(model_path,
	torch_dtype=torch.bfloat16,
	device_map='cuda',
	trust_remote_code=True)
	if adapter_path != None:
	# Load fine-tuned LoRA
	self.model.load_adapter(adapter_path)

	def get_last_hidden_state(self, text):
	inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
	with torch.no_grad():
	out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
	return out.squeeze().float().cpu().numpy()

	def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
	"""
	Returns a list of embeddings for the given sentences.

	Args:
	sentences: List of sentences to encode

	Returns:
	List of embeddings for the given sentences
	"""

	out = []

	for s in sentences:
	out.append(self.get_last_hidden_state(s))

	return out

	minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft-pos')

	example_sentences = ["I don't like apples", "I like apples"]

	encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)

	print(encoded_sentences)

	```

	## Training Details

	⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️

	\| Training Details \| Value \|
	\|-------------------------\|-------------------\|
	\| Loss \| InfoNCE \|
	\| Batch Size \| 40 \|
	\| InfoNCE Temperature \| 0.05 \|
	\| Learning Rate \| 1e-05 \|
	\| Warmup Steps \| 100 \|
	\| Learning Rate Scheduler \| CosineAnnealingLR \|
	\| LoRA Rank \| 8 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.1 \|
	\| Training Precision \| bf16 \|
	\| Max Epoch \| 1 \|
	\| GPU \| RTX3090 \|
	\| Num GPUs \| 4 \|

	## Training Scripts

	_(coming soon...)_

	## Evaluation Results

	_(coming soon...)_

	## Contributors

	Trapoom Ukarapol, Zhicheng Lee, Amy Xin

	## Foot Notes

	This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.