MiniCPM-2B-Text-Embedding-cft
Description
This is a fine-tuned version of MiniCPM-2B-dpo-bf16 to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️ If you want to see the version utilizing hard-negative examples in the training process, please refer here
Base Model
Usage
- Clone MiniCPM-2B-dpo-bf16 repository
git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
- Change a tokenizer setting in
tokenizer_config.json
"add_eos_token": true
- Use the model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
class MiniCPMSentenceEmbedding:
def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(model_path,
torch_dtype=torch.bfloat16,
device_map='cuda',
trust_remote_code=True)
if adapter_path != None:
# Load fine-tuned LoRA
self.model.load_adapter(adapter_path)
def get_last_hidden_state(self, text):
inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
with torch.no_grad():
out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
return out.squeeze().float().cpu().numpy()
def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
"""
Returns a list of embeddings for the given sentences.
Args:
sentences: List of sentences to encode
Returns:
List of embeddings for the given sentences
"""
out = []
for s in sentences:
out.append(self.get_last_hidden_state(s))
return out
minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft-pos')
example_sentences = ["I don't like apples", "I like apples"]
encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
print(encoded_sentences)
Training Details
⚠️ The training process ignores hard-negative samples and treat other in-batch samples + their entailments as in-batch negatives. ⚠️
Training Details | Value |
---|---|
Loss | InfoNCE |
Batch Size | 40 |
InfoNCE Temperature | 0.05 |
Learning Rate | 1e-05 |
Warmup Steps | 100 |
Learning Rate Scheduler | CosineAnnealingLR |
LoRA Rank | 8 |
LoRA Alpha | 32 |
LoRA Dropout | 0.1 |
Training Precision | bf16 |
Max Epoch | 1 |
GPU | RTX3090 |
Num GPUs | 4 |
Training Scripts
(coming soon...)
Evaluation Results
(coming soon...)
Contributors
Trapoom Ukarapol, Zhicheng Lee, Amy Xin
Foot Notes
This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support sentence-similarity models for transformers library.