NMIXX-e5
This repository contains a e5-mistral‐based Embedding model fine‐tuned with a triplet‐loss setup on the nmixx-fin/NMIXX_train
dataset. It produces high‐quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain.
How to use
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
"""
Pool the last token's hidden state from the model's output.
"""
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def format_instruct(task_description: str, query: str) -> str:
"""
Format the instruction and query for the model.
"""
return f'Instruct: {task_description}\nQuery: {query}'
# 1. Load model and tokenizer
model_name = "nmixx-fin/nmixx-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# 2. Prepare input texts
# For asymmetric tasks like retrieval, add instructions to queries only.
task_instruction = "제시된 기준 문장과 의미가 가장 유사한 문장을 찾으세요."
queries = [
format_instruct(task_instruction, "금융금융"),
format_instruct(task_instruction, "융금융금")
]
documents = [
"금융입니다",
"금융, 이라구요",
"금, 뭐요?"
]
# 3. Tokenize and generate embeddings
input_texts = queries + documents
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to(device)
with torch.no_grad():
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1) # Normalize embeddings
# 4. Calculate similarity
# Separate query and document embeddings
query_embeddings = embeddings[:len(queries)]
doc_embeddings = embeddings[len(queries):]
# Calculate cosine similarity
scores = (query_embeddings @ doc_embeddings.T) * 100
print("Cosine-Similarity scores:")
print(scores.tolist())
- Downloads last month
- 48
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for nmixx-fin/nmixx-e5
Base model
intfloat/e5-mistral-7b-instruct