Safetensors
Korean
mistral

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NMIXX-e5

This repository contains a e5-mistral‐based Embedding model fine‐tuned with a triplet‐loss setup on the nmixx-fin/NMIXX_train dataset. It produces high‐quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain.


How to use

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    """
    Pool the last token's hidden state from the model's output.
    """
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def format_instruct(task_description: str, query: str) -> str:
    """
    Format the instruction and query for the model.
    """
    return f'Instruct: {task_description}\nQuery: {query}'

# 1. Load model and tokenizer
model_name = "nmixx-fin/nmixx-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

# 2. Prepare input texts
# For asymmetric tasks like retrieval, add instructions to queries only.
task_instruction = "제시된 기준 문장과 의미가 가장 유사한 문장을 찾으세요."
queries = [
    format_instruct(task_instruction, "금융금융"),
    format_instruct(task_instruction, "융금융금")
]
documents = [
    "금융입니다",
    "금융, 이라구요",
    "금, 뭐요?"
]

# 3. Tokenize and generate embeddings
input_texts = queries + documents
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to(device)

with torch.no_grad():
    outputs = model(**batch_dict)
    embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1) # Normalize embeddings

# 4. Calculate similarity
# Separate query and document embeddings
query_embeddings = embeddings[:len(queries)]
doc_embeddings = embeddings[len(queries):]

# Calculate cosine similarity
scores = (query_embeddings @ doc_embeddings.T) * 100
print("Cosine-Similarity scores:")
print(scores.tolist())
Downloads last month
48
Safetensors
Model size
7.11B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nmixx-fin/nmixx-e5

Finetuned
(3)
this model

Dataset used to train nmixx-fin/nmixx-e5