--- license: mit datasets: - nmixx-fin/NMIXX_train language: - ko base_model: - intfloat/e5-mistral-7b-instruct --- # NMIXX-e5 This repository contains a e5-mistral‐based Embedding model fine‐tuned with a triplet‐loss setup on the `nmixx-fin/NMIXX_train` dataset. It produces high‐quality sentence embeddings for Korean financial text, optimized for semantic similarity tasks in the finance domain. --- # How to use ```python import torch import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: """ Pool the last token's hidden state from the model's output. """ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] def format_instruct(task_description: str, query: str) -> str: """ Format the instruction and query for the model. """ return f'Instruct: {task_description}\nQuery: {query}' # 1. Load model and tokenizer model_name = "nmixx-fin/nmixx-e5" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) model.eval() # 2. Prepare input texts # For asymmetric tasks like retrieval, add instructions to queries only. task_instruction = "제시된 기준 문장과 의미가 가장 유사한 문장을 찾으세요." queries = [ format_instruct(task_instruction, "금융금융"), format_instruct(task_instruction, "융금융금") ] documents = [ "금융입니다", "금융, 이라구요", "금, 뭐요?" ] # 3. Tokenize and generate embeddings input_texts = queries + documents batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to(device) with torch.no_grad(): outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) # Normalize embeddings # 4. Calculate similarity # Separate query and document embeddings query_embeddings = embeddings[:len(queries)] doc_embeddings = embeddings[len(queries):] # Calculate cosine similarity scores = (query_embeddings @ doc_embeddings.T) * 100 print("Cosine-Similarity scores:") print(scores.tolist()) ```