language: ko license: apache-2.0 library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - korean - legal - bert datasets: - custom metrics: - cosine_similarity widget: - source_sentence: "์ธํ„ฐ๋„ท ์‚ฌ๊ธฐ ํ”ผํ•ด ์†ํ•ด๋ฐฐ์ƒ ์ฒญ๊ตฌ" sentences: - "์˜จ๋ผ์ธ ๊ฑฐ๋ž˜ ์‚ฌ๊ธฐ ํ”ผํ•ด ๊ตฌ์ œ" - "์ „์ž์ƒ๊ฑฐ๋ž˜ ์‚ฌ๊ธฐ ๋ฏผ์‚ฌ์ฑ…์ž„" - "ํ˜•๋ฒ•์ƒ ์‚ฌ๊ธฐ์ฃ„ ๊ตฌ์„ฑ์š”๊ฑด" - source_sentence: "์ƒ์—ฌ๊ธˆ์„ ์ž„๊ธˆ์œผ๋กœ ์ธ์ •ํ•˜๊ธฐ ์œ„ํ•œ ์š”๊ฑด" sentences: - "๊ทผ๋กœ์ž ์ž„๊ธˆ ์ฒด๋ถˆ ์†ํ•ด๋ฐฐ์ƒ" - "ํ‡ด์ง๊ธˆ ์‚ฐ์ • ๊ธฐ์ดˆ ํ‰๊ท ์ž„๊ธˆ" - "๋ถ€๋™์‚ฐ ๋งค๋งค๊ณ„์•ฝ ํ•ด์ œ" inference: parameters: task: sentence-similarity normalize_embeddings: true model-index: - name: Ko-Legal-SBERT results: - task: type: sentence-similarity name: Sentence Similarity dataset: type: custom name: Korean Legal Dataset metrics: - type: cosine_similarity value: 0.85 name: Same Domain Similarity

๐Ÿ›๏ธ Ko-Legal-SBERT: ํ•œ๊ตญ ๋ฒ•๋ฅ  ํŠนํ™” ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ

Inference API sentence-transformers

Ko-Legal-SBERT๋Š” ํ•œ๊ตญ ๋ฒ•๋ฅ  ๋ฌธ์„œ์— ํŠนํ™”๋œ ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. 35,104๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ๋ฒ•๋ฅ  ํŠธ๋ฆฌํ”Œ์…‹์œผ๋กœ ํŒŒ์ธํŠœ๋‹๋˜์–ด ๋ฒ•๋ฅ  ๋ฌธ์„œ ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿš€ ๋น ๋ฅธ ์‹œ์ž‘

Inference API ์‚ฌ์šฉ (๊ถŒ์žฅ)

import requests

API_URL = "https://api-inference.huggingface.co/models/woong0322/ko-legal-sbert-finetuned"
headers = {"Authorization": "Bearer YOUR_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

# ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
output = query({
    "inputs": "์ธํ„ฐ๋„ท ์‚ฌ๊ธฐ ํ”ผํ•ด ์†ํ•ด๋ฐฐ์ƒ ์ฒญ๊ตฌ"
})

sentence-transformers ์‚ฌ์šฉ

from sentence_transformers import SentenceTransformer
import numpy as np

# ๋ชจ๋ธ ๋กœ๋“œ
model = SentenceTransformer('woong0322/ko-legal-sbert-finetuned')

# ๋ฒ•๋ฅ  ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ
texts = [
    "์ƒ์—ฌ๊ธˆ์„ ์ž„๊ธˆ์œผ๋กœ ์ธ์ •ํ•˜๊ธฐ ์œ„ํ•œ ์š”๊ฑด",
    "ํ‡ด์ง๊ธˆ ์‚ฐ์ •์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ํ‰๊ท ์ž„๊ธˆ",
    "ํ˜•๋ฒ•์ƒ ์ ˆ๋„์˜ ๋ฒ”์˜์™€ ๊ณ ์˜"
]

embeddings = model.encode(texts)

# ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
similarity_01 = np.dot(embeddings[0], embeddings[1])  # ๋…ธ๋™๋ฒ• ๊ด€๋ จ: ๋†’์€ ์œ ์‚ฌ๋„
similarity_02 = np.dot(embeddings[0], embeddings[2])  # ๋…ธ๋™๋ฒ• vs ํ˜•๋ฒ•: ๋‚ฎ์€ ์œ ์‚ฌ๋„

print(f"๋…ธ๋™๋ฒ• ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„: {similarity_01:.3f}")  # ์˜ˆ์ƒ: 0.85+
print(f"๋…ธ๋™๋ฒ• vs ํ˜•๋ฒ• ์œ ์‚ฌ๋„: {similarity_02:.3f}")   # ์˜ˆ์ƒ: 0.0 ๊ทผ์ฒ˜

๐Ÿ“Š ์„ฑ๋Šฅ ํ‰๊ฐ€

๋ฉ”ํŠธ๋ฆญ ์ ์ˆ˜ ์„ค๋ช…
๋™์ผ ๋ถ„์•ผ ์œ ์‚ฌ๋„ 0.853 ๊ฐ™์€ ๋ฒ• ๋ถ„์•ผ ๋ฌธ์„œ ๊ฐ„ ํ‰๊ท  ์œ ์‚ฌ๋„
๋ถ„์•ผ ๊ฐ„ ๊ตฌ๋ถ„๋„ 0.023 ๋‹ค๋ฅธ ๋ฒ• ๋ถ„์•ผ ๊ฐ„ ํ‰๊ท  ์œ ์‚ฌ๋„ (๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ)
์ „์ฒด ํ’ˆ์งˆ ์ ์ˆ˜ 95.0/100 ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ข…ํ•ฉ ํ‰๊ฐ€

๋ถ„์•ผ๋ณ„ ์„ฑ๋Šฅ

  • ๋ฏผ์‚ฌ๋ฒ•: 36.3% ์ปค๋ฒ„๋ฆฌ์ง€, ๋†’์€ ์ •ํ™•๋„
  • ์„ธ๋ฒ•: 16.4% ์ปค๋ฒ„๋ฆฌ์ง€, ์šฐ์ˆ˜ํ•œ ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ
  • ํ–‰์ •๋ฒ•: 14.9% ์ปค๋ฒ„๋ฆฌ์ง€, ์•ˆ์ •์  ์„ฑ๋Šฅ
  • ํ˜•์‚ฌ๋ฒ•: 6.2% ์ปค๋ฒ„๋ฆฌ์ง€, ๋ช…ํ™•ํ•œ ๋ถ„๋ฅ˜

๐Ÿ—๏ธ ๋ชจ๋ธ ๊ตฌ์กฐ

  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: jhgan/ko-sbert-nli
  • ์ž„๋ฒ ๋”ฉ ์ฐจ์›: 768
  • ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด: 512 ํ† ํฐ
  • ํ•™์Šต ๋ฐฉ๋ฒ•: Triplet Loss with Hard Negative Mining

ํ•™์Šต ๋ฐ์ดํ„ฐ

  • ์ด ํŠธ๋ฆฌํ”Œ์…‹: 35,104๊ฐœ
  • ํ•™์Šต ์˜ˆ์ œ: 70,208๊ฐœ (Anchor-Positive, Anchor-Negative ์Œ)
  • ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜: ํ•œ๊ตญ ๋ฒ•์› ํŒ๋ก€, ๋ฒ•๋ น ๋ฐ์ดํ„ฐ
  • ํ’ˆ์งˆ ๊ฒ€์ฆ: 98.6% ๋ฒ•๋ฅ  ํ‚ค์›Œ๋“œ ํฌํ•จ, ์ค‘๋ณต ์ œ๊ฑฐ ์™„๋ฃŒ

๐ŸŽฏ ํ™œ์šฉ ๋ถ„์•ผ

๐Ÿ’ผ ๋น„์ฆˆ๋‹ˆ์Šค ํ™œ์šฉ

  • ๋ฒ•๋ฅ  ๊ฒ€์ƒ‰ ์—”์ง„: ์˜๋ฏธ ๊ธฐ๋ฐ˜ ํŒ๋ก€/๋ฒ•๋ น ๊ฒ€์ƒ‰
  • ๋ฒ•๋ฅ  ์ƒ๋‹ด ์‹œ์Šคํ…œ: ์œ ์‚ฌ ์‚ฌ๋ก€ ์ž๋™ ์ถ”์ฒœ
  • ๊ณ„์•ฝ์„œ ๋ถ„์„: ์กฐํ•ญ ๊ฐ„ ์œ ์‚ฌ๋„ ๋ฐ ์ค‘๋ณต ๊ฒ€์ถœ
  • ์ปดํ”Œ๋ผ์ด์–ธ์Šค: ๊ทœ์ • ์ค€์ˆ˜ ์—ฌ๋ถ€ ์ž๋™ ๊ฒ€ํ† 

๐Ÿ”ฌ ์—ฐ๊ตฌ ํ™œ์šฉ

  • ๋ฒ•๋ฅ  AI ์—ฐ๊ตฌ: ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ  NLP ๋ฒค์น˜๋งˆํฌ
  • ํŒ๋ก€ ๋ถ„์„: ํŒ๊ฒฐ ํŒจํ„ด ๋ฐ ๊ฒฝํ–ฅ ๋ถ„์„
  • ๋ฒ•๋ฅ  ์˜จํ†จ๋กœ์ง€: ๋ฒ•๋ฅ  ๊ฐœ๋… ๊ฐ„ ๊ด€๊ณ„ ๋ชจ๋ธ๋ง
  • ์ž๋™ ๋ถ„๋ฅ˜: ๋ฒ•๋ฅ  ๋ฌธ์„œ ์นดํ…Œ๊ณ ๋ฆฌ ์ž๋™ ๋ถ„๋ฅ˜

๐Ÿ“š ๊ธฐ์ˆ ์  ์„ธ๋ถ€์‚ฌํ•ญ

์ด ๋ชจ๋ธ์€ SentenceTransformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

๐Ÿค ๊ธฐ์—ฌ ๋ฐ ํ”ผ๋“œ๋ฐฑ

์ด ๋ชจ๋ธ์„ ์—ฐ๊ตฌ๋‚˜ ์ƒ์—…์  ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์‹ค ๋•Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”:

@misc{ko-legal-sbert-2025,
  title={Ko-Legal-SBERT: Korean Legal Domain Specialized Sentence Embedding Model},
  author={woong0322},
  year={2025},
  url={https://huggingface.co/woong0322/ko-legal-sbert-finetuned}
}

๐Ÿ“„ ๋ผ์ด์„ ์Šค

์ด ๋ชจ๋ธ์€ Apache 2.0 ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค. ์ƒ์—…์  ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ถœ์ฒ˜๋งŒ ๋ช…์‹œํ•˜๋ฉด ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ’ก ์ด ๋ชจ๋ธ์ด ๋„์›€์ด ๋˜์…จ๋‹ค๋ฉด โญ์„ ๋ˆŒ๋Ÿฌ์ฃผ์„ธ์š”!

Downloads last month
11
Safetensors
Model size
111M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support