๐Ÿคฌ hongssi/final_abuse_manual_model

hongssi/final_abuse_manual_model์€ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์—์„œ ์š•์„ค, ํ˜์˜ค ํ‘œํ˜„, ๋ชจ์š•์„ฑ ๋ฐœ์–ธ ๋“ฑ์„ ํƒ์ง€ํ•˜๋Š” ๋‹ค์ค‘ ๋ ˆ์ด๋ธ” ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Smilegate์˜ UNSMILE ๋ฐ์ดํ„ฐ์…‹ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, beomi/KcELECTRA-small ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿง  ๋ชจ๋ธ ๊ฐœ์š”

  • โœ… Base Model: beomi/KcELECTRA-small
  • โœ… Task: Multi-label classification (sigmoid-based)
  • โœ… Output: ๊ฐ ๋ผ๋ฒจ๋ณ„ [0.0 ~ 1.0] ํ™•๋ฅ  ๊ฐ’
  • โœ… ๋ชฉ์ : Call center, ์ปค๋ฎค๋‹ˆํ‹ฐ, ์ฑ—๋ด‡ ๋“ฑ์—์„œ์˜ ์š•์„ค/๋ชจ์š• ํƒ์ง€ ๋ฐ ๋ถ„๋ฅ˜

๐Ÿท๏ธ ํด๋ž˜์Šค ๋ผ๋ฒจ (11๊ฐœ)

[
  "์—ฌ์„ฑ/๊ฐ€์กฑ", "๋‚จ์„ฑ", "์„ฑ์†Œ์ˆ˜์ž", "์ธ์ข…/๊ตญ์ ", "์—ฐ๋ น",
  "์ง€์—ญ", "์ข…๊ต", "๊ธฐํƒ€ ํ˜์˜ค", "์•…ํ”Œ/์š•์„ค", "clean", "๊ฐœ์ธ์ง€์นญ"
]

ํ•œ ๋ฌธ์žฅ์ด ์—ฌ๋Ÿฌ ๋ผ๋ฒจ์— ํ•ด๋‹น๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (multi-label classification)


๐Ÿงพ ํ•™์Šต ์ •๋ณด

ํ•ญ๋ชฉ ๊ฐ’
๋ฐ์ดํ„ฐ์…‹ UNSMILE
์ƒ˜ํ”Œ ์ˆ˜ 95,000+ ๋ฌธ์žฅ
๋ชจ๋ธ ๊ตฌ์กฐ ELECTRA-small, classification head (11 output nodes)
ํ† ํฐํ™” KcELECTRA tokenizer (uncased, 128 tokens max)
์ž…๋ ฅ ๊ธธ์ด max_length=128
์†์‹ค ํ•จ์ˆ˜ Binary Cross Entropy (BCEWithLogitsLoss)
์˜ตํ‹ฐ๋งˆ์ด์ € AdamW
๋Ÿฌ๋‹๋ ˆ์ดํŠธ 5e-5
๋ฐฐ์น˜์‚ฌ์ด์ฆˆ 32
ํ•™์Šต ์—ํญ 5 epochs
ํ‰๊ฐ€ ์ง€ํ‘œ Macro F1 Score, Binary Accuracy

๐Ÿ“Š ๋ชจ๋ธ ์„ฑ๋Šฅ

ํด๋ž˜์Šค F1 ์ ์ˆ˜
์•…ํ”Œ/์š•์„ค 0.87
์—ฌ์„ฑ/๊ฐ€์กฑ 0.84
์„ฑ์†Œ์ˆ˜์ž 0.78
clean 0.91
๊ธฐํƒ€ ํ‰๊ท  Macro F1: 0.83

ํ‰๊ฐ€ ๊ธฐ์ค€์€ UNSMILE validation set ๊ธฐ๋ฐ˜์ด๋ฉฐ, ์‹ค์‚ฌ์šฉ ํ™˜๊ฒฝ์—์„œ ์ „์ฒ˜๋ฆฌ ๋ฐ ์‚ฌ์ „ ํƒ์ง€ ์‹œ์Šคํ…œ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“ฅ ์‚ฌ์šฉ๋ฒ• ์˜ˆ์‹œ

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

labels = [
  "์—ฌ์„ฑ/๊ฐ€์กฑ", "๋‚จ์„ฑ", "์„ฑ์†Œ์ˆ˜์ž", "์ธ์ข…/๊ตญ์ ", "์—ฐ๋ น",
  "์ง€์—ญ", "์ข…๊ต", "๊ธฐํƒ€ ํ˜์˜ค", "์•…ํ”Œ/์š•์„ค", "clean", "๊ฐœ์ธ์ง€์นญ"
]

model_id = "hongssi/final_abuse_manual_model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "์•ผ ๋„ˆ๋Š” ์‚ฌ๋žŒ๋„ ์•„๋‹ˆ๋‹ค"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits)[0]

results = {label: float(prob) for label, prob in zip(labels, probs)}
print(results)

๐Ÿง  ํ†ตํ•ฉ ํ™œ์šฉ: ์š•์„ค ์‚ฌ์ „ ํƒ์ง€์™€ ํ•จ๊ป˜

๋ณธ ๋ชจ๋ธ์€ Aho-Corasick ๊ธฐ๋ฐ˜์˜ ์š•์„ค ์‚ฌ์ „ ํƒ์ง€์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, ๋ชจ๋ธ์ด ํƒ์ง€ํ•˜์ง€ ๋ชปํ•œ ๋ช…์‹œ์  ๋น„์†์–ด๋„ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ์–ด ์‹ค์‚ฌ์šฉ์—์„œ ๋”์šฑ ์•ˆ์ •์ ์ธ ์šด์˜์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


โœ… ๋ผ์ด์„ ์Šค

  • ๋ณธ ๋ชจ๋ธ์€ MIT ๋ผ์ด์„ ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
  • ํ•™์Šต ๋ฐ์ดํ„ฐ์ธ UNSMILE์€ Smilegate์—์„œ ๊ณต๊ฐœํ•œ ์ €์ž‘๋ฌผ๋กœ, ํ•ด๋‹น ๋ผ์ด์„ ์Šค๋ฅผ ๋ฐ˜๋“œ์‹œ ํ™•์ธํ•˜์„ธ์š”.

๐Ÿ™‹โ€โ™‚๏ธ ์ž‘์„ฑ์ž

  • ๐Ÿ‘ค hongssi (ํ™ํƒœํœ˜)
  • โœ‰๏ธ [email protected]
  • ๐Ÿ”— ๊ด€๋ จ ํ”„๋กœ์ ํŠธ: FastAPI ๊ธฐ๋ฐ˜ ์š•์„ค ํƒ์ง€ API ์„œ๋ฒ„

---
Downloads last month
18
Safetensors
Model size
109M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support