polyglot-ko-1b-txt2sql

polyglot-ko-1b-txt2sql은 ν•œκ΅­μ–΄ μžμ—°μ–΄ μ§ˆλ¬Έμ„ SQL 쿼리둜 λ³€ν™˜ν•˜κΈ° μœ„ν•΄ νŒŒμΈνŠœλ‹λœ ν…μŠ€νŠΈ 생성 λͺ¨λΈμž…λ‹ˆλ‹€.
기반 λͺ¨λΈμ€ EleutherAI/polyglot-ko-1.3bλ₯Ό μ‚¬μš©ν–ˆμœΌλ©°, LoRAλ₯Ό 톡해 κ²½λŸ‰ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€.


λͺ¨λΈ 정보

  • Base model: EleutherAI/polyglot-ko-1.3b
  • Fine-tuning: QLoRA (4bit quantization + PEFT)
  • Task: Text2SQL (μžμ—°μ–΄ β†’ SQL λ³€ν™˜)
  • Tokenizer: λ™μΌν•œ ν† ν¬λ‚˜μ΄μ € μ‚¬μš©

ν•™μŠ΅ 데이터

λͺ¨λΈμ€ ν•œκ΅­μ–΄ SQL λ³€ν™˜ νƒœμŠ€ν¬λ₯Ό μœ„ν•΄ μ„€κ³„λœ μžμ—°μ–΄ 질문-쿼리 νŽ˜μ–΄λ‘œ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
λ°μ΄ν„°λŠ” λ‹€μŒ 두 가지 μ†ŒμŠ€ 기반으둜 κ΅¬μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

  • shangrilar/ko_text2sql 데이터셋 일뢀
  • OpenAI 기반 LLM(GPT) 좔둠을 톡해 μƒμ„±λœ synthetic Korean SQL pairs

μ‚¬μš© μ˜ˆμ‹œ

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("your-username/polyglot-ko-1b-txt2sql", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("your-username/polyglot-ko-1b-txt2sql")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """
당신은 SQL μ „λ¬Έκ°€μž…λ‹ˆλ‹€.

### DDL:
CREATE TABLE players (
  player_id INT PRIMARY KEY AUTO_INCREMENT,
  username VARCHAR(255) UNIQUE NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL,
  date_joined DATETIME NOT NULL,
  last_login DATETIME
);

### Question:
μ‚¬μš©μž 이름에 'admin'이 ν¬ν•¨λœ 계정 μˆ˜λŠ”?

### SQL:
"""

outputs = generator(prompt, do_sample=False, max_new_tokens=128)
print(outputs[0]["generated_text"])
Downloads last month
0
Safetensors
Model size
1.33B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for castellina/polyglot-ko-txt2sql

Finetuned
(11)
this model

Dataset used to train castellina/polyglot-ko-txt2sql