--- license: gemma datasets: - jslin09/LegalElements language: - zh base_model: - google/gemma-2-2b library_name: transformers widget: - text: "Is this review positive or negative? Review: Best cast iron skillet you will ever buy." example_title: "Sentiment analysis" --- # Model Card for Gemma2-2b-ner 本模型基於 [Gemma2:2b](https://huggingface.co/google/gemma-2-2b) 進行微調,目的是讓其依據台灣刑法學中常用的「刑法三階理論」,針對大型語言模型生成的詐欺罪「犯罪事實」段落,依照詐欺罪法條所規定的構成要件進行標註。具備生成詐欺罪「犯罪事實」的模型,可以參考以 BLOOM 560M 為基礎的[BLOOM 560M Fraud](https://huggingface.co/jslin09/bloom-560m-finetuned-fraud)微調模型,或是以 Gemma2 為基礎的[Gemma2:2b Fraud](https://huggingface.co/jslin09/gemma2-2b-fraud)微調模型。如果想知道實際的表現,可以到[示範平台](https://huggingface.co/spaces/jslin09/LE-NER)試用。 ## Model Details ### Model Description 本模型目前在識別出詐欺罪犯罪事實構成要件要素的平均準確率(percision)及召回率(recall)分別為0.98及0.75。從本模型訓練初期的語料資料錄為 979 筆開始,採用強化學習的流程,將生成的標註資料,採用人工對齊的方式修正後再投入語料庫中進行訓練。最終訓練用的語料計有 2577 筆,經過微調 3 個回合,就完成了本模型。以下是訓練過程各代的準確率及召回率的變化。 |版次|資料量|準確率(Precision)|召回率(Recall)| |---|---|---|---| |v1|979|0.272727273|0.218623482| |v2|1538|0.725888325|0.581300813| |v3|1886|0.717277487|0.465986395| |v4|2173|0.826086957|0.550724638| |v5|2577|0.983606557|0.75| - **Developed by:** [Chun-Hsien Lin](https://huggingface.co/jslin09) - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Model type:** [More Information Needed] - **Language(s) (NLP):** Traditional Chinese - **License:** [More Information Needed] - **Finetuned from model [optional]:** [Gemma2-2b](https://huggingface.co/google/gemma-2-2b) ### Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses 目前可以識別出來的標註標籤有以下七種具名實體,無法識別出來的構成要件要素具名實體,則會傳回 None。
  
from colorama import Fore, Back, Style

elements = {'LEO_SOC': ('犯罪主體', 'Subject of Crime'),
            'LEO_VIC': ('客體', 'Victim'),
            'LEO_ACT': ('不法行為', 'Behavior'),
            'LEO_SLE': ('主觀要件', 'Subjective Legal Element of the Offense'),
            'LEO_CAU': ('因果關係', 'Causation'),
            'LEO_ROH': ('危害結果', 'Result of Hazard'),
            'LEO_ATP': ('未遂', 'Attempted')
           }
tag_color = {'LEO_SOC': Fore.BLACK + Back.RED,
             'LEO_VIC': Fore.BLACK + Back.YELLOW,
             'LEO_ACT': Fore.BLACK + Back.GREEN,
             'LEO_SLE': Fore.BLACK + Back.MAGENTA,
             'LEO_CAU': Fore.BLACK + Back.CYAN,
             'LEO_ROH': Fore.BLACK + Back.BLUE,
             'LEO_ATP': Fore.WHITE + Back.BLACK,
            }
  
為了要將本模型標註出來的結果以更明顯的方式識別,可以參考以下的程式碼,將本大型語言模型生成的標註結果以及所標註的標籤,同時送入以下的函數,就可以將結果以 colorama 的方式著色標註。
  
from colorama import Fore, Back, Style
    
def tag_in_color(response_content, tag):
    '''
    說明:
        將標註結果依照標籤進行標色。
    Parameters:
        response_content (str): 已經標註完畢並有標籤的內容。
        tag (str): 標籤名稱,英文,沒有括號。
    Return:
        result (str): 去除標籤並含有 colorama 標色符號的字串。
    '''
    response_head = response_content.split("標註結果:\n")[0]
    response_body = response_content.split("標註結果:\n")[1]
    start_index = 0
    # 使用正規表示式找出所有構成要件要素文字的起始位置
    # 加入 re.escape() 是為了避免處理到有逸脱字元的字串會報錯而中斷程式執行
    findall_open_tags = [m.start() for m in re.finditer(re.escape(f"[{tag}]"), response_body)]
    findall_close_tags = [m.start() for m in re.finditer(re.escape(f"[/{tag}]"), response_body)]
    try:
        parts = [response_body[start_index:findall_open_tags[0]]] # 第一個標籤之前的句子
    except IndexError:
        parts = []
    # 找出每個標籤所在位置,取出標籤文字並加以著色。
    for j, idx in enumerate(findall_open_tags):
        tag_text = response_body[idx + len(tag) + 2:findall_close_tags[j]]
        parts.append(f"{tag_color[tag]}" + tag_text + Style.RESET_ALL) # 標籤內文字著色
        closed_tag = findall_close_tags[j] + len(tag) + 3
        try:
            next_open_tag = findall_open_tags[j+1]
            parts.append(response_body[closed_tag: next_open_tag]) # 結束標籤之後到下一個標籤前的文字
        except IndexError:
            parts.append(response_body[findall_close_tags[-1] + len(tag) + 3 :]) # 加入最後一句
    result = ''
    for _, part in enumerate(parts):
        result = result + part
    if result == '':
        color_result = f"{tag_color[tag]}{tag}" + Fore.RESET + Back.RESET + " " +Fore.YELLOW + Back.RED + "*** 無標註結果 ***" + Fore.RESET + Back.RESET
    else:
        color_result = Fore.RED + Back.YELLOW +  "標註著色結果:\n" + Fore.RESET + Back.RESET + result
    return color_result
  
### Direct Use [More Information Needed] ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use 本模型目前僅能標示依據中華民國刑法規定的「詐欺罪」所擬撰(或是語言模型生成)之「犯罪事實」中的構成要件要素,若要具備標註其餘各種不同的犯罪構成要件要素之標註能力,則是後續可以發展以及擴增語料庫的方向。 [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### 訓練資料 本模型是以強化學習的方式微調 Gemma2:2b 並經過多回合人工對齊生成資料反覆迭代訓練而成,訓練所需要的資料集是[法律要件資料集](https://huggingface.co/datasets/jslin09/LegalElements)。使用者可以下載後自己持續迭代後修正及擴充資料集內容。 [More Information Needed] ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]