|
--- |
|
license: cc-by-sa-4.0 |
|
language: ja |
|
tags: |
|
- generated_from_trainer |
|
- text-classification |
|
|
|
metrics: |
|
- accuracy |
|
|
|
widget: |
|
- text: "💪(^ω^ 🍤)" |
|
example_title: "Facemark 1" |
|
- text: "(੭ु∂∀6)੭ु⁾⁾ ஐ•*¨*•.¸¸" |
|
example_title: "Facemark 2" |
|
- text: ":-P" |
|
example_title: "Facemark 3" |
|
- text: "(o.o)" |
|
example_title: "Facemark 4" |
|
- text: "(10/7~)" |
|
example_title: "Non-facemark 1" |
|
- text: "??<<「ニャア(しゃーねぇな)」プイッ" |
|
example_title: "Non-facemark 2" |
|
- text: "(0.01)" |
|
example_title: "Non-facemark 3" |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Facemark Detection |
|
|
|
This model classifies given text into facemark(1) or not(undefined). |
|
|
|
This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.1301 |
|
- Accuracy: 0.9896 |
|
|
|
## Model description |
|
|
|
This model classifies given text into facemark(1) or not(undefined). |
|
|
|
## Intended uses & limitations |
|
|
|
Extract a facemark-prone potion of text and apply the text to the model. |
|
Extraction of a facemark can be done by regex but usually includes many non-facemarks. |
|
|
|
For example, I used the following regex pattern to extract a facemark-prone text by perl. |
|
|
|
```perl |
|
$input_text = "facemark prne text" |
|
|
|
my $text = '[0-9A-Za-zぁ-ヶ一-龠]'; |
|
my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]'; |
|
my $allow_text = '[ovっつ゜ニノ三二]'; |
|
my $hw_kana = '[ヲ-゚]'; |
|
my $open_branket = '[\(∩꒰(]'; |
|
my $close_branket = '[\)∩꒱)]'; |
|
my $around_face = '(?:' . $non_text . '|' . $allow_text . ')*'; |
|
my $face = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}'; |
|
my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face; |
|
|
|
my $facemark; |
|
if ($input_text=~/($face_char)/) { |
|
$facemark = $1; |
|
} |
|
``` |
|
Example of facemarks are: |
|
``` |
|
(^U^)← |
|
。\n\n⊂( *・ω・ )⊃ |
|
っ(。>﹏<) |
|
タカ( ˘ω' ) ヤスゥ… |
|
。(’↑▽↑) |
|
……💰( ˘ω˘ )💰 |
|
ーーー(*´꒳`*)!( |
|
…(o:∇:o) |
|
!!…(;´Д`)? |
|
(*´﹃ `*)✿ |
|
``` |
|
Examples of non-facemarks are: |
|
``` |
|
(3,000円) |
|
: (1/3) |
|
(@nVApO) |
|
(10/7~) |
|
?<<「ニャア(しゃーねぇな)」プイッ |
|
(残り 51字) |
|
(-0.1602) |
|
(25-0) |
|
(コーヒー飲んだ) |
|
(※軽トラ) |
|
``` |
|
|
|
This model intended to use for a facemark-prone text like above. |
|
|
|
## Training and evaluation data |
|
|
|
Facemark data is collected manually and automatically from Twitter timeline. |
|
|
|
* train.csv : 35591 samples (29911 facemark, 5680 non-facemark) |
|
* test.csv : 3954 samples (3315 facemark, 639 non-facemark) |
|
|
|
## Training procedure |
|
|
|
```bash |
|
python ./examples/pytorch/text-classification/run_glue.py \ |
|
--model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \ |
|
--do_train --do_eval \ |
|
--max_seq_length=128 --per_device_train_batch_size=32 \ |
|
--use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \ |
|
--output_dir=facemark_classify \ |
|
--save_steps=1000 --save_total_limit=3 \ |
|
--train_file=train.csv \ |
|
--validation_file=test.csv |
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 50.0 |
|
|
|
### Training results |
|
|
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.1301 |
|
- Accuracy: 0.9896 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.0.dev0 |
|
- Pytorch 1.11.0+cu102 |
|
- Datasets 2.7.1 |
|
- Tokenizers 0.13.2 |
|
|