facemark_detection / README.md
omzn's picture
Update README.md
cc28932
|
raw
history blame
3.89 kB
---
license: cc-by-sa-4.0
language: ja
tags:
- generated_from_trainer
- text-classification
metrics:
- accuracy
widget:
- text: "💪(^ω^ 🍤)"
example_title: "Facemark 1"
- text: "(੭ु∂∀6)੭ु⁾⁾ ஐ•*¨*•.¸¸"
example_title: "Facemark 2"
- text: ":-P"
example_title: "Facemark 3"
- text: "(o.o)"
example_title: "Facemark 4"
- text: "(10/7~)"
example_title: "Non-facemark 1"
- text: "??<<「ニャア(しゃーねぇな)」プイッ"
example_title: "Non-facemark 2"
- text: "(0.01)"
example_title: "Non-facemark 3"
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Facemark Detection
This model classifies given text into facemark(1) or not(undefined).
This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1301
- Accuracy: 0.9896
## Model description
This model classifies given text into facemark(1) or not(undefined).
## Intended uses & limitations
Extract a facemark-prone potion of text and apply the text to the model.
Extraction of a facemark can be done by regex but usually includes many non-facemarks.
For example, I used the following regex pattern to extract a facemark-prone text by perl.
```perl
$input_text = "facemark prne text"
my $text = '[0-9A-Za-zぁ-ヶ一-龠]';
my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]';
my $allow_text = '[ovっつ゜ニノ三二]';
my $hw_kana = '[ヲ-゚]';
my $open_branket = '[\(∩꒰(]';
my $close_branket = '[\)∩꒱)]';
my $around_face = '(?:' . $non_text . '|' . $allow_text . ')*';
my $face = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}';
my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face;
my $facemark;
if ($input_text=~/($face_char)/) {
$facemark = $1;
}
```
Example of facemarks are:
```
(^U^)←
。\n\n⊂( *・ω・ )⊃
っ(。>﹏<)
タカ( ˘ω' ) ヤスゥ…
。(’↑▽↑)
……💰( ˘ω˘ )💰
ーーー(*´꒳`*)!(
…(o:∇:o)
!!…(;´Д`)?
(*´﹃ `*)✿
```
Examples of non-facemarks are:
```
(3,000円)
: (1/3)
(@nVApO)
(10/7~)
?<<「ニャア(しゃーねぇな)」プイッ
(残り 51字)
(-0.1602)
(25-0)
(コーヒー飲んだ)
(※軽トラ)
```
This model intended to use for a facemark-prone text like above.
## Training and evaluation data
Facemark data is collected manually and automatically from Twitter timeline.
* train.csv : 35591 samples (29911 facemark, 5680 non-facemark)
* test.csv : 3954 samples (3315 facemark, 639 non-facemark)
## Training procedure
```bash
python ./examples/pytorch/text-classification/run_glue.py \
--model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \
--do_train --do_eval \
--max_seq_length=128 --per_device_train_batch_size=32 \
--use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \
--output_dir=facemark_classify \
--save_steps=1000 --save_total_limit=3 \
--train_file=train.csv \
--validation_file=test.csv
```
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 50.0
### Training results
It achieves the following results on the evaluation set:
- Loss: 0.1301
- Accuracy: 0.9896
### Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.11.0+cu102
- Datasets 2.7.1
- Tokenizers 0.13.2