File size: 8,556 Bytes
e2e5372
 
370059a
 
 
922f057
 
6860ad0
 
922f057
dab7ef0
e2e5372
 
b8cb473
e2e5372
b8cb473
e2e5372
b8cb473
 
 
 
e2e5372
f60386a
 
 
6860ad0
 
 
 
 
 
 
 
 
 
f60386a
 
b8cb473
e2e5372
b8cb473
e2e5372
6860ad0
2f7cc7f
b8cb473
e2e5372
6860ad0
 
 
eb72ad7
b49b894
e2e5372
b8cb473
 
eb72ad7
b49b894
 
 
 
 
 
 
 
 
 
 
 
 
6860ad0
 
b8cb473
2f7cc7f
e2e5372
2f7cc7f
e2e5372
2f7cc7f
 
 
 
 
 
bd10bd4
b8cb473
 
 
 
 
 
 
 
 
 
 
 
e2e5372
2f7cc7f
e2e5372
6860ad0
 
 
 
 
eb72ad7
6860ad0
eb72ad7
6860ad0
03a753b
6860ad0
03a753b
6860ad0
03a753b
6860ad0
eb72ad7
6860ad0
eb72ad7
6860ad0
 
 
 
 
03a753b
 
 
6860ad0
 
 
 
 
eb72ad7
6860ad0
eb72ad7
6860ad0
03a753b
6860ad0
03a753b
 
e2e5372
03a753b
 
b8cb473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6860ad0
e2e5372
b8cb473
e2e5372
b8cb473
 
 
 
 
6860ad0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---

# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

### Goal: solve a task of archive page images sorting (for their further content-based processing)

**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class 🏷️ (category) results of top
N predictions output, predictions summarizing into a tabular format, 
HF 😊 hub support for the model

## Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, 
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF 😊 hub [^1] πŸ”— 

| Version | Base                   | Pages |   PDFs   | Description                                                               |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
|  `v2.0` | `vit-base-path16-224`  | 10073 | **3896** | annotations with mistakes, more heterogenous data                         |
|  `v2.1` | `vit-base-path16-224`  | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes     |
|  `v2.2` | `vit-base-path16-224`  | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0`                     |
|  `v3.2` | `vit-base-path16-384`  | 15855 | **5730** | same data as `v2.2`, but a bit larger model base with higher resolution |
|  `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.2`, but the largest model base with higher resolution  |


## Model description πŸ“‡

πŸ”² Fine-tuned model repository:  vit-historical-page [^1] πŸ”—

πŸ”³ Base model repository: Google's **vit-base-patch16-224**,  **vit-base-patch16-384**,  **vit-large-patch16-284** [^2] [^13] [^14] πŸ”—

### Data πŸ“œ

Training set of the model: **8950** images for `v2.0`

Training set of the model: **10745** images for `v2.1`

Training set of the model: **14565** images for `v2.2`, `v3.2` and `v5.2` 

### Categories 🏷️


|    Label️ | Description                                                                                                      |
|----------:|:-----------------------------------------------------------------------------------------------------------------|
|    `DRAW` | **πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions** |
|  `DRAW_L` | **πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table**      |
| `LINE_HW` | **βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure**                                        |
|  `LINE_P` | **πŸ“ - printed text organized in a tabular or form-like structure**                                              |
|  `LINE_T` | **πŸ“ - machine-typed text organized in a tabular or form-like structure**                                        |
|   `PHOTO` | **πŸŒ„ - photographs or photographic cutouts, potentially with text captions**                                     |
| `PHOTO_L` | **πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations**                     |
|    `TEXT` | **πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements**          |
| `TEXT_HW` | **βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)**                                        |
|  `TEXT_P` | **πŸ“„ - only printed text in paragraph or block form (non-tabular)**                                              |
|  `TEXT_T` | **πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)**                                        |

Evaluation set:  **1290** images (taken from `v2.2` annotations)

#### Data preprocessing 

During training the following transforms were applied randomly with a 50% chance:

* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

### Training Hyperparameters
        
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"      

### Results πŸ“Š

**v2.0** Evaluation set's accuracy (**Top-3**):  **95.58%** 

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1147_model_v20_conf_mat_TOP-3.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-3**):  **99.84%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1157_model_v21_conf_mat_TOP-3.png?raw=true)

**v2.2** Evaluation set's accuracy (**Top-3**):  **100.00%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1201_model_v22_conf_mat_TOP-3.png?raw=true)

**v2.0** Evaluation set's accuracy (**Top-1**):  **84.96%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1152_model_v20_conf_mat_TOP-1.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-1**):  **96.36%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1156_model_v21_conf_mat_TOP-1.png?raw=true)

**v2.2** Evaluation set's accuracy (**Top-1**):  **99.61%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1202_model_v22_conf_mat_TOP-1.png?raw=true)

#### Result tables

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1142_model_v20_TOP-3_EVAL.csv) πŸ”—

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1148_model_v20_TOP-1_EVAL.csv) πŸ”—

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1153_model_v21_TOP-3_EVAL.csv) πŸ”—

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1151_model_v21_TOP-1_EVAL.csv) πŸ”—

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1156_model_v22_TOP-3_EVAL.csv) πŸ”—

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1158_model_v22_TOP-1_EVAL.csv) πŸ”—

#### Table columns

- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category 🏷️, guess TOP-N 
- **SCORE-N** - score of the category 🏷️, guess TOP-N
- **TRUE** - actual label of the category 🏷️

### Contacts πŸ“§

For support write to πŸ“§ [email protected] πŸ“§

Official repository: UFAL [^3]

### Acknowledgements πŸ™

- **Developed by** UFAL [^5] πŸ‘₯
- **Funded by** ATRIUM [^4]  πŸ’°
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] πŸ”— or 384x384 [^13] [^14] πŸ”— resolution size 

**©️ 2022 UFAL & ATRIUM**

[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384