File size: 8,556 Bytes

e2e5372
 
370059a
 
 
922f057
 
6860ad0
 
922f057
dab7ef0
e2e5372
 
b8cb473
e2e5372
b8cb473
e2e5372
b8cb473
 
 
 
e2e5372
f60386a
 
 
6860ad0
 
 
 
 
 
 
 
 
 
f60386a
 
b8cb473
e2e5372
b8cb473
e2e5372
6860ad0
2f7cc7f
b8cb473
e2e5372
6860ad0
 
 
eb72ad7
b49b894
e2e5372
b8cb473
 
eb72ad7
b49b894
 
 
 
 
 
 
 
 
 
 
 
 
6860ad0
 
b8cb473
2f7cc7f
e2e5372
2f7cc7f
e2e5372
2f7cc7f
 
 
 
 
 
bd10bd4
b8cb473
 
 
 
 
 
 
 
 
 
 
 
e2e5372
2f7cc7f
e2e5372
6860ad0
 
 
 
 
eb72ad7
6860ad0
eb72ad7
6860ad0
03a753b
6860ad0
03a753b
6860ad0
03a753b
6860ad0
eb72ad7
6860ad0
eb72ad7
6860ad0
 
 
 
 
03a753b
 
 
6860ad0
 
 
 
 
eb72ad7
6860ad0
eb72ad7
6860ad0
03a753b
6860ad0
03a753b
 
e2e5372
03a753b
 
b8cb473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6860ad0
e2e5372
b8cb473
e2e5372
b8cb473
 
 
 
 
6860ad0

---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---

# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

### Goal: solve a task of archive page images sorting (for their further content-based processing)

**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class 🏷️ (category) results of top
N predictions output, predictions summarizing into a tabular format, 
HF 😊 hub support for the model

## Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, 
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF 😊 hub [^1] 🔗 

| Version | Base                   | Pages |   PDFs   | Description                                                               |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
|  `v2.0` | `vit-base-path16-224`  | 10073 | **3896** | annotations with mistakes, more heterogenous data                         |
|  `v2.1` | `vit-base-path16-224`  | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes     |
|  `v2.2` | `vit-base-path16-224`  | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0`                     |
|  `v3.2` | `vit-base-path16-384`  | 15855 | **5730** | same data as `v2.2`, but a bit larger model base with higher resolution |
|  `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.2`, but the largest model base with higher resolution  |


## Model description 📇

🔲 Fine-tuned model repository:  vit-historical-page [^1] 🔗

🔳 Base model repository: Google's **vit-base-patch16-224**,  **vit-base-patch16-384**,  **vit-large-patch16-284** [^2] [^13] [^14] 🔗

### Data 📜

Training set of the model: **8950** images for `v2.0`

Training set of the model: **10745** images for `v2.1`

Training set of the model: **14565** images for `v2.2`, `v3.2` and `v5.2` 

### Categories 🏷️


|    Label️ | Description                                                                                                      |
|----------:|:-----------------------------------------------------------------------------------------------------------------|
|    `DRAW` | **📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions** |
|  `DRAW_L` | **📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table**      |
| `LINE_HW` | **✏️📏 - handwritten text organized in a tabular or form-like structure**                                        |
|  `LINE_P` | **📏 - printed text organized in a tabular or form-like structure**                                              |
|  `LINE_T` | **📏 - machine-typed text organized in a tabular or form-like structure**                                        |
|   `PHOTO` | **🌄 - photographs or photographic cutouts, potentially with text captions**                                     |
| `PHOTO_L` | **🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations**                     |
|    `TEXT` | **📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements**          |
| `TEXT_HW` | **✏️📄 - only handwritten text in paragraph or block form (non-tabular)**                                        |
|  `TEXT_P` | **📄 - only printed text in paragraph or block form (non-tabular)**                                              |
|  `TEXT_T` | **📄 - only machine-typed text in paragraph or block form (non-tabular)**                                        |

Evaluation set:  **1290** images (taken from `v2.2` annotations)

#### Data preprocessing 

During training the following transforms were applied randomly with a 50% chance:

* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

### Training Hyperparameters
        
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"      

### Results 📊

**v2.0** Evaluation set's accuracy (**Top-3**):  **95.58%** 

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1147_model_v20_conf_mat_TOP-3.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-3**):  **99.84%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1157_model_v21_conf_mat_TOP-3.png?raw=true)

**v2.2** Evaluation set's accuracy (**Top-3**):  **100.00%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1201_model_v22_conf_mat_TOP-3.png?raw=true)

**v2.0** Evaluation set's accuracy (**Top-1**):  **84.96%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1152_model_v20_conf_mat_TOP-1.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-1**):  **96.36%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1156_model_v21_conf_mat_TOP-1.png?raw=true)

**v2.2** Evaluation set's accuracy (**Top-1**):  **99.61%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250526-1202_model_v22_conf_mat_TOP-1.png?raw=true)

#### Result tables

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1142_model_v20_TOP-3_EVAL.csv) 🔗

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1148_model_v20_TOP-1_EVAL.csv) 🔗

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1153_model_v21_TOP-3_EVAL.csv) 🔗

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1151_model_v21_TOP-1_EVAL.csv) 🔗

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1156_model_v22_TOP-3_EVAL.csv) 🔗

- **v2.2** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1158_model_v22_TOP-1_EVAL.csv) 🔗

#### Table columns

- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category 🏷️, guess TOP-N 
- **SCORE-N** - score of the category 🏷️, guess TOP-N
- **TRUE** - actual label of the category 🏷️

### Contacts 📧

For support write to 📧 [email protected] 📧

Official repository: UFAL [^3]

### Acknowledgements 🙏

- **Developed by** UFAL [^5] 👥
- **Funded by** ATRIUM [^4]  💰
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] 🔗 or 384x384 [^13] [^14] 🔗 resolution size 

**©️ 2022 UFAL & ATRIUM**

[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384