File size: 8,556 Bytes
e2e5372 370059a 922f057 6860ad0 922f057 dab7ef0 e2e5372 b8cb473 e2e5372 b8cb473 e2e5372 b8cb473 e2e5372 f60386a 6860ad0 f60386a b8cb473 e2e5372 b8cb473 e2e5372 6860ad0 2f7cc7f b8cb473 e2e5372 6860ad0 eb72ad7 b49b894 e2e5372 b8cb473 eb72ad7 b49b894 6860ad0 b8cb473 2f7cc7f e2e5372 2f7cc7f e2e5372 2f7cc7f bd10bd4 b8cb473 e2e5372 2f7cc7f e2e5372 6860ad0 eb72ad7 6860ad0 eb72ad7 6860ad0 03a753b 6860ad0 03a753b 6860ad0 03a753b 6860ad0 eb72ad7 6860ad0 eb72ad7 6860ad0 03a753b 6860ad0 eb72ad7 6860ad0 eb72ad7 6860ad0 03a753b 6860ad0 03a753b e2e5372 03a753b b8cb473 6860ad0 e2e5372 b8cb473 e2e5372 b8cb473 6860ad0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
### Goal: solve a task of archive page images sorting (for their further content-based processing)
**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class π·οΈ (category) results of top
N predictions output, predictions summarizing into a tabular format,
HF π hub support for the model
## Versions π
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF π hub [^1] π
| Version | Base | Pages | PDFs | Description |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
| `v2.0` | `vit-base-path16-224` | 10073 | **3896** | annotations with mistakes, more heterogenous data |
| `v2.1` | `vit-base-path16-224` | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes |
| `v2.2` | `vit-base-path16-224` | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0` |
| `v3.2` | `vit-base-path16-384` | 15855 | **5730** | same data as `v2.2`, but a bit larger model base with higher resolution |
| `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.2`, but the largest model base with higher resolution |
## Model description π
π² Fine-tuned model repository: vit-historical-page [^1] π
π³ Base model repository: Google's **vit-base-patch16-224**, **vit-base-patch16-384**, **vit-large-patch16-284** [^2] [^13] [^14] π
### Data π
Training set of the model: **8950** images for `v2.0`
Training set of the model: **10745** images for `v2.1`
Training set of the model: **14565** images for `v2.2`, `v3.2` and `v5.2`
### Categories π·οΈ
| LabelοΈ | Description |
|----------:|:-----------------------------------------------------------------------------------------------------------------|
| `DRAW` | **π - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions** |
| `DRAW_L` | **ππ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table** |
| `LINE_HW` | **βοΈπ - handwritten text organized in a tabular or form-like structure** |
| `LINE_P` | **π - printed text organized in a tabular or form-like structure** |
| `LINE_T` | **π - machine-typed text organized in a tabular or form-like structure** |
| `PHOTO` | **π - photographs or photographic cutouts, potentially with text captions** |
| `PHOTO_L` | **ππ - photos presented within a table-like layout or accompanied by tabular annotations** |
| `TEXT` | **π° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements** |
| `TEXT_HW` | **βοΈπ - only handwritten text in paragraph or block form (non-tabular)** |
| `TEXT_P` | **π - only printed text in paragraph or block form (non-tabular)** |
| `TEXT_T` | **π - only machine-typed text in paragraph or block form (non-tabular)** |
Evaluation set: **1290** images (taken from `v2.2` annotations)
#### Data preprocessing
During training the following transforms were applied randomly with a 50% chance:
* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
### Training Hyperparameters
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"
### Results π
**v2.0** Evaluation set's accuracy (**Top-3**): **95.58%**

**v2.1** Evaluation set's accuracy (**Top-3**): **99.84%**

**v2.2** Evaluation set's accuracy (**Top-3**): **100.00%**

**v2.0** Evaluation set's accuracy (**Top-1**): **84.96%**

**v2.1** Evaluation set's accuracy (**Top-1**): **96.36%**

**v2.2** Evaluation set's accuracy (**Top-1**): **99.61%**

#### Result tables
- **v2.0** Manually β **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1142_model_v20_TOP-3_EVAL.csv) π
- **v2.0** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1148_model_v20_TOP-1_EVAL.csv) π
- **v2.1** Manually β **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1153_model_v21_TOP-3_EVAL.csv) π
- **v2.1** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1151_model_v21_TOP-1_EVAL.csv) π
- **v2.2** Manually β **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1156_model_v22_TOP-3_EVAL.csv) π
- **v2.2** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250526-1158_model_v22_TOP-1_EVAL.csv) π
#### Table columns
- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category π·οΈ, guess TOP-N
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
- **TRUE** - actual label of the category π·οΈ
### Contacts π§
For support write to π§ [email protected] π§
Official repository: UFAL [^3]
### Acknowledgements π
- **Developed by** UFAL [^5] π₯
- **Funded by** ATRIUM [^4] π°
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] π or 384x384 [^13] [^14] π resolution size
**Β©οΈ 2022 UFAL & ATRIUM**
[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384
|