k4tel
/

vit-historical-page

@@ -7,7 +7,7 @@ tags:
 # Image processing using ViT - for historical documents
-**Goal:** This project solves a task of page images classification
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class (category) results of top
@@ -18,23 +18,12 @@ HF 😊 hub support for the model
 Fine-tuned model files can be found here:  [huggingface.co/k4tel/vit-historical-page](https://huggingface.co/k4tel/vit-historical-page) 🔗
-- **Developed by:** Kate L [github/k4tel](https://github.com/K4TEL/ltp-ocr.git)
 - **Funded by ATRIUM:**
 - **Shared by ATRIUM & UFAL:**
 - **Model type:** finetuned ViT
 - **Base model repository:** [google/vit](https://huggingface.co/google/vit-base-patch16-224) 🔗
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Paper:** not yet
-- **Demo:** [github](https://github.com/K4TEL/ltp-ocr.git)
-### Direct Use
-Page images classification in to 11 predefined categories.
 #### Training Hyperparameters
 * eval_strategy "epoch"
@@ -52,30 +41,6 @@ Page images classification in to 11 predefined categories.
 Training set of the model: **8950** images
-#### Categories
-- **DRAW 📈**:	1182	(11.89%)  - drawings, maps, paintings with text
-- **DRAW_L 📈📏**:	813	(8.17%)   - drawings, maps, paintings with a table legend or inside tabular layout / forms
-- **LINE_HW ✏️📏**:	596	(5.99%)   - handwritten text lines inside tabular layout / forms
-- **LINE_P 📏**:	603	(6.06%)   - printed text lines inside tabular layout / forms
-- **LINE_T 📏**:	1332	(13.39%)  - machine typed text lines inside tabular layout / forms
-- **PHOTO 🌄**:	1015	(10.21%)  - photos with text
-- **PHOTO_L 🌄📏**:	782	(7.86%)   - photos inside tabular layout / forms
-- **TEXT 📰**:	853	(8.58%)   - mixed types, printed, and handwritten texts
-- **TEXT_HW ✏️📄**:	732	(7.36%)   - only handwritten text
-- **TEXT_P 📄**:	691	(6.95%)   - only printed text
-- **TEXT_T 📄**:	1346	(13.53%)  - only machine typed text
 #### Data preprocessing
 During training the following transforms were applied randomly with a 50% chance:
@@ -87,15 +52,47 @@ During training the following transforms were applied randomly with a 50% chance
 * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
 * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
-Evaluation set (10% of the above stats): **995** images
 ### Results 📊
-Evaluation set's accuracy (Top-3):  **99.6%**
-⚠️ Regarding the model output, **Top-3** is enough to cover most of the images,
-setting **Top-5** will help with a small number of difficult to classify samples.
-Finally, using **Top-11** option will give you a **raw version** of class scores returned by the model
 #### Contacts

 # Image processing using ViT - for historical documents
+### Goal: This project solves a task of page images classification
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class (category) results of top
 Fine-tuned model files can be found here:  [huggingface.co/k4tel/vit-historical-page](https://huggingface.co/k4tel/vit-historical-page) 🔗
+- **Developed by:** Kate L
 - **Funded by ATRIUM:**
 - **Shared by ATRIUM & UFAL:**
 - **Model type:** finetuned ViT
 - **Base model repository:** [google/vit](https://huggingface.co/google/vit-base-patch16-224) 🔗
 #### Training Hyperparameters
 * eval_strategy "epoch"
 Training set of the model: **8950** images
 #### Data preprocessing
 During training the following transforms were applied randomly with a 50% chance:
 * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
 * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
+#### Categories
+| Label | Ratio     | Description                                                                  |
+| --- |-----------|------------------------------------------------------------------------------|
+| **DRAW** | 	11.89% | **📈 - drawings, maps, paintings with text**                                 |
+|**DRAW_L**| 	8.17% | **📈📏 - drawings ... with a table legend or inside tabular layout / forms** |
+| **LINE_HW**| 5.99% | **✏️📏 - handwritten text lines inside tabular layout / forms**              |
+| **LINE_P**| 	6.06% | **📏 - printed text lines inside tabular layout / forms**                    |
+|**LINE_T**| 	13.39% | **📏 - machine typed text lines inside tabular layout / forms**              |
+| **PHOTO**| 	10.21% | **🌄 - photos with text**                                                    |
+| **PHOTO_L**| 7.86% | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation** |
+| **TEXT**| 	8.58% | **📰 - mixed types of printed and handwritten texts**                        |
+| **TEXT_HW**| 7.36% | **✏️📄 - only handwritten text**                                             |
+| **TEXT_P**| 	6.95% | **📄 - only printed text**                                                   |
+| **TEXT_T**| 	13.53% | **📄 - only machine typed text**                                             |
+Evaluation set (same proportions):	**995** images
 ### Results 📊
+Evaluation set's accuracy (**Top-3**):  **99.6%**
+![TOP-3 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250209-1526_conf_mat.png?raw=true)
+Evaluation set's accuracy (**Top-1**):  **97.3%**
+![TOP-1 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250218-1523_conf_mat.png?raw=true)
+#### Result tables
+- Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250209-1534_model_1119_3_TOP-3_EVAL.csv) 🔗
+- Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250218-1519_model_1119_3_TOP-1_EVAL.csv) 🔗
+#### Table columns
+- **FILE** - name of the file
+- **PAGE** - number of the page
+- **CLASS-N** - label of the category, guess TOP-N
+- **SCORE-N** - score of the category, guess TOP-N
+- **TRUE** - actual label of the category
 #### Contacts