--- language: multilingual tags: - document-classification - text-classification - multilingual - doclaynet - e5 pipeline_tag: text-classification base_model: intfloat/multilingual-e5-large datasets: - pierreguillou/DocLayNet-base metrics: - accuracy model-index: - name: multilingual-e5-doclaynet results: - task: type: text-classification name: Document Classification dataset: name: DocLayNet type: pierreguillou/DocLayNet-base metrics: - type: accuracy value: 0.9719 name: Test Accuracy - type: loss value: 0.5192 name: Test Loss library_name: transformers --- # Multilingual E5 for Document Classification (DocLayNet) This model is a fine-tuned version of intfloat/multilingual-e5-large for document text classification based on the DocLayNet dataset. ## Evaluation results - Test Loss: 0.5192, Test Acc: 0.9719 ## Usage: ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="kaixkhazaki/multilingual-e5-doclaynet") prediction = pipe("This is some text from a financial report") print(prediction) ``` ## Model description - Base model: intfloat/multilingual-e5-large - Task: Document text classification - Languages: Multilingual ## Training data - Dataset: DocLayNet-base - Source: https://huggingface.co/datasets/pierreguillou/DocLayNet-base - Categories: ```python { 'financial_reports': 0, 'government_tenders': 1, 'laws_and_regulations': 2, 'manuals': 3, 'patents': 4, 'scientific_articles': 5 } ``` ## Training procedure Trained on single gpu for 2 epochs for apx. 20 minutes. hyperparameters: ```python { 'batch_size': 8, 'num_epochs': 10, 'learning_rate': 2e-5, 'weight_decay': 0.01, 'warmup_ratio': 0.1, 'gradient_clip': 1.0, 'label_smoothing': 0.1, 'optimizer': 'AdamW', 'scheduler': 'cosine_with_warmup' } ```