--- license: apache-2.0 language: - sr metrics: - f1 - recall - precision - accuracy base_model: - classla/bcms-bertic pipeline_tag: token-classification library_name: transformers tags: - NER - Serbian language - Named Entity Recognition - Legal - NLP - BERT - Legal documents - Court ruling contributors: - Vladimir Kalušev [Hugging Face](https://huggingface.co/kalusev) - Branko Brkljač [Hugging Face](https://huggingface.co/brkljac) --- # NER4Legal_SRB ## Model Description NER4Legal_SRB is a fine-tuned Named Entity Recognition (NER) model designed for processing Serbian legal documents. This model was created as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development", accepted for publication at the 15th International Conference on Information Society and Technology, Kopaonik, Serbia, March 9-12, 2025. The model aims to automate tasks involving legal documents, such as document archiving, search, and retrieval. It leverages the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) pre-trained BERT model, carefully adapted to the specific task of identifying and classifying a predefined set of word entities in Serbian legal texts. Model can be run on both CPU and GPU. Provided model was trained on all data from NER4Legal_SRB dataset described in the reference paper. ## Abstract Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to the design of different document processing tools and enhancements in the process of document archiving, search, and retrieval. The domain of official legal documents is especially interesting due to the vast amount of data generated daily, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions, and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work, we present one LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages the pre-trained bidirectional encoder representations from transformers (BERT), carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with a mean F1 score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution. ## Base Model The model is fine-tuned from the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) base model, which is a pre-trained BERT model designed for the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages. ## Dataset This model was fine-tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. The dataset was specifically developed for this task to enable precise identification and classification of entities in Serbian legal texts. ## Performance Metrics The model achieved a mean F1 score of 0.96 during cross-validation tests on the labeled dataset, demonstrating robust performance and applicability to real-world scenarios. For detailed information about performed model evaluation and reported results please consult the original conference paper. ## Contributors - Vladimir Kalušev [https://huggingface.co/kalusev](https://huggingface.co/kalusev) - Branko Brkljač [https://huggingface.co/brkljac](https://huggingface.co/brkljac), [https://brkljac.github.io/](https://brkljac.github.io/) ## Usage Here’s how to use the model in Python: ```python from transformers import AutoModelForTokenClassification, AutoTokenizer import torch # Load the model and tokenizer device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True) model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device) # Define the label mapping (id_to_label) id_to_label = { 0: 'O', 1: 'B-COURT', 2: 'B-DATE', 3: 'B-DECISION', 4: 'B-LAW', 5: 'B-MONEY', 6: 'B-OFFICIAL GAZZETE', 7: 'B-PERSON', 8: 'B-REFERENCE', 9: 'I-COURT', 10: 'I-LAW', 11: 'I-MONEY', 12: 'I-OFFICIAL GAZZETE', 13: 'I-PERSON', 14: 'I-REFERENCE' } # NER with GPU/CPU fallback def perform_ner(text): """ Perform Named Entity Recognition on a single text with GPU memory fallback. Args: text (str): Input text. Returns: list: List of tokens and predicted labels. """ try: # Tokenize the input text inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) # Get predictions from the model with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predictions = torch.argmax(logits, dim=2).squeeze().tolist() except RuntimeError as e: if "CUDA out of memory" in str(e): print("Switching to CPU due to memory constraints.") inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model.cpu()(**inputs) # Run model on CPU logits = outputs.logits predictions = torch.argmax(logits, dim=2).squeeze().tolist() else: raise e # Re-raise other exceptions tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze()) labels = [id_to_label[pred] for pred in predictions] # Filter out special tokens results = [ (token, label) for token, label in zip(tokens, labels) if token not in tokenizer.all_special_tokens ] return results # Example usage text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje.""" # Perform NER results = perform_ner(text) # Print tokens and labels as a formatted table print("Token | Predicted Label") print("----------------------------------------") for token, label in results: print(f"{token:<17} | {label}") ``` SRB4Legal_NER performance in presence of noisy inputs ### If you would like to use this software, please consider citing the following publication: - *Kalušev, V., Brkljač, B. (2025). **Named entity recognition for Serbian legal documents: Design, methodology and dataset development**. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March, 2025, Vol. -, ISBN -, accepted for publication

@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}

@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}