--- license: apache-2.0 language: - it - en pipeline_tag: token-classification tags: - legal - finance - medical - privacy - named-entity-recognition --- # Italian_NER_XXL_v2 ## ๐Ÿš€ Model Overview Welcome to the second generation of our state-of-the-art Named Entity Recognition model for Italian text. Building on the success of our previous version, Italian_NER_XXL_v2 delivers significantly enhanced performance with an **accuracy of 87.5%** and **F1 score of 89.2%** - an improvement of over 8 percentage points from my previous model. ## ๐Ÿ’ก Key Improvements - **Enhanced Accuracy**: From 79% to 87.5% - **Better Context Understanding**: Improved recognition of entities in complex sentences - **Reduced False Positives**: More precise identification of sensitive information - **Expanded Training Data**: Trained on a more diverse corpus of Italian text ## ๐Ÿ† Market Leadership Italian_NER_XXL_v2 remains the only model in Italy capable of identifying a comprehensive range of **52** different entity categories, maintaining our unique position in the Italian NLP landscape. This unparalleled breadth of entity recognition makes our model the premier choice for privacy, legal, and financial applications. ## ๐Ÿ”ฌ Technical Foundation The model builds upon the transformer-based architecture, specifically utilizing a fine-tuned BERT variant optimized for Italian language understanding. We've implemented advanced techniques including: - Custom attention mechanisms for better contextual understanding - Specialized token classification heads for each entity category - Enhanced preprocessing pipeline for Italian text ## ๐Ÿ“‹ Recognized Categories Our model identifies an extensive range of entities across multiple domains: ### Personal Information - **NOME**: First name of a person - **COGNOME**: Last name of a person - **DATA_NASCITA**: Date of birth - **DATA_MORTE**: Date of death - **ETA**: Age of a person - **CODICE_FISCALE**: Italian tax code - **PROFESSIONE**: Occupation or profession - **STATO_CIVILE**: Civil status ### Contact Information - **INDIRIZZO**: Physical address - **NUMERO_TELEFONO**: Phone number - **EMAIL**: Email address - **CODICE_POSTALE**: Postal code ### Financial Information - **VALUTA**: Currency - **IMPORTO**: Monetary amount - **NUMERO_CARTA**: Credit/debit card number - **CVV**: Card security code - **NUMERO_CONTO**: Bank account number - **IBAN**: International bank account number - **BIC**: Bank identifier code - **P_IVA**: VAT number - **TASSO_MUTUO**: Mortgage rate - **NUM_ASSEGNO_BANCARIO**: Bank check number - **BANCA**: Bank name ### Legal Entities - **RAGIONE_SOCIALE**: Company legal name - **TRIBUNALE**: Court identifier - **LEGGE**: Law reference - **N_SENTENZA**: Sentence number - **N_LICENZA**: License number - **AVV_NOTAIO**: Lawyer or notary reference - **REGIME_PATRIMONIALE**: Property regime ### Medical Information - **CARTELLA_CLINICA**: Medical record - **MALATTIA**: Disease or medical condition - **MEDICINA**: Medicine or medical treatment - **STORIA_CLINICA**: Clinical history - **STRENGTH**: Medicine strength - **FREQUENZA**: Treatment frequency - **DURATION**: Duration of treatment - **DOSAGGIO**: Medicine dosage - **FORM**: Medicine form (e.g., tablet) ### Technical Information - **IP**: IP address - **IPV6_1**: IPv6 address - **MAC**: MAC address - **USER_AGENT**: Browser user agent - **IMEI**: Mobile device identifier ### Geographic and Temporal Data - **STATO**: Country or nation - **LUOGO**: Geographic location - **ORARIO**: Specific time - **DATA**: Generic date ### Document and Vehicle Information - **NUMERO_DOCUMENTO**: Document number - **TARGA_VEICOLO**: Vehicle license plate - **FOGLIO**: Document sheet reference - **PARTICELLA**: Land registry particle - **MAPPALE**: Land registry map reference - **SUBALTERNO**: Land registry subordinate reference ### Web and Security - **URL**: Web address - **PASSWORD**: Password - **PIN**: Personal identification number - **BRAND**: Commercial brand or trademark ## ๐Ÿ’ป Implementation ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("DeepMount00/Italian_NER_XXL_v2") model = AutoModelForTokenClassification.from_pretrained("DeepMount00/Italian_NER_XXL_v2") # Create NER pipeline nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # Example text example = """Il commendatore Gianluigi Alberico De Laurentis-Ponti, con residenza legale in Corso Imperatrice 67, Torino, avente codice fiscale DLNGGL60B01L219P, รจ amministratore delegato della "De Laurentis Advanced Engineering Group S.p.A.", che si trova in Piazza Affari 32, Milano (MI); con una partita IVA di 09876543210, la societร  รจ stata recentemente incaricata di sviluppare una nuova linea di componenti aerospaziali per il progetto internazionale di esplorazione di Marte.""" # Run NER ner_results = nlp(example) # Process results for entity in ner_results: print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.4f})") ``` ## ๐Ÿš€ Use Cases - **Privacy Compliance**: GDPR data mapping and PII detection - **Document Anonymization**: Automated redaction of sensitive information - **Legal Document Analysis**: Extraction of key entities from contracts and legal texts - **Financial Monitoring**: Detection of financial entities for compliance and fraud prevention - **Medical Record Processing**: Structured extraction from clinical notes and reports ## ๐Ÿ”ฎ Future Development We're committed to continuous improvement of the model: - Quarterly updates with further accuracy enhancements - Expansion to include new entity types based on user feedback - Development of domain-specific variants for specialized applications - Integration of contextual entity linking capabilities ## ๐Ÿ‘ฅ Contribution and Contact Your feedback is essential to improving this model. If you're interested in contributing, have suggestions, or need a customized NER solution, please contact: Michele Montebovi Email: [montebovi.michele@gmail.com](mailto:montebovi.michele@gmail.com) We welcome collaboration from the Italian NLP community to further enhance this tool and expand its applications across industries. ## ๐Ÿ“ Citation If you use this model in your research or applications, please cite: ```bibtex @misc{montebovi2025italiannerxxl, author = {Montebovi, Michele}, title = {Italian\_NER\_XXL\_v2: A Comprehensive Named Entity Recognition Model for Italian}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/DeepMount00/Italian_NER_XXL_v2}} } ```