--- license: gpl-3.0 language: - en base_model: - google-bert/bert-base-uncased --- # PatentBERT - PyTorch BERT model specialized for patent classification using the **CPC (Cooperative Patent Classification) system**. (PyTorch version of the original [PatentBert](https://github.com/jiehsheng/PatentBERT/) model.) ## 📊 Specifications - **Output classes**: 656 (CPC subclass labels) - **Classification system**: CPC (Cooperative Patent Classification) - **Architecture**: BERT-base (768 hidden, 12 layers, 12 attention heads) - **Vocabulary**: 30,522 tokens - **Format**: SafeTensors ## 🏷️ CPC Classes (Real Distribution) The model predicts classes according to the authentic CPC system used in PatentBERT training: ### Main Sections (Actual Counts) - **A (84 classes)**: Human Necessities - Agriculture, Food, Health, Sports - **B (171 classes)**: Performing Operations; Transporting - Manufacturing, Transport - **C (88 classes)**: Chemistry; Metallurgy - Chemical processes, Materials - **D (40 classes)**: Textiles; Paper - Fibers, Fabrics, Paper-making - **E (31 classes)**: Fixed Constructions - Building, Mining, Roads - **F (101 classes)**: Mechanical Engineering; Lightning; Heating; Weapons; Blasting - **G (81 classes)**: Physics - Optics, Acoustics, Computing, Measuring - **H (51 classes)**: Electricity - Electronics, Power generation, Communication - **Y (9 classes)**: General Tagging of New Technological Developments ### Example of CPC Subclasses - `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY - `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES - `C07D`: HETEROCYCLIC COMPOUNDS - `G06F`: ELECTRIC DIGITAL DATA PROCESSING - `H04L`: TRANSMISSION OF DIGITAL INFORMATION ## 🚀 Usage ```python from transformers import BertForSequenceClassification, BertTokenizer import json import torch # Load model and tokenizer model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch') tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch') # Inference example text = "A method for producing synthetic materials with enhanced thermal properties..." inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) predictions = outputs.logits.softmax(dim=-1) # Get prediction predicted_class_id = predictions.argmax().item() confidence = predictions.max().item() # Use model labels (CPC codes) predicted_label = model.config.id2label[str(predicted_class_id)] print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})") print(f"Confidence: {confidence:.2%}") ``` ## 📁 Included Files - `model.safetensors`: Model weights (420 MB) - `config.json`: Configuration with integrated CPC labels - `vocab.txt`: Tokenizer vocabulary - `tokenizer_config.json`: Tokenizer configuration - `labels.json`: Complete CPC label mapping (656 authentic labels) - `README.md`: This documentation ## 🔬 Performance This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data. ## 📖 References - [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/) - [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557) ## 📝 Citation If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion. ``` @article{patent_bert, author = "Jieh-Sheng Lee and Jieh Hsiang", title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}", journal = "World Patent Information", volume = "61", number = "101965", year = "2020", } ```