PatentBERT - PyTorch
BERT model specialized for patent classification using the CPC (Cooperative Patent Classification) system. (PyTorch version of the original PatentBert model.)
π Specifications
- Output classes: 656 (CPC subclass labels)
- Classification system: CPC (Cooperative Patent Classification)
- Architecture: BERT-base (768 hidden, 12 layers, 12 attention heads)
- Vocabulary: 30,522 tokens
- Format: SafeTensors
π·οΈ CPC Classes (Real Distribution)
The model predicts classes according to the authentic CPC system used in PatentBERT training:
Main Sections (Actual Counts)
- A (84 classes): Human Necessities - Agriculture, Food, Health, Sports
- B (171 classes): Performing Operations; Transporting - Manufacturing, Transport
- C (88 classes): Chemistry; Metallurgy - Chemical processes, Materials
- D (40 classes): Textiles; Paper - Fibers, Fabrics, Paper-making
- E (31 classes): Fixed Constructions - Building, Mining, Roads
- F (101 classes): Mechanical Engineering; Lightning; Heating; Weapons; Blasting
- G (81 classes): Physics - Optics, Acoustics, Computing, Measuring
- H (51 classes): Electricity - Electronics, Power generation, Communication
- Y (9 classes): General Tagging of New Technological Developments
Example of CPC Subclasses
A01B
: SOIL WORKING IN AGRICULTURE OR FORESTRYB25J
: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICESC07D
: HETEROCYCLIC COMPOUNDSG06F
: ELECTRIC DIGITAL DATA PROCESSINGH04L
: TRANSMISSION OF DIGITAL INFORMATION
π Usage
from transformers import BertForSequenceClassification, BertTokenizer
import json
import torch
# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')
# Inference example
text = "A method for producing synthetic materials with enhanced thermal properties..."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.softmax(dim=-1)
# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()
# Use model labels (CPC codes)
predicted_label = model.config.id2label[str(predicted_class_id)]
print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
print(f"Confidence: {confidence:.2%}")
π Included Files
model.safetensors
: Model weights (420 MB)config.json
: Configuration with integrated CPC labelsvocab.txt
: Tokenizer vocabularytokenizer_config.json
: Tokenizer configurationlabels.json
: Complete CPC label mapping (656 authentic labels)README.md
: This documentation
π¬ Performance
This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data.
π References
π Citation
If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.
@article{patent_bert,
author = "Jieh-Sheng Lee and Jieh Hsiang",
title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}",
journal = "World Patent Information",
volume = "61",
number = "101965",
year = "2020",
}
- Downloads last month
- 50
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for ZoeYou/patentbert-pytorch
Base model
google-bert/bert-base-uncased