PatentBERT - PyTorch

BERT model specialized for patent classification using the CPC (Cooperative Patent Classification) system. (PyTorch version of the original PatentBert model.)

πŸ“Š Specifications

  • Output classes: 656 (CPC subclass labels)
  • Classification system: CPC (Cooperative Patent Classification)
  • Architecture: BERT-base (768 hidden, 12 layers, 12 attention heads)
  • Vocabulary: 30,522 tokens
  • Format: SafeTensors

🏷️ CPC Classes (Real Distribution)

The model predicts classes according to the authentic CPC system used in PatentBERT training:

Main Sections (Actual Counts)

  • A (84 classes): Human Necessities - Agriculture, Food, Health, Sports
  • B (171 classes): Performing Operations; Transporting - Manufacturing, Transport
  • C (88 classes): Chemistry; Metallurgy - Chemical processes, Materials
  • D (40 classes): Textiles; Paper - Fibers, Fabrics, Paper-making
  • E (31 classes): Fixed Constructions - Building, Mining, Roads
  • F (101 classes): Mechanical Engineering; Lightning; Heating; Weapons; Blasting
  • G (81 classes): Physics - Optics, Acoustics, Computing, Measuring
  • H (51 classes): Electricity - Electronics, Power generation, Communication
  • Y (9 classes): General Tagging of New Technological Developments

Example of CPC Subclasses

  • A01B: SOIL WORKING IN AGRICULTURE OR FORESTRY
  • B25J: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
  • C07D: HETEROCYCLIC COMPOUNDS
  • G06F: ELECTRIC DIGITAL DATA PROCESSING
  • H04L: TRANSMISSION OF DIGITAL INFORMATION

πŸš€ Usage

from transformers import BertForSequenceClassification, BertTokenizer
import json
import torch

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')

# Inference example
text = "A method for producing synthetic materials with enhanced thermal properties..."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.softmax(dim=-1)

# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()

# Use model labels (CPC codes)
predicted_label = model.config.id2label[str(predicted_class_id)]

print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
print(f"Confidence: {confidence:.2%}")

πŸ“ Included Files

  • model.safetensors: Model weights (420 MB)
  • config.json: Configuration with integrated CPC labels
  • vocab.txt: Tokenizer vocabulary
  • tokenizer_config.json: Tokenizer configuration
  • labels.json: Complete CPC label mapping (656 authentic labels)
  • README.md: This documentation

πŸ”¬ Performance

This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data.

πŸ“– References

πŸ“ Citation

If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.

@article{patent_bert, 
  author = "Jieh-Sheng Lee and Jieh Hsiang",
  title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}",
  journal = "World Patent Information",
  volume = "61",
  number = "101965",
  year = "2020",
}
Downloads last month
50
Safetensors
Model size
110M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZoeYou/patentbert-pytorch

Finetuned
(5565)
this model