# ๐Ÿ”„ TensorFlow โ†’ PyTorch Conversion

This section guides you through converting the PatentBERT model from TensorFlow to PyTorch and uploading it to Hugging Face Hub.

## ๐Ÿ“‹ Conversion Plan:

1. **TensorFlow Model Download** (previous cells)
2. **Weight Extraction** - Extract parameters from TensorFlow checkpoint
3. **PyTorch Conversion** - Create equivalent PyTorch model
4. **Model Testing** - Verify that the conversion works
5. **Hugging Face Upload** - Publish to Hub for public use

## โš ๏ธ Prerequisites:
- PatentBERT model downloaded (run previous cells first)
- Python 3.7+ with TensorFlow 1.15
- Separate environment with PyTorch to avoid conflicts

In [1]:
# Step 1: Environment verification and preparation

import os
import sys
import json
import numpy as np
import tensorflow as tf

print("๐Ÿ” Environment verification...")
print(f"Python: {sys.version}")
print(f"TensorFlow: {tf.__version__}")
print(f"NumPy: {np.__version__}")

# Verify that PatentBERT model has been downloaded
model_folder = './'
required_files = [
 'model.ckpt-181172.data-00000-of-00001',
 'model.ckpt-181172.index',
 'model.ckpt-181172.meta',
 'bert_config.json',
 'vocab.txt'
]

print(f"\n๐Ÿ“‚ Checking model files in {model_folder}:")
missing_files = []
for file in required_files:
 filepath = os.path.join(model_folder, file)
 if os.path.exists(filepath):
 print(f"โœ… {file}")
 else:
 print(f"โŒ {file} - MISSING")
 missing_files.append(file)

if missing_files:
 print(f"\nโš ๏ธ Missing files: {missing_files}")
 print("๐Ÿ’ก Please run the previous cells first to download the model")
else:
 print("\nโœ… All model files are present!")

# Create working directories for conversion
conversion_dir = "/tmp/patentbert_conversion"
tf_weights_dir = os.path.join(conversion_dir, "tf_weights")
pytorch_dir = os.path.join(conversion_dir, "pytorch_model")

for dir_path in [conversion_dir, tf_weights_dir, pytorch_dir]:
 os.makedirs(dir_path, exist_ok=True)
 print(f"๐Ÿ“ Created: {dir_path}")

print(f"\n๐ŸŽฏ Ready for conversion!")
print(f"๐Ÿ“Š Working directories configured")

๐Ÿ” Environment verification...
Python: 3.7.16 (default, Jan 17 2023, 22:20:44) 
[GCC 11.2.0]
TensorFlow: 1.15.0
NumPy: 1.21.5

๐Ÿ“‚ Checking model files in ./:
โœ… model.ckpt-181172.data-00000-of-00001
โœ… model.ckpt-181172.index
โœ… model.ckpt-181172.meta
โœ… bert_config.json
โœ… vocab.txt

โœ… All model files are present!
๐Ÿ“ Created: /tmp/patentbert_conversion
๐Ÿ“ Created: /tmp/patentbert_conversion/tf_weights
๐Ÿ“ Created: /tmp/patentbert_conversion/pytorch_model

๐ŸŽฏ Ready for conversion!
๐Ÿ“Š Working directories configured


In [2]:
# Step 2: TensorFlow model weights extraction

print("๐Ÿ”„ Extracting weights from TensorFlow PatentBERT model...")

def extract_tf_weights():
 """Extract all weights from TensorFlow checkpoint"""
 
 # File paths
 checkpoint_path = "./model.ckpt-181172"
 config_path = "./bert_config.json"
 vocab_path = "./vocab.txt"
 
 # Read BERT configuration
 with open(config_path, 'r') as f:
 config = json.load(f)
 
 print(f"๐Ÿ“– Model configuration:")
 print(f" โ€ข Hidden size: {config.get('hidden_size', 768)}")
 print(f" โ€ข Number of layers: {config.get('num_hidden_layers', 12)}")
 print(f" โ€ข Attention heads: {config.get('num_attention_heads', 12)}")
 print(f" โ€ข Vocabulary size: {config.get('vocab_size', 30522)}")
 
 # List all variables in checkpoint
 var_list = tf.train.list_variables(checkpoint_path)
 print(f"๐Ÿ” Found {len(var_list)} variables in checkpoint")
 
 # Filter important variables (ignore optimization variables)
 skip_patterns = ['adam', 'beta', 'global_step', 'learning_rate']
 important_vars = []
 
 for name, shape in var_list:
 if not any(pattern in name.lower() for pattern in skip_patterns):
 important_vars.append((name, shape))
 
 print(f"๐Ÿ“Š {len(important_vars)} important variables to extract")
 
 # Extract and save weights
 weights_info = {}
 total_size = 0
 
 print("๐Ÿ”„ Extraction in progress...")
 for i, (name, shape) in enumerate(important_vars):
 try:
 # Load variable
 weight = tf.train.load_variable(checkpoint_path, name)
 
 # Create safe filename
 safe_name = name.replace('/', '_').replace(':', '_').replace(' ', '_')
 filename = f"{safe_name}.npy"
 
 # Save in NumPy format
 filepath = os.path.join(tf_weights_dir, filename)
 np.save(filepath, weight)
 
 # Record metadata
 weights_info[name] = {
 'filename': filename,
 'shape': list(shape),
 'dtype': str(weight.dtype),
 'size_mb': weight.nbytes / (1024 * 1024)
 }
 
 total_size += weight.nbytes
 
 # Show progress
 if (i + 1) % 20 == 0 or (i + 1) == len(important_vars):
 print(f" Progress: {i + 1}/{len(important_vars)} ({(i+1)/len(important_vars)*100:.1f}%)")
 
 except Exception as e:
 print(f"โš ๏ธ Error for {name}: {e}")
 continue
 
 # Create complete metadata
 metadata = {
 'model_info': {
 'name': 'PatentBERT',
 'source': 'TensorFlow',
 'checkpoint_path': checkpoint_path,
 'extraction_date': '2025-07-20'
 },
 'config': config,
 'weights_info': weights_info,
 'statistics': {
 'total_weights': len(weights_info),
 'total_size_mb': total_size / (1024 * 1024),
 'original_variables': len(var_list),
 'extracted_variables': len(weights_info)
 }
 }
 
 # Save metadata
 metadata_path = os.path.join(tf_weights_dir, 'extraction_metadata.json')
 with open(metadata_path, 'w') as f:
 json.dump(metadata, f, indent=2)
 
 # Copy configuration files
 import shutil
 shutil.copy(config_path, os.path.join(tf_weights_dir, 'bert_config.json'))
 shutil.copy(vocab_path, os.path.join(tf_weights_dir, 'vocab.txt'))
 
 print(f"โœ… Extraction completed!")
 print(f"๐Ÿ“ Weights saved in: {tf_weights_dir}")
 print(f"๐Ÿ“Š {len(weights_info)} weights extracted")
 print(f"๐Ÿ’พ Total size: {total_size / (1024 * 1024):.1f} MB")
 
 # Show some examples of extracted weights
 print(f"\n๐Ÿ“‚ Examples of created files:")
 files = sorted(os.listdir(tf_weights_dir))
 for i, file in enumerate(files[:5]):
 print(f" โ€ข {file}")
 if len(files) > 5:
 print(f" ... and {len(files) - 5} other files")
 
 return tf_weights_dir, metadata

# Execute extraction
try:
 weights_dir, metadata = extract_tf_weights()
 print("\n๐ŸŽ‰ Extraction successful!")
 
except Exception as e:
 print(f"โŒ Error during extraction: {e}")
 import traceback
 traceback.print_exc()

๐Ÿ”„ Extracting weights from TensorFlow PatentBERT model...
๐Ÿ“– Model configuration:
 โ€ข Hidden size: 768
 โ€ข Number of layers: 12
 โ€ข Attention heads: 12
 โ€ข Vocabulary size: 30522
๐Ÿ” Found 604 variables in checkpoint
๐Ÿ“Š 176 important variables to extract
๐Ÿ”„ Extraction in progress...
 Progress: 20/176 (11.4%)
 Progress: 20/176 (11.4%)
 Progress: 40/176 (22.7%)
 Progress: 40/176 (22.7%)
 Progress: 60/176 (34.1%)
 Progress: 60/176 (34.1%)
 Progress: 80/176 (45.5%)
 Progress: 80/176 (45.5%)
 Progress: 100/176 (56.8%)
 Progress: 100/176 (56.8%)
 Progress: 120/176 (68.2%)
 Progress: 120/176 (68.2%)
 Progress: 140/176 (79.5%)
 Progress: 140/176 (79.5%)
 Progress: 160/176 (90.9%)
 Progress: 160/176 (90.9%)
 Progress: 176/176 (100.0%)
โœ… Extraction completed!
๐Ÿ“ Weights saved in: /tmp/patentbert_conversion/tf_weights
๐Ÿ“Š 176 weights extracted
๐Ÿ’พ Total size: 419.5 MB

๐Ÿ“‚ Examples of created files:
 โ€ข bert_config.json
 โ€ข bert_embeddings_LayerNorm_gamma.npy
 โ€ข bert_embed

In [1]:
# Step 3: Convert TensorFlow weights to PyTorch format

print("๐ŸŽฏ Converting TensorFlow weights to PyTorch format...")

corrected_upload_script = """#!/usr/bin/env python3
import os
import sys
from huggingface_hub import HfApi, create_repo, upload_folder
from transformers import BertForSequenceClassification, BertTokenizer

def check_model_files(model_dir):
 \"\"\"Check for required model files with support for both formats.\"\"\"
 
 # Required base files
 required_base = ['config.json', 'vocab.txt', 'tokenizer_config.json']
 
 # Model files (at least one of these)
 model_files = ['model.safetensors', 'pytorch_model.bin']
 
 missing_base = []
 for file in required_base:
 if not os.path.exists(os.path.join(model_dir, file)):
 missing_base.append(file)
 
 # Check for at least one model file
 found_model_files = []
 for f in model_files:
 if os.path.exists(os.path.join(model_dir, f)):
 found_model_files.append(f)
 
 if missing_base:
 print(f"โŒ Missing required files: {missing_base}")
 return False
 
 if not found_model_files:
 print(f"โŒ No model file found. Expected one of: {model_files}")
 return False
 
 # Show found files
 all_files = os.listdir(model_dir)
 print(f"โœ… Model files found: {all_files}")
 print(f"โœ… Model weights format: {found_model_files[0]}")
 return True

def test_model_loading(model_dir):
 \"\"\"Test model loading to verify it works.\"\"\"
 try:
 print("๐Ÿงช Model loading test...")
 
 # Load model and tokenizer
 model = BertForSequenceClassification.from_pretrained(model_dir)
 tokenizer = BertTokenizer.from_pretrained(model_dir)
 
 print(f"โœ… Model loaded: {model.config.num_labels} classes, {model.config.hidden_size} hidden")
 print(f"โœ… Tokenizer loaded: {len(tokenizer)} tokens")
 
 # Quick inference test
 text = "A method for producing synthetic materials"
 inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
 
 import torch
 with torch.no_grad():
 outputs = model(**inputs)
 predictions = outputs.logits.softmax(dim=-1)
 
 print(f"โœ… Inference test successful: shape {predictions.shape}")
 return True
 
 except Exception as e:
 print(f"โŒ Test error: {e}")
 return False

def upload_to_huggingface(model_dir, repo_name, token, private=False):
 \"\"\"Upload model to Hugging Face Hub with support for all formats.\"\"\"
 
 print("๐Ÿš€ Upload to Hugging Face Hub")
 print(f"๐Ÿ“‚ Model: {model_dir}")
 print(f"๐Ÿท๏ธ Repository: {repo_name}")
 print(f"๐Ÿ”’ Private: {private}")
 
 # File verification
 if not check_model_files(model_dir):
 return False
 
 # Loading test
 if not test_model_loading(model_dir):
 print("โš ๏ธ Warning: Model doesn't load correctly, but continuing upload...")
 
 try:
 # Initialize API
 api = HfApi(token=token)
 
 # Check connection
 user_info = api.whoami()
 print(f"โœ… Connected as: {user_info['name']}")
 
 # Create or verify repository
 try:
 create_repo(repo_name, token=token, private=private, exist_ok=True)
 print(f"โœ… Repository created/verified: https://huggingface.co/{repo_name}")
 except Exception as e:
 print(f"โš ๏ธ Repository warning: {e}")
 
 # Upload complete folder
 print("๐Ÿ“ค Uploading files...")
 
 # Determine model format
 model_format = "SafeTensors" if os.path.exists(os.path.join(model_dir, 'model.safetensors')) else "PyTorch"
 
 # Create informative commit message
 commit_message = f\"\"\"Upload PatentBERT PyTorch model

BERT model fine-tuned for patent classification, converted from TensorFlow to PyTorch.

Specifications:
- Format: {model_format}
- Classes: Auto-detected from config.json 
- Conversion: TensorFlow 1.15 โ†’ PyTorch via transformers
- CPC Labels: Real Cooperative Patent Classification labels included

Included files:
{', '.join(sorted(os.listdir(model_dir)))}
\"\"\"
 
 upload_folder(
 folder_path=model_dir,
 repo_id=repo_name,
 token=token,
 commit_message=commit_message,
 ignore_patterns=[".git", ".gitattributes", "*.tmp"]
 )
 
 print("๐ŸŽ‰ Upload completed successfully!")
 print(f"๐ŸŒ Model available at: https://huggingface.co/{repo_name}")
 
 # Usage instructions
 print("\\n๐Ÿ“‹ Usage instructions:")
 print(f"from transformers import BertForSequenceClassification, BertTokenizer")
 print(f"model = BertForSequenceClassification.from_pretrained('{repo_name}')")
 print(f"tokenizer = BertTokenizer.from_pretrained('{repo_name}')")
 
 return True
 
 except Exception as e:
 print(f"โŒ Upload error: {e}")
 import traceback
 traceback.print_exc()
 return False

def main():
 if len(sys.argv) != 4:
 print("Usage: python upload_to_hf.py ")
 print("Example: python upload_to_hf.py ./pytorch_model ZoeYou/patentbert-pytorch hf_xxx...")
 sys.exit(1)
 
 model_dir = sys.argv[1]
 repo_name = sys.argv[2]
 token = sys.argv[3]
 
 if not os.path.exists(model_dir):
 print(f"โŒ Directory not found: {model_dir}")
 sys.exit(1)
 
 success = upload_to_huggingface(model_dir, repo_name, token, private=False)
 
 if success:
 print("\\nโœ… UPLOAD SUCCESSFUL!")
 else:
 print("\\nโŒ UPLOAD FAILED!")
 sys.exit(1)

if __name__ == "__main__":
 # Import torch for loading test
 try:
 import torch
 except ImportError:
 print("โš ๏ธ torch not available, loading test skipped")
 
 main()
"""

# Save the corrected upload script
with open('/tmp/upload_to_hf_corrected.py', 'w', encoding='utf-8') as f:
 f.write(corrected_upload_script)

# Also overwrite the original script
with open('/tmp/upload_to_hf.py', 'w', encoding='utf-8') as f:
 f.write(corrected_upload_script)

print("โœ… CORRECTED upload script created!")
print("\n๐Ÿ”ง Key corrections:")
print(" โœ… Accepts BOTH model.safetensors AND pytorch_model.bin")
print(" โœ… Automatically detects model format")
print(" โœ… Improved error messages")
print(" โœ… Better commit message with format info")
print(" โœ… Proper torch import for testing")

print("\n๐Ÿš€ NOW RUN THIS CORRECTED COMMAND:")
print(" python /tmp/upload_to_hf.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx")

print("\n๐Ÿ’ก Or use the new corrected script:")
print(" python /tmp/upload_to_hf_corrected.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx")

๐ŸŽฏ Converting TensorFlow weights to PyTorch format...
โœ… CORRECTED upload script created!

๐Ÿ”ง Key corrections:
 โœ… Accepts BOTH model.safetensors AND pytorch_model.bin
 โœ… Automatically detects model format
 โœ… Improved error messages
 โœ… Better commit message with format info
 โœ… Proper torch import for testing

๐Ÿš€ NOW RUN THIS CORRECTED COMMAND:
 python /tmp/upload_to_hf.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx

๐Ÿ’ก Or use the new corrected script:
 python /tmp/upload_to_hf_corrected.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx


In [None]:
# ๐ŸŽ‰ UPLOAD SUCCESS! Let's test the uploaded model

print("๐ŸŽ‰ Upload successful! Testing the uploaded model from Hugging Face...")

# Test the uploaded model

from transformers import BertForSequenceClassification, BertTokenizer
import torch

print("๐Ÿ” Testing uploaded PatentBERT model from Hugging Face...")

try:
 # Load model and tokenizer from Hugging Face Hub
 model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
 tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')
 
 print(f"โœ… Model loaded: {model.config.num_labels} classes")
 print(f"โœ… Tokenizer loaded: {len(tokenizer)} tokens")
 
 # Test inference
 text = "A method for producing synthetic materials with enhanced properties"
 inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
 
 with torch.no_grad():
 outputs = model(**inputs)
 predictions = outputs.logits.softmax(dim=-1)
 
 # Get top prediction
 predicted_class_id = predictions.argmax().item()
 confidence = predictions.max().item()
 
 # Use real CPC labels if available
 if hasattr(model.config, 'id2label') and model.config.id2label:
 predicted_label = model.config.id2label[predicted_class_id]
 print(f"โœ… Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
 else:
 print(f"โœ… Predicted class ID: {predicted_class_id}")
 
 print(f"โœ… Confidence: {confidence:.2%}")
 print("๐ŸŽ‰ Model works perfectly from Hugging Face!")
 
except Exception as e:
 print(f"โŒ Error: {e}")


print("๐Ÿ“ Model test code ready. Your model is now live at:")
print("๐ŸŒ https://huggingface.co/ZoeYou/patentbert-pytorch")

print("\\n๐Ÿ“‹ Quick usage example:")


In [2]:
# step 4: Provide usage example for the uploaded model

# ๐ŸŽ‰ CONVERSION SUCCESS! Upload script correction

print("๐ŸŽ‰ CONVERSION SUCCESSFUL! Upload script correction...")

upload_script = """#!/usr/bin/env python3
import os
import sys
from huggingface_hub import HfApi, create_repo, upload_folder
from transformers import BertForSequenceClassification, BertTokenizer

def check_model_files(model_dir):
 \"\"\"Check for required model files.\"\"\"
 
 # Required base files
 required_base = ['config.json', 'vocab.txt', 'tokenizer_config.json']
 
 # Model files (at least one of these)
 model_files = ['model.safetensors', 'pytorch_model.bin']
 
 missing_base = []
 for file in required_base:
 if not os.path.exists(os.path.join(model_dir, file)):
 missing_base.append(file)
 
 # Check for at least one model file
 has_model_file = any(os.path.exists(os.path.join(model_dir, f)) for f in model_files)
 
 if missing_base:
 print(f"โŒ Missing required files: {missing_base}")
 return False
 
 if not has_model_file:
 print(f"โŒ No model file found. Expected: {model_files}")
 return False
 
 # Show found files
 found_files = []
 for file in os.listdir(model_dir):
 if os.path.isfile(os.path.join(model_dir, file)):
 found_files.append(file)
 
 print(f"โœ… Model files found: {found_files}")
 return True

def test_model_loading(model_dir):
 \"\"\"Test model loading to verify it works.\"\"\"
 try:
 print("๐Ÿงช Model loading test...")
 
 # Load model and tokenizer
 model = BertForSequenceClassification.from_pretrained(model_dir)
 tokenizer = BertTokenizer.from_pretrained(model_dir)
 
 print(f"โœ… Model loaded: {model.config.num_labels} classes, {model.config.hidden_size} hidden")
 print(f"โœ… Tokenizer loaded: {len(tokenizer)} tokens")
 
 # Quick inference test
 text = "A method for producing synthetic materials"
 inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
 
 with torch.no_grad():
 outputs = model(**inputs)
 predictions = outputs.logits.softmax(dim=-1)
 
 print(f"โœ… Inference test successful: shape {predictions.shape}")
 return True
 
 except Exception as e:
 print(f"โŒ Test error: {e}")
 return False

def upload_to_huggingface(model_dir, repo_name, token, private=False):
 \"\"\"Upload model to Hugging Face Hub.\"\"\"
 
 print("๐Ÿš€ Upload to Hugging Face Hub")
 print(f"๐Ÿ“‚ Model: {model_dir}")
 print(f"๐Ÿท๏ธ Repository: {repo_name}")
 print(f"๐Ÿ”’ Private: {private}")
 
 # File verification
 if not check_model_files(model_dir):
 return False
 
 # Loading test
 if not test_model_loading(model_dir):
 print("โš ๏ธ Warning: Model doesn't load correctly, but continuing upload...")
 
 try:
 # Initialize API
 api = HfApi(token=token)
 
 # Check connection
 user_info = api.whoami()
 print(f"โœ… Connected as: {user_info['name']}")
 
 # Create or verify repository
 try:
 create_repo(repo_name, token=token, private=private, exist_ok=True)
 print(f"โœ… Repository created/verified: https://huggingface.co/{repo_name}")
 except Exception as e:
 print(f"โš ๏ธ Repository warning: {e}")
 
 # Upload complete folder
 print("๐Ÿ“ค Uploading files...")
 
 # Create informative commit message
 commit_message = f\"\"\"Upload PatentBERT PyTorch model

BERT model fine-tuned for patent classification, converted from TensorFlow to PyTorch.

Specifications:
- Format: {'SafeTensors' if os.path.exists(os.path.join(model_dir, 'model.safetensors')) else 'PyTorch'}
- Classes: Auto-detected from config.json
- Conversion: TensorFlow 1.15 โ†’ PyTorch via transformers

Included files:
{', '.join(os.listdir(model_dir))}
\"\"\"
 
 upload_folder(
 folder_path=model_dir,
 repo_id=repo_name,
 token=token,
 commit_message=commit_message,
 ignore_patterns=[".git", ".gitattributes", "*.tmp"]
 )
 
 print("๐ŸŽ‰ Upload completed successfully!")
 print(f"๐ŸŒ Model available at: https://huggingface.co/{repo_name}")
 
 # Usage instructions
 print("\\n๐Ÿ“‹ Usage instructions:")
 print(f"from transformers import BertForSequenceClassification, BertTokenizer")
 print(f"model = BertForSequenceClassification.from_pretrained('{repo_name}')")
 print(f"tokenizer = BertTokenizer.from_pretrained('{repo_name}')")
 
 return True
 
 except Exception as e:
 print(f"โŒ Upload error: {e}")
 return False

def main():
 if len(sys.argv) != 4:
 print("Usage: python upload_to_hf.py ")
 print("Example: python upload_to_hf.py ./pytorch_model ZoeYou/patentbert-pytorch hf_xxx...")
 sys.exit(1)
 
 model_dir = sys.argv[1]
 repo_name = sys.argv[2]
 token = sys.argv[3]
 
 if not os.path.exists(model_dir):
 print(f"โŒ Directory not found: {model_dir}")
 sys.exit(1)
 
 success = upload_to_huggingface(model_dir, repo_name, token, private=False)
 
 if success:
 print("\\nโœ… UPLOAD SUCCESSFUL!")
 else:
 print("\\nโŒ UPLOAD FAILED!")
 sys.exit(1)

if __name__ == "__main__":
 # Import torch for loading test
 try:
 import torch
 except ImportError:
 print("โš ๏ธ torch not available, loading test skipped")
 
 main()
"""

# Save corrected upload script
with open('/tmp/upload_to_hf.py', 'w', encoding='utf-8') as f:
 f.write(upload_script)

print("โœ… CORRECTED upload script created!")
print("\n๐Ÿ”ง Applied corrections:")
print(" โœ… Accepts model.safetensors AND pytorch_model.bin")
print(" โœ… Model loading test before upload")
print(" โœ… Robust file verification")
print(" โœ… Informative commit message")
print(" โœ… Usage instructions included")

print("\n๐Ÿš€ CORRECTED COMMAND:")
print(" python upload_to_hf.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx")

๐ŸŽ‰ CONVERSION SUCCESSFUL! Upload script correction...
โœ… CORRECTED upload script created!

๐Ÿ”ง Applied corrections:
 โœ… Accepts model.safetensors AND pytorch_model.bin
 โœ… Model loading test before upload
 โœ… Robust file verification
 โœ… Informative commit message
 โœ… Usage instructions included

๐Ÿš€ CORRECTED COMMAND:
 python upload_to_hf.py patentbert_conversion/pytorch_model ZoeYou/patentbert-pytorch xxxxx


๐ŸŽฏ COMPLETE TENSORFLOW โ†’ PYTORCH CONVERSION GUIDE

๐Ÿ“‹ 4-step process:

1๏ธโƒฃ **DOWNLOAD** (in this notebook)
 โ€ข Run previous cells to download PatentBERT
 โ€ข Model will be in ./

2๏ธโƒฃ **EXTRACTION** (in this notebook)
 โ€ข Run TensorFlow weight extraction cell
 โ€ข Weights will be extracted to /tmp/patentbert_conversion/tf_weights/

3๏ธโƒฃ **CONVERSION** (Python 3.8+ environment)
 ```
 bash /tmp/install_pytorch_env.sh
 source patentbert_pytorch/bin/activate
 python /tmp/convert_patentbert.py /tmp/patentbert_conversion/tf_weights /tmp/patentbert_conversion/pytorch_model
 ```

4๏ธโƒฃ **TEST AND UPLOAD**

 `python /tmp/test_patentbert.py /tmp/patentbert_conversion/pytorch_model`

 `python /tmp/upload_to_hf.py /tmp/patentbert_conversion/pytorch_model username/patentbert-pytorch your_hf_token`

๐ŸŽ‰ RESULT:
โ€ข PyTorch model ready for production
โ€ข Compatible with Hugging Face Transformers
โ€ข Publicly available on Hub
โ€ข Documentation and examples included

๐Ÿ’ก TIP:
First create an account at https://huggingface.co/ and get your access token
from https://huggingface.co/settings/tokens


In [None]:
# ๐Ÿท๏ธ ADDING CLASS LABELS - Essential for prediction interpretation

print("๐Ÿท๏ธ Creating and adding CPC class labels...")

# Load the REAL CPC labels from the original PatentBERT label file
import pandas as pd
import json
import os

# Load the real CPC labels
label_file_path = "./labels_group_id.tsv"
cpc_df = pd.read_csv(label_file_path, sep='\t')

print(f"โœ… Loaded {len(cpc_df)} real CPC labels from PatentBERT")
print(f"๐Ÿ“ Example labels from the real data:")
for i in [0, 50, 100, 200, 300, 400, 500, 600, 655]:
 if i < len(cpc_df):
 row = cpc_df.iloc[i]
 print(f" {i:3d}: {row['id']} - {row['title'][:80]}...")

# Extract labels and descriptions
cpc_labels = cpc_df['id'].tolist()
cpc_descriptions = [f"{row['id']}: {row['title']}" for _, row in cpc_df.iterrows()]

print(f"\nโœ… Real CPC system structure:")
print(f" ๐Ÿ“Š Total classes: {len(cpc_labels)}")

# Analyze the actual distribution by section
section_counts = {}
for label in cpc_labels:
 section = label[0]
 section_counts[section] = section_counts.get(section, 0) + 1

print(f" ๐Ÿ“ˆ Distribution by section:")
for section, count in sorted(section_counts.items()):
 print(f" {section}: {count} classes")

# Create label configuration file
label_config = {
 "id2label": {str(i): label for i, label in enumerate(cpc_labels)},
 "label2id": {label: i for i, label in enumerate(cpc_labels)},
 "num_labels": len(cpc_labels),
 "classification_type": "CPC",
 "description": "Real Cooperative Patent Classification (CPC) labels from PatentBERT training data"
}

# Save to model directory
model_dir = "/tmp/patentbert_conversion/pytorch_model"
labels_file = os.path.join(model_dir, "labels.json")

with open(labels_file, 'w', encoding='utf-8') as f:
 json.dump(label_config, f, indent=2, ensure_ascii=False)

print(f"โœ… Labels saved to: {labels_file}")

# Update model configuration to include labels
config_file = os.path.join(model_dir, "config.json")

if os.path.exists(config_file):
 with open(config_file, 'r') as f:
 config = json.load(f)
 
 # Add labels to config
 config["id2label"] = label_config["id2label"]
 config["label2id"] = label_config["label2id"]
 
 # Save updated config
 with open(config_file, 'w', encoding='utf-8') as f:
 json.dump(config, f, indent=2, ensure_ascii=False)
 
 print("โœ… Configuration updated with real CPC labels")
else:
 print("โš ๏ธ config.json file not found")

# Create detailed README with REAL CPC labels and distribution
section_descriptions = {
 'A': 'Human Necessities - Agriculture, Food, Health, Sports',
 'B': 'Performing Operations; Transporting - Manufacturing, Transport',
 'C': 'Chemistry; Metallurgy - Chemical processes, Materials',
 'D': 'Textiles; Paper - Fibers, Fabrics, Paper-making',
 'E': 'Fixed Constructions - Building, Mining, Roads',
 'F': 'Mechanical Engineering; Lightning; Heating; Weapons; Blasting',
 'G': 'Physics - Optics, Acoustics, Computing, Measuring',
 'H': 'Electricity - Electronics, Power generation, Communication',
 'Y': 'General Tagging of New Technological Developments'
}

readme_with_labels = f"""# PatentBERT - PyTorch

BERT model specialized for patent classification using the **real CPC (Cooperative Patent Classification) system** from the original PatentBERT training data.

## ๐Ÿ“Š Specifications

- **Output classes**: {len(cpc_labels)} (real CPC labels)
- **Classification system**: CPC (Cooperative Patent Classification)
- **Architecture**: BERT-base (768 hidden, 12 layers, 12 attention heads)
- **Vocabulary**: 30,522 tokens
- **Format**: SafeTensors

## ๐Ÿท๏ธ CPC Classes (Real Distribution)

The model predicts classes according to the authentic CPC system used in PatentBERT training:

### Main Sections (Actual Counts)
"""

# Add real distribution to README
for section in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
 if section in section_counts:
 count = section_counts[section]
 desc = section_descriptions.get(section, f'Section {section}')
 readme_with_labels += f"- **{section} ({count} classes)**: {desc}\n"

readme_with_labels += f"""
### Example Real Classes

- `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY
- `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- `C07D`: HETEROCYCLIC COMPOUNDS
- `G06F`: ELECTRIC DIGITAL DATA PROCESSING
- `H04L`: TRANSMISSION OF DIGITAL INFORMATION

## ๐Ÿš€ Usage

```python
from transformers import BertForSequenceClassification, BertTokenizer
import json
import torch

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')

# Inference example
text = "A method for producing synthetic materials with enhanced thermal properties..."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

with torch.no_grad():
 outputs = model(**inputs)
 predictions = outputs.logits.softmax(dim=-1)

# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()

# Use model labels (real CPC codes)
predicted_label = model.config.id2label[predicted_class_id]


print(f"Predicted CPC class: {{predicted_label}} (ID: {{predicted_class_id}})")
print(f"Confidence: {{confidence:.2%}}")
```

## ๐Ÿ“ Included Files

- `model.safetensors`: Model weights (420 MB)
- `config.json`: Configuration with integrated real CPC labels
- `vocab.txt`: Tokenizer vocabulary
- `tokenizer_config.json`: Tokenizer configuration
- `labels.json`: Complete real CPC label mapping ({len(cpc_labels)} authentic labels)
- `README.md`: This documentation

## ๐Ÿ”ฌ Performance

This model was trained on a large patent corpus to automatically classify documents according to the real CPC system, using the exact same {len(cpc_labels)} CPC codes from the original PatentBERT training data.

## ๐Ÿ“– References

- [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/)
- [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557)

## ๐Ÿ“ Citation

If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.
"""

# Save updated README
readme_file = os.path.join(model_dir, "README.md")
with open(readme_file, 'w', encoding='utf-8') as f:
 f.write(readme_with_labels)

print("โœ… README updated with REAL CPC label documentation")

# Summary of created/updated files
print("\n๐Ÿ“ Added/updated files:")
print(f" โ€ข labels.json - Complete mapping of {len(cpc_labels)} REAL CPC labels")
print(f" โ€ข config.json - Updated configuration with authentic id2label/label2id")
print(f" โ€ข README.md - Complete documentation with real CPC distribution")

print("\n๐ŸŽฏ Model is now ready for upload with AUTHENTIC CPC labels!")

๐Ÿท๏ธ Creating and adding CPC class labels...
โœ… Loaded 656 real CPC labels from PatentBERT
๐Ÿ“ Example labels from the real data:
 0: A01B - SOIL WORKING IN AGRICULTURE OR FORESTRY; PARTS, DETAILS, OR ACCESSORIES OF AGRIC...
 50: A46B - BRUSHES ...
 100: B07B - SEPERATING SOLIDS FROM SOLIDS BY SIEVING, SCREENING, OR SIFTING OR BY USING GAS ...
 200: B60Q - ARRANGEMENT OF SIGNALLING OR LIGHTING DEVICES, THE MOUNTING OR SUPPORTING THEREO...
 300: C10F - DRYING OR WORKING-UP OF PEAT...
 400: E04G - SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR OTHER BUILDING AIDS, OR T...
 500: F28B - STEAM OR VAPOUR CONDENSERS ...
 600: H01H - ELECTRIC SWITCHES; RELAYS; SELECTORS...
 655: Y10T - TECHNICAL SUBJECTS COVERED BY FORMER US CLASSIFICATION...

โœ… Real CPC system structure:
 ๐Ÿ“Š Total classes: 656
 ๐Ÿ“ˆ Distribution by section:
 A: 84 classes
 B: 171 classes
 C: 88 classes
 D: 40 classes
 E: 31 classes
 F: 101 classes
 G: 81 classes
 H: 51 classes
 Y: 9 classes
โœ… Labels saved t

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')

# Inference example
text = "A device designed to spin in a user's hands may include a body with a centrally mounted ball bearing positioned within a center orifice of the body, wherein an outer race of the ball bearing is attached to the frame; a button made of a pair of bearing caps attached to one another through the ball bearing and clamped against an inner race of the ball bearing, such that when the button is held between a user's thumb and finger, the body freely rotates about the ball bearing; and a plurality of weights distributed at opposite ends of the body, creating at least a bipolar weight distribution."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

with torch.no_grad():
 outputs = model(**inputs)
 predictions = outputs.logits.softmax(dim=-1)

# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()

# Use model labels (real CPC codes)
predicted_label = model.config.id2label[predicted_class_id]

print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
print(f"Confidence: {confidence:.2%}")


Predicted CPC class: A63B (ID: 76)
Confidence: 99.51%


In [7]:
model.config.id2label[76]

'A63B'