bert-base-multilingual-uncased-finetuned-postal-can
This model is a fine-tuned version of google-bert/bert-base-multilingual-uncased on 15+ millions of Canadian postal addresses from OpenAddresses.io.
Model description
- The model performs token classification, i.e. it parses a string representing a Canadian address into its constituent address components such as street name / number, appartment/suite/unit number, ...
- Output labels (address components):
- O, STREET_NB, STREET_NAME, UNIT, CITY, REGION, POSTCODE
- Demo: Canadian postal address parsing
- Code: didierguillevic/postal_address_canada_parsing
Usage
Sample usage:
from transformers import pipeline
model_checkpoint = "Didier/bert-base-multilingual-uncased-finetuned-postal-can"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
text = "405-200 René Lévesque Blvd W, Montreal, Quebec H2Z 1X4"
text = text.lower()
results = token_classifier(text)
Results:
- Input: "405-200 René Lévesque Blvd W, Montreal, Quebec H2Z 1X4"
- Output:
- UNIT: 405
- STREET_NB: 200
- STREET_NAME: rene levesque blvd w
- CITY: montreal
- REGION: quebec
- POSTCODE: h2z 1x4
Intended uses & limitations
Usage:
- given a string representing a Canadian postal address, the model classifies each token into one of the address component labels.
(Current) Limitations:
- no label for person_name / company_name (no data to train on)
- trained on post-normalized addresses from OpenAddresses.io, hence missing un-normalized forms. E.g. "ST" (for street), but no training data with "street", "str.", ...
Enhancements:
- Additional de-normalization of training data
- Addition of person / companies names to the training data
- Post-processing of results
Training and evaluation data
15+ millions of Canadian postal addresses from OpenAddresses.io.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
Framework versions
- Transformers 4.44.0
- Pytorch 2.3.1
- Datasets 2.20.0
- Tokenizers 0.19.1
- Downloads last month
- 44
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for Didier/bert-base-multilingual-uncased-finetuned-postal-can
Base model
google-bert/bert-base-multilingual-uncased