bert-base-multilingual-uncased-finetuned-postal-can

This model is a fine-tuned version of google-bert/bert-base-multilingual-uncased on 15+ millions of Canadian postal addresses from OpenAddresses.io.

Model description

The model performs token classification, i.e. it parses a string representing a Canadian address into its constituent address components such as street name / number, appartment/suite/unit number, ...
Output labels (address components):
- O, STREET_NB, STREET_NAME, UNIT, CITY, REGION, POSTCODE
Demo: Canadian postal address parsing
Code: didierguillevic/postal_address_canada_parsing

Usage

Sample usage:

from transformers import pipeline

model_checkpoint = "Didier/bert-base-multilingual-uncased-finetuned-postal-can"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

text = "405-200 René Lévesque Blvd W,  Montreal, Quebec H2Z 1X4"
text = text.lower()
results = token_classifier(text)

Results:

- Input: "405-200 René Lévesque Blvd W,  Montreal, Quebec H2Z 1X4"
- Output:
  - UNIT: 405
  - STREET_NB: 200
  - STREET_NAME: rene levesque blvd w
  - CITY: montreal
  - REGION: quebec
  - POSTCODE: h2z 1x4

Intended uses & limitations

Usage:

given a string representing a Canadian postal address, the model classifies each token into one of the address component labels.

(Current) Limitations:

no label for person_name / company_name (no data to train on)
trained on post-normalized addresses from OpenAddresses.io, hence missing un-normalized forms. E.g. "ST" (for street), but no training data with "street", "str.", ...

Enhancements:

Additional de-normalization of training data
Addition of person / companies names to the training data
Post-processing of results

Training and evaluation data

15+ millions of Canadian postal addresses from OpenAddresses.io.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1

Framework versions

Transformers 4.44.0
Pytorch 2.3.1
Datasets 2.20.0
Tokenizers 0.19.1

Didier
/

bert-base-multilingual-uncased-finetuned-postal-can