bert-base-multilingual-uncased-finetuned-postal-can

This model is a fine-tuned version of google-bert/bert-base-multilingual-uncased on 15+ millions of Canadian postal addresses from OpenAddresses.io.

Model description

  • The model performs token classification, i.e. it parses a string representing a Canadian address into its constituent address components such as street name / number, appartment/suite/unit number, ...
  • Output labels (address components):
    • O, STREET_NB, STREET_NAME, UNIT, CITY, REGION, POSTCODE
  • Demo: Canadian postal address parsing
  • Code: didierguillevic/postal_address_canada_parsing

Usage

Sample usage:

from transformers import pipeline

model_checkpoint = "Didier/bert-base-multilingual-uncased-finetuned-postal-can"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

text = "405-200 René Lévesque Blvd W,  Montreal, Quebec H2Z 1X4"
text = text.lower()
results = token_classifier(text)

Results:

- Input: "405-200 René Lévesque Blvd W,  Montreal, Quebec H2Z 1X4"
- Output:
  - UNIT: 405
  - STREET_NB: 200
  - STREET_NAME: rene levesque blvd w
  - CITY: montreal
  - REGION: quebec
  - POSTCODE: h2z 1x4

Intended uses & limitations

Usage:

  • given a string representing a Canadian postal address, the model classifies each token into one of the address component labels.

(Current) Limitations:

  • no label for person_name / company_name (no data to train on)
  • trained on post-normalized addresses from OpenAddresses.io, hence missing un-normalized forms. E.g. "ST" (for street), but no training data with "street", "str.", ...

Enhancements:

  • Additional de-normalization of training data
  • Addition of person / companies names to the training data
  • Post-processing of results

Training and evaluation data

15+ millions of Canadian postal addresses from OpenAddresses.io.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1

Framework versions

  • Transformers 4.44.0
  • Pytorch 2.3.1
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
44
Safetensors
Model size
167M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Didier/bert-base-multilingual-uncased-finetuned-postal-can

Finetuned
(1760)
this model

Space using Didier/bert-base-multilingual-uncased-finetuned-postal-can 1