Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)

Model Overview

This repository contains a Byte-Pair Encoding (BPE) tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-bpe-tokenizer-60k.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.

Key Features:

  • Custom <cls> and <sep> special tokens.
  • BPE subword segmentation optimized for financial vocabulary.
  • Template-based post-processing for both single and paired sequences.
  • Configured decoding using the BPE decoder for accurate reconstruction of financial text.

Training Details

Dataset

  • Name: Finance-Instruct-500k
  • Source: Financial domain prompts, completions, and instructions.
  • Split Used: train
  • Size: 500,000 instruction-based samples
  • Loading Method: Streaming mode for efficient processing.

Tokenizer Configuration

  • Model Type: Byte-Pair Encoding (BPE)
  • Vocabulary Size: 30,000 (optimized for finance-specific tasks)
  • Lowercasing: Enabled
  • Special Tokens:
    • <cls> β€” Classification token
    • <sep> β€” Separator token
    • <unk> β€” Unknown token
    • <pad> β€” Padding token
    • <mask> β€” Masking token (MLM tasks)
  • Post-Processing Template:
    • Single Sequence: $A:0 <sep>:0 <cls>:2
    • Paired Sequences: $A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
  • Decoder: BPE decoder for reconstructing original text.

Training Method


Intended Uses & Limitations

Intended Uses

  • Pre-tokenization for financial LLMs.
  • Downstream financial NLP tasks:
    • Financial question answering
    • Document parsing
    • Financial news summarization
    • Risk assessment chatbots

Limitations

  • Optimized for English financial text β€” performance may drop outside the finance domain.
  • May reflect biases present in the financial data used for training.

License

This tokenizer is released under the MIT License.


Citation

If you use this tokenizer, please cite:

title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yakul259/finance-bpe-tokenizer-30k

Finetuned
(1)
this model

Dataset used to train yakul259/finance-bpe-tokenizer-30k