Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)

Model Overview

This repository contains a Byte-Pair Encoding (BPE) tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-bpe-tokenizer-60k.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.

Key Features:

Custom <cls> and <sep> special tokens.
BPE subword segmentation optimized for financial vocabulary.
Template-based post-processing for both single and paired sequences.
Configured decoding using the BPE decoder for accurate reconstruction of financial text.

Training Details

Dataset

Name: Finance-Instruct-500k
Source: Financial domain prompts, completions, and instructions.
Split Used: train
Size: 500,000 instruction-based samples
Loading Method: Streaming mode for efficient processing.

Tokenizer Configuration

Model Type: Byte-Pair Encoding (BPE)
Vocabulary Size: 30,000 (optimized for finance-specific tasks)
Lowercasing: Enabled
Special Tokens:
- <cls> — Classification token
- <sep> — Separator token
- <unk> — Unknown token
- <pad> — Padding token
- <mask> — Masking token (MLM tasks)
Post-Processing Template:
- Single Sequence: $A:0 <sep>:0 <cls>:2
- Paired Sequences: $A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
Decoder: BPE decoder for reconstructing original text.

Training Method

Base Model: yakul259/english-bpe-tokenizer-60k
Corpus Source: Finance-Instruct-500k
Batch Size: 1000 lines per batch
Trainer: BpeTrainer from Hugging Face tokenizers library
Special Tokens Added: <cls>, <sep>, <unk>, <pad>, <mask>

Intended Uses & Limitations

Intended Uses

Pre-tokenization for financial LLMs.
Downstream financial NLP tasks:
- Financial question answering
- Document parsing
- Financial news summarization
- Risk assessment chatbots

Limitations

Optimized for English financial text — performance may drop outside the finance domain.
May reflect biases present in the financial data used for training.

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer, please cite:

title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face

yakul259
/

finance-bpe-tokenizer-30k