Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)
Model Overview
This repository contains a Byte-Pair Encoding (BPE) tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-bpe-tokenizer-60k.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.
Key Features:
- Custom
<cls>
and<sep>
special tokens. - BPE subword segmentation optimized for financial vocabulary.
- Template-based post-processing for both single and paired sequences.
- Configured decoding using the BPE decoder for accurate reconstruction of financial text.
Training Details
Dataset
- Name: Finance-Instruct-500k
- Source: Financial domain prompts, completions, and instructions.
- Split Used:
train
- Size: 500,000 instruction-based samples
- Loading Method: Streaming mode for efficient processing.
Tokenizer Configuration
- Model Type: Byte-Pair Encoding (BPE)
- Vocabulary Size: 30,000 (optimized for finance-specific tasks)
- Lowercasing: Enabled
- Special Tokens:
<cls>
β Classification token<sep>
β Separator token<unk>
β Unknown token<pad>
β Padding token<mask>
β Masking token (MLM tasks)
- Post-Processing Template:
- Single Sequence:
$A:0 <sep>:0 <cls>:2
- Paired Sequences:
$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
- Single Sequence:
- Decoder: BPE decoder for reconstructing original text.
Training Method
- Base Model: yakul259/english-bpe-tokenizer-60k
- Corpus Source: Finance-Instruct-500k
- Batch Size: 1000 lines per batch
- Trainer:
BpeTrainer
from Hugging Facetokenizers
library - Special Tokens Added:
<cls>
,<sep>
,<unk>
,<pad>
,<mask>
Intended Uses & Limitations
Intended Uses
- Pre-tokenization for financial LLMs.
- Downstream financial NLP tasks:
- Financial question answering
- Document parsing
- Financial news summarization
- Risk assessment chatbots
Limitations
- Optimized for English financial text β performance may drop outside the finance domain.
- May reflect biases present in the financial data used for training.
License
This tokenizer is released under the MIT License.
Citation
If you use this tokenizer, please cite:
title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for yakul259/finance-bpe-tokenizer-30k
Base model
yakul259/english-bpe-tokenizer-60k