yakul259's picture
Update README.md
fcce7d8 verified
---
language: en
tags:
- tokenizer
- bpe
- NLP
- finance
license: mit
datasets:
- Josephgflowers/Finance-Instruct-500k
library_name: tokenizers
base_model:
- yakul259/english-bpe-tokenizer-60k
---
# **Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)**
## **Model Overview**
This repository contains a **Byte-Pair Encoding (BPE) tokenizer** fine-tuned on the **Finance-Instruct-500k** dataset, starting from the base model **yakul259/english-bpe-tokenizer-60k**.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.
**Key Features:**
- Custom `<cls>` and `<sep>` special tokens.
- BPE subword segmentation optimized for financial vocabulary.
- Template-based post-processing for both single and paired sequences.
- Configured decoding using the BPE decoder for accurate reconstruction of financial text.
---
## **Training Details**
### **Dataset**
- **Name:** [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k)
- **Source:** Financial domain prompts, completions, and instructions.
- **Split Used:** `train`
- **Size:** 500,000 instruction-based samples
- **Loading Method:** Streaming mode for efficient processing.
### **Tokenizer Configuration**
- **Model Type:** Byte-Pair Encoding (BPE)
- **Vocabulary Size:** *30,000* (optimized for finance-specific tasks)
- **Lowercasing:** Enabled
- **Special Tokens:**
- `<cls>` β€” Classification token
- `<sep>` β€” Separator token
- `<unk>` β€” Unknown token
- `<pad>` β€” Padding token
- `<mask>` β€” Masking token (MLM tasks)
- **Post-Processing Template:**
- **Single Sequence:** `$A:0 <sep>:0 <cls>:2`
- **Paired Sequences:** `$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2`
- **Decoder:** BPE decoder for reconstructing original text.
### **Training Method**
- **Base Model:** [yakul259/english-bpe-tokenizer-60k](https://huggingface.co/yakul259/english-bpe-tokenizer-60k)
- **Corpus Source:** [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k)
- **Batch Size:** 1000 lines per batch
- **Trainer:** `BpeTrainer` from Hugging Face `tokenizers` library
- **Special Tokens Added:** `<cls>`, `<sep>`, `<unk>`, `<pad>`, `<mask>`
---
## **Intended Uses & Limitations**
### Intended Uses
- Pre-tokenization for financial LLMs.
- Downstream financial NLP tasks:
- Financial question answering
- Document parsing
- Financial news summarization
- Risk assessment chatbots
### Limitations
- Optimized for English financial text β€” performance may drop outside the finance domain.
- May reflect biases present in the financial data used for training.
---
## **License**
This tokenizer is released under the **MIT License**.
---
## **Citation**
If you use this tokenizer, please cite:
title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face