|
--- |
|
language: en |
|
tags: |
|
- tokenizer |
|
- bpe |
|
- NLP |
|
- finance |
|
license: mit |
|
datasets: |
|
- Josephgflowers/Finance-Instruct-500k |
|
library_name: tokenizers |
|
base_model: |
|
- yakul259/english-bpe-tokenizer-60k |
|
--- |
|
|
|
# **Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)** |
|
|
|
## **Model Overview** |
|
This repository contains a **Byte-Pair Encoding (BPE) tokenizer** fine-tuned on the **Finance-Instruct-500k** dataset, starting from the base model **yakul259/english-bpe-tokenizer-60k**. |
|
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation. |
|
|
|
**Key Features:** |
|
- Custom `<cls>` and `<sep>` special tokens. |
|
- BPE subword segmentation optimized for financial vocabulary. |
|
- Template-based post-processing for both single and paired sequences. |
|
- Configured decoding using the BPE decoder for accurate reconstruction of financial text. |
|
|
|
--- |
|
|
|
## **Training Details** |
|
|
|
### **Dataset** |
|
- **Name:** [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k) |
|
- **Source:** Financial domain prompts, completions, and instructions. |
|
- **Split Used:** `train` |
|
- **Size:** 500,000 instruction-based samples |
|
- **Loading Method:** Streaming mode for efficient processing. |
|
|
|
### **Tokenizer Configuration** |
|
- **Model Type:** Byte-Pair Encoding (BPE) |
|
- **Vocabulary Size:** *30,000* (optimized for finance-specific tasks) |
|
- **Lowercasing:** Enabled |
|
- **Special Tokens:** |
|
- `<cls>` β Classification token |
|
- `<sep>` β Separator token |
|
- `<unk>` β Unknown token |
|
- `<pad>` β Padding token |
|
- `<mask>` β Masking token (MLM tasks) |
|
- **Post-Processing Template:** |
|
- **Single Sequence:** `$A:0 <sep>:0 <cls>:2` |
|
- **Paired Sequences:** `$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2` |
|
- **Decoder:** BPE decoder for reconstructing original text. |
|
|
|
### **Training Method** |
|
- **Base Model:** [yakul259/english-bpe-tokenizer-60k](https://huggingface.co/yakul259/english-bpe-tokenizer-60k) |
|
- **Corpus Source:** [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k) |
|
- **Batch Size:** 1000 lines per batch |
|
- **Trainer:** `BpeTrainer` from Hugging Face `tokenizers` library |
|
- **Special Tokens Added:** `<cls>`, `<sep>`, `<unk>`, `<pad>`, `<mask>` |
|
|
|
--- |
|
|
|
## **Intended Uses & Limitations** |
|
|
|
### Intended Uses |
|
- Pre-tokenization for financial LLMs. |
|
- Downstream financial NLP tasks: |
|
- Financial question answering |
|
- Document parsing |
|
- Financial news summarization |
|
- Risk assessment chatbots |
|
|
|
### Limitations |
|
- Optimized for English financial text β performance may drop outside the finance domain. |
|
- May reflect biases present in the financial data used for training. |
|
|
|
--- |
|
|
|
## **License** |
|
This tokenizer is released under the **MIT License**. |
|
|
|
--- |
|
|
|
## **Citation** |
|
If you use this tokenizer, please cite: |
|
|
|
title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k |
|
author = yakul259 |
|
year = 2025 |
|
publisher = Hugging Face |