yakul259
/

finance-bpe-tokenizer-30k

Model card Files Files and versions

finance-bpe-tokenizer-30k / README.md

yakul259's picture

Update README.md

fcce7d8 verified 2 months ago

|

history blame contribute delete

2.98 kB

	---
	language: en
	tags:
	- tokenizer
	- bpe
	- NLP
	- finance
	license: mit
	datasets:
	- Josephgflowers/Finance-Instruct-500k
	library_name: tokenizers
	base_model:
	- yakul259/english-bpe-tokenizer-60k
	---

	# Finance BPE Tokenizer (Fine-tuned on Finance-Instruct-500k)

	## Model Overview
	This repository contains a Byte-Pair Encoding (BPE) tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-bpe-tokenizer-60k.
	It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.

	Key Features:
	- Custom `<cls>` and `<sep>` special tokens.
	- BPE subword segmentation optimized for financial vocabulary.
	- Template-based post-processing for both single and paired sequences.
	- Configured decoding using the BPE decoder for accurate reconstruction of financial text.

	---

	## Training Details

	### Dataset
	- Name: [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k)
	- Source: Financial domain prompts, completions, and instructions.
	- Split Used: `train`
	- Size: 500,000 instruction-based samples
	- Loading Method: Streaming mode for efficient processing.

	### Tokenizer Configuration
	- Model Type: Byte-Pair Encoding (BPE)
	- Vocabulary Size: 30,000 (optimized for finance-specific tasks)
	- Lowercasing: Enabled
	- Special Tokens:
	- `<cls>` — Classification token
	- `<sep>` — Separator token
	- `<unk>` — Unknown token
	- `<pad>` — Padding token
	- `<mask>` — Masking token (MLM tasks)
	- Post-Processing Template:
	- Single Sequence: `$A:0 <sep>:0 <cls>:2`
	- Paired Sequences: `$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2`
	- Decoder: BPE decoder for reconstructing original text.

	### Training Method
	- Base Model: [yakul259/english-bpe-tokenizer-60k](https://huggingface.co/yakul259/english-bpe-tokenizer-60k)
	- Corpus Source: [Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k)
	- Batch Size: 1000 lines per batch
	- Trainer: `BpeTrainer` from Hugging Face `tokenizers` library
	- Special Tokens Added: `<cls>`, `<sep>`, `<unk>`, `<pad>`, `<mask>`

	---

	## Intended Uses & Limitations

	### Intended Uses
	- Pre-tokenization for financial LLMs.
	- Downstream financial NLP tasks:
	- Financial question answering
	- Document parsing
	- Financial news summarization
	- Risk assessment chatbots

	### Limitations
	- Optimized for English financial text — performance may drop outside the finance domain.
	- May reflect biases present in the financial data used for training.

	---

	## License
	This tokenizer is released under the MIT License.

	---

	## Citation
	If you use this tokenizer, please cite:

	title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
	author = yakul259
	year = 2025
	publisher = Hugging Face