knosing's picture
Update README.md
022762b verified
---
license: mit
language:
- en
- ja
tags:
- causal-lm
- llama
- accounting
- tiktoken
- business
- multilingual
library_name: transformers
model-index:
- name: soroban_3.8B_instruct_fp32
results: []
---
# 📎 Fast Accounting LLM (soroban3.8Bv0.0)
**soroban_3.8B_instruct_fp32** is a 3.8B parameter LLaMA-style language model trained from scratch using multilingual corpora in **English and Japanese**, with a strong emphasis on **accounting**, **financial data**, and **business language**. It uses a custom tokenizer built on top of `tiktoken` (`o200k_base`), supports special tokens for dialogue, and is optimized for instructional and analytical text generation. The model is a checkpoint version of pretrained soroban which is further FTed on instruction datasets.
## Model Details
### Model Description
- **Developed by:** FastAccounting Japan
- **Model type:** Causal Language Model (LLaMA-style architecture)
- **Language(s):** English and Japanese
- **License:** MIT
- **Finetuned from model:** Trained from scratch
- **Tokenizer:** Custom BPE tokenizer (based on tiktoken `o200k_base`)
### Model Sources
- **Repository:** https://huggingface.co/FastAccounting/soroban_3.8B_instruct_fp32
## Uses (Coming Soon)
### Direct Use
- Chat-style generation for accounting and business tasks
- Financial Q&A and report summarization
- Instructional document parsing (in Japanese & English)
### Downstream Use
- Fine-tuning for audit compliance or domain-specific accounting QA
### Out-of-Scope Use
- Unfiltered open-domain conversation
- Real-time decision-making in regulated financial environments
## Bias, Risks, and Limitations
- **Domain Bias:** Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks.
- **Language Balance:** While trained on both Japanese and English, performance may vary between them depending on prompt structure.
### Recommendations
- Fine-tune or prompt carefully for non-accounting use cases.
- Always validate financial output before applying in business or legal settings.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32")
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32")
prompt = "<|begin_of_text|>Explain cash flow in Japanese."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.pop("token_type_ids", None) # 💥 remove unused key
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
- Internal multilingual accounting corpus (~160B tokens)
- Custom curated Japanese financial documents
- Publicly available English accounting datasets
### Training Procedure
- **Precision:** fp32 precision
- **Context length:** 4096 tokens
- **Optimizer:** AdamW with weight decay
- **Learning schedule:** Cosine decay with warmup
## Evaluation
- Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions)
## Environmental Impact
- **Hardware Type:** 24x NVIDIA H100 GPUs
- **Hours used:** ~4560 GPU hours
- **Cloud Provider:** On-prem H100 cluster
- **Carbon Emitted:** Estimated 1250kg CO2eq
## Technical Specifications
### Model Architecture and Objective
- 3.8B parameter LLaMA-style Transformer
- 32 layers, 16 attention heads, RoPE, SwiGLU, FlashAttention2
- Untied input/output embeddings
### Compute Infrastructure
- **Framework:** PyTorch with Hugging Face Transformers
- **Libraries:** DeepSpeed, Megatron-DS integration
## Citation
```
@misc{fa_llm_2025,
author = {FastAccounting LLM},
title = {soroban_3.8B: A Multilingual Accounting Language Model},
year = {2025},
url = {https://huggingface.co/FastAccounting/fa_llm_1B},
}
```
## Contact
**Lead Developer:** Keshav Singh & Fujitake Masato
**Organization:** FastAccounting Japan
**Hugging Face Profile:** https://huggingface.co/FastAccounting