|
--- |
|
license: mit |
|
language: |
|
- en |
|
- ja |
|
tags: |
|
- causal-lm |
|
- llama |
|
- accounting |
|
- tiktoken |
|
- business |
|
- multilingual |
|
library_name: transformers |
|
model-index: |
|
- name: soroban_3.8B_instruct_fp32 |
|
results: [] |
|
--- |
|
|
|
# 📎 Fast Accounting LLM (soroban3.8Bv0.0) |
|
|
|
**soroban_3.8B_instruct_fp32** is a 3.8B parameter LLaMA-style language model trained from scratch using multilingual corpora in **English and Japanese**, with a strong emphasis on **accounting**, **financial data**, and **business language**. It uses a custom tokenizer built on top of `tiktoken` (`o200k_base`), supports special tokens for dialogue, and is optimized for instructional and analytical text generation. The model is a checkpoint version of pretrained soroban which is further FTed on instruction datasets. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Developed by:** FastAccounting Japan |
|
- **Model type:** Causal Language Model (LLaMA-style architecture) |
|
- **Language(s):** English and Japanese |
|
- **License:** MIT |
|
- **Finetuned from model:** Trained from scratch |
|
- **Tokenizer:** Custom BPE tokenizer (based on tiktoken `o200k_base`) |
|
|
|
### Model Sources |
|
- **Repository:** https://huggingface.co/FastAccounting/soroban_3.8B_instruct_fp32 |
|
|
|
## Uses (Coming Soon) |
|
|
|
### Direct Use |
|
- Chat-style generation for accounting and business tasks |
|
- Financial Q&A and report summarization |
|
- Instructional document parsing (in Japanese & English) |
|
|
|
### Downstream Use |
|
- Fine-tuning for audit compliance or domain-specific accounting QA |
|
|
|
### Out-of-Scope Use |
|
- Unfiltered open-domain conversation |
|
- Real-time decision-making in regulated financial environments |
|
|
|
## Bias, Risks, and Limitations |
|
- **Domain Bias:** Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks. |
|
- **Language Balance:** While trained on both Japanese and English, performance may vary between them depending on prompt structure. |
|
|
|
### Recommendations |
|
- Fine-tune or prompt carefully for non-accounting use cases. |
|
- Always validate financial output before applying in business or legal settings. |
|
|
|
## How to Get Started with the Model |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32") |
|
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32") |
|
|
|
prompt = "<|begin_of_text|>Explain cash flow in Japanese." |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
inputs.pop("token_type_ids", None) # 💥 remove unused key |
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- Internal multilingual accounting corpus (~160B tokens) |
|
- Custom curated Japanese financial documents |
|
- Publicly available English accounting datasets |
|
|
|
### Training Procedure |
|
- **Precision:** fp32 precision |
|
- **Context length:** 4096 tokens |
|
- **Optimizer:** AdamW with weight decay |
|
- **Learning schedule:** Cosine decay with warmup |
|
|
|
## Evaluation |
|
- Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions) |
|
|
|
## Environmental Impact |
|
- **Hardware Type:** 24x NVIDIA H100 GPUs |
|
- **Hours used:** ~4560 GPU hours |
|
- **Cloud Provider:** On-prem H100 cluster |
|
- **Carbon Emitted:** Estimated 1250kg CO2eq |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
- 3.8B parameter LLaMA-style Transformer |
|
- 32 layers, 16 attention heads, RoPE, SwiGLU, FlashAttention2 |
|
- Untied input/output embeddings |
|
|
|
### Compute Infrastructure |
|
- **Framework:** PyTorch with Hugging Face Transformers |
|
- **Libraries:** DeepSpeed, Megatron-DS integration |
|
|
|
## Citation |
|
``` |
|
@misc{fa_llm_2025, |
|
author = {FastAccounting LLM}, |
|
title = {soroban_3.8B: A Multilingual Accounting Language Model}, |
|
year = {2025}, |
|
url = {https://huggingface.co/FastAccounting/fa_llm_1B}, |
|
} |
|
``` |
|
|
|
## Contact |
|
**Lead Developer:** Keshav Singh & Fujitake Masato |
|
|
|
**Organization:** FastAccounting Japan |
|
|
|
**Hugging Face Profile:** https://huggingface.co/FastAccounting |
|
|
|
|