π Fast Accounting LLM (soroban1Bv0.0)
soroban_1B is a 1B parameter LLaMA-style language model trained from scratch using multilingual corpora in English and Japanese, with a strong emphasis on accounting, financial data, and business language. It uses a custom tokenizer built on top of tiktoken
(o200k_base
), supports special tokens for dialogue, and is optimized for instructional and analytical text generation.
Model Details
Model Description
- Developed by: FastAccounting Japan
- Model type: Causal Language Model (LLaMA-style architecture)
- Language(s): English and Japanese
- License: MIT
- Finetuned from model: Trained from scratch
- Tokenizer: Custom BPE tokenizer (based on tiktoken
o200k_base
)
Model Sources
Uses (Coming Soon)
Direct Use
- Chat-style generation for accounting and business tasks
- Financial Q&A and report summarization
- Instructional document parsing (in Japanese & English)
Downstream Use
- Fine-tuning for audit compliance or domain-specific accounting QA
Out-of-Scope Use
- Unfiltered open-domain conversation
- Real-time decision-making in regulated financial environments
Bias, Risks, and Limitations
- Domain Bias: Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks.
- Language Balance: While trained on both Japanese and English, performance may vary between them depending on prompt structure.
Recommendations
- Fine-tune or prompt carefully for non-accounting use cases.
- Always validate financial output before applying in business or legal settings.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_untrained_base")
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_untrained_base")
prompt = "<|begin_of_text|>Explain cash flow in Japanese."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.pop("token_type_ids", None) # π₯ remove unused key
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
- Internal multilingual accounting corpus (~3T tokens)
- Custom curated Japanese financial documents
- Publicly available English accounting datasets
Training Procedure
- Precision: fp16 precision
- Context length: 2048 tokens
- Optimizer: AdamW with weight decay
- Learning schedule: Cosine decay with warmup
Evaluation
- Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions)
Environmental Impact
- Hardware Type: 32x NVIDIA H100 GPUs
- Hours used: ~4800 GPU hours
- Cloud Provider: On-prem H100 cluster
- Carbon Emitted: Estimated 350kg CO2eq
Technical Specifications
Model Architecture and Objective
- 1B parameter LLaMA-style Transformer
- 16 layers, 32 attention heads, RoPE, SwiGLU, FlashAttention2
- Tied input/output embeddings
Compute Infrastructure
- Framework: PyTorch with Hugging Face Transformers
- Libraries: DeepSpeed, Megatron-DS integration
Citation
@misc{fa_llm_2025,
author = {FastAccounting LLM},
title = {soroban_1B: A Multilingual Accounting Language Model},
year = {2025},
url = {https://huggingface.co/FastAccounting/fa_llm_1B},
}
Contact
Lead Developer: Keshav Singh & Fujitake Masato
Organization: FastAccounting Japan
Hugging Face Profile: https://huggingface.co/FastAccounting
- Downloads last month
- 36
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support