📎 Fast Accounting LLM (soroban1Bv0.0)

soroban_1B is a 1B parameter LLaMA-style language model trained from scratch using multilingual corpora in English and Japanese, with a strong emphasis on accounting, financial data, and business language. It uses a custom tokenizer built on top of tiktoken (o200k_base), supports special tokens for dialogue, and is optimized for instructional and analytical text generation.

Model Details

Model Description

Developed by: FastAccounting Japan
Model type: Causal Language Model (LLaMA-style architecture)
Language(s): English and Japanese
License: MIT
Finetuned from model: Trained from scratch
Tokenizer: Custom BPE tokenizer (based on tiktoken o200k_base)

Model Sources

Repository: https://huggingface.co/FastAccounting/soroban_untrained_base

Uses (Coming Soon)

Direct Use

Chat-style generation for accounting and business tasks
Financial Q&A and report summarization
Instructional document parsing (in Japanese & English)

Downstream Use

Fine-tuning for audit compliance or domain-specific accounting QA

Out-of-Scope Use

Unfiltered open-domain conversation
Real-time decision-making in regulated financial environments

Bias, Risks, and Limitations

Domain Bias: Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks.
Language Balance: While trained on both Japanese and English, performance may vary between them depending on prompt structure.

Recommendations

Fine-tune or prompt carefully for non-accounting use cases.
Always validate financial output before applying in business or legal settings.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_untrained_base")
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_untrained_base")

prompt = "<|begin_of_text|>Explain cash flow in Japanese."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.pop("token_type_ids", None)  # 💥 remove unused key
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Internal multilingual accounting corpus (~3T tokens)
Custom curated Japanese financial documents
Publicly available English accounting datasets

Training Procedure

Precision: fp16 precision
Context length: 2048 tokens
Optimizer: AdamW with weight decay
Learning schedule: Cosine decay with warmup

Evaluation

Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions)

Environmental Impact

Hardware Type: 32x NVIDIA H100 GPUs
Hours used: ~4800 GPU hours
Cloud Provider: On-prem H100 cluster
Carbon Emitted: Estimated 350kg CO2eq

Technical Specifications

Model Architecture and Objective

1B parameter LLaMA-style Transformer
16 layers, 32 attention heads, RoPE, SwiGLU, FlashAttention2
Tied input/output embeddings

Compute Infrastructure

Framework: PyTorch with Hugging Face Transformers
Libraries: DeepSpeed, Megatron-DS integration

Citation

@misc{fa_llm_2025,
  author = {FastAccounting LLM},
  title = {soroban_1B: A Multilingual Accounting Language Model},
  year = {2025},
  url = {https://huggingface.co/FastAccounting/fa_llm_1B},
}

Contact

Lead Developer: Keshav Singh & Fujitake Masato

Organization: FastAccounting Japan

Hugging Face Profile: https://huggingface.co/FastAccounting

FastAccounting
/

soroban_untrained_base

You need to agree to share your contact information to access this model