You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

πŸ“Ž Fast Accounting LLM (soroban1Bv0.0)

soroban_1B is a 1B parameter LLaMA-style language model trained from scratch using multilingual corpora in English and Japanese, with a strong emphasis on accounting, financial data, and business language. It uses a custom tokenizer built on top of tiktoken (o200k_base), supports special tokens for dialogue, and is optimized for instructional and analytical text generation.

Model Details

Model Description

  • Developed by: FastAccounting Japan
  • Model type: Causal Language Model (LLaMA-style architecture)
  • Language(s): English and Japanese
  • License: MIT
  • Finetuned from model: Trained from scratch
  • Tokenizer: Custom BPE tokenizer (based on tiktoken o200k_base)

Model Sources

Uses (Coming Soon)

Direct Use

  • Chat-style generation for accounting and business tasks
  • Financial Q&A and report summarization
  • Instructional document parsing (in Japanese & English)

Downstream Use

  • Fine-tuning for audit compliance or domain-specific accounting QA

Out-of-Scope Use

  • Unfiltered open-domain conversation
  • Real-time decision-making in regulated financial environments

Bias, Risks, and Limitations

  • Domain Bias: Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks.
  • Language Balance: While trained on both Japanese and English, performance may vary between them depending on prompt structure.

Recommendations

  • Fine-tune or prompt carefully for non-accounting use cases.
  • Always validate financial output before applying in business or legal settings.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_untrained_base")
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_untrained_base")

prompt = "<|begin_of_text|>Explain cash flow in Japanese."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.pop("token_type_ids", None)  # πŸ’₯ remove unused key
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

  • Internal multilingual accounting corpus (~3T tokens)
  • Custom curated Japanese financial documents
  • Publicly available English accounting datasets

Training Procedure

  • Precision: fp16 precision
  • Context length: 2048 tokens
  • Optimizer: AdamW with weight decay
  • Learning schedule: Cosine decay with warmup

Evaluation

  • Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions)

Environmental Impact

  • Hardware Type: 32x NVIDIA H100 GPUs
  • Hours used: ~4800 GPU hours
  • Cloud Provider: On-prem H100 cluster
  • Carbon Emitted: Estimated 350kg CO2eq

Technical Specifications

Model Architecture and Objective

  • 1B parameter LLaMA-style Transformer
  • 16 layers, 32 attention heads, RoPE, SwiGLU, FlashAttention2
  • Tied input/output embeddings

Compute Infrastructure

  • Framework: PyTorch with Hugging Face Transformers
  • Libraries: DeepSpeed, Megatron-DS integration

Citation

@misc{fa_llm_2025,
  author = {FastAccounting LLM},
  title = {soroban_1B: A Multilingual Accounting Language Model},
  year = {2025},
  url = {https://huggingface.co/FastAccounting/fa_llm_1B},
}

Contact

Lead Developer: Keshav Singh & Fujitake Masato

Organization: FastAccounting Japan

Hugging Face Profile: https://huggingface.co/FastAccounting

Downloads last month
36
Safetensors
Model size
1.01B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support