knosing's picture
Update README.md
022762b verified
metadata
license: mit
language:
  - en
  - ja
tags:
  - causal-lm
  - llama
  - accounting
  - tiktoken
  - business
  - multilingual
library_name: transformers
model-index:
  - name: soroban_3.8B_instruct_fp32
    results: []

πŸ“Ž Fast Accounting LLM (soroban3.8Bv0.0)

soroban_3.8B_instruct_fp32 is a 3.8B parameter LLaMA-style language model trained from scratch using multilingual corpora in English and Japanese, with a strong emphasis on accounting, financial data, and business language. It uses a custom tokenizer built on top of tiktoken (o200k_base), supports special tokens for dialogue, and is optimized for instructional and analytical text generation. The model is a checkpoint version of pretrained soroban which is further FTed on instruction datasets.

Model Details

Model Description

  • Developed by: FastAccounting Japan
  • Model type: Causal Language Model (LLaMA-style architecture)
  • Language(s): English and Japanese
  • License: MIT
  • Finetuned from model: Trained from scratch
  • Tokenizer: Custom BPE tokenizer (based on tiktoken o200k_base)

Model Sources

Uses (Coming Soon)

Direct Use

  • Chat-style generation for accounting and business tasks
  • Financial Q&A and report summarization
  • Instructional document parsing (in Japanese & English)

Downstream Use

  • Fine-tuning for audit compliance or domain-specific accounting QA

Out-of-Scope Use

  • Unfiltered open-domain conversation
  • Real-time decision-making in regulated financial environments

Bias, Risks, and Limitations

  • Domain Bias: Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks.
  • Language Balance: While trained on both Japanese and English, performance may vary between them depending on prompt structure.

Recommendations

  • Fine-tune or prompt carefully for non-accounting use cases.
  • Always validate financial output before applying in business or legal settings.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32")
model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_3.8B_instruct_fp32")

prompt = "<|begin_of_text|>Explain cash flow in Japanese."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.pop("token_type_ids", None)  # πŸ’₯ remove unused key
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

  • Internal multilingual accounting corpus (~160B tokens)
  • Custom curated Japanese financial documents
  • Publicly available English accounting datasets

Training Procedure

  • Precision: fp32 precision
  • Context length: 4096 tokens
  • Optimizer: AdamW with weight decay
  • Learning schedule: Cosine decay with warmup

Evaluation

  • Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions)

Environmental Impact

  • Hardware Type: 24x NVIDIA H100 GPUs
  • Hours used: ~4560 GPU hours
  • Cloud Provider: On-prem H100 cluster
  • Carbon Emitted: Estimated 1250kg CO2eq

Technical Specifications

Model Architecture and Objective

  • 3.8B parameter LLaMA-style Transformer
  • 32 layers, 16 attention heads, RoPE, SwiGLU, FlashAttention2
  • Untied input/output embeddings

Compute Infrastructure

  • Framework: PyTorch with Hugging Face Transformers
  • Libraries: DeepSpeed, Megatron-DS integration

Citation

@misc{fa_llm_2025,
  author = {FastAccounting LLM},
  title = {soroban_3.8B: A Multilingual Accounting Language Model},
  year = {2025},
  url = {https://huggingface.co/FastAccounting/fa_llm_1B},
}

Contact

Lead Developer: Keshav Singh & Fujitake Masato

Organization: FastAccounting Japan

Hugging Face Profile: https://huggingface.co/FastAccounting