--- license: mit language: - en - ja tags: - causal-lm - llama - accounting - tiktoken - business - multilingual library_name: transformers model-index: - name: soroban_1B_instruct_105_ckp results: [] --- # 📎 Fast Accounting LLM (soroban1Bv0.0) **soroban_1B_instruct_105_ckp** is a 1B parameter LLaMA-style language model trained from scratch using multilingual corpora in **English and Japanese**, with a strong emphasis on **accounting**, **financial data**, and **business language**. It uses a custom tokenizer built on top of `tiktoken` (`o200k_base`), supports special tokens for dialogue, and is optimized for instructional and analytical text generation. The model is a checkpoint version of pretrained soroban which is further FTed on instruction datasets. ## Model Details ### Model Description - **Developed by:** FastAccounting Japan - **Model type:** Causal Language Model (LLaMA-style architecture) - **Language(s):** English and Japanese - **License:** MIT - **Finetuned from model:** Trained from scratch - **Tokenizer:** Custom BPE tokenizer (based on tiktoken `o200k_base`) ### Model Sources - **Repository:** https://huggingface.co/FastAccounting/soroban_1B_instruct_105_ckp ## Uses (Coming Soon) ### Direct Use - Chat-style generation for accounting and business tasks - Financial Q&A and report summarization - Instructional document parsing (in Japanese & English) ### Downstream Use - Fine-tuning for audit compliance or domain-specific accounting QA ### Out-of-Scope Use - Unfiltered open-domain conversation - Real-time decision-making in regulated financial environments ## Bias, Risks, and Limitations - **Domain Bias:** Model has strong exposure to accounting-style language; it may underperform in open-domain or creative tasks. - **Language Balance:** While trained on both Japanese and English, performance may vary between them depending on prompt structure. ### Recommendations - Fine-tune or prompt carefully for non-accounting use cases. - Always validate financial output before applying in business or legal settings. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("FastAccounting/soroban_1B_instruct_105_ckp") model = AutoModelForCausalLM.from_pretrained("FastAccounting/soroban_1B_instruct_105_ckp") prompt = "<|begin_of_text|>Explain cash flow in Japanese." inputs = tokenizer(prompt, return_tensors="pt") inputs.pop("token_type_ids", None) # 💥 remove unused key outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data - Internal multilingual accounting corpus (~3T tokens) - Custom curated Japanese financial documents - Publicly available English accounting datasets ### Training Procedure - **Precision:** fp16 precision - **Context length:** 2048 tokens - **Optimizer:** AdamW with weight decay - **Learning schedule:** Cosine decay with warmup ## Evaluation - Ongoing evaluation with domain-specific metrics (financial text perplexity, accuracy on Japanese bookkeeping test questions) ## Environmental Impact - **Hardware Type:** 32x NVIDIA H100 GPUs - **Hours used:** ~4800 GPU hours - **Cloud Provider:** On-prem H100 cluster - **Carbon Emitted:** Estimated 350kg CO2eq ## Technical Specifications ### Model Architecture and Objective - 1B parameter LLaMA-style Transformer - 16 layers, 32 attention heads, RoPE, SwiGLU, FlashAttention2 - Tied input/output embeddings ### Compute Infrastructure - **Framework:** PyTorch with Hugging Face Transformers - **Libraries:** DeepSpeed, Megatron-DS integration ## Citation ``` @misc{fa_llm_2025, author = {FastAccounting LLM}, title = {soroban_1B: A Multilingual Accounting Language Model}, year = {2025}, url = {https://huggingface.co/FastAccounting/fa_llm_1B}, } ``` ## Contact **Lead Developer:** Keshav Singh & Fujitake Masato **Organization:** FastAccounting Japan **Hugging Face Profile:** https://huggingface.co/FastAccounting