BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Abstract
BizFinBench is a benchmark for evaluating large language models in financial applications, revealing distinct performance patterns across various tasks.
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
Community
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
πPaper |π Homepage|π€Huggingface
In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate).
Large language models excel across general tasks, yet judging their reliability in logicβheavy, precisionβcritical domains such as finance, law and healthcare is still difficult. To address this challenge, we propose BizFinBench, the first benchmark grounded in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, covering five dimensions: numerical calculation, reasoning, information extraction, prediction recognition and knowledgeβbased question answering, which are mapped to nine fine-grained categories.
π’ News
- π [16/05/2025] We released BizFinBench benchmark (V1), the first benchmark grounded in real-world financial applications.
π‘ Highlights
- π₯ Benchmark: We propose BizFinBench, the first evaluation benchmark in the financial domain that integrates business-oriented tasks, covering 5 dimensions and 9 categories. It is designed to assess the capacity of LLMs in real-world financial scenarios.
- π₯ Judge model: We design a novel evaluation method, i.e., Iterajudge, which enhances the capability of LLMs as a judge by refining their decision boundaries in specific financial evaluation tasks.
- π₯ key insights: We conduct a comprehensive evaluation with 25 LLMs based on BizFinBench, uncovering key insights into their strengths and limitations in financial applications.
π Data Distrubution
This dataset contains multiple subtasks, each focusing on a different financial understanding and reasoning ability, as follows:
Dataset | Description | Evaluation Dimensions | Volume |
---|---|---|---|
Anomalous Event Attribution | A financial anomaly attribution evaluation dataset assessing models' ability to trace stock fluctuations based on given information (e.g., timestamps, news articles, financial reports, and stock movements). | Causal consistency, information relevance, noise resistance | 1,064 |
Financial Numerical Computation | A financial numerical computation dataset evaluating models' ability to perform accurate numerical calculations in financial scenarios, including interest rate calculations, gain/loss computations, etc. | Calculation accuracy, unit consistency | 581 |
Financial Time Reasoning | A financial temporal reasoning evaluation dataset assessing models' ability to comprehend and reason about time-based financial events, such as "the previous trading day" or "the first trading day of the year." | Temporal reasoning correctness | 514 |
Financial Data Description | A financial data description evaluation dataset measuring models' ability to analyze and describe structured/unstructured financial data, e.g., "the stock price first rose to XX before falling to XX." | Trend accuracy, data consistency | 1,461 |
Stock Price Prediction | A stock price movement prediction dataset evaluating models' ability to forecast future stock price trends based on historical data, financial indicators, and market news. | Trend judgment, causal rationality | 497 |
Financial Named Entity Recognition | A financial named entity recognition dataset assessing models' ability to identify entities (Person, Organization, Market, Location, Financial Products, Date/Time) in short/long financial news. | Recognition accuracy, entity category correctness | 433 |
Emotion_Recognition | A financial sentiment recognition dataset evaluating models' ability to discern nuanced user emotions in complex financial market environments. Inputs include multi-dimensional data such as market conditions, news, research reports, user holdings, and queries, covering six emotion categories: optimism, anxiety, pessimism, excitement, calmness, and regret. | Emotion classification accuracy, implicit information extraction and reasoning correctness | 600 |
Financial Tool Usage | A financial tool usage dataset evaluating models' ability to understand user queries and appropriately utilize various financial tools (investment analysis, market research, information retrieval, etc.) to solve real-world problems. Tools include calculators, financial encyclopedia queries, search engines, data queries, news queries, economic calendars, and company lookups. Models must accurately interpret user intent, select appropriate tools, input correct parameters, and coordinate multiple tools when necessary. | Tool selection rationality, parameter input accuracy, multi-tool coordination capability | 641 |
Financial Knowledge QA | A financial encyclopedia QA dataset assessing models' understanding and response accuracy regarding core financial knowledge, covering key domains: financial fundamentals, markets, investment theories, macroeconomics, etc. | Query comprehension accuracy, knowledge coverage breadth, answer accuracy and professionalism | 990 |
βοΈResults
The models are evaluated across multiple tasks, with results color-coded to represent the top three performers for each task:
- π₯ indicates the top-performing model.
- π₯ represents the second-best result.
- π₯ denotes the third-best performance.
Model | AEA | FNC | FTR | FTU | FQA | FDD | ER | SP | FNER | Average |
---|---|---|---|---|---|---|---|---|---|---|
Proprietary LLMs | ||||||||||
ChatGPT-o3 | π₯ 86.23 | 61.30 | π₯ 75.36 | π₯ 89.15 | π₯ 91.25 | π₯ 98.55 | π₯ 44.48 | 53.27 | 65.13 | π₯ 73.86 |
ChatGPT-o4-mini | π₯ 85.62 | 60.10 | 71.23 | 74.40 | 90.27 | 95.73 | π₯ 47.67 | 52.32 | 64.24 | 71.29 |
GPT-4o | 79.42 | 56.51 | π₯ 76.20 | 82.37 | 87.79 | π₯ 98.84 | π₯ 45.33 | 54.33 | 65.37 | π₯ 71.80 |
Gemini-2.0-Flash | π₯ 86.94 | π₯ 62.67 | 73.97 | 82.55 | 90.29 | π₯ 98.62 | 22.17 | π₯ 56.14 | 54.43 | 69.75 |
Claude-3.5-Sonnet | 84.68 | π₯ 63.18 | 42.81 | π₯ 88.05 | 87.35 | 96.85 | 16.67 | 47.60 | 63.09 | 65.59 |
Open Source LLMs | ||||||||||
Qwen2.5-7B-Instruct | 73.87 | 32.88 | 39.38 | 79.03 | 83.34 | 78.93 | 37.50 | 51.91 | 30.31 | 56.35 |
Qwen2.5-72B-Instruct | 69.27 | 54.28 | 70.72 | 85.29 | 87.79 | 97.43 | 35.33 | 55.13 | 54.02 | 67.70 |
Qwen2.5-VL-3B | 53.85 | 15.92 | 17.29 | 8.95 | 81.60 | 59.44 | 39.50 | 52.49 | 21.57 | 38.96 |
Qwen2.5-VL-7B | 73.87 | 32.71 | 40.24 | 77.85 | 83.94 | 77.41 | 38.83 | 51.91 | 33.40 | 56.68 |
Qwen2.5-VL-14B | 37.12 | 41.44 | 53.08 | 82.07 | 84.23 | 7.97 | 37.33 | 54.93 | 47.47 | 49.52 |
Qwen2.5-VL-32B | 76.79 | 50.00 | 62.16 | 83.57 | 85.30 | 95.95 | 40.50 | 54.93 | π₯ 68.36 | 68.62 |
Qwen2.5-VL-72B | 69.55 | 54.11 | 69.86 | 85.18 | 87.37 | 97.34 | 35.00 | 54.94 | 54.41 | 67.53 |
Qwen3-1.7B | 77.40 | 35.80 | 33.40 | 75.82 | 73.81 | 78.62 | 22.40 | 48.53 | 11.23 | 50.78 |
Qwen3-4B | 83.60 | 47.40 | 50.00 | 78.19 | 82.24 | 80.16 | 42.20 | 50.51 | 25.19 | 59.94 |
Qwen3-14B | 84.20 | 58.20 | 65.80 | 82.19 | 84.12 | 92.91 | 33.00 | 52.31 | 50.70 | 67.05 |
Qwen3-32B | 83.80 | 59.60 | 64.60 | 85.12 | 85.43 | 95.37 | 39.00 | 52.26 | 49.19 | 68.26 |
Xuanyuan3-70B | 12.14 | 19.69 | 15.41 | 80.89 | 86.51 | 83.90 | 29.83 | 52.62 | 37.33 | 46.48 |
Llama-3.1-8B-Instruct | 73.12 | 22.09 | 2.91 | 77.42 | 76.18 | 69.09 | 29.00 | 54.21 | 36.56 | 48.95 |
Llama-3.1-70B-Instruct | 16.26 | 34.25 | 56.34 | 80.64 | 79.97 | 86.90 | 33.33 | π₯ 62.16 | 45.95 | 55.09 |
Llama 4 Scout | 73.60 | 45.80 | 44.20 | 85.02 | 85.21 | 92.32 | 25.60 | 55.76 | 43.00 | 61.17 |
DeepSeek-V3 (671B) | 74.34 | 61.82 | 72.60 | π₯ 86.54 | π₯ 91.07 | 98.11 | 32.67 | 55.73 | π₯ 71.24 | 71.57 |
DeepSeek-R1 (671B) | 80.36 | π₯ 64.04 | π₯ 75.00 | 81.96 | π₯ 91.44 | 98.41 | 39.67 | 55.13 | π₯ 71.46 | π₯ 73.05 |
QwQ-32B | 84.02 | 52.91 | 64.90 | 84.81 | 89.60 | 94.20 | 34.50 | π₯ 56.68 | 30.27 | 65.77 |
DeepSeek-R1-Distill-Qwen-14B | 71.33 | 44.35 | 16.95 | 81.96 | 85.52 | 92.81 | 39.50 | 50.20 | 52.76 | 59.49 |
DeepSeek-R1-Distill-Qwen-32B | 73.68 | 51.20 | 50.86 | 83.27 | 87.54 | 97.81 | 41.50 | 53.92 | 56.80 | 66.29 |
βοΈCitation
comming soon
π License
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
π Acknowledgement
- We would like to thank Weijie Zhang for his contribution to the development of the inference engine.
- This work leverages vLLM as the backend model server for evaluation purposes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding (2025)
- PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language (2025)
- MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks (2025)
- BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text (2025)
- YourBench: Easy Custom Evaluation Sets for Everyone (2025)
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models (2025)
- SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper