Model Card for Model ID

Welcome to the 🤖🧮CyberSolve LinAlg 1.2🧠📐 model card!

We introduce CyberSolve LinAlg 1.2, a text-to-text large language model trained to solve linear equations. Specifically, CyberSolve LingAlg 1.2 is a downstream version of the FLAN-T5 large model, Google/FLAN-T5-large, fine-tuned on the one-dimensional linear algebra split of the Google DeepMind mathematics dataset. The model weights of CyberSolve LinAlg 1.2 are a further downstream checkpoint from the original CyberSolve LinAlg 1.1 checkpoint, trained for additional epochs to improve model capability.

Note: This is currently the most capable version of CyberSolve LinAlg. See this model demoed in the CyberSolve LinAlg 1.2 🤖 Space.

Model Details

Model Description and Overview

To construct CyberSolve LinAlg 1.2, the FLAN-T5 large model is fined-tuned using a custom PyTorch training loop optimized for multiple Nvidia A100 GPUs. We supervise a training of FLAN-T5 large on the algebra__linear_1d split of the Google DeepMind mathematics dataset, an open source dateset from Google DeepMind available through the 🤗 hub at deepmind/math_dataset. This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.

In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve LinAlg 1.x family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original CyberSolve LinAlg 1.1 checkpoint.

Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a 90.75 exact match score on the evaluation set of 10k linear equations from the DeepMind algebra__linear_1d split. This is a non-trivial improvement from the exact match score of 86.56 attained by CyberSolve LinAlg 1.1.

Developed by: John Graham Reynolds
Funded by: Vanderbilt University
Model type: Text-to-Text Generation
Language(s) (NLP): English
Finetuned from model: "Google/FLAN-T5-large"

Model Source

Repository: TODO

Uses

Direct Use

In order to effectively query the model's ability to solve linear equations, a string of the format "Solve <any one-dimensional linear equation of variable x> for x." should be tokenized and passed to the model's generate attribute. An example input string is input_text = "Solve 24 = 1601*c - 1605*c for c.". The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.

How to Use and Query the Model

Use the code below to get started with the model. Users pass an input_text string (again, of the form input_text = "Solve 24 = 1601*c - 1605*c for c.") which prompts the model to solve a one-dimensional linear equation. Model prediction is significantly faster on a GPU, and so usage of the .to('cuda') commands to make sure both the model and all input ids are on the GPU is best practice.

Furthermore, the FLAN-T5 model architecture makes use of many normalization layers, as is common in the transformer architecture. By default, CyberSolve uses the T5 model's T5LayerNorm Python class; it is highly recommended that user install the Nvidia Apex package for Nvidia GPUs or the ROCm Apex package for AMD GPUs. Once installed, the model will default to using the apex.normalization.FusedRMSNorm class when computing the normalization layers. The FusedRMSNorm class from apex makes use of an optimized fused kernel that is much faster than the standard T5LayerNorm class, thereby significantly improving both inference and training.

The base FLAN-T5 model is capable of answering a variety of prompts, but the domain-adapted CyberSolve LinAlg model is designed specifically for solving linear equations. As such, users must be considerate in their prompt engineering to issue a coherent, relevant query as outlined above and below.


# import apex
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/CyberSolve-LinAlg-1.2").to("cuda")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large") # CyberSolve uses the same tokenizer as the base FLAN-T5 model

# Pass the model instruction to solve a linear equation in the following simple format
input_text = "Solve 24 = 1601*c - 1605*c for c."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This code outputs the following:

-6

Training Details

Training Data / Preprocessing

The data used comes from Google DeepMind and the 🤗 hub. The model card can be found here. The Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets. Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg 1.x family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the algebra__linear_1d split.

The training and evaluation splits of the 1D linear algebra dataset split are preprocessed in the following way: we format the raw problems and their solutions of the form "b'Solve 65*l - 361 + 881 = 0 for l.\\n'" and "b'-8\\n'" into the much cleanear "Solve 65*l - 361 + 881 = 0 for l." and "-8". All inputs and labels are then tokenized. We subsequently evaluate the length of each input_ids vector and each labels vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets to the 🤗 hub at the following locations: MarioBarbeque/DeepMind-LinAlg-1D-train and MarioBarbeque/DeepMind-LinAlg-1D-eval.

Training Procedure

The model was trained locally on a single-node with multiple Nvidia A100 GPUs using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of both Nvidia Apex and 🤗 Accelerate.

Training Hyperparameters

Precision: We trained the model in bfloat16, and subsequently publish it with the same precision of the base "google/flan-t5-large" model in FP32.
Optimizer: apex.optimizers.FusedAdam, a fused kernel version of the AdamW optimizer from Nvidia Apex
Learning Rate: We use a linear learing rate scheduler with an initial learning rate of 1e-4 to further adjust the CyberSolve LinAlg 1.1 weights
Batch Size: 64
Number of Training Steps: 1918 training steps over 2 additional epochs (CyberSolve LinAlg 1.2) - beyond the original 2877 total steps over 3 epochs (CyberSolve LinAlg 1.1)

Evaluation / Metrics

We evaluate our text-to-text linear equation solver by using the exact_match metric to compare the model's decoded predicted tokens with their numeric labels. CyberSolve LinAlg 1.2 scores a 90.75 exact match score on the evaluation set of 10k linear equations from the DeepMind algebra__linear_1d split. This is a non-trivial improvement from the exact match score of 86.56 attained by CyberSolve LinAlg 1.1.

Additionally, we construct a partial correctness dataset available at the following model card: MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark. This dataset was created with the goal of analyzing both the token-to-token and decoded-sequence-to-decoded-sequence partial correctness of CyberSolve's predicitions in detail beyond just its ability to get answers flat out right or wrong. Similar partial correctness benchmark datasets were created for the intial FLAN-T5 model, the zeroth-generation downsampled training of CyberSolve, and the 1.1 version of the model. We have yet to complete partial correctness analysis of the various model versions and their predicitions, but we look forward to better understanding the mathematical reasoning capabilities of models and publishing our results when complete!

Testing Data, Factors & Metrics

Testing Data

The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation datasets of 2M and 10k records, respectively. Before training CyberSolve LinAlg 1.1, we trained a zeroth-generation, downsampled verison of CyberSolve by scikit-learn train_test_split-ing the set of 2M training records into much smaller training and evaluation datasets. We used this smaller set to evaluate the less-interesting zeroth-generation model, while we used the standard set of 10k evaluation records for evaluating both CyberSolve LinAlg 1.1 and CyberSolve LinAlg 1.2.

Results

We find the following benchmark scores for our each of our neural models after the corresponding epoch of training.

model	epoch	exact_match score
CyberSolve LinAlg 1.2	1	90.75
CyberSolve LinAlg 1.2	0	83.12
-------------------------------	--------	--------
CyberSolve LinAlg 1.1	2	86.56
CyberSolve LinAlg 1.1	1	73.80
CyberSolve LinAlg 1.1	0	55.35
-------------------------------	--------	--------
CyberSolve LinAlg Downsample	2	44.99
CyberSolve LinAlg Downsample	1	39.69
CyberSolve LinAlg Downsample	0	32.21

Summary

We train this model for the purpose of researching the mathematical reasoning abilities of transformer-based neural models (both the full-correctness and partial-correctness mathematical reasoning abilities of neural models). Our efforts made use of the 🤗 ecosystem, a system of parallelized Nvidia A100 GPUs in an Azure Databricks environment, custom PyTorch training and evaluation code, novel high-performance computing and deep learning libraries like Nvidia Apex, and more.

We learned a great deal and look forward to finalizing our research on the partial correctness reasoning abilities of these preliminary models. We also eagerly plan to further improve the CyberSolve family of models to tackle more difficult mathematical tasks. As we look forward, CyberSolve LinAlg 2.x will likely incoropate knowledge of systems of composed one-dimensional linear equations and more general multiple variable linear equations. Finally, methods related to reinforcement learning are equally enticing for improving neural reasoning abilities; the future is bright for teaching mathematics to AI!

We look forward to taking part in this great and worthy endeavor.

Environmental Impact

Hardware Type: Nvidia Ampere A100 80GB
Hours used: 21.5
Cloud Provider: Microsoft Azure
Compute Region: EastUS
Carbon Emitted: 3.18 kgCO2

Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 21.5 hours of computation was performed on hardware of type A100 SXM4 80 GB (TDP of 400W).

Total emissions are estimated to be 3.18 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.

Estimations were conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).

Hardware

The model was trained locally in an Azure Databricks workspace using a single node cloud compute instance with 2 Nvidia A100 80GB GPUs for 21.5 GPU Hours.

Software

Training utilized PyTorch, Nvidia Apex, 🤗 Transformers, 🤗 Tokenizers, 🤗 Datasets, 🤗 Accelerate, and more in an Azure Databricks execution environment.

Cite This Model

@misc{cybersolve, author = {John Graham Reynolds}, title = {{CyberSolve-LinAlg-1.2 Hugging Face model card}}, year = 2025, url = {https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2}, urldate = {date-you-accessed} }

Citations

@article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} }

MarioBarbeque
/

CyberSolve-LinAlg-1.2