Model Card for Model ID
Welcome to the 🤖🧮CyberSolve LinAlg 1.2🧠📐 model card!
We introduce CyberSolve LinAlg 1.2, a text-to-text large language model trained to solve linear equations. Specifically, CyberSolve LingAlg 1.2 is a downstream version of the FLAN-T5 large model, Google/FLAN-T5-large, fine-tuned on the one-dimensional linear algebra split of the Google DeepMind mathematics dataset. The model weights of CyberSolve LinAlg 1.2 are a further downstream checkpoint from the original CyberSolve LinAlg 1.1 checkpoint, trained for additional epochs to improve model capability.
Note: This is currently the most capable version of CyberSolve LinAlg. See this model demoed in the CyberSolve LinAlg 1.2 🤖 Space.
Model Details
Model Description and Overview
To construct CyberSolve LinAlg 1.2, the FLAN-T5 large model is fined-tuned using a custom PyTorch training loop optimized for multiple Nvidia A100 GPUs. We supervise a training of FLAN-T5 large on the algebra__linear_1d split of the Google DeepMind mathematics dataset, an open source dateset from Google DeepMind available through the 🤗 hub at deepmind/math_dataset. This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve LinAlg 1.x family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original CyberSolve LinAlg 1.1 checkpoint.
Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a 90.75 exact match score on the evaluation set of 10k linear equations from the DeepMind algebra__linear_1d split. This is a non-trivial improvement from the exact match score of 86.56 attained by CyberSolve LinAlg 1.1.
- Developed by: John Graham Reynolds
- Funded by: Vanderbilt University
- Model type: Text-to-Text Generation
- Language(s) (NLP): English
- Finetuned from model: "Google/FLAN-T5-large"
Model Source
- Repository: TODO
Uses
Direct Use
In order to effectively query the model's ability to solve linear equations, a string of the format "Solve <any one-dimensional linear equation of variable x> for x."
should be tokenized and passed to the model's generate
attribute.
An example input string is input_text = "Solve 24 = 1601*c - 1605*c for c."
. The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
How to Use and Query the Model
Use the code below to get started with the model. Users pass an input_text
string (again, of the form input_text = "Solve 24 = 1601*c - 1605*c for c."
) which prompts the model to solve a one-dimensional linear equation.
Model prediction is significantly faster on a GPU, and so usage of the .to('cuda')
commands to make sure both the model and all input ids are on the GPU is best practice.
Furthermore, the FLAN-T5 model architecture makes use
of many normalization layers, as is common in the transformer architecture. By default, CyberSolve uses the T5 model's T5LayerNorm
Python class; it is highly recommended that user install the Nvidia Apex
package for Nvidia GPUs
or the ROCm Apex
package for AMD GPUs. Once installed, the model will default to using the apex.normalization.FusedRMSNorm
class when computing the normalization layers. The FusedRMSNorm
class from apex
makes use of an optimized fushed kernel
that is much faster than the standard T5LayerNorm
class, thereby significantly improving both inference and training.
The base FLAN-T5 model is capable of answering a variety of prompts, but the domain-adapted CyberSolve LinAlg model is designed specifically for solving linear equations. As such, users must be considerate in their prompt engineering to issue a coherent, relevant query as outlined above and below.
# import apex
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/CyberSolve-LinAlg-1.2").to("cuda")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large") # CyberSolve uses the same tokenizer as the base FLAN-T5 model
# Pass the model instruction to solve a linear equation in the following simple format
input_text = "Solve 24 = 1601*c - 1605*c for c."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This code outputs the following:
-6
Training Details
Training Data / Preprocessing
The data used comes from Google DeepMind and the 🤗 hub. The model card can be found here. The Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets. Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg 1.x family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the algebra__linear_1d split.
The training and evaluation splits of the 1D linear algebra dataset split are preprocessed in the following way: we format the raw problems and their solutions of the form "b'Solve 65*l - 361 + 881 = 0 for l.\\n'"
and "b'-8\\n'"
into the much cleanear "Solve 65*l - 361 + 881 = 0 for l."
and "-8"
.
All inputs and labels are then tokenized. We subsequently evaluate the length of each input_ids vector and each labels vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets
to the 🤗 hub at the following locations: MarioBarbeque/DeepMind-LinAlg-1D-train and MarioBarbeque/DeepMind-LinAlg-1D-eval.
Training Procedure
The model was trained locally on a single-node with multiple Nvidia A100 GPUs using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of both Nvidia Apex and 🤗 Accelerate.
Training Hyperparameters
- Precision: We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
- Optimizer:
apex.optimizers.FusedAdam
, a fused kernel version of the AdamW optimizer from Nvidia Apex - Learning Rate: We use a linear learing rate scheduler with an initial learning rate of 1e-4 to further adjust the CyberSolve LinAlg 1.1 weights
- Batch Size: 64
- Number of Training Steps: 1918 training steps over 2 additional epochs (CyberSolve LinAlg 1.2) - beyond the original 2877 total steps over 3 epochs (CyberSolve LinAlg 1.1)
Evaluation / Metrics
We evaluate our text-to-text linear equation solver by using the exact_match
metric to compare the model's decoded predicted tokens with their numeric labels. CyberSolve LinAlg 1.2 scores a 90.75 exact match score
on the evaluation set of 10k linear equations from the DeepMind algebra__linear_1d split. This is a non-trivial improvement from the exact match score of 86.56 attained by CyberSolve LinAlg 1.1.
Additionally, we construct a partial correctness dataset available at the following model card: MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark. This dataset was created with the goal of analyzing both the token-to-token and decoded-sequence-to-decoded-sequence partial correctness of CyberSolve's predicitions in detail beyond just its ability to get answers flat out right or wrong. Similar partial correctness benchmark datasets were created for the intial FLAN-T5 model, the zeroth-generation downsampled training of CyberSolve, and the 1.1 version of the model. We have yet to complete partial correctness analysis of the various model versions and their predicitions, but we look forward to better understanding the mathematical reasoning capabilities of models and publishing our results when complete!
Testing Data, Factors & Metrics
Testing Data
The 1D Linear Algebra split of the Google DeepMind Mathematics dataset comes pre-split into training and evaluation datasets of 2M and 10k records, respectively. Before training CyberSolve LinAlg 1.1, we trained a zeroth-generation, downsampled verison of CyberSolve
by scikit-learn train_test_split
-ing the set of 2M training records into much smaller training and evaluation datasets. We used this smaller set to evaluate the less-interesting zeroth-generation model, while we used the standard set of 10k evaluation records for evaluating
both CyberSolve LinAlg 1.1 and CyberSolve LinAlg 1.2.
Results
We find the following benchmark scores for our each of our neural models after the corresponding epoch of training.
model | epoch | exact_match score |
---|---|---|
CyberSolve LinAlg 1.2 | 1 | 90.75 |
CyberSolve LinAlg 1.2 | 0 | 83.12 |
------------------------------- | -------- | -------- |
CyberSolve LinAlg 1.1 | 2 | 86.56 |
CyberSolve LinAlg 1.1 | 1 | 73.80 |
CyberSolve LinAlg 1.1 | 0 | 55.35 |
------------------------------- | -------- | -------- |
CyberSolve LinAlg Downsample | 2 | 44.99 |
CyberSolve LinAlg Downsample | 1 | 39.69 |
CyberSolve LinAlg Downsample | 0 | 32.21 |
Summary
We train this model for the purpose of researching the mathematical reasoning abilities of transformer-based neural models (both the full-correctness and partial-correctness mathematical reasoning abilities of neural models). Our efforts made use of the 🤗 ecosystem, a system of parallelized Nvidia A100 GPUs in an Azure Databricks environment, custom PyTorch training and evaluation code, novel high-performance computing and deep learning libraries like Nvidia Apex, and more.
We learned a great deal and look forward to finalizing our research on the partial correctness reasoning abilities of these preliminary models. We also eagerly plan to further improve the CyberSolve family of models to tackle more difficult mathematical tasks. As we look forward, CyberSolve LinAlg 2.x will likely incoropate knowledge of systems of composed one-dimensional linear equations and more general multiple variable linear equations. Finally, methods related to reinforcement learning are equally enticing for improving neural reasoning abilities; the future is bright for teaching mathematics to AI!
We look forward to taking part in this great and worthy endeavor.
Environmental Impact
- Hardware Type: Nvidia Ampere A100 80GB
- Hours used: 21.5
- Cloud Provider: Microsoft Azure
- Compute Region: EastUS
- Carbon Emitted: 3.18 kgCO2
Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 21.5 hours of computation was performed on hardware of type A100 SXM4 80 GB (TDP of 400W).
Total emissions are estimated to be 3.18 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
Estimations were conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).
Hardware
The model was trained locally in an Azure Databricks workspace using a single node cloud compute instance with 2 Nvidia A100 80GB GPUs for 21.5 GPU Hours.
Software
Training utilized PyTorch, Nvidia Apex, 🤗 Transformers, 🤗 Tokenizers, 🤗 Datasets, 🤗 Accelerate, and more in an Azure Databricks execution environment.
Citations
@article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} }
- Downloads last month
- 55
Model tree for MarioBarbeque/CyberSolve-LinAlg-1.2
Base model
google/flan-t5-large