This is a one-layer base model with the LlaMA 2 architecture trained on 6B tokens of the algebraic-stack part of the Proof-pile 2 dataset.
It's output distribution is thus mostly concerned with code. The tokenizer is the LlaMA 2 one. I used the following hyper parameters:
dmodel = 512
dff = 2048
nheads = 4
nctx = 1024

For the training I used AdamW with weight decay = 0.05 and cosine annealing with 5000 warmup steps and maximum learning rate 1e-4. We used BF16 precision.

Train loss: 2.6228
Test loss: 2.7490

Downloads last month
129
Safetensors
Model size
37M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train Ffohturk/Mila_1L