ANE-Hash — Double SHA-256 on Apple Neural Engine (bit-sliced FP16)

proving a point of emulation — making the ANE do pure bit-twiddling with nothing but fp16 and conv perms.

What this is

A faithful double SHA-256 (SHA-256 → SHA-256) implemented entirely in fp16 and compiled to a Core ML mlprogram that maps to the Apple Neural Engine.
Bitwise ops are emulated with fp16 arithmetic; rotates/shifts are 1×1 conv permutations, so they land on ANE (no gather).
Verified byte-for-byte against NIST test vectors via hashlib.

This is not a miner. It’s an ANE stress test that shows you can emulate integer crypto on a matrix engine…and that you probably shouldn’t, performance-wise. That contrast is the point.

Files

model.py — builds and converts the Core ML package (bitlot.mlpackage).
test.py — correctness + throughput harness (H/s, MH/s reporting).

Repo name: ANE-Hash. Model artifact name in scripts is bitlot.mlpackage—keep or rename to taste.

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install coremltools numpy

Build:

python model.py    # produces bitlot.mlpackage

Verify on NIST-style strings (runs entirely on device, compares to hashlib):

python test.py --verify --cu CPU_AND_NE

Benchmark (prints mean/p50/p90 and hash rate):

python test.py --bench --batches 1,2,4,8,16,32,64,128,256,512,1024 \
               --iters 10 --warmup 3 --cu CPU_AND_NE --verbose

What you should see

In Xcode › Metrics, Compute Unit Mapping should be 100% Neural Engine.
Median prediction around ~2 ms at batch=1 on recent M-series/macOS 15.x (your silicon/OS will vary).
Measured hash rate from the harness typically ~0.5–0.6 kH/s at the best batch (often 16–64).
You’re bottlenecked by emulated boolean math and massive adders, not ANE’s MACs.

Why it runs on ANE

All fp16: constants, weights, ops.
No gather: rotations and shifts are implemented as 1×1 conv with fixed fp16 0/1 weights.
Core ML target: mlprogram, opset targeting iOS 18 / macOS 15.
Compute units: CPU_AND_NE (or CPU_ONLY for sanity checks).

If you want to nudge flexible-shape models toward ANE on newer OS builds, load with:

import coremltools as ct
mlmodel = ct.models.MLModel(
    "bitlot.mlpackage",
    compute_units=ct.ComputeUnit.CPU_AND_NE,
    optimization_hints={"reshapeFrequency": ct.ReshapeFrequency.Infrequent},
)

How it works (ANE-centric)

Bit-slicing: each 32-bit word is a (32,1,1) column of {0,1} in fp16, batched as (N,32,1,L).
Boolean algebra in fp16:
- AND = mul, XOR = abs(a-b), OR = maximum, CH, MAJ are expressed via those primitives.
Rotates/shifts: fused 1×1 conv permutations (0/1 fp16 weights). This is why ops map to ANE.
Adders: carry-save + prefix carry-propagate (log-depth) built from the same primitives.
Double hash: first compression from provided midstate + last block; second compression over the 256-bit result with correct padding.

Everything stays fp16; there are no fp32 constants and nothing that triggers batch-norm folding.

Intended use

Research / demo: show ANE can run faithful bit-level crypto via fp16 emulation.
Education: bit-slicing, permutation-as-conv tricks, Core ML graph shaping for ANE.

Not intended for mining or production crypto. If you want real speed, use:

CPU with SHA-2 instructions (CryptoKit/OpenSSL) — tens of MH/s on modern M-series.
Metal with uint math on GPU.

Repro & environment

Core ML Tools ≥ 7.x recommended.
Opset target: iOS 18 / macOS 15.
Device: Apple silicon with Neural Engine.
Batch flexibility: RangeDim(1, 1024); best throughput often at batch 16–64.

License

See LICENSE in this repo.

Shout-out

To everyone who looks at a matrix engine and says: “yeah, let’s make it do bitwise crypto.”
ANE-Hash is that energy.