DABstep

Running on CPU Upgrade

Issues with Dabstep v1 -- perhaps time for v2?

#16

by justinlangsethgenesis - opened about 9 hours ago

about 9 hours ago

•

There are many issues with this benchmark v1:

the scorer.py (it's too permissive in its string matcher, accepts negatives for positives, accepts single letter when answer is actually supposed to be letter:number)
the repetitiveness of the hard cases (there are only 15 actual hard cases, with small variations)
illogically high fees due to SUM of fees (which is not specified in manual, but is industry practice)
large swaths of transactions match no fees and count as 0 fee
one cluster of hard cases seems improperly limited in which MCCs it expects in the answers (only those in fees.json, vs the mcc - lookup table or the ones in the payments table)
rounding expected at a different level than specified in answer format (although there is a possible explanation for this in that net fees charged monthly are rounded at cents, and the "14 decimal precision" format instruction is a confusing red herring)
typeo in one of the case clusters (card scheme vs ACI)
fees still accrue for transactions "refused by Ayden" that scale by volume, versus fixed "decline" fees
one cluster expecting simply ACI code instead of ACI:{fee} as per format instructions
one cluster ignores is_credit null wildcard in its expected answers
the downloadability of all submission and scored files, allowing for derivation of ground truth and review of other submissions including their reasoning traces (when present)

Perhaps we work together to create a new version?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment