Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Issues with Dabstep v1 -- perhaps time for v2?
#16
by
justinlangsethgenesis
- opened
There are many issues with this benchmark v1:
- the scorer.py (it's too permissive in its string matcher, accepts negatives for positives, accepts single letter when answer is actually supposed to be letter:number)
- the repetitiveness of the hard cases (there are only 15 actual hard cases, with small variations)
- illogically high fees due to SUM of fees (which is not specified in manual, but is industry practice)
- large swaths of transactions match no fees and count as 0 fee
- one cluster of hard cases seems improperly limited in which MCCs it expects in the answers (only those in fees.json, vs the mcc - lookup table or the ones in the payments table)
- rounding expected at a different level than specified in answer format (although there is a possible explanation for this in that net fees charged monthly are rounded at cents, and the "14 decimal precision" format instruction is a confusing red herring)
- typeo in one of the case clusters (card scheme vs ACI)
- fees still accrue for transactions "refused by Ayden" that scale by volume, versus fixed "decline" fees
- one cluster expecting simply ACI code instead of ACI:{fee} as per format instructions
- one cluster ignores is_credit null wildcard in its expected answers
- the downloadability of all submission and scored files, allowing for derivation of ground truth and review of other submissions including their reasoning traces (when present)
Perhaps we work together to create a new version?