The neglected art of a 'relevant task' benchmark
And how to approximate one if you can't afford to create one
Long ago when deep learning was all the rage (circa 2018), you could spend a lot of time and money crunching a lot of data to make a new model and come up with something that was … inconclusively better than what you had had before.
Was your model architecture wrong? Could you have picked a better learning rate? Did the extra data you’d worked hard to obtain improve matters or did it all add information only to a piece of phase space where you already had plenty of signal? Stuck in a local optimum? How to know?
The best way out of the swamp of confusion was to know what you were shooting for. What was a reasonable limit for how good an answer you could get?
In many cases, if you were trying to replace a small human-generated decision with a small machine-generated decision, the answer was to estimate a human benchmark - some approximation of how correct / consistent an answer a small group of skilled humans would produce.
For instance, if you asked ten humans to make a decision on granting credit to a loan applicant. Or if you asked ten humans whether or not to shortlist a candidate for a given role. The level of agreement you got between those ten humans was a reasonable guesstimate for the upper limit on how good a model you would be able to build.
Even back then, with the smaller and more precisely scoped tasks we were looking to solve with AI, building a robust human benchmark wasn’t simple or cheap. But compared to chasing an unknown goal and potentially grinding a team into the ground trying to learn signal that just wasn’t there to be learned, it was a smart move.
Unfortunately, a lot of teams working with both traditional and generative AI techniques today aren’t using realistic human benchmarks to draw a line in the sand and say ‘we’re aiming for this’. Why? Two reasons jump out from the conversations I’ve had:
A lot of teams working today are super new to using AI techniques and hence slowly and painfully rediscovering a lot of prior art. They haven’t built a similar system before and they’re falling into time wasting traps.
With the more amorphous problems we’re chasing with natural language or image generating systems, definitions of ‘better’ or ‘what humans agree on’ are getting even harder and more expensive to generate
Consider for instance, generating a summary of a document or marking an essay. What does good look like? We’re trying to solve hard problems and hard things are, well, hard to do.
But here’s the rub. If you’re chasing something without even taking a stab at figuring out if that thing is achievable, how will you know when to give up? Or if you prefer, when to switch tactics? Or pivot your business model?
If you are the category definer, out there at the bleeding edge, pushing the boundaries, the following advice isn’t for you.
But a lot of teams right now are hustling / being hustled to integrate AI in general and generative AI in particular into their product or feature set.
For those teams, I encourage you to put a decent chunk of time and if necessary, money into figuring out what ‘best achievable’ performance is. If that ‘best achievable’ is good enough performance to make your use case valuable (definitely not always true!), then happy days, you have something to shoot for and a line to measure your progress against at least.
Still no definitive map through the trials and tribulations of too many knobs to turn but better than dashing yourselves against a genuinely insurmountable barrier.
If even ‘best achievable’ isn’t going to be good enough to make anyone use your product then go work on some other problem and maybe check back on state-of-the-art every six months.
A sneaky but actually quite useful pragmatic proxy for whether some goal is achievable with the best techniques available today is to see whether the large & established / very well funded companies are making progress on the same or a closely related goal.
For instance, there are a lot of people chasing the idea of AI tutors across a range of education levels. If you’re trying to do this type of thing for school aged kids, pay very, very close attention to how Khan Academy is getting along - what features they are releaseing and importantly what features they are not releasing? With early access to GPT4 and very deep pockets, they form a reasonable approximation of ‘best achievable’ in this niche market.
Who is the Khan Academy of the niche you’re developing within? It is quite possibly not a direct competitor as lots of problems cross business domains.
Trust me, think hard about how to construct a relevant benchmark. You will save yourself money and wasted effort and that one resource that none of us can actually get any more of … time.