One thing that periodically niggles me about the very attractive idea of augmenting humans with AI in task completion, is how synergistic that partnership will really be in different circumstances. Those pursuing self driving cars (yes please, and before I hit 70 if you don’t mind) have struggled with this thorny topic in the challenges of maintaining human driver attention when the computer co-pilot is doing most of the heavy lifting.
So it was very interesting, if a little disheartening to read this discussion paper from economists at Harvard and MIT that suggests that human radiologists may not actually be helped by working with a system that provides AI generated opinions on X-rays (although the idea that humans are not always perfect Bayesians, which crops up a number of times in the paper, is not exactly going to surprise anyone except perhaps the odd economist).
The results show that, unless the mistakes we document can be corrected, the optimal solution involves delegating cases either to humans or to AI but rarely to a human assisted by AI
Because of the potentials for significant harm and medical misadventure, not to mention the very grey area of liability, augmenting medical professionals with AI rather than replacing them is very prominent in discussion of AI in healthcare. It seems that we need to ramp up work on how to make those partnerships successful and ensure we’re training AI on the right bits.
And yes, you do recall Geoff Hinton confidently proclaiming back in 2016 that we should cease training radiologists because AI would supplant them within five years. Not to poke fun at Hinton, but a timely reminder to us all to ask which of the many current bold claims are similarly underestimating both the subtlety of the task they seek to automate and the synergy or otherwise of human and machine.
Enjoying the read? Never miss a post, subscribe now!
The exploitative backstory of Worldcoin and the crypto bros
Fresh from his world tour on AI regulation, Sam Altman has been in the news again with his seemingly oh so altruistic moonshot of building ‘universal access to the global economy’ via fledgling new crypto currency, Worldcoin. Opinions on this new venture are polarising to say the least. As an MIT alum, I’m inclined to heavily upweight this bleakly horrifying take from
and from MIT Technology Review.The startup promises a fairly-distributed, cryptocurrency-based universal basic income. So far all it's done is build a biometric database from the bodies of the poor.
I definitely recommend reading the article in full (Tech Review gives three free articles a month). The chilling idea that this is really all about trying to give a boost to all those ailing VC crypto investments gets a good workout in this episode of Hard Fork with
andWhen LLMs got political leanings
And the final thing to really catch my attention this week was work from way back in May (gasp!) from Stanford University’s Institute for Human-Centered AI on the troubling gaps between ‘average opinion’ and the ‘opinions’ aka the outputs of some of the popular language models now in wide and increasing usage.
High-level findings? All models show wide variation in political and other leanings by income, age, education, etc. For the most part, Santurkar says, models trained on the internet alone tend to be biased toward less educated, lower income, or conservative points of view. Newer models, on the other hand, further refined through curated human feedback tend to be biased toward more liberal, higher educated, and higher income audiences.
If the article sparks your interest, the actual research paper is worth a scan too. Big ups to the authors for publishing their code and data as well.
This is an incredibly complex area as there is definitely no right answer to whose views a language model should represent or even what an average opinion might look like. I think the lead researcher Shibani Santurkar sums it up rather well
“We’re not saying whether either is good or bad here,” Santurkar says. “But it is important to provide visibility to both developers and users that such biases exist.”
In an era when folks are assessing the capabilities of new language model variants ‘by vibe’, partly because no definitive benchmarks for performance really exist at this point, the OpinionQA dataset sparks a lot of questions worth asking.