A paper, a framework and a court case that caught my eye this week
Usage is a big part of the cost for GenAI products
It’s expensive to train foundational language models and takes a long time, so there is a fair bit of science behind making the decision of how much data to use and how many parameters to have in your model architecture to reach peak model performance on a fixed computational budget.
The cost of building and operating a model can essentially be split into cost to train and cost to use (usually referred to as inference costs). Models with more parameters usually provide higher quality outputs but more model parameters also translates to more compute intensive (read costly) and slower inference steps. So they are more expensive to run after training. That’s important if you are building a model for heavy commercial usage.
In December, researchers from MosaicML placed a paper on arXiv Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, an update to the popular DeepMind Chinchilla scaling laws that proposes laws that are also parameterised on the expected model usage.
In this paper, we modify Chinchilla scaling laws to account for inference costs, calculating the optimal parameter and training token counts—both in terms of compute and dollar costs—to train and deploy a model of any given quality and inference demand. Our principled derivation estimates that LLM practitioners expecting significant demand (~109 inference requests) should train models substantially smaller and longer than Chinchilla-optimal.
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, Sardana and Frankle
A simple and impactful illustration of how broadly you need to think when building AI products and why cost models of the finished product aren’t something to be overlooked or left to later.
Also a useful heuristic to keep in mind for those bemused by the size of capital raises in the Gen AI space or trying to predict the burn rate for OpenAI and how an eventually cashflow positive business model might pan out.
A 27 page preparedness framework but where are the senior technical decision makers?
Open AI got a new board observer this month, a Microsoft veteran whom is, I imagine, extremely commercially savvy. They also published a 27 page document outlining their ‘Preparedness Framework’ which will guide their thinking on assessing potential harm (of the catastrophic risk kind of course) from Open AI products as they are developed.
Reading through the doc, you can’t fault their enthusiasm for the subject and by publishing the framework in full, they are giving other research teams the opportunity to pick up a good draft and run with it which is great.
As Andrew Ng explains, the Preparedness Framework fits into the risk management strategy of Open AI like so (emphasis is mine)
OpenAI’s Preparedness Team is responsible for evaluating models. The Safety Advisory Group, whose members are appointed by the CEO for year-long terms, reviews the Preparedness Team’s work and recommends approaches to deploying models and mitigating risks, if necessary. The CEO has the authority to approve and oversee recommendations, overriding the Safety Authority Group if needed. OpenAI’s board of directors can overrule the CEO.
The Preparedness Team scores each model in four categories of risk: enabling or enhancing cybersecurity threats, helping to create weapons of mass destruction, generating outputs that affect users’ beliefs, and operating autonomously without human supervision. The team can modify these risk categories or add new categories in response to emerging research.
The team scores models in each category using four levels: low, medium, high, or critical. Critical indicates a model with superhuman capabilities or, in the autonomy category, one that can resist efforts to shut it down. A model’s score is its highest risk level in any category.
The team scores each model twice: once after training and fine-tuning, and a second time after developers have tried to mitigate risks.
OpenAI will not release models that earn a score of high or critical prior to mitigation, or a medium, high, or critical after mitigation.
OpenAI Revamps Safety Protocol, The Batch Issue 231, Jan 2024
Now I’m confident that there are fantastically talented and qualified folk working on the Preparedness Team. No concerns there (while they all still have jobs).
I note that the CEO appoints members of the Safety Advisory Group. That gives me a bit of pause as Altman’s track record does nothing to convince me that he’ll be appointing folks with strong technical voices who might oppose him. Then the CEO, Altman again, gets to override the recommendations. And yes, the board can override him (and this one might do so more successfully than the last one).
I’m not saying it’s odd to give the CEO and the board final decision making rights in any commercially orientated organisation. But. This does mean there are a lot of turtles in the decision-making stack before you get down to the first people who are likely to actually really understand how the technology they are all discussing works and why the risk is (or isn’t) so catastrophic.
Compared to the looming, truely existential risk of climate change, the risk of Open AI developing anything that threatens the stability of humans on this planet (more than extraordinary concentration of wealth in the hands of a few might be doing already) really doesn’t keep me up at night. But if it did, this framework and decision making hierarchy wouldn’t be easing my worries any. Just like we see at Boeing today, we take relevant technical skills away from the top table decision-making loops at our collective peril.
What happens when trust evaporates?
Somewhat tangentially to the world of AI (at least at first glance), I was intrigued by a recent Freakonomics podcast on the extent of fraud in academic publications. It’s actually a two parter - Jan 11 and Jan 18 - and both parts are well worth a listen, as the host seeks to unpack the ‘why’ of the fraud beyond the sensationalism and the ‘what could we do about it’ as well.
Data Colada, consisting of Uri Simonsohn, Leif Nelson and Joe Simmons, have been publishing a blog for just over 10 years that aims to spot inaccurate or fraudulent research publications and have them amended or retracted. I can’t imagine it makes them universally popular but they do seem to be widely respected and very considered in their ongoing work. Late last year, they were sued by a Harvard professor they had accused of research fraud.
A few weeks ago, I wrote about Francesca Gino, a researcher on dishonesty who last month was placed on administrative leave from Harvard Business School after allegations of systematic data manipulation in four papers she co-authored. The alleged data manipulation appeared, in a few cases, chillingly blatant. Looking at Microsoft Excel version control (which stores old versions of a current file), various rows in a spreadsheet of data seem manipulated. The data before the apparent manipulation failed to show evidence of the effect the researchers had hoped to find; the data after it did.
In total, three researchers — Joe Simmons, Leif Nelson, and Uri Simonsohn — published four blog posts to their blog Data Colada, pointing out places where the data in these papers shows signs of being manipulated. In 2021, they also privately reported their finding to Harvard, which conducted an investigation before placing Gino on leave and sending retraction notices for the papers in question.
Gino is now suing the three researchers who published the blog posts pointing out the alleged data manipulation, asking for “not less than $25 million.” (She is also suing Harvard.) Her argument is that because of the allegations of fraud, she lost her professional reputation and a lot of income. (Harvard Business School professors can make a lot of money through speaking appearances and book deals). I reached out to Gino for comment earlier this week but did not hear back before publication deadline.
Is it defamation to point out scientific research fraud? Kelsey Piper, Vox, Aug 2023
Now I haven’t spent hundreds of hours pouring over this exchange but on the face of it, the suit feels like a pretty blatant act of intimidation. “Think you can speak the truth to power/wealth? Think again.” type of thing. Kudos to the researchers of Data Colada that they are standing their ground, with a well subscribed Go Fund me campaign to cover their legal costs. And also impressive to see that their academic institutions are standing behind them both publicly and financially.
Data Colada (and the podcast episodes) focus on fraud in behavioural science and economics. They find that roughly 2% of published and peer reviewed research is verifiably fraudulent. Which they believe is a lower bound on a more substantial problem.
As I reached the end of the second podcast, which had extensively covered why people feel pushed to commit academic fraud, I wondered what the falsification rates would be like in a hyper competitive field when million dollar salaries and eye watering valuations are on offer and where many ‘publications’ are not peer reviewed at all?
And what sort of legal hell might rain down on anyone who tried a Data Colada-esque rigour test in such an environment?
All the more reason to continue to bring a sceptics view to bear on new developments in AI in 2024. Many things will work in the lab / demo / trial but not the messy real world. Some may never have actually worked at all.
Until next time
Thank you as always for being here and for taking the time to read and ponder the above.
If you enjoy Data Runs Deep, the best thanks you could give me is to share this newsletter with someone else you think would enjoy it. I also really appreciate likes and restacks - each one is a lovely virtual hug to receive and puts a little sparkle in my day.
Wherever your week takes you, I wish you at least a few months of peace and rejuvenation in amongst the hustle. Smell a rose or two and get some sunshine in your eyes, which is precisely what I am off to do now.