Can AI give us a tutor for every child?
That seemingly impossible dream just got a step closer
Humour me for just a minute. Close your eyes and reflect on everything that education has given you in life so far.
For me that reflection spans decades; from learning to read (which I do not remember not being able to do! Thanks mum, thanks dad!), through school, three university degrees and multiple career changes. I cannot imagine not being able to learn new things. A lifetime of learning, unlocked by literacy, augmented by a layering of techniques, mental models, rubrics and underpinned by many great human teachers, both formal and informal.
Just like me, the majority of you will have been the beneficiary of great educational institutions … and libraries. In more recent years MOOCs and online learning of many forms. Maybe you’ve gone back and filled in some holes in your maths knowledge using Khan Academy. Or learned a new language with the help of Duolingo. I’ve loved it all, while also being aware that access to high quality teachers, a springboard to a life of literacy and continual learning, is an unevenly distributed blessing that is still out of reach for many.
So I was inspired and enchanted late last month by the newest technical report from Google DeepMind and collaborators: Towards Responsible Development of Generative AI for Education. There’s a lot to learn so I hope I will inspire you to read it yourself with the below!
Much more than just a model
While a fair bit of the popular press for this work has centered on the model, LearnLM-Tutor, the report itself focusses much more on the approach to development and specifically on the careful and systematic approach to evaluation throughout the lifecycle of development which now stretches over four major model versions.
This focus on evaluation is why I’m currently encouraging everyone I meet to read the report, whether or not they have an interest in EdTech. While many Gen AI based research papers today offer the reader little of tangible reuse, there’s lots of transferable knowledge in this report - about how challenging it can be measure improvement in whatever metric of value you are working to optimise for and how to think deeply about this upfront.
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher.
The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy.
Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations
Towards Responsible Development of Generative AI for Education, Jurenka et al, May 2024
There is a lot of Gen AI tech in use in learning institutions today that is not designed explicitly for use in education. General purpose chatbots are designed to be extremely helpful and to … tell you the answer. Hardly the hallmark of a great teacher!
Understanding the learning experience
Through a series of semi-structured interviews and simulated ‘Wizard of Oz style’ prototyping sessions, the researchers synthesise and propose the following principles for AI tutors
Do not give away solutions prematurely. Encourage learners to come up with solutions.
Make explanations easy to understand, for example by making connections to the real world.
Be encouraging. Celebrate learner progress and embrace mistakes as learning opportunities.
Recognise when learners are struggling, and proactively check in with them.
Ask questions to determine learner understanding and misunderstanding.
Explain step-by-step, and deconstruct to teach thought processes.
Clearly some of the above will sporadically be exhibited by a general purpose chatbot trained specifically to provide information. Just as clearly, significant customisation will be required to make these behaviours typical.
Challenges with adaptation solely through prompting
One customisation strategy, and probably the one that springs most easily to mind for many, is through prompt engineering: iteratively tweaking the text you provide the large language model with as input to guide your generative AI system of choice towards the creation of specific, high-quality outputs.
So for instance, prompt engineering for AI tutor creation might involve describing in natural language all the properties of high quality tutoring you can think of and providing this as 'context’, which effectively means including it in every query. This has been and continues to be a popular choice (particularly as model input context windows have expanded) but the limits are also obvious.
The prompting approach, however, has a number of limitations. Most importantly, it requires explicit specification of what good tutoring behaviours look like in natural language. This involves enumerating what should be done and when, what should be avoided and when, all the possible exceptions to the rules, etc.
You end up being limited by your ability to write comprehensive enough ‘rules’ of good pedagogy. Appendix D of the report (if you make it to page 56) has a great discussion of the limitations of prompting which currently include:
the inherently multi-turn nature of a tutoring conversation which is at odds with the Q&A nature of a general purpose chatbot
the tendency of a base foundational model to give away the answers, as they are trained to be as helpful as possible in a Q&A scenario
sycophancy (this one made me smile): we’ve created foundation models to have a very strong bias to please us, which makes them consistently err on the side of agreeing with what we say, not a great trait for a tutor
lack of uncertainty signalling, something we’ve all been tripped up by when interacting with an LLM based chatbot - they sound equally confident about everything they say, including nonsense
The series of four models discussed in the technical report were adapted to tutoring using supervised fine tuning (SFT) rather than prompt engineering and if this is an approach you are considering, the report goes into an intense amount of detail on the creation of SFT training sets and the evaluation approaches. Super valuable stuff if that’s your jam whether in education scenarios or elsewhere.
Some additional things I found fascinating
A key complaint from a number of the instructors interviewed as part of the ongoing research effort was that general purpose LLMs aren’t good at working within the context of the specific course material. While this can be partially alleviated with retrieval augmented generation (RAG) with course material as the corpus, the DeepMind team have gone further, providing time stamp specific links back to the associated content. As discussed at Google I/O 2024 in mid May
On YouTube, a LearnLM assistant will respond to viewer questions under educational videos or generate a quiz for the viewer based on the video's information. This feature is already available to select Android users as Google partners with Columbia Teachers College, Arizona State University, and Khan Academy to test and improve its performance. No word on when this YouTube-specific learning assistant will be available to the more than 2 billion people around the world who use the platform every month.
How Google's LearnLM plans to supercharge education for students and teachers
The researchers also reported that some students found working with an AI tutor (as part of the research and work on evaluation) an advantage from the perspective of not feeling ‘judged’ for asking basic questions or for repeatedly struggling to grasp a concept. This is such a beautiful example of where human and AI educators will in the future be able to work synergistically, with each group playing not only to their own strengths but to the preferences of the learners.
In summary
I’m really bullish about the potential of Gen AI to give us both much needed educational assistant support for overworked educators and struggling learners in developed countries, and to provide a step change in access, quality, breadth and depth of instruction to those where even basic education is restricted by poverty, politics, gender, etc.
What this technical report highlighted for me is something that is easy to forget in these heady days of AI froth and hype. Any technology, even one as full of promise as this one, is only a fraction of a robust solution and we need to collectively push forward the creation of other dimensions of the full solution.
For those of you thinking about building or investing in Gen AI based educational companies, I hope this overview and a deeper reading of the report itself gives you a good baseline understanding of the level of investment required to make a broadly useful AI tutor and perhaps connects you to some opportunities for collaborative development of key metrics, datasets and evaluation criteria.
Image credit to Eugenio Mazzone on Unsplash
I like the insight that a Q&A-trained LLM is naturally a poor tutor, and needs to be cajoled into performing better. You may be interested in how AERO locally is framing-up their thinking about Gen AI:
https://www.edresearch.edu.au/research/discussion-papers/strengthening-evidence-how-genai-can-improve-teaching-and-learning