Structured knowledge tracing for life-long learning

Attachment

PSI-KT-arxiv_2403.13179v1.pdf

Author

Alvaro Tejero-Cantero

Date

April 1, 2024

Last edited time

Jun 2, 2024 9:34 AM

URL

https://openreview.net/forum?id=NgaLU2fP5D

Can we make learning more enjoyable? As an avid, lifelong learner myself, I have pondered this question long and deep. In this post, I want to tell you about a machine learning algorithm we developed to help self-directed learners stay on track and have more fun. Our paper describing the method, Predictive, Scalable and Interpretable Knowledge Tracing on structured domains (PSI-KT) has been recently accepted at ICLR, and you can freely-licensed code at github.com/mlcolab/psi-kt. This project was joint work by Hanqi Zhou, Robert Bamler, Charley M Wu, and myself, Álvaro Tejero-Cantero.

Time: Tracing knowledge as we practice and forget

Self-directed learning is the kind Duolingo is famous for — an affair between a human and a computer, owls and other cute animals notwithstanding. Suppose you aim at learning Chinese well enough to follow cookbook recipes. To avoid suggesting too easy, or too hard learning activities, our computer tutor, much like a human one, needs to have a good idea of you current level in Chinese (and cooking). However, instead of putting you through a placement test, we quiz you with basic culinary words and phrases, one at a time. How well you answer these over time progressively informs the system about your current competences in Chinese. Crucially, since you get to see the solution after every answer, each quiz gives you a refresher or a chance to self-correct. A stream of quizzes can thus smoothly combine activity-based learning with performance evaluation, where learning feels more consequential and interactive than attending a lecture and evaluation less stressful than end-of-term exams.

Knowledge tracing estimates the knowledge state of concepts at all times (three colored curves) based on observed learner performances at times of review, represented in the diagram as filled squares for success vs. empty squares for fail (ratings are here binary here only for simplicity of illustration.)

Keeping tabs on the state of all your knowledge over time is called knowledge tracing (KT), and it must not only reflect the boosts you get at practice times but also track the steady decay of all memories in between. Reliable KT gives a tutor a key ingredient to guide you to your learning goal with appropriate choices of what to quiz you on and when. In this context, educational psychologists speak of a fruitful zone of proximal development: the knowledge and skills within reach with a bit of targeted help. A knowledgeable human tutor can easily give you the right nudge, but a computer tutor has no a priori idea about which concepts are necessary to master, or even helpful, in order to unlock a new skill or avoid forgetting the ones you already have. Thus, if we want accurate and useful knowledge tracing, we need to think also about knowledge mapping.

Relations: Mapping knowledge prerequisites

We have seen that in order to provide adequate scaffolding for learners on they way up the edifice of knowledge, tutors need a sense of how the different pieces relate to each other. You probably can anticipate how this is a humongous problem: one can think of so many ways in which bits of knowledge refer to other pieces, or even how to decompose knowledge into little, discrete and testable units in the first place. To make some headway, we clearly need simplifying assumptions. We decided to take the knowledge pieces as given (a.k.a. not our problem) and only care about finding their prerequisite relations as in “addition is a prerequisite for understanding integer multiplication.”

Knowledge mapping estimates how much learning a concept facilitates learning of others. The diagram illustrates PSI-KT’s estimates of prerequisite strength for the concept “area of parallelograms” (bottom left) as numbers between 0 and 1 on the arrows. PSI-KT uses performance data only (here from the Junyi15 dataset), i.e. the succinct concept labels are available but not used.

DALL-E on navigating structured knowledge

More generally, we consider A a prerequisite for B insofar as knowing A aids learning B. This is an eminently operational idea of prerequisiteness, and emphatically not an ontological one: we just care the best times to introduce them, not what they ultimately are. It is also a practical approach when the content of A and B is not fully available, which is common in publicly available datasets and desired when privacy is a concern. In practice, this definition means that seeing many learners performing better on B tasks (e.g. multiplication) when, by design or by chance, A (addition) is freshly practiced before it flags a potential prerequisite link addition → multiplication. As our map of the knowledge domain improves with more data, we can, conversely, use the newly-gained map to predict performance on future tasks. Thus, in PSI-KT this structured knowledge tracing does not require an existing knowledge map: PSI-KT learns the prerequisite structure of knowledge concurrently with everything else it needs to predict performance. This makes PSI-KT useful for any structured domain, regardless of whether an expensive, expert-made map is available for it. On the flip side, the very first explorers of the domain may get frustrated with clueless suggestions derived from an initial rough map, which puts a lot of pressure on PSI-KT to get good quickly at mapping (and perhaps a future version should use external maps when available). Although we could evaluate PSI-KT using public datasets of pre-college math learner performance, we expect structured knowledge tracing to be useful for biology, economics or even cooking. Please drop us a line if you know of suitable non-mathy datasets, so that we can test!

Traits: To each their own

Deep learning is often criticized for trading off predictive performance against interpretability. Case in point: deep learning knowledge tracing systems often represent each learner by a set of numbers (16 in the models we surveyed) that they use to individualize their forecasts. But there is no a priori way to know how these numbers relate to quantities that psychologists can measure and educators can understand! We believe that such interpretability is important for multiple actors: for machine learners and psychologists, to understand and debug the system; for educators, to enable human-in-the-loop tutoring; and, of course, for learners so that they can get helpful, actionable feedback. Interpretability is also a key ethical concern: if knowledge tracing is used to steer your learning, you should be entitled to understand the rationale behind its recommendations, so that you can challenge them if you find them incorrect or unfair. Accordingly, we designed PSI-KT from the ground to represent learners through interpretable traits such as their susceptibility to forgetting or their ability to exploit prerequisite knowledge when learning. In the paper we demonstrate how these cognitive parameters estimated by PSI-KT from real-world learning data are truly specific to each learner and, better yet, they are predictive of future behavior. This operational interpretability of learner traits is what a tutor needs to select which new items to learn and decide when to schedule them (e.g. later for learners that forget more slowly). The key was to build the model based on what experimental psychologists know about how we forget, leverage neural networks to get best-in-class efficiency, and throw in some Bayesian goodness to estimate traits over time, including our uncertainty about them due to the somewhat limited individual data.

With this, we can spell the P of predictive and I of interpretable in PSI-KT. As for the S, it stands for scalability, i.e. keeping up with growing data. Scalability is necessary if structured knowledge tracing is to be useful for real learners out there, so let us give an (algorithmic) thought to the invisible energy-gobbling GPUs heating up data centers somewhere.

Updates: tracking learning as it happens

Learners progress by interacting repeatedly with concepts in the domain. How does PSI-KT take into account the ever-growing performance data in order to keep its predictions current at all times? When new data comes in, machine learning models often need to update by retraining on the entire old dataset and the new information together. As more and more data accumulates, an uncomfortable dilemma appears: either we incur a horrendous electricity bill or we knowingly use a vintage, stale model. The Bayesian nature of PSI-KT allows us to offer a way to update the model’s parameters without retraining it from scratch, which is cool news for the data centers and should promote the adoption of structured knowledge tracing in real-world learning contexts. For even more flexibility, future models should offer cheap update paths also when new learners join the group, or when new concepts are added to the domain.

Methods: Variational inference of a Bayesian hierarchical state space model

In this section I want to highlight some of the modeling and evaluation choices we made for PSI-KT. This is a brief technical overview: feel free to skip ahead for the bottom line, or jump to Section 3 in the paper to see the actual math.

The key ingredient to interpretability in PSI-KT is to model the degradation of knowledge with time as a noisy process of exponential decay towards a long-term, consolidated level. This reflects extensive experimental evidence about the dynamics of human memory. Accordingly, an individual memory fades in average following an Ornstein-Uhlenbeck process, similar to how a macroscopic particle submerged in a liquid is twitched around by incessant random collisions while its speed dwindles to a halt due to viscous drag. Introducing a non-zero reversion mean, similar to the Vasicek model used in finance, allows us to represent long-term memories as initial (baseline) or long-term (consolidated) knowledge. Mastery of prerequisites modulates the reversion mean of a concept; the linear weights that quantify their influence collectively form the adjacency matrix of the prerequisite graph that PSI-KT infers from learner performances jointly with the parameters of the dynamical model of memory.

The dynamical model can be conveniently marginalized over one time step, which means we obtain discrete updates to knowledge states depending solely on the elapsed time between queries. The learner-specific parameters that quantify knowledge evolution are not fixed — they are allowed to evolve over time following simple Markovian dynamics. To infer knowledge states as well as learner traits from observed performances, we variationally approximate the corresponding posterior distribution with a neural network. The neural network parameters are shared across learners, which means that we efficiently amortize the inference over all available students. We find reasonable to assume the prerequisite structure to be static and the same for all learners, which allows to infer it using interactions from all learners at all times. In PSI-KT we introduced only a simple topological constraint on reciprocal prerequisites, but we expect that introducing additional inductive biases will help learn the structure.

Our experimental assessment of the learner representations starts with information-theoretical measures of representational specificity (are they learner-bound?), consistency (do they differ if estimated from different sets of interactions of the same learner?), and disentanglement (are their components individually meaningful?). These evaluations test basic properties we would like to demand of learned representations, regardless of the application. But representations that represent become useful only when they are also operationally interpretable, i.e. they can predict measurable behaviors and thus help suggest personalized interventions. We carried out a similar analysis for our inferred prerequisite graph, showing that the mastering the prerequisites of a concept (as inferred from past data) indeed aids performance (as evaluated in unseen data). We also measured the concordance of inferred prerequisites with ratings from experts and amateurs. While we noticeably improved in a range of alignment metrics against baselines, the picture that emerges from our experiments is that structure inference is a hard problem and more work is needed, including obtaining more labeled data for better validation.

In conclusion

PSI-KT offers joint knowledge tracing and mapping with interpretable learner traits and prerequisite relations as a foundation for next generation tutoring systems. We obtained these desirable qualities while achieving excellent performance even with small learner groups: deep learning baselines required data from at least 60k learners to match PSI-KT trained from just 300! Beyond the raw benchmark results of PSI-KT, which I hope and expect colleagues will soon improve upon, we also present a comprehensive framework to evaluate the interpretability of representations in knowledge tracing, drawing attention the need for stringent checking of ML models deployed in consequential domains such as education. Our PyTorch code for PSI-KT is shared as free software on Github. I am happy to answer questions about the method and keen to discuss practical ways to make recent advances in machine learning available to learners and educators for the broadest benefit.

Thanks to Seth Axen and Vladimir Starostin for insightful comments on a previous version of this piece.