Into the Kernel: Decoding AI’s Hidden Mind
Model Internals Is All You Need
I have a brain wired for the jungle—built to read patterns in rustling leaves and shifting shadows—and yet I navigate modern complexity through abstracted binary switches. Lights go on. Elevators arrive. Email sent,... and now AI. When I interact with a foundation model, I prompt it and I get an output. But what happens in between has been a topic of interest to only researchers. In this note I want to convince you that not only is history repeating itself, but also the immense value created when we get to understand what happens between input and output. I am biased to look at the world in this way because my career started with studying cognitive psychology in my undergrad (which propelled me to pursue AI 40 years ago). Skinner explained behavior via the principle of conditioning, where he showed complex behaviors/outputs could be conditioned on inputs (or what we call today prompt engineering). Behaviorism was eventually dismantled by people like Chomsky and the cognitive school who showed conclusively that intelligence can only be accounted for by internal representations and the operations on them. This is old news now; AI even has a prestigious conference on this topic (ICLR). And today’s AI success has been precisely these “system2” representational machines that embed, attend and transform representations in ways we don't fully understand.
We started Krnel because there is growing evidence that foundation models have thoughts and beliefs, and likely an internal language, not by design but by emergence. These are clearly different to ours but if harnessed and managed correctly can create immense new value. Gaudi, the first ecological architect, decorated his buildings with the construction rubble, creating the exquisite designs we enjoy today. We believe the internal representations of a foundation model, which are almost entirely “dropped on the floor” by applications, are in fact carriers of immense knowledge and insights. We can build really interesting applications that tap into these internal thoughts and beliefs of AI. Keep in mind for now, the keyword is “tapping into”. Three fascinating ideas have shaped this conviction.
The first time my interest piqued on this topic of what models think was when I read an article about two physicists that had trained a neural model on pendular motion data and found the model exhibited very high predictive powers. However, they did not publish the results because they could not understand and explain the model.
Next was the Financial Times article on the transformer which stated “Polosukhin, one of the coauthors of the seminal paper “Attention is All You Need” paper, is a science fiction fan from Kharkiv in Ukraine. He believed self-attention was a bit like the alien language in the film Arrival, which had just recently been released. The extraterrestrials’ fictional language did not contain linear sequences of words. Instead, they generated entire sentences using a single symbol that represented an idea or a concept, which human linguists had to decode as a whole”. Something like the Glass Bead Game. In our own decoding experiments we’ve witnessed similar outcomes, where models represent concepts in ways we don't fully understand, if at all.
Next was the work of Lisa Schut, a graduate student at Oxford, who in her paper made an even more interesting proposition: “if the hidden knowledge encoded in these highly capable systems can be leveraged, human knowledge and performance can be advanced”. Her agenda is to decode not the knowledge in AI we already know, but rather those that the machine knows and we don't. This is a fascinating agenda. Historians view history as some markov path through time. Today could’ve easily have been very different if any past event had unfolded in an epsilon different way. What we know today is simply one trajectory through the pachinko machine that is life. What if the internal state of AI is not only represented in a dramatically different way, but can also simulate all possible paths that could’ve happened? What could we build with such machines?
This all sounds very academic I hear you think. But let me offer a defense of practicality. We started Krnel because we believe by understanding what models represent and believe, we can harness that towards useful work. This agenda puts us on a similar stage of development in AI as society was at industrial revolution with the invention of the steam engine. History is repeating itself. The first law of thermodynamics by Jules appeared after the French realized the importance of the steam engine by their British counterparts. To use the joke among quantum physicists, the Brits belonged to a “shut up and calculate” school where they didn't care how the steam engine worked, just that it did. Even the museums holding these beautiful artefacts were decorated more ornately than their protestant churches. The French, on the other hand, realizing the importance of the discovery and the need to not be left behind, set up an institute to understand the principles governing the behavior of steam engines. It was here where Jules came up with the first law of thermodynamics that today (together with the other laws of thermodynamics) explains not just steam engines but even the cosmos itself and a lot of things in between. This understanding has been transformational and, in my opinion, probably more profound than the later information revolution. This understanding of energy allowed us to explain and improve not just a steam engine but cars, food, etc. Under this unified perspective all “work” (by steam engine or otherwise) is fundamentally about tapping into (recall the keyword) and harnessing the energy that is dissipated as the universe moves from a state of low to high entropy. Similar to how a waterwheel operates by leveraging the energy of falling water, machines that you and I use today from cars, to planes, to even food we eat are following the same fundamental principals (see this brilliant documentary on this topic). This understanding has allowed us to design new and more efficient machines to tap into the principles that are not observable to my simple monkey brain that only sees bananas. Krnel’s agenda is similar: can we tap into the vast amount of work the model already performs in between input and output?
Importantly, this deeper understanding of mechanical machines in the industrial revolution had an empirical counterpart. Early steam engines were notoriously dangerous and exploded often. Measurement through gages and even dedicated instrumentation cars on trains, became key to understanding and managing system behaviors. Our understanding of these systems have become so complete that we have increasing fly/drive by wire systems, and an exponentially decreasing number of incidents.
The agenda is once again to uncover hidden laws of neural systems and harness them to create more value, this time in information and not mechanical systems. Understanding the mechanisms / principals of how neural systems works is extremely challenging, akin to understanding the full system description of a steam engine, but now in even higher dimensions. To quote Einstein “.. all our science, measured against reality, is primitive and childlike -- and yet it is the most precious thing we have”. Today the field of Mechanistic Interpretability (MI, 1. 2) has emerged whose goal, like Jules, is towards this understanding by discovering the circuits and algorithms that a model has implemented during training. Our mission at Krnel is however more aligned with the measurement branch of the effort. Our vision is less explainability and more measurement, detection and control. Just like the steam engines, our agenda is to place gages on the neural systems that measure and detect states we care about and control AI to that end.
We are focusing these efforts with the goal of securing AI models, whose details and discoveries we will share in the upcoming reports. Understanding model internals will support many new applications but we are starting and focusing on security because today’s guardrails/scanners only attend to the input or output of the model; they are easy to evade, costly to install and run and do not solve the root cause of the problem. Our vision is like the gages of the steam trains; to instrument and observe models at runtime to detect and intervene on risky inputs, all within the model. We will show that models actually form beliefs over their inputs and it is this belief that once discovered can be harnessed to do work for us.
Welcome to Krnel—where we look inside the machine to build the future.
* None of the text below was written or assisted by AI.