Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI | COFYT

Home

Library

Sign In

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI | COFYT

About this video

Video Title: Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI
Channel: No Priors: AI, Machine Learning, Tech, & Startups
Speakers: Andrej Karpathy, Sarah Guo
Duration: 01:06:32

Overview

This video features a conversation between Andrej Karpathy and Sarah Guo, discussing the impact and implications of AI agents, particularly in the realm of coding and research. They delve into the current capabilities and limitations of AI agents, the evolving nature of skills in the AI era, and the potential for autonomous research processes. The discussion also touches upon job market impacts, the debate between open and closed-source models, advancements in robotics, and the concept of "agentic education."

Key takeaways

AI as a Skill Multiplier: The conversation highlights how AI agents are transforming individual productivity, shifting the bottleneck from typing speed to the ability to effectively instruct and manage these agents.
The "Psychosis" of AI Advancement: Karpathy describes a state of "AI psychosis" driven by the rapid pace of AI capability advancements, leading to a constant exploration of new possibilities and a feeling of urgency to stay at the forefront.
Mastery of Agents: The discussion explores what mastery of AI agents looks like, emphasizing the ability to orchestrate multiple agents for complex tasks, manage them collaboratively, and leverage their persistent, looping capabilities.
AutoResearch and Autonomy: The concept of "AutoResearch" is introduced as a way for agents to conduct research autonomously, taking humans out of the loop for tasks that have objective metrics and can be optimized.
Shifting Industry Paradigms: The conversation suggests that the rise of AI agents will necessitate a re-evaluation of software design, moving towards agent-first interactions and API-driven systems rather than traditional app-based interfaces.
Speciation of AI Models: Karpathy posits that the future may see more specialized AI models ("speciation") tailored for specific domains, offering increased efficiency and targeted capabilities, rather than a single monolithic model for all tasks.
Open vs. Closed Source Models: The discussion acknowledges the progress of open-source models in closing the gap with closed-source models, highlighting the importance of open platforms for industry adoption and a healthy ecosystem.
Robotics and the Physical World: While acknowledging recent advancements, Karpathy suggests that robotics and physical world interactions will likely lag behind digital advancements due to the inherent complexities of manipulating matter.
MicroGPT and Agentic Education: The concept of "MicroGPT" is presented as a simplified, essential version of LLM training, illustrating how education itself might be reshaped by agents capable of explaining complex concepts.
Job Market Uncertainty and Adaptability: The conversation touches upon the impact of AI on the job market, suggesting that while automation may change certain tasks, the demand for adaptable skills and the ability to work with AI tools will remain crucial.

[0:00] Code's not even the right verb anymore, right? [laughter] But I have to express my will to my agents for 16 hours a day. Manifest. [music] How can I have not just a single session of Claude code or Codex or some of these agent harnesses? How can I have more of them? How can I do that appropriately? The agent part is now taken for granted. Now the claw-like entities are taken for granted and now you can have multiple of them and now you can have instructions to them and now you can have optimization over the instructions. But there [laughter] I mean this is why it gets to the psychosis is that this is like infinite and everything is a skill issue. Hi listeners, welcome back to No Priors. Today I'm here with Andre Karpathy and we have a wide-ranging conversation for you about code agents, the future of engineering and AI research, how more people can contribute to research, what's happening in robotics, his prediction for how agents can reach out [music] into the real world, and education in this next age. Welcome, Andre. Andre, thanks for doing this. Yeah, thank you for having me. Uh so it's been a very exciting couple of months in AI. Uh yeah, you could say that. >> I remember um walking into the office at some point and you were like really locked in and I was asking what you were up to and you're like, I just I have to code for 16 hours a day or code's not even the right verb anymore, right? But I have to um express my will to my agents for 16 hours a day. Manifest um because like there's been a jump in capability. Uh what's happening? Tell me about your experience. Yeah, I kind of feel like I was just in this perpetual I still am often in this state of AI psychosis just like all the time um because there was a huge unlock in what you can achieve as a person as an individual, right? Because you were bottlenecked by, you know, your typing speed and so on. But now with these agents it really, I would say in December is when it really just something flipped where I kind of went from 80/20 of like, you know, uh to like 20/80 of writing code by myself versus just delegating to agents. And I don't even think it's 20/80 by now. I think it's a lot more than that. I don't think I've typed like a line of code probably since December basically. >> [laughter] >> Um which is like an extremely large uh change. Um I was talking to it like for example, I was talking about it to for example my parents and so on and I don't think like a normal person actually realizes that this happened or how dramatic it was. Like literally like if you just find a random software engineer or something like that at their desk and what they're doing, like their default workflow of, you know, building software is completely different as of basically December. Uh so I'm just like in this state of psychosis of trying to figure out like what's possible, uh trying to push it to the limit. How is it how can I have not just a single session of, you know, um Claude code or Codex or some of these agent harnesses? How can I have more of them? How can I do that uh appropriately? And then how can I use these claws? What are these claws? Uh and uh so there's like a lot of new things. I want to be at the forefront of it, you know, and I'm very antsy that I'm not at the forefront of it and I see lots of people on Twitter doing all kinds of things and they all sound like really good ideas and I need to be at the forefront or I feel extremely nervous. And so I guess I'm just in this psychosis of like what's possible like because it's unexplored fundamentally. Well, if you're nervous, the rest of us are are nervous. We have a we have a team that we work with at Conviction that their setup is everybody is like, you know, none of the engineers write code by hand and they they're all microphoned and they just like whisper to their agents all the time. It's the strangest work setting ever. Uh and I thought they were crazy and now I like I fully accept I was like, oh this was the way. Like you're just ahead of it. Um what uh how do you think about your own capacity now to like explore or to do projects? Like what is it limited by? Yeah, what is it limited by? Uh just I think everything like so many things even if they don't work, I think to a large extent you feel like it's a skill issue. It's not that the capability is not there. It's that you just haven't found a way to string it together of what's available. Like I just don't I didn't give good enough instructions in the agents from the file or whatever it may be. I don't have a nice enough memory tool that I put in there or something like that. So it all kind of feels like skill issue when it doesn't work to some extent. You want to see how you can parallelize them etc. and you want to be Peter Steinberg basically. Uh so Peter is famous. He has a funny photo where he's in front of a monitor with lots of uh like he uses Codex. So lots of Codex agents tiling the the monitor and they all take about 20 minutes if you prompt them correctly and use the high effort. And so they all take about 20 minutes. They have multiple, you know, 10 repos checked out. And so he's just um going between them and giving them work. It's just like you can you can you can move in much larger macro actions. It's not just like here's a line of code, here's a new function. It's like here's a new functionality and delegate it to agent one. Here's a new functionality that's not going to interfere with the other one. Give it agent two. And then try to uh review their work as best as you can >> [laughter] >> depending on how much you care about that code. Like where are these macro actions that I can like manipulate my software repository by? And like another agent is doing some like research, another agent is writing code, another one is coming up with a plan for some new implementation. And so everything is just like happens in these like macro actions over your repository. Um and you're just trying to become like really good at it and develop like a muscle memory for it is extremely um Yeah, it's very rewarding number one because it actually works. Uh but it's also kind of like the new thing to learn. So that's why hence the psychosis. Yeah, I I do feel like my instinct is like whenever I'm waiting for an agent to complete something, the obvious thing to do is like, well, I can do more work, right? Like if I have access to more tokens then like I should just parallelize at tasks. And so that's that's very stressful because if you don't feel very bounded by your ability to spend on tokens, then you know, you are the bottleneck in the system that is max capability. Yeah, if you're not maximizing your subscription at least. And ideally for multiple agents. Like if you run out of the quota on Codex, you should switch to Claude or whatnot. I don't know. Like that's what I've been trying to do a little bit and I feel nervous when I have subscription left over. That just means I haven't maximized my token throughput. So I actually kind of experienced this when I was a PhD student. You would feel nervous when your GPUs are not running. Like you have GPU capability and you're not maximizing your the available flops to you. But now it's not about flops, it's about tokens. So what is your token throughput and what token throughput do you command? I would actually argue that it's very interesting that we had, you know, at least 10 years where in many engineering tasks people just did they didn't feel compute bound. Right? Um and now the entire industry feels that now. They feel like they they felt resource bound uh and now that you have this big capability jump, you're like, oh, actually it's not, you know, my ability to access the computer anymore. Like I'm the binding constraint. Yeah, it's a skill issue. Which is very empowering cuz um yeah, cuz you could be getting better. So that's why that's why I think it's very addictive because there's unlocks when you when you get better. Where do you think it goes? Like if you just think about like, okay, you know, Andre's iterating and everybody else is for 16 hours a day getting better at using coding agents. Like what does it look like in a year? Of like you've reached mastery. >> [laughter] >> Yeah, what does mastery look like, right? At the end of the year or like two, three years, five years, 10 years, etc. Well, I think everyone is basically interested in like going up the stack. So I would say it's yeah, it's not about a single session with your agent. Multiple agents, how do they collaborate and teams and so on. So everyone's trying to figure out what that looks like. And then I would say Claude is also kind of an interesting direction because it really, when I say a Claude, I mean this like layer that kind of takes persistence to a whole new level. Like it's something that like keeps looping. It's it's like um it's not something that you are interactively in the middle of. It kind of like has its own little sandbox, its own little you know, it kind of like does stuff on your behalf even if you're not looking kind of thing. Um and then also has like maybe more sophisticated memory systems etc. that are not yet implemented in agents. So um Open Claude has a lot more sophisticated memory I would say than what you would get by default uh which is just a memory compaction when your context runs out, right? You think that's the piece that resonated for more users versus like perhaps like broader tool access? For Open Claude? Yeah. Uh there's like I think there's at least five things that are really good ideas in here. Yeah, good job, Peter. I mean Peter has done a really amazing job. Um I saw him recently. Uh and I talked to him about it and I he's very humble about it. But I think he [4:51] innovated simultaneously in like five different ways and put it all together. Um so for example like the soul and D document. Like he actually really crafted a personality that is kind of compelling and interesting. And I feel like a lot of the current agents they don't get this correctly. I actually think a Claude has a pretty good personality. It feels like a teammate uh and it's excited with you etc. I would say um for example Codex is a lot more dry um which is kind of interesting because [laughter] in it's true. You know, it doesn't it and the other thing I would say is for example with Claude I think they dialed the sycophancy fairly well where when Claude gives me praise, I do feel like I slightly deserve it because sometimes I kind of give it like not very well formed thoughts and uh I give it an idea that I don't think it's fully baked and it doesn't actually react very strongly. It's like, oh yeah, we can implement that. But when it's a really good idea by my own account, it does uh seem to reward it a bit more. And so I kind of feel like I'm trying to like earn its praise which is really weird. And so I do think the personality matters a lot uh and I think a lot of the other uh tools maybe don't appreciate it as much. And I think in this aspect also Peter really cares about this and so that was correct. And then the memory system and then uh just, you know, he's just having fun with this um and then the the single WhatsApp portal to all of the automation. >> Yeah. Is there something that you have done personally with your claws beyond software engineering that you think is fun or interesting? Yeah, so in January I had a claw I went through a period of claw psychosis. So I built um I have a claw basically that takes care of my home and I call him Dobby the elf uh claw. Um and uh basically I used uh the agents to find all of the smart home subsystems of my home on the local area network which I was kind of surprised that it worked out of the box. Like I just told it that I think I have Sonos at home. Like can you try to find it? And it goes and it did like IP scan of all of the um basically um computers on the local area network and and found the Sonos thing uh the Sonos uh, system and it turned out that there's no password protection or anything like that. It just logged in and it's like, "Oh, yeah, you have these Sonos systems installed. I Let me try to reverse engineer how it's working." It does some web searches and it finds like, "Okay, these are the API endpoints." And then it's like, "Do you want to try it?" And I'm like, "Whoa, like you just did that." And I'm like, "Yeah, can you try to play something in the study?" And, uh, it does and music comes out and I'm like, "I can't believe I just That's crazy. That's like three prompts." Yeah. >> I can't believe I just typed in like, "Can you find my Sonos?" and then suddenly it's playing music. And it did the same for lights. And so like it kind of hacked in, figured out the whole thing, uh, created APIs, created dashboard so I could see the command, uh, kind of center of like all of my lights in the home. And then it was like switching lights on and off and, you know, so I can ask it like, "Dobby, it's sleepy time." And when it's sleepy time that just means all the lights go off, etc. and like so on. So it controls all of my lights, my HVAC, my shades, uh, the pool and, uh, the spa and also my security system. So I have a camera pointed outside of the house and anytime someone rolls in I have a Quinn, uh, a Quinn, uh, model that looks at the videos. So first of all there's change detection. Right. >> And then based on change detection it goes to Quinn and then it actually like tells me, um, it sends me a text to my WhatsApp. It shows an image from the outside and it says, "Hey, a FedEx truck just pulled up. FedEx truck just pulled up and you might want to check it and you got new mail or something like that." And Dobby just text me this. This is really incredible. Um, so so Dobby is in charge of the house. I text through with it through WhatsApp, um, and it's been like really fun to have these macro actions that maintain my house. I haven't like really pushed it, uh, like way more beyond that and I think people are doing a lot more crazy things with it, uh, but for me even just the home automation setup I used to use like six apps, uh, completely different apps and I don't have to use these apps anymore. Like Dobby controls everything in natural language. It's amazing. Um, and so I think like I haven't even pushed the paradigm fully but already that is so helpful and so inspiring I would say. Do you think that's indicative of like what people want from a user experience perspective with software, right? Because I I don't think, you know, it's pretty ignored that it takes humans effort to like learn new software, like new UI. Yeah. I think, uh, to some extent that's right. It's like working backwards from how people think an AI should be because what people have in their mind of like what an AI is is not actually what an LLM is by by like in the raw sense. Like LLM is a token generator, you know, like more tokens come out. But what they think of is like this this persona identity that they can tell stuff and it remembers it, you know? And, uh, it's just kind of an entity behind the WhatsApp. It's like a lot more understandable. Mhm. Uh, so I think to some extent it's like matching the expectations that humans already have for what an AI should behave but under the hood it's like a lot of technical details go into that. And LLMs are too raw of a primitive, uh, to actually, um, type check as AI I think for most people if that makes sense. Yeah. Um, I think that's like how we understand what the AI is and like the, um, description of it as Dobby or some persona obviously resonates with people. Um, I also think that it it uh, the unification that you did across your six different software systems for your home automation speaks to a different question of like do people really want all of the software that we have today? Yeah. Right? Um, because I I would argue like, well, you have the hardware but you've now thrown away the software or the UX layer of it. Um, do you think that's what people want? Yeah, I think there's this like there's this sense that these apps that are on the app store for using these smart home devices, etc. Uh, these shouldn't even exist kind of in a certain sense. Like shouldn't it just be APIs and shouldn't agents be just using it directly? And, um, wouldn't it like I can do all kinds of home automation stuff that, uh, in any individual app will not be able to do, right? Um, and an LLM can actually drive the tools and call all the right tools and do uh, do pretty complicated things. Um, and so in a certain sense it does point to this like maybe there's like an overproduction of lots of custom bespoke apps that shouldn't exist because agents kind of like crumble them up and everything should be a lot more just like exposed API endpoints and agents are the glue of the intelligence that actually like tool calls all the all the parts. Um, another example is like my treadmill. Uh, there's an app for my treadmill and I wanted to like keep track of how often I do my cardio, uh, but like I don't want to like log into web UI and go through a flow and etc. Like all this should just be like make APIs available and this is kind of, you know, going towards the agentic, um, sort of web or like agent first, uh, tools and all this kind of stuff. So I think the industry just has to reconfigure in so many ways that's like the customer is not the human anymore. It's like agents who are acting on behalf of humans and this refactoring will be will probably be substantial in a certain sense. One way that people sometimes push back on this is like, do people Do you Do we expect people to write code some of these tools? Do we expect normal people to do this kind of stuff that I described? Mhm. But I think to some extent this is just, you know, technology as it exists today and right now there is some write coding and I'm actually watching it and I'm working with the system but I kind of feel like this kind of stuff that I just talked about this should be free like in a year or two or three. There's no write coding involved. This is trivial. This is table stakes. This is like any AI, even the open source models, etc. can like do this. You should be able to translate it from a less technical humans intent very easily to this outcome. >> Yeah. Today it's write coding and it's involved and not many people are going to do it but >> And you still have to make some design decisions, right? We were talking about like we take frames for example. Yeah. Yeah. But I kind of feel like this will just, uh, start to the barrier will just come down and it's just ephemeral software on your behalf and some kind of like claw is handling all the details for you but you're not involved. Claw has a Claw has a machine and it will figure it out and it's just presenting you UIs and you're like saying stuff, you know? Mhm. Why haven't you, um, I guess like pushed the boundaries of what you can do personally with claws? Like is it, you know, you're focusing on more important projects, auto research, etc. or, uh, you're climbing the hill to mastery or something else, right? Yeah, I just feel like I'm so distracted by everything so I spend I [laughter] spend like a week on the claw stuff and I I have more to do almost, um, but I will say that, um, >> It's like Jensen told us we're all just busier, unfortunately. >> Uh, I didn't really take advantage of a lot of like email and calendar and all this other stuff and I didn't really have access cuz I'm still a little bit like suspicious and it's still very new and rough around the edges. So I didn't want to give it like full access to my digital life yet and part of it is just the security, privacy and uh, just being very cautious in that in that realm. And, um, so some of it is like held back by that I would say. Yeah, maybe that's like the dominant dominant feature but some of it is also just I feel so distracted because I feel like I had a week of claw and then other stuff is happening and What was the, um, I mean you've talked about like being able to train or at least optimize a uh, a a model as a task you want to see agents do for a long time. Like what was the motivation behind auto research? Auto research, yeah. So I think like I had a tweet earlier where I kind of like said something along the lines of to get the most out of the tools that have become available now you have to remove yourself as the as the bottleneck. You can't be there to prompt the next thing. You're You need to take yourself outside. Um, you have to arrange things such that they're completely autonomous. And the more you know, how can you maximize your token throughput and not be in the loop? This is the this is the goal. And so I kind of mentioned that the the name of the game now is to increase your leverage. Uh, I put in just very few tokens just once in a while and a huge amount of stuff happens on my behalf. And so auto research like I tweeted that and I think people liked it and whatnot but it they haven't like maybe worked through like the implications of that and for me auto research is an example of like an implication of that. Where it's like I don't want to be like the researcher in loop like looking at results, etc. Like I'm I'm holding the system back. So the question is how do I refactor all the abstractions so that I'm not I have to arrange it once and hit go. The name of the game is how can you get more agents running for longer periods of time without your involvement doing stuff on your behalf? And auto research is just, yeah, here's an objective, here's a metric, here's your boundaries of what you can and cannot do. And go. And, uh, yeah, it worked. >> at its effectiveness. Yeah, I I didn't expect, uh, it to work because so I have the project data chat, um, and fundamentally like I think a lot of people are very confused with my obsession for like training GPT-2 models and so on. But for me, uh, training GPT models and so on is just a little harness, a little playground for training LLMs. And fundamentally what I'm more interested in is like this idea of recursive self-improvement and to what extent you can actually have LLMs improving LLMs because I think all the frontier labs this is like the thing Mhm. uh, for obvious reasons and they're all trying to recursively self-improve roughly speaking. And so for me this is kind of like, um, a little playpen of that. Um, and I guess I like tuned Nan Chat already quite a bit by hand in the good old fashion way that I'm used to. Like I'm a researcher. I've done this for like, you know, two decades. I have some amount of like What is the opposite of hubris? Uh, yeah. [laughter] Earned confidence? Okay. I have like two decades of like, "Oh, I've trained this model like thousands of times. I've like, um, so I've done a bunch of experiments. I've done hyperparameter tuning. I've done all the things I'm very used to and I've done for two decades. Yeah. And I've gotten to a certain point and I thought it was like fairly well tuned and then I let auto research go for like overnight and it came back with like tunings that I didn't see. Mhm. And yeah, I did forget like the weight decay on the value embeddings and my Adam betas were not sufficiently tuned and these things just jointly interact. So like once you tune one thing the other things have to potentially change too. You know, I shouldn't be a bottleneck. I shouldn't be running these hyperparameter optimizations. I shouldn't be looking at the results. There's objective criteria in this case. Uh, so you just let you just have to arrange it so that it can just go forever. So that's a single sort of version of auto research of like a single loop trying to improve. And I was surprised that it, um, it found these things that I you know, the repo was already fairly well tuned and still found something. And that's just a single it's a single loop. Like these frontier labs they have GPU clusters of tens of thousands of them. And so it's very easy to imagine how you would basically get a lot of this automation on, um, smaller models. And fundamentally everything around like frontier level intelligence is about extrapolation and scaling loss. And so you basically do a ton of the exploration on the smaller models and then you try to, um, extrapolate out. So you're saying our research efforts are going to get more efficient. Like we're going to have better direction for when we scale as well if we can do this experimentation better. >> Yeah, I would say that like the most interesting project and probably what the frontier labs are working on is uh, Mhm. Yeah. you know, you experiment on the smaller models. You try to make it as autonomous as possible. Remove researchers >> [laughter] >> from the loop. They have way too much What is the What is the opposite of too much confidence? Yeah, yeah, they don't know. They shouldn't be touching any of this really. And so you have to like rewrite the whole thing because right now, I mean certainly they can contribute ideas. But okay, they shouldn't actually be enacting these ideas. There is a queue of ideas and there's maybe an automated scientist that comes up with ideas based on all the archive papers and GitHub repos and it funnels ideas in or researchers can contribute ideas, but it's a single queue and there is workers that pull items and they try them out. And whatever works just gets sort of put on the feature branch and maybe some people like monitor the feature branch and merge to the main branch sometimes. So yeah, just removing humans from all the processes and automating as much as possible and getting high token tokens per second throughputs and it does require rethinking of all the abstractions and everything has to be reshuffled. So yeah, I think it's very exciting. If we take one more recursive step here, when is the model going to write a better program MD than you? Yeah. Also program MD is like >> loop. Yeah, exactly. >> Yeah. So program MD is my crappy attempt at describing like how the auto researcher should work. Like oh, do this then do that and that and then try these kinds of ideas and then here's maybe some ideas like look at architecture, look at optimizer, etc. But I just came up with with this in markdown, right? >> Mhm. And so yeah, exactly. You want some kind of an auto research loop maybe that looks for You can imagine that different program that MDs would would give you different progress. So you basically every research organization is described by program MD. A research organization is a set of markdown files that describe all the roles and how the whole thing connects. And you can imagine having a better research organization. So maybe they do fewer stand-ups in the morning because they're useless. And this is all just code, right? And so you can So one organization can have fewer stand-ups, one organization can have more. One organization can be very risk-taking, one organization can be less. As you can definitely imagine that you have multiple research orgs and then they all have code. And once you have code, then you can imagine tuning the code. So 100% there's like the metal layer of it. Uh Did you see my text about my contest idea? My contest idea was like let people write different program MDs, right? And and so for same hardware, where do you get most improvement? >> Oh, I see. And then you can take all that data and then give it to the model and say write a better program MD. >> Yes, yes. Yeah, exactly. >> We're going to get something better. Like there's no way we don't, right? >> 100% look at where the improvements came from and like can I change the program MD such that more of these kinds of things would be done or like things that didn't work except you can 100% imagine doing that. So I think this is a great idea, but it's like you know, I think like you can sort of go one step at a time where you sort of have one process and then second process and then the next process and these are all layers of an onion. Like the LLM sort of part is now taken for granted. The agent part is now taken for granted. Now the claw-like entities are taken for granted and now you can have multiple of them and now you can have instructions to them and now you can have optimization over the instructions and it's just like a little too much, you know, but I mean this is why it gets to the psychosis is that this is like infinite and everything is scale issue and that's why I feel like Yeah, that's just coming back to This is why it's so insane. Okay, well, if [laughter] we're we're just trying to like diagnose the current moment and what is a relevant skill right now, what do you like what do you think is the implication that this that this is the loop we should be trying to achieve in different areas and then it works, right? Like you know, remove create the metric or create the ability for agents to continue working on it without you. Do we still have performance engineering? Like what Yeah, I mean so there's a few caveats that I would put on top of the LLM psychosis. So number one, this is extremely well suited to anything that has objective metrics that are easy to evaluate. So for example, like writing kernels for more efficient CUDA, you know, code for various parts of the model, etc. are a perfect fit because you have inefficient code and then you want efficient code that has the exact same behavior but it's much faster. Perfect fit. So a lot of things like like are perfect fit for auto research, but many things will not be. And so they it's just if you can't evaluate then you can't auto research it, right? So that's like caveat number one. And then maybe caveat number two I would say is you know, we're we're kind of talking about the next steps and we kind of see what the next steps are, but fundamentally the the whole thing still doesn't it still kind of like bursting at the seams a little bit and there's cracks and it doesn't fully work and if you kind of try to go too far ahead, the whole thing is actually net not useful if that makes sense. Because these models like still are not, you know, they've improved a lot, but they're still are like rough around the edges is maybe the way I would describe it. I simultaneously feel like I'm talking to an extremely brilliant PhD student who's been like a systems programmer for their entire life and a 10-year-old. And it's so weird because humans like there's like I feel like they're a lot more coupled like you have to you know, um Yes, you wouldn't you wouldn't encounter that combination. >> This jaggedness is really strange and humans have a lot less of that kind of jaggedness, although they definitely have some. >> [laughter] >> But humans have a lot more jaggedness. Uh sorry, the agents have a lot more jaggedness where sometimes like you know, I ask for functionality and it like comes back with something that's just like totally wrong and then we get into loops that are totally wrong and then I'm just I get so frustrated with the agents all the time still because you feel the power of it, but you also there's still like it does not say statistical things once in a while for me as well. I get very annoyed [clears throat] when I feel like the agent wasted a lot of compute on something it should have recognized was an obvious problem. Yeah. I think like some of the bigger things is like maybe what's under underneath it if I could hypothesize is fundamentally these models are trained via reinforcement learning. So they're actually struggling with the exact same thing we just talked about which is the labs can improve the models in anything that is verifiable or that [clears throat] has rewards. So did you write the program correctly and does it you do you the unit tests check out? Yes or no. But some of the things where they're struggling is like for example, I think they have a tough time with like nuance of maybe what I what I had in mind or what I intended and when to ask clarifying questions. Um or like what I Yeah, it's just um anything that feels softer is like worse. And so you're kind of like you're either on rails and you're part of the super intelligence circuits or you're not on rails and you're outside of the verifiable domains and suddenly everything kind of just like meanders. Like maybe another way to put it is if you go to if today if you go to like state-of-the-art model, ChatGPT and you ask it tell me a joke, um do you know what joke you're going to get? There's the joke. The joke? I do feel I I I can't tell you like the you know, standard form of it, but I do feel like ChatGPT has like three jokes. >> Yeah, yeah. So the the joke that apparently all the LLMs like love the most is why do scientists not trust atoms? Okay. Because they make everything up. Okay. >> They make everything up. So this is still >> emerge? So this is the joke you would get like three or four years ago and this is the joke you still get today. Okay. >> So even though the models have improved tremendously and if you give them an agentic task, they will just go for hours and move mountains for you. And then you ask for like a joke and it has a stupid joke. It's crappy joke from five years ago and it's because it's outside of the it's outside of the RL. It's outside of the reinforcement learning. It's outside of what's being improved. It's like and it's part of the jaggedness of like shouldn't you expect models as they get better to also have like better jokes or more diversity of them or it's just it's not being optimized and stuck. Do you think that that implies that we are not seeing like generalization in the sense of like broader intelligence of joke smartness being attached to code smartness? Yeah, I think there's some decoupling where some things are verifiable and some things are not and some things are optimized for arbitrarily by the labs depending on like what data went in and some things are not and um and >> But I mean the the premise there's a you know, premise from some research groups that if you're smarter at code generation or in these verifiable fields, you should be better at everything. And like the the joke situation suggests that that's not happening at all. Okay. >> Yeah, I don't think that's happening. I think I think maybe we're seeing like a little bit of that, but not like a satisfying amount. >> Yeah, that jaggedness exists in humans. You [laughter] can be very very good at math and still tell really bad jokes. >> Yeah, that's true. Yeah, but it just it still means that we're not getting like the story is that we're getting a lot of the intelligence and capabilities in all the domains of society like for free as we get better and better models and that's not like exactly fundamentally what's going on and there's some blind spots and some things are not being optimized for and this is all clustered up in these neural net opaque models, right? So you're either on rails of what it was trained for and everything is like you're going at speed of light or you're not. And so it's the jaggedness. So um So that's why I think like even though the the progression is obvious what should happen, you can't let it fully go there