Video: How Warp builds self improving agents on Claude | Duration: 2555s | Summary: How Warp builds self improving agents on Claude | Chapters: Welcome and Introduction (3.28s), Agent Deployment Challenges (91.145s), Agent Quality Challenges (257.84499999999997s), Skills Framework (474.90999999999997s), Agent Feedback Loops (639.35s), Feedback Quality Matters (809.0400000000001s), Evaluation Methods (965.1500000000001s), Live Agent Demo (1141.655s), Outer Loop Design (1745.765s), Handling Misleading Feedback (1841.31s), Skills vs Memory (1961.325s), Skill Management Strategy (2109.775s), Measuring Agent Goals (2352.37s), Evaluation Metrics Evolution (2418.04s), Closing Remarks (2536.415s)
Transcript for "How Warp builds self improving agents on Claude":
Hi, everyone. Thank you for tuning in. Today, we're gonna be talking about how Warp builds self improving agents on Claude. We'll do intros in a sec, but let's start with some housekeeping. So a recording of this session will be distributed via email within twenty four hours. Don't stress if you need to get up, get some water. We will distribute a recording. Questions can be submitted at any time using the q and a tab in the webinar portal, so I think that way. And we are gonna get to some live questions at the end. I'm really excited. I think that's gonna be the best part. And give us feedback. We will open a survey at the end, and we are, like, really excited and eager to hear any feedback you have about our webinar format, anything that comes to mind. So to kick it off, my name is Carly. I'm on a team in Anthropic called Applied AI. My team sits between research product and go to market. I kinda like to say that the charter of my team is to optimize and figure out how Claude models work in the wild. So at times, that takes the shape of product work, sometimes research work, figuring out what capabilities Claude needs, you know, to to have the best impact in the wild. But most of the time, that takes the shape of working with customers who are using our models to build really amazing products. I've been working with Warp for the past about year, and it's been a total pleasure optimizing Claude with them and in their product. I'll pass it off to Zach, who I'm really excited is here today. Thanks, Carly, for the, for the intro. I'm really excited to be here. I am Zach. I am the founder and CEO of Warp. For folks who might not be familiar, Warp's, mission is helping developers ship better software more quickly. We have a couple products. One product is a agentic development environment. It is born from the terminal. The company has been around for a little while trying to make the command line the best place to build. We have a second product, which you'll also see a bit later, in this webinar, which is called Oz, which is our cloud agent infrastructure, product. But today, I'm not gonna focus so much on on our products as, like, trying to talk about how we solve a particular problem where we're trying to build better agents. So with that, I'm gonna hop in, Carly, and get get going here. So what are we gonna talk about today? You know, for folks who are trying to deploy agents and here I'm thinking about agents that are, really, like, automations. So good examples would be an agent that does code review or an agent that triages GitHub issues or even non developer agents, like an agent that does competitive research. It's easy to sort of, like, set up an agent to do something like that. It's it's kinda hard to get it to a spot where it does it that well. And so we're gonna talk about how you can improve your agents over time using a technique called self improvement loops. We're gonna do a live demo. You know, hopefully, that goes okay. And then, as Carly said, I'm I'm excited because we're gonna do audience q and a as well. So let's start with the the problem. The problem is, let's say you want to build a code review agent, and I'm gonna focus on this use case because this is something, that we have actually built at Warp. The way that you might go about doing it might look something like as follows. So you might try to write a prompt. That prompt for code review would probably look something like, look at this open PR, try to identify any bugs, style issues with the code, and then add comments to it, maybe suggest diffs for improving it. And so you'll write that in a in a prompt. And then what you'll do is you'll set up, you'll set up an agent probably in CICD for this particular use case that runs whenever a pull, like, a pull request is opened. That will run your agents. You'll you'll try to, like, scale this across your team, and you have people use code review. And that will work so so. Actually, that won't work that great. And this is the problem. Like, your initial prompt here, your initial stab at creating this agent might be, like, 80% right, but it's not gonna be a 100% right, and it could be very annoying. And so this is actual, you know, feedback from our internal Slack from when we first rolled out our code review agent. So Zach Bay was like, this thing is generating a lot of noise. Can we can we how do we make this thing better? I was like, this thing is making stupid comments on my PRs. Like and, like, Alok was a little more diplomatic. He's like, the agent you know, you know, a lot of the stuff that's come out of this is low quality. And. so I think you'll find this experience, with other domains too, not just code review. Like, Carly, we've seen it. We've built agents for all sorts of things, like go to market agents, competitive research. The first stab at building one of these agents is often not very good. And unless you have some way of making the agent better, it can actually just be, like, annoying to people. on the team. The my favorite analogy here is, like, also, I think, you know, I think of my colleague, Tina, for example. Every day that, you know, Tina comes into work, if I give her feedback on Slack, like, she gets better. And so I think that's kinda, like, a similar problem statement here, and that's amazing. And so this self improvement loop is just, like, so important today to running agents. Zach, I'm just curious. Like, for before maybe you guys were implementing self improvement loops, in response to these comments from, you know, you and and other Zach and Alok. What what, like, what would you do to respond to that? Yeah. We did, I would say, kind of the obvious thing, which was we would go in, try to manually update this prompt based on the feedback. So, you know, we would have someone on our team look at the, you know, PRs where it was failing and then try to extract something to make the prompt better. And that's fine. I actually don't think that's, like, a bad way to start. It's very coarse. But it doesn't scale, and you run the risk of it not improving quickly enough. And then, you know, we see, like, people just start to ignore the agents. And, I bet a lot of folks in here have had this experience where you have people at your company setting up bots, and they're just, like, spewing out stuff that is not that useful. And that's the failure mode. And so we we were like, we have to get to a better a better spot. So we you know, the other things we would try is, like, improving the context in the code base, like improving our agents. I m d. But, you know, we didn't have a great solution, to be honest. Cool. Makes sense. So we were we were gonna do a poll. Did you did you set up that poll? Are people responding to this? I think it might be live. Let's let's take a look. Oh, it looks like it's live if people click on, like, the poll the poll one. Oh, I see some some answers. Zach, do you wanna walk through kind of each of the the types, or we can let people take. take what they want? People are voting here. I think, like, these are all not great solutions to me. And so, hopefully, we can go over something that scales better. But, yeah, if you you could just provide more and more context to go along with that initial prompt, that's certainly one way of doing it. You know, manually fixing the agent's outputs. In the code reviews case, this means just like, hey. The code review agent might have suggested a diff that's like, okay, And you take that as a starting point and you make it into something better. I would say ignoring the agent's not really a solution. That's like that's like throwing up your hands and being like this this didn't work. So I don't think any of these things are great. Let's. move on to, how how you can build something. better here. Okay. So what you can do, is you could adopt a framework around skills. This is what I wanna spend, you know, most of the time talking about here. So we call this self improvement, using skills. For folks who aren't familiar, and I'm guessing most people are, but skills are sort of ways of encoding knowledge for agents without putting that knowledge, like, directly in in the prompt. There's something that the agent can, like, look up in the course of doing its job. Their file based is another really nice thing about them, and I'll I'll explain why. But these are just standard skills. And so this is the framework. It's really simple, actually, which is kind of the beauty of it. Let's let's talk it through in the code review example. So in the code review example, instead of encoding code review in, like, a prompt, you would move that knowledge or those instructions into a skill. And, like, let's call this the base skill. And, again, it's gonna have the same type of thing where it's like, look for bugs, improve the style, reuse code, whatever you think belongs in a good, a good code review skill. Then when the agent runs, so when a PR is opened, the agent runs that skill. It it executes with that skill in context and uses that skill to produce the code review. Then there's the important step of humans will look at these code reviews and give feedback on the output of the agent. And so in the case of code review, this might be something as simple as, like, thumbs up, thumbs down. I think there's higher signal ways of doing this though, which is like a human could be like, this was a good useful comment, or the human could give explicit, feedback on why the code review wasn't good. So for instance, like, okay. You suggested renaming this variable, but but actually in our code base, we have a convention that says, you know, these types of global variables should have this name in context. Right? And so the important thing is, like, humans are giving feedback on the output of the skill. And then the final step gonna chime. Go ahead. there too. Like, I feel like. it's also you don't want that feedback to go into the void here too. because, like, in the way that if I were to give feedback to Tina, Tina would take that feedback. And I think, historically, with agents, any feedback that comes, it's just like it doesn't matter. It just, like, goes away. And so I think that loop like that this middle piece is, like, the most expensive valuable part that you wanna really leverage. Yes. So humans give feedback on the output of the skill. And then the key kind of magical thing is you add a you add a second agent in. And that agent think of it as like an observer agent or an outer loop agent. This agent might instead of running on every PR, it might run, like, once a day, once a week, something like that. And what it does is it goes and looks at all of the human feedback around the first, around, like, the base skill. So it pulls all of the prior code reviews. It looks at what the agent suggested, and then it looks at, like, the human responses to what the agent suggested. And then it makes a suggested change to the base skill. And this is where it's really cool that these skills are all files. Agents are very, very good at, like, you know, updating files. You could have this all work within code review. And so the agent will update the skill, and then the next time that skill runs, it will have, you know, synthesized feedback from all of these human interactions with it. So that's, like, the general loop. It's very simple, but it's also very powerful. I'll talk a little bit about how to, like, put this into practice before we go into a demo. So a couple thoughts on, like, how to write these skills, and I love your input, Carly, as well. But, like, one thing is it's better to write these skills as, like, principles and not rules. Write the skill as though you're instructing, like, a smart person, not like you're programming a computer. You know, agents have intelligence. They're able to take these principles and generalize, in a way that, like, a computer program looking at a set of rules can't. So, you know, it should be more like, look look for repeated code rather than, like, encoding, you know, a set of rules on, like, what variable names ought to look like in high detail. So that's one, one lesson. A second lesson to get this right and make it work in practice is really about making it very easy for the humans who are participating in the loop to give feedback. So if if you make it too hard, you're not gonna get the feedback and you're not gonna be able to improve the skill. So I suggest something where it's a very, like, simple input mechanism. So, you know, again, responding on a PR makes a ton of sense because that's where developers are already working. So let them work in the tool that they're already using. Make the feedback be captured automatically. Don't make people go through, like, a manual extra step of submitting feedback. And if you do that, and can get it at enough scale and enough signal, then the agent can be very effective at synthesizing that feedback, generalizing it, and improving the skill. Yeah. Zach, I'm curious. Like, do you have a recommended sample size to run this, like, self improvement loop? You know, is it would it be overcorrecting if you run it on, you know, one piece of feedback? How do you think about that? Yeah. So I I think the risk with this approach is, like, the generalization isn't good if you don't have enough data signals. And so and also I think the quality of the data signal matters a ton. Like, it's probably not that useful to use thumbs up, thumb thumbs down data, for instance. It's like, it doesn't tell the agent what was good about this versus what wasn't. How however, you can get really good signal even from, like, a, like, a relatively small sample size if it's very detailed feedback from a person around, like, domain specific knowledge that the agent, otherwise would have no way of getting. And so, again, I think the the code review example is good here. It's like the people who are most knowledgeable on our team about our code base are like some of our senior engineers, and they're gonna be able to give very, very smart feedback that an agent can generalize and give that feedback in a way. So they could almost explicitly give instructions, and that will generalize well. But in general, the more feedback, the better. And, like, at Warp, we're using this loop now to manage our whole open source repo. I can show a little bit of this later where, you know, we have hundreds of people contributing and we're doing, you know, thousands of code reviews. And so the bigger the corpus you can get of quality signal, the better. Does that match how you think about it? Absolutely. Makes sense and Cool. And then the third thing is really advice for how to write this, like, improver skill. matches. And so this is like you know, there's two skills involved here. If to be said this is confusing at all, it's like there's the domain specific skill and then there's, like, the outer skill that, like, improves the domain specific skill. And so when you're writing that, improver skill, you should be think you know, teaching it to, like, ask big questions and generalize and create principles, to key it should keep the inner skill, like, tidy, and then it should make a, like, a nice reviewable thing for a human to look at. And this this skill is very reusable, by the way. Like, you know, the improver skill for improving code review is not that different from the improver skill for improving some other things. There's a little bit of domain specific knowledge, but this this is, a fairly reusable mechanism. Yep. I'm gonna turn this over to you, Carly, to talk for a few minutes about, like, what do you do like, what are other options here for do for doing improvement? Yeah. So this is something we've we've been thinking about this type of loop at anthropic as well. And lucky for Warp and, like, you guys have your domain experts who are also developing your product. That's pretty tight. Like Zach and Alok also have really good feedback to give on the agent. But some companies might not have robust human signal. So the pattern actually doesn't require a human in the loop. Any feedback signal works. So here's, like, the same pattern, but kind of in that middle block, you can have any quantitative or qualitative eval there. So I'll take a second to talk about evals. This is, like, really kinda one zero one level, but an eval is a test for an AI system. So you give an AI an input, and then you apply a grading logic to its output to measure success. So, you know, you'll have WARPs agent run, you'll have its whole trajectory, and then you'll find some way to grade how did that trajectory do, what was the output like. And so you can actually think of the Slack messages that we saw as maybe not a grading mechanism, but like a feedback mechanism similar to an eval. There are three types of graders that we like to think about. There's code based, which is fast, cheap, deterministic. This is the fastest, the cheapest, but also, like, sometimes the hardest to implement because it doesn't, like, embrace nuance in the way that agents require. So it's unit test. You can do some string matching, regex parsing. The easiest one and the way to think about this in agents, like, I think it is or it is very possible, to do code based evals even with agents, and so a great example of this is, like, tool triggering. So say at step end of your agent, you know that at the next step, a certain tool has to be triggered. That's something that can be deterministically evaled or or checked for. The next one is model based eval. So this is you can also known as LM as a judge. This is where you grade the output or the trajectory using an LM. So you'll you'll use a really smart model like Opus to to assess the output. You need a rubric for this. You need to decide, like, what does success look like and how to grade that. And the final one is human. This is slow and it's used for calibration. This actually often will be, like, backwards feeding into the model based and code based evals, and this is how you can build up a more robust version of the other two. So WARP has done kind of the self improvement loop with a lot of human feedback, but this is something that we're exploring and doing a lot of with both model based feedback and code based feedback in order to get this loop to be easier to run and not require such, like, expensive feedback. Makes, sense. again, any feedback signal works. And let's move back to the warp framework and walk through a demo. Okay. I am going to do a demo. Let me go ahead and pull that up. Okay. Are you able to see this okay? Yeah. I can see that. Okay. Cool. So I I'm gonna do a demo. Wish me luck. We're gonna do this not with code review. We're gonna do a different kind of agent, which is, an issue triaging agent. And just to give context here, this is something that we are also doing on our on our open source repo. It's something we do internally as well where when someone files a new GitHub issue, against Warp, and this is a demo repo, but it's it's the same on our real repo. We want an agent to run and assess that issue for, like, complexity, feasibility, is it something that's ready for our team to to work on or not? And so just to take a sample issue here, so this is, like, something, you know, we're we're showing, like, a raw issue, HTTP response instead of a friendly error message. Ben filed this. He did a good job filing it. And then what happened is we have a GitHub action that triggers on issue filed that runs, an agent that does an analysis, tries to figure out the root cause related files and suggested fixed direction. So so far so good. We set up this agent using a skill, and so we can we can sort of share this skill if folks wanna see it. It's open source. This skill kinda gives, like, background on what the job of the agent is. It tells it how to assign labels, like, what are the different labels mean in our in our system. It tells it how to do research on our code before asking follow-up questions. So this is, like, the inner skill in our example that we're trying to improve. So we're trying to improve issue triage over time. Now if you go back and look at this issue that came in, you'll see it it did a pretty good job doing it doing the triage, but not perfect. And what we would have liked in this case is actually for another label to have been applied, the ready to spec label. And this is a label that indicates to us that this is a something that a contributor might be able to start, you know, building up, product and technical specs for implementing. And so what happened here is Ben, who's on our team, went and gave feedback. He did it right on the issue. He did it where the thing was happening, which is the flow I suggest, and he says, I'd expect this to apply ready to spec label since the design requirements are clearly defined. So this is the improvement that we would like to see in the scale going forward. Does everything make sense so far? Yeah. And I just wanna say, like, I think what's really valuable here too is, like, Ben took extra care to make his comment, like, very clear, about his. expectation and also why he expected that, which just makes it easier later for the agent to to ingest. Totally. Just think and think about it like you're talking to a person. Like, give give the agent the rationale, and it will be able to generalize Yeah. better. Hold on. So the next question is, like, okay. How does this feed back into the system? How do you get it into the loop? And so the the way we do the loop is not within GitHub because it's not really CICD. It's something that we run within Oz, which is our agent orchestration system, which is what you're looking at here. And these are a bunch of, like, scheduled agents. And so we have this, issue triage improvement agent that we, we schedule. It's on a schedule. I'm just gonna go ahead and and and run it now. I'll we'll look at this in a second. I'm gonna start it up. It's gonna run with with Opus four seven fast mode, and we'll get this going. While this this is firing up, I wanna show what the update skill looks like. And so this is the outer loop skill. This is what the cloud agent I just started is going to run. And so what this does is it is is it mentions what is the inner loop skill. So the inner loop skill is this is this skill here that we wanna improve. It tells it, like, how to fetch all of the things that, all the issues that have feedback on them. And one of the cool things about skills here, just to point out, is, like, there is a script that comes with this skill, and this is one of the better things that you can do with skills where skills can reference resource files, as well rather than having to write code on the fly every time. So highly recommend this. It says to convert get the maintainer feedback, and then propose the smallest edit that explains the signal. And, you know, because this is a demo, we're we're telling it to to generalize based on pretty small sample size. I think. we could, you know, tweak this skill as well. Let me go and take a look at this. So I'm gonna go ahead and pull up this agent here. And this is still firing up, but you can see it's starting to run. And so just so people get what they're looking at here, this is a version of Warp's terminal app running in the cloud, running this outer loop skill. And I'll kinda narrate what it's doing because it's kind of interesting. So the skill's name is update triage. Again, this is the outer loop skill. The agent here is executing this skill. It's authenticating to GitHub. It's then running, this Python script that I mentioned to pull all of the recent issues that have been filed against this repo. It's writing a summary of them into a JSON file that it's then reading into context by catting it and doing j q. Now it's thinking, and, thankfully, this is using fast mode of open OPIS four seven, which is awesome. It makes my demo go faster. It is then this is this is where it gets really cool. So it says, now I have clear evidence from three concrete maintainer comments by Ben, around things that should be different with the inner skill. So I expect this to auto apply ready to spec. I'd expect this to auto apply ready to label. And so then it's going to do what we instruct, to make a small focus edit, And we can see this edit here, instructing it to be more aggressive, essentially, in applying these labels. And then finally, it's going to go ahead and open up a PR for this, which it's already done. Fast mode is amazing. I'll open up this PR and just sort of show, show what it did. Carly, any questions? This making sense so far? Oh, this makes sense. I guess actually, Yeah. what model powers the agent, like, that was just running? So so we for this demo, I set it to, Oh, open fast. seven Opus Opus. four seven fast. In Warp, the default model is AutoGenius, which is also powered by an Opus model. But I was like, we don't have fast. mode on. by default because it's it's a little yeah. it's a little expensive for sure. And, also, it's a little expensive, for our default. Yeah. this is not, like, a necessarily a latency sensitive use case if you're running it weekly on a job. This is I would not recommend using you don't need fast mode for this at all. But for the demo, I was like, you know what? Let's do it. And so yeah. And it it created this PR, and I think that this is it's really cool to see the PR created. And this is why it's great doing this all with skill files, by the way, is, like, the actual update can be done through your normal, like, code review workflow. And so we got a great description here. It basically is saying it's updating the skill. It's it's saying what the signals were that made it want to do the update, explains the change. And then if we go ahead, we can look at this, skill file change here where it's saying, you know, apply ready to spec when this describes a real problem. But there's, like, the cut but the concrete UI UX shape is not. And so it's it's extracted basically the right signal. This is, like, what we would have wanted in this use case. And now as a person, I would go and what like, to complete this loop, what I would do is I would approve this. I would merge it. And then the next time the issue triage skill runs, it would have this knowledge. So that's that's cool. And, like, you know, one of the thing I will show is, like, this is this is the process that we use at scale with all of our sort of agents that are working on our open source repo right now. And so we have all of these different types of agents, the spec writing agent, review agents, and, you know, a triage agent, and they all have these self improvement loops. So it's something that generalizes. And as you're building more and more agents, you should kinda just think like, okay. How is this agent gonna get better over time? It's gonna get better if we put in one of these self improvement loops. It could get better from what the other things Carly Ryan said too, where it's like you have, different types of graders. But you really if you're gonna build an agent, you should build it with an improvement strategy. So that's that that that wraps up the demo here, but, hope that was worked. It it worked really well, actually. I'm very relieved. was great. Yeah. Yeah. I'm I'm glad. We can jump audience q and a. So, thank you so much, Zach, for for demoing that. I think one of the the top questions that I'm excited to ask because I asked you this earlier is why. do we use a separate outer loop skill, per skill rather than a single generic outer loop for this, like, self improving skill? Yeah. I I actually think that the right solution is probably somewhere in the middle where you could do something like a templated outer loop skill or, like, a base outer loop skill because there's a lot whether you're improving any a free agent, a code review agent, a go to market agent, there's gonna be a lot of similarities in what that skill should look like. But then there are, sometimes specific things to the domain that are gonna make the improvement flow go better. So, you know, it's like if you're you're trying to improve the code review skill, I think, probably there's, like maybe you add, like, weight things from senior engineers more than junior engineers or whatever into into that skill. So you could provide it more information that's gonna do a better job of, of improving. Yeah. I love that. I think you're right that it's, like, mostly gonna be overlap, but then there might be some things on, yeah, like tie breaking or, disagreement breaking that you you might wanna add to the outer skill. But it's kind of up to everyone. Like, I think if you're building, like, a 100 self improvement skills, you you might be fine with one. Yeah. But if you are running three weekly for your agent, I think three different ones could make sense. I actually think this is a great segue. to the question I was asked next is what happens if the feedback is misleading? For example, a junior dev feedback is incorrect. Yeah. This is this is a problem. I mean, I I I I think, you know, more data is probably a lot of the answer here. And then sort of, I think I'm curious how you all solve this, but, like, having the agent maybe not blindly accept feedback, double checking it. Like, the agent has a lot of context as well, that it can use when it's making making these thing making these improvements. And then there is a human in the loop when it comes, you know, to whether or not you want to actually update the core skill or not. So I think there's, like, a bunch of things you can do to improve it, but it's a great question. And I do think you should expect that not every piece of feedback is gonna be perfect. There's gonna be contradictory feedback. And so, like, you know, you you you need to plan for that. Yeah. I don't think we have, like, a perfect solution here either. I think a lot of what you just said of, like, how to tell the agents, to deal with contradictions, that's something that, like, applies across, I feel like, building agents, not even just here with the feedback space. I think that there's a world where you do maybe filter the feedback that you put in. So, like, if it is this, like, verbose type of detailed feedback, you maybe do try and have it only from, like, certain users. But it's up to everyone to, like, figure it out with their agent. And, also, I like what Zach said about, you know, maybe at the end, that's where you have the human in the loop that reviews kind of that skill modification. But endoscopic, we still do for, like, some of these things, like, these more cutting edge things. We will try and have intentionally human in the loop at one of the stages, whether it be at the end, whether it be for filtering the feedback, anywhere in between. But, yeah, we we got a few questions actually on quality of feedback. And that's that's something that I think that there are different ways to to approach this. There's one that I can I'll start by answering Zach, and then and then I'm curious if you have more. Sure. Someone asked, how do you decide between skills and memory files? So I think memory can be kind of like a buzzword, and so I just wanna, like, walk through how maybe Anthropic thinks about it, because these are both, like, file based entities. But the. way that we think about memory is something that, like, as the agent is running on any given iteration, any given inference time, the agent will then go and modify its memory. So it's, like, truly learning over time, things about the user, the organization, the the repository, and this is, like, the agent's files to write to to learn from. Whereas skills are something that, like, are maybe more static or stable, and that's something that, like, ideally, over time, if you get it right, it's not gonna change. And so it's kind of like of course, we're talking about self improvement skills and, like, changing skills, but I think it's kind of like this North Star of skills is that they don't change. Like, they're this is, like, a skill. And then or, like, this is something that, applies everywhere and doesn't matter, like, who's the user or what the agent run is. And then memory, the north star there is that, like, it is ever changing. The agent is constantly editing it and writing to it. Zach, do you guys kinda think about these paradigms similarly? Yeah. Pretty similar. So for us, memory, and we're we're we're about to launch, like, a memory primitive into Oz, is about sort of auto creation. I think the power of it is, like, the agent itself discerning what it is important and might need to know. And so it's like it's like so it doesn't have to, you know, repeat, context or repeat tool calls, that type of stuff. It can learn the conventions. I also think, like, our our memory, the way I'm thinking about it, is, like, a little bit less domain specific than skills. Like, skills to me tend to be very, like, this is how you do x, like, almost like a sort of procedural knowledge. Memory can can contain things that are it's, like, a little bit more general. And but it's it's overall, it's it's not that dissimilar. I think also maybe for us, the lookup pattern for the memory is gonna be probably a little bit different than the, like, when do you pull the skill into context. But there there are two flavors of the of similar things. Yeah. Completely agree. So next one is in domains, like financial advice where outputs are subjective. So this is maybe, like, coding is so verifiable, and I can take a stab at this one too. Yes. I I'll take a stab. How do. you establish a reliable ground truth for evaluation that can be used to validate the feedback? So I think this kinda comes back to, like, the three types of evals that I mentioned, and some of them are code based or deterministic. I think in finance, maybe not financial advice, but just, like, in finance, like, where there is a golden output, you actually probably will want more deterministic evals to run. And so it might be complex to tee up or mock up, but you probably will want to, like, build out, you know, these financial projections and then check against the golden output deterministically. It won't necessarily be this human feedback loop. But if you do also want the human feedback loop, which is valuable, I think that would be really filtering for domain experts when you accept the feedback and and put it into the loop and not necessarily accepting it from everyone. Yeah. There is there is, by the way, a class of programming, things that are like this too, and I actually just wrote about this. So I implemented, mermaid diagram rendering in Warp. And, you know, Warp Warp is a, like, it's a Rust based app, and there's only a mermaid JavaScript library. And the way that I did this was by basically, it was another kind of feedback loop where I I had an agent create a very large amount of, like, sample mermaid charts using the canonical rendering, basically creating images for all the different chart types. And then I had our agent work in a loop where it it did, like, an initial sort of bootstrapping pass at creating the mermaid library in Rust, and that was wrong in all sorts of ways. But because there's this huge corpus of, like, known good renderings, I then had to do a loop where it would render, it would use computer vision through Anthropic actually to compare, like, the reference image and our generated image, understand where, like, the missing pieces were, and then make a code change and then redo the loop. So if you have, like, a really verifiable outcome, I think, like, there's I would certainly focus on building, like, the verification harness first, and then the agent can do a really amazing job, trying to tune to that. Yeah. I totally agree. Over time, your skills are going over multiple edits and iterations. How do you keep each of yourself improved skills, one, from exploding in size and context, and two, consistent as a whole? Great question. So I think a good skill file should not be that big. Actually, I think is is right. So one thing that, though, that's nice about skills is that they can reference resources. And so I would you you basically don't you wanna avoid something where a huge amount of context comes into the context window in one giant chunk. And you want, like, this is my experience at least. You want, like, progressive loading of the context that is necessary for the thing that you're doing. And so you could have your improvement skill, you know, specifically try to write scripts or try to favor, like, reusable components when it's creating and updating the inner skill. I think that's really important. You can also have your, you know, your improver skill can be like, okay. Time for two skills or whatever. It's like it's it's intelligent. And so if you prompt the improver in a way that is gonna keep your system coherent, that can help. Your improver skill can look globally over the other 10 skills that are in your repo, and this is something that we have it too, and figure out what is the right place to put the update. So I would try to bake whatever your desired end state is into that improver skill and make sure that it's working towards that. Yeah. I totally agree. And I think that, like, the person who asked this question has the right thing in mind of, like, these are things we wanna avoid. And I think a lot of the principles that you walked through, Zach, like, kind of in the middle of your deck as well, like, kind of contribute to this of, like, having, like, a why and thinking about general principles, is is is helpful here. I'll do one more. I also think just side note. I think the survey may have opened, for people to rate this webinar. Cool. So okay. This one's kind of a long one, so we'll be we'll be absorbing it as I read. Okay. Self improvement assumes some kind of target or goal. How do you go about identifying and quantifying the goal so it's measurable quickly and shortening the feedback loop between a change to the agent or the agent's process and then measuring it automatically, especially on difficult to quantify goals? Or is this just a human thing to do? It's in the demand that, like, I'm looking at here so I'm specifically focusing a lot on, like, our our, like, coding, like, repo management agents because that's, like, where we're investing a ton right now having gone open source recently. The way that we sort of do this is we've start we've created a bunch of other metrics around, like, how these agents are performing and how things are moving through the system. I would say, if I'm being totally honest, right now, it's like people are looking at those and trying to assess if we're making progress. What would be awesome is to, like, get to a point where you can feed those overall global metrics into, how these agents are working. And the kind of metrics I mean are, like, what's, like, the time to merge, the number of contributors who are able to merge, like, that type of thing, costs around this. Like, all of those metrics should feed into this. We're not quite there yet. We're kinda going, like, walk, crawl, run on deploying these, but that's how I would like that's where I would like to get to. Yeah. Yeah. And I think, this kinda goes back to the three types of, like, eval graders and just thinking and reasoning through what your eval suite and your agent may be, to anyone building an agent. So, you know, there's online evals that are, like, one thing. I'm mostly when I talked about those three, a lot of that happens, like, with offline emails, and so you'll have some test set. And how do you build I think this person asking this question is, like, wondering, how do I build that test set? How do I, like, think about success and measure. success? And that's why I kinda mentioned this. Like, your human graders and the human evals are going and should feed backwards into the LLM graded and the code based evals. And so you're gonna your evals suite will be, like, ever evolving, ever kind of improving, getting more robust. And anytime, like, a human does grade something or or assess something, that should ideally feed backwards into measuring success more easily and cheaply. I think that will be our last question, but thank you so much for joining us today, Zach. I really appreciate the time and and you walking through and demoing. Thanks for having me. Thanks to everyone who took the time to join. I hope this was, I hope this was useful. Amazing. Bye. Cool.