Name: Plan First, Ship Faster: How CodeRabbit Built Agent Orchestration on Claude
Uploaded: 2026-04-08T18:16:09.557Z
Duration: 46 min 56 s
Description: Plan First, Ship Faster: How CodeRabbit Built Agent Orchestration on Claude

Transcript for "Plan First, Ship Faster: How CodeRabbit Built Agent Orchestration on Claude": Hi, everyone. Thank you for attending today's session. I am super excited for today's conversation with CodeRabbit and Anthropic. We will be discussing how CodeRabbit built their agent orchestration on Claude. I think this is especially pertinent as we're seeing the industry shift from more static workflows to these truly adjunctive systems, and CodeRabbit has seen incredible success with their agent architecture. So, again, thank you for joining us today. A few housekeeping items before we begin. If you're worried about missing something, fear not. This session will be recorded and distributed via email within twenty four hours to all attendees. Secondly, if you have any questions, please don't wait until the end to ask them. You can submit them right away through the submit a question widget on the right hand side of your screen, and we'll also save time for a live q and a at the end of the session. And lastly, please give us feedback from today's session. You can rate this webinar through the survey widget. Awesome. Well, with that out of the way, let me introduce you to the team you'll be hearing from today. My name is Brittney Tong. I've been at Anthropic for a couple years now working with our fastest growing startups, and it is a pleasure to be joined by two incredible speakers. We have David Loker, who is the VP of AI at CodeRabbit. He brings nearly two decades of experience building large scale ML and AI systems with stints at both Netflix and Amazon, and Ethan Dixon, who is one of our amazing applied AI members here at Anthropic. So here's a look at what we'll cover today. First, we'll start with AI's hidden quality task, the real cost of moving fast without a plan. Then we'll dig into intent versus execution gap and why code that technically works can still miss the mark. And then from there, David will walk us through CodeRabbit's holistic approach. Plan first, code later, and review always. We'll get to see it in action with a live demo and then close it out for live q and a. So to set the stage a little bit, David's team built CodeRabbit using the cloud ecosystem or what we like to call the cloud thinking engine. What you're looking at is our full platform from our foundation models, Opus, Sonnet, and Haiku to the developer building blocks and tooling that teams use to build production ready agents. Today's conversation will live in that transformative products layer. How CodeRabbit has used these building blocks, particularly Opus, as the orchestration brain to create a cloud fueled experience that's changing how engineering teams ship code. Now I'll hand it off to David to dive deeper into the CodeRabbit story. I was like, thanks very much for yeah. So as I, as I was so warmly introduced, my name is David. I have VP of AI at CodeRabbit. I've been here for about a year now. For those who don't know, CodeRabbit is an AI coder platform, but I'm gonna be talking about sort of our agent orchestration system that we built on cloud. So I'm sure we're all, at this point, familiar with cloud code and the massive performance gains, that you get from leveraging code generation, right, to improve PR throughput, ship more features, and generally build more, with less syntax generation as being sort of the core focus. But I wanna look at the other side of the coin today, and I wanna discuss, the problem as we've been seeing it at CodeRabbit, as a result of trying to run too fast. Right? Point out and give some some concrete examples, of different modes of failure, and introduce, our proposed solution or the way that we've been thinking about, solving that problem while still leveraging as much of the modern workflow with with cloud and and cloud code as we possibly can to keep moving really quick. Yeah. So AI, we see great throughput. Right? Just as a a baseline metric, like, 20% PRs, more shipped code, more features going out. But instance, have increased. Right? So we we have more things happening down the road. We have more reverts. We have more, missing the mark in terms of, of what customers are looking for, more feature misses. And so in particular, like, more issues are being generated, right, as a result. So the AI generated code has has more issues. Readability and maintainability of the code, is degrading. Alright? So I'm trying to say that this is not necessarily just a model issue, that this is actually a product of how we leverage the tools and how how can we, as practitioners in the space, how can we leverage these tools and increase the odds of success so that we can make sure that what we get out at the end is what we intended and reduce some of these downstream negative effects. So now I wanna open a poll up. And so there's gonna be a a poll off to the side that you can go and look at and vote in terms of things that you've experienced, when you're using any of these AI coding systems and cloud code to to build something. Was your prompt today or or whatnot. So go and vote off to the side, and I'm gonna talk a little bit more at the same time, but, please go vote. I'm curious to see the results and see where where people's biggest pain points are. But this is kind of new failure mode. Right? We have can have passing tests, but passing tests doesn't mean we solve the underlying problem. Sometimes the issue is not in the code comp compilation. It's not necessarily apparent, especially if after you've reviewed the code, you've done this validation within your testing systems, and you've iterated and got it working. That doesn't necessarily mean that the intent that you had was fully realized. You might miss a business need or some sort of features that you had in your head that you thought was gonna come out of this. It doesn't end up happening the way you thought. Either more things get built, some things, just don't happen, and I'm gonna go into a particular example. But, yeah, sometimes it's the wrong scope. Right? It it built a feature that just wasn't required. We we the core workflow, some piece didn't make it in because maybe we didn't say it. We made some assumptions, and maybe we didn't even realize that we were making those assumptions. And as a result of that and not clarifying and and being super specific, the underlying system has to has to make up what it thinks is correct, and sometimes it's not what we have in mind. And a lot of times what ends up happening is you build a lot of stuff and you don't find out until much later. That gap appears too late and that can be costly to go back and rework everything. Alright. So, basically, I'm proposing that we how do we how do we for our own projects as well as in the companies we work for, how do we increase the likelihood of success? So I feel like we need a systematic approach of eliminating these issues. Right? And so I'm gonna I'm gonna walk through a personal story of mine. So I was I was building something as a side project, and I was building a secure infrastructure around a memory system, and I wanted to build a chat interface on top of it, to sort of test it out. And I I wanted everything to be you have to log in. Everything's tied to a user, and so I'm I'm specifying it. And I felt like I did a decent job, but, ultimately, I didn't spend a lot of time, through the specification portion of it. And I iterated and I had this system running, in cloud code, so it's for for a few hours, actually. And at the end of it, I'm I'm asking, okay. It's it says it's done. I'm like, alright. Well, how do I how do I use it? And it's giving me instructions to use the the user token. I'm like, what do you mean? How do I how do I get that token? And the login page just wasn't there. There was no way to actually create the user even though I specified that the whole thing requires the the concept of a user, being in the system. And that was just something I missed. I missed the fact that that is something I needed to specify. I wasn't clear. So that's kind of, you know, I didn't say something, and as a result, I didn't get what I expected. Right? And so we have to make these things explicit. So hours of work, kind of spent churning out tokens, building what I thought was well defined, and in reality, ultimately, I should have reviewed upfront the the actual thing that went into it. So how can we avoid, this particular situation? I'm sure lots of us have run into something very similar. And, you know, cogeneration and syntax generation is is a fairly fast endeavor, but if you go too far down the path and you find that gap too late, it can be costly to go back. And so we're trying to think of, like, what are some of the processes or systems that, we can put in place, you know, to increase our chances the cloud code builds exactly what we want and what we had in our heads. And so there's a few things. Right? So we can think about what outcomes are we actually trying to measure. What do we care about, right, and how do I how do I measure that? And we can read through it and we can think through it from their perspective of what assumption are still implicit, and we can ask Claude to actually try and, elicit some of those things. What are some of the things that we're missing? What are some of these things that are are coming out as assumptions? And make them and make them iterate, and so we can actually think about them and come up with what we actually want. What workflows or edge cases are are easy to forget. So, again, looking at at thinking about the edge cases and having working with god also to think of those edge cases. And how will we know, if the output matches the intent before rollout? So thinking about maybe a record of work of what what exactly we want in terms of the MPP, what are the success criteria so that we can go and make sure that those things are met and match that original intent by being explicit about what we're what we're looking for. So before I talk more detail about the system that we built and how we built it on top of cloud, to help modernize this process and help leverage AI, and to facilitate, a better building experience. So I wanna invite Ethan back to join me. Part of this is gonna be conversational, and he's gonna have some interesting insights as well into this process. Hey, Ethan. Yeah. So we built, a system that is for planning our agent orchestration. So the idea is to sit, as a precursor to even the planning portion of Claude as a way of making some of these issues go away by thinking longer about what goes into the coding system. David, And to cut you out there, it might be interesting to actually take a look at these, poll results that I think we got back. I'll, see if. you wanna share those. Awesome. Wow. Okay. So it looks like we we got a a pretty clear when are important requirements were assumed and not stated. I think it's a lot of times we don't understand as we as we sort of learn more and more of the things that we have in our head that we didn't know at some point and we had to be taught. And then we we kinda have these things, and we assume everybody knows them. And, we make that assumption of the AI system as well, right, that that that that's there. And as soon as we're not even aware that we're, assuming those things. Yeah. This is a really good segue into, obviously, the thing we're gonna talk about next. Yeah. So I think we looked at it and we're like, okay. Plan first code later. I think this is a relatively, well known architecture. Like, every every, you know, cloud code has a really great planning system within it. So the the idea that planning and executing or or generating the code, right, and then reviewing are all separate parts of the software to own life cycle just as they were, pre AI. And so if I look at a lot of the way that people intuitively use the system, initially, it's this idea of a prompt only workflow. Right? I'm just gonna type that in prompts. Maybe I'll I'll use voice, but, ultimately, I just kind of give instructions directly to Claude to execute and and write, the syntax and build the the system, and I'll just say it. Right? A lot of times that's where where I miss my assumptions because I'm not I'm not actively engaging and thinking through and reviewing that process. I'm kinda doing a stream of consciousness, and I might not take the time to really think about where the gaps might be. And and that's one thing. That's ambiguity in this for the main, poll, result. Right? But I think there's there's other aspects that too that can be detrimental if you're working with the team. And that's that I if I if you do it, I have no idea. The the prompt that you that you typed out, I have no idea if there was assumptions that you made, if it was something that you wanted to happen versus not want to happen. I don't really have a way of getting into that. I have a way of helping you in in case there's something some assumption that you made that might not match up with what I would make or or that makes sense in the business sense. And those assumptions get scattered. Right? So everybody has their own. Every person who's who's doing it, if they're not doing a planning system, they're all they're all kinda doing their own thing. And so I think we the better workflows we're seeing it right now is is doing that plan and, clarifying the scope early. And so we're building a product to help facilitate this better workflow. Right? This allows us to get back to this idea of of collaborative, planning. Right? So building something in collaboration with the rest of our team, ensuring the customer needs are met, getting alignment from multiple stakeholders, make sure that we're building something that's appropriate for the business, and its constraints sometimes from, even from a, like, a DevOps perspective or infrastructure perspective, there might be constraints that need to be, understood and brought into the system so it knows. And, this this kinda goes back to a story for me is when I was a kid, my dad always used to say, we used to do a lot of building around the house. Right? And and so I we built fences. We built the deck. We built stuff. We we finished our basement in our house in Canada. And but I was just saying measure twice, cut once. Because you cut once you cut that piece of wood, that's it. Like, it's, it's it's either the right length or you gotta go back and and use something else and start again. And so that that used to infuriate me as a kid, this idea that always taking slowing things down. But ultimately, it leads to a faster overall process once you're you're doing those things more more carefully and making sure you're doing the right thing. Yeah, David. I love that. I think, you know, to your point earlier, it's really important. These these tools are great sort of in isolation and when you're working on a solo project. But the second you wanna graduate to, you know, working on real production code with the rest of your team, and trying to navigate all of the other complexities that other developers on your team are introducing, that's when this philosophy becomes all the more important. Yes. That's right. And, as I've grown older, I've I've started to see the wisdom of that particular statement. The other one used to drive me crazy is less haste, more speed. Used to say that all the time. I would do something really quickly and something bad would happen. I have to start over again. He'd always just say that to me, drive me crazy. And, I do I do now see we can we can go really fast. Right? With Cloud Code, we can build things really quickly. At the end of the day, it's better to understand that we're building the wrong thing earlier than to go through many hours of of iterations and then come out at the end and realize, oh, we didn't really build the thing that we wanted to build or it doesn't match what our teammates or our stakeholders actually need. And so this is kind of the how we at CodeRabbit are viewing the software development life cycle now, leveraging AI, in the cloud ecosystem as much as we can while still ensuring that we're that while we're moving fast, we're moving fast in the right direction, that we're doing things that that match up with what our customers need, what we're trying to build and making sure that it it makes sense in in the scale that we're at and and all the other constraints that we have, with minimal rework. Right? So tokens do have a cost to them. There is a limitation in some sense either financially or just a part of your plan. There's there's a limit in terms of how much you can do. And so making sure that you're using them to to drive features out in a way that that that, is optimal. So less haste, more speed, I guess, because my as my dad would say. But, yeah, I think I think having that planning upfront, it being collaborative, having lots of people be able to look at that and say whether it makes sense to them and offer advice is getting back to this, collaborative way of building things that we had pre AI, and I think it can leverage AI heavily while still, again, being fast and being as fast as we can get without doing the rework. So ultimately faster at the end of it. I would say one other thing that comes from this workflow that has been useful for us is, the record of work. So having this ability to iterate on a plan as a team and have people understand what's being built, it creates an understanding of what came up and what led to it. Right? And so if somebody new comes in and they wanna understand what how do we build this, why do we build it, and that, there's now a record of that. It's not ephemeral. It's kind of being stored and and, ultimately can be looked at. And that lets allows for better validation too at the end of it. You get something out. How do I know that my intent was met? I can go and I can read what my original intent was, and I can see if these are the items that are supposed to happen. These are the features that are supposed to be built, and these are the success criteria. I can I can validate that, and I can have AI help me validate if I have the original? And so, yeah, it's a it's kind of a new way of thinking about things, but I'm gonna turn it over to you for a little. bit here. Definitely. David, I think it's a great point. It sort of almost goes back to the kind of, like, fundamentals of software development, which maybe we've strayed away from a little bit in the last year or two because the tools make everything so easy. Yeah. Great. I wanna do a quick aside here, and talk a little bit about context engineering, but I promise I'll get back to, the main point of what we're talking about in a minute and make it all connect. So folks listening might have seen a tweet from Anthropic, you know, over the last, I think it was in last June where we sort of member of last year that talks a little bit about our thinking around how to most effectively elicit, you know, the best performance out of our models. And this also kind of extends some of the things that you were talking about a minute ago, David, where, you know, context, tokens are really not free. And so we really try to think about context as this finite resource where in theory, every token you're adding to the context window degrades that attention budget that the model has by some amount. And this can be both on, like, the sort of micro level, you know, and on and a macro level when you're thinking about, does the agent maintain coherence over really long trajectories? So the main takeaways that I think are important for the context of this conversation are both the fact that, again, context is a finite resource, but also enabling progressive disclosure. If you have a really, kinda like clean mechanism for ensuring that the model only needs to load in certain parts of, you know, code files or other bits of relevant, context or team preferences at the time that it is needed, you're much more likely to have, again, that sort of coherence over these these long context workflows. And so what we really think, is important to the context of this today is that planning is in a sense a really neat way to do context engineering. You know, with our our most recent, public models, Opus four six, the models are generally getting much better at this long context coherence, certain things like needle in a haystack tests. They are objectively, you know, climbing all of these leaderboards very consistently, but that doesn't mean context engineering is a solved problem. And what's really neat about treating planning as a sort of, you know, proactive means of context engineering is you get all of this front work sort of preloaded. Right? You do all the exploration. You do the kind of discovery work. You come up with a really coherent plan, and that makes the behavior of the agent in the long run much more effective. You know, you don't have to go and read files and check test cases, and then, you know, halfway down the, the sort of rollout, you uncover some unexpected bug, and you have to go back and then update a whole bunch of prior assumptions. So, again, planning is really this this means of context engineering that leads to much more efficient long run agent behavior. And, hopefully, this segue is nice into, David, your guys' discussion around how you guys think about some of these different models, within the CodeRabbit suite. Yeah. Great. Awesome. Thanks a lot. So, yeah, one of the things that we think about when thinking about contents engineering is is one, like, as you're saying, what are the right tokens, to accomplish something so that we're using that budget wisely? And the other one from our perspective is also the efficiency angle. So how do we make it efficient in terms of either latency, or cost, right, at the end of the day? And so you have Opus. It's just like this massively intelligent system. You know, we use it as the main brain and the orchestration loop. Right? So it's making the higher level, very strategic decisions, doing some some understanding of what do I need to understand, what I what don't I know about the problem, and how can I sort of, set a strategy up in order to discover that information systematically? And so, we're trying to do this efficiency by design. So so each model here is matched to the task complexity, so then SONNET is done at a slightly lower level with maybe slightly more targeted tasks, but ultimately still not really, really fine grained tasks. And then as it discovers the more fine grained tasks and hands those things off to Haiku, which can do a really good job at those very specific things like lower complexity tasks, context distillation. So here's a big file. I need this function out of it. I need to understand what it does, but I don't really need the code. And I can I can do that kind of task within Haiku? And that way, I can be efficient in the way that I divvy up my work, right, and just use the brain of the operation in the spot where it's where it's needed and not everywhere along fast. So that speeds things up and it actually makes it cheaper. Right? So this is the same approach that we generally use even in our review product, but that comes with some level of complexity. Right? Figuring that out, can be tricky. And so that kinda brings me to, I think, what is the crux of the problem when you're trying to figure out something like that? If I'm trying to make something efficient, how do I tell if I'm succeeding? How do I tell if I make a change, whether that change makes things better or makes things worse? If I don't have a way to do that well, then I end up, essentially going down the wrong path potentially or wasting time or I'm not able to quickly iterate through a problem and really make those efficient gains. We have an evaluation harness that's very comprehensive for CoderView. We did not have one built out for this idea of creating, an agent orchestration layer on on top of clog code. And so the we had to come up with something new, and that was a process. Right? So initially, it's it's hand tuned. Right? We have to do a lot of manual inspection. We have to slowly build up a good set of of LLM judges that we have that can evaluate certain aspects of the plan, and we have to come up with all these examples of good outputs, based on hands hands, looking at it by hand, right, and manually reviewing these. And that over time, then we we can start taking some of those and actually having Cloud Code go in and build these systems. So now because we know at the end of it, we're getting code out of it, we can actually evaluate the code then whether or not that code is functional, doesn't have extra things in it, and and all these other pieces. Right? And we can see how many tokens it took to build that and whether our plan is helping with efficiency or not because we can just run the whole system without that as well and see whether that extra planning step at some level, is useful. And one of the things that CodeRabbit has, right, if you're using our system is we're plugged into a lot of places. We have a lot of previous knowledge of past PRs and all this other stuff, and a lot of that context can come to bear in helping understand which parts of the code might be involved, in in something that you're building. And this planning system is not is not meant to, to take out the cloud code planning system. It's meant as a as a higher level orchestration of that to point it in a really, narrow and right direction again and to be collaborative so that everything that that needs to be explicit is made explicit, and so we're aware of all these assumptions that are being made. And I'll show a little bit about that in the demo later. But the idea here is to sit above that and then let Claude go and make a fine grained implementation plan to then build everything that was made explicit within that, essentially, like a a very collaborative PRD document. It's a little bit more detailed than a PRD. David, I want I wanna double click on this because I think it's really interesting, the idea of not just evaluating the the actual output of the code, but the quality of the plan itself. And I think this is something that we're gonna have to start adopting in sort of every product domain. I'm curious if there were other things that you guys found when you were going through the process of, you know, trying to define what actually a a good plan looks like that, you know, you uncovered that was sort of unexpected or other, you know, course that that came about when designing this, this system. Yeah. So I think we didn't realize that, what the right level of detail was gonna be for that plan. Right? So one of the things is that if you if you make a plan and the code changes, that plan especially if you're too too detailed, then that plan kinda becomes invalidated. Right? It can become out of date very quickly. And so finding the right level of detail and letting cloud do its thing that it can do very well, finding that that that tipping point or finding the right level there was was difficult. It required a lot of, iterations. And I think initially, we didn't think about the idea of of evaluating the, amount of time and tokens spent, the cloud did during its exploration phase and whether we could reduce that number. Right? And so that came later. So this idea came later. And so that, Yeah. you know, it's a it's a learning process. We were building out a new evaluation suite. So there's a lot of interesting things that come out of it when you're trying to do that. Yeah. It's really cool. I think that, you know, teams building products on top of this, type of technology can can actually benefit a lot by thinking about what is the right level of granularity for their domain, where you need to think about, like, potentially giving the model off ramps and and giving it a more structured mechanism to actually go back and and update that plan. Yeah. Right. Yeah. Makes sense. Alright. So so I think what we've what we've built, using the quality ecosystem right now is a team wide planning. Right? So we're taking it taking this idea that, for us at least, doing it just at the individual dev level wasn't leading to, longer term success in terms of some of the projects and things were being built that didn't match expectations. And so bringing that up, collaboration by design. Right? So the team's involved, and we have that review artifact at the end so people can learn from it, people can understand what was built, and that we've kind of thought of it as the plan itself. Like, that is a quality gate. If we can make that really good and really make sure the quality of that plan upfront is really good, the downstream effect is is very pronounced. So you end up with a lot better code at the end of it. And that's what my sort of what I was saying earlier in the talk was that not all of these issues are about the model. It's it's kind of being really explicit and really getting into making sure that upfront prompt is going in the context engineering portion of it. The quality bar is very high. I really like this plan of, treating the plan quality as this new sort of gate because this is something that generalizes to basically every domain that you would apply LLMs to. Just because we have, you know, really powerful tools in our pocket doesn't mean that we can, again, get rid of the fundamentals, and you still have to get together with the get together with the team, you know, talk through the things you wanna, you wanna build, think through the the kind of, like, unknown unknowns. So I think this is. a really, like, crucial takeaway for, again, teams building all types of products on top of LLMs is is the plan quality is this new sort of, like, distinct moment within generally, LLM driven knowledge work. Yeah. 100. I wanna show a little bit of something. I think I have to stop sharing in order to do that, so give me a second here. It's just a little video of a problem. So starting setting the stage here, this is a, demo repo that's in Cobol, and we're going to have our planning system, build a plan around converting it to Python and, using DoubleTree as, like, the the framework of the UI portion and having handing off to Claude. To do that, it's gonna one shot, quite a lot of tokens as you'll see. But this is a a demo. So you can see most of it's COBOL, obviously, very at a date. And then we go to our planning system. We're gonna type in a prompt using that repository that's gonna ask it to make a new version in Python, and use the TUI, with libraries like bubble tree or bubble tea. Okay. So, obviously, this is really sped up, but the idea is it's creating a plan and scanning all the repository, and then you got these design choices. So some of the things that were assumptions, based on what was going on are now made explicit, and you can go in and you can validate those assumptions. So you can see the options that were considered, what ended up be the option that was chosen, and you can change that or you can type in something on the side and tell it a different thing to look at. So some of those things are being surfaced that you can go through. So there's a lot going through the assumptions. Now we have the the phases of the plan. It's broken down into sections that are easier to digest, but at the end of it, you have a prompt for each of these, right, so the agent prompt. And so here's an example of of typing something that's going to try and make an update, ensure the project uses bubble tea, to improve the look and feel of the app. And so then that's gonna go off again and and come up with information about what is being done. It's asking a question. Are you are you looking for pie actual Go code, or are you looking for just the aesthetic? And so we say we're just looking for the aesthetic and ask it to to refund and make sure everything's good, and it makes some updates. But at the end of it, this is the iterative process. I have versions of the plan I have then at at record. And I can just go and copy the entire prompt and then paste that into Anthropic Claude. Right? And at this point, it's gonna go off and build much stuff. Obviously, this is gonna be, very sped up. Just give it a second here. Alright. And so we have here 1,600,000 lines, created, and, you can then run it, and we can see that it created a Python version of the app in question. And so I think this is a great example of of getting something really quickly out of it that did a lot of work, by spending a little bit of time. And, obviously, this is a toy example, but the idea behind that spending time and understanding the assumptions and things that are being made so you can make sure that what you want to have happen, is actually happening and that this the the requirements that you have are being met. Let's go back. So, improving the odds of success. I mean, that's the whole point that we were trying to do, internally is improving the odds of success and not at the local level even though, like, the idea of of creating a prototype. That's still, you know, something that that lots of people can do, and we're talking about building stuff in production systems, building something as a as a group, as a team. How do I improve the success across the board and make sure everybody's involved in in in driving product forward. So syntax generation obviously being very cheap, but you can run really fast in the wrong direction and that can make that that more expensive to then go back. So planning exposes those assumptions early, and I think that's the whole point is to again, less case, more speed. And so that shared record of work, I think, is is very valuable now. The more we do this, the faster we build. Understanding what we build, why we built it, and what success is supposed to mean for all of these products helps us be more explicit and be more intentional about what we're building and what we mean, when we say we're done. David, I also really like that last one. Having these artifacts, which are, you know, effectively the record of work, also means that over time, you're gonna make, a lot better planning work. So you you can imagine that you might have a team that sits down, does one of these, sends cloud code or another tool off to go do a bunch of development work. You come back and find that you actually missed a whole bunch of assumptions upfront. And in theory, this is a a really good way to say, like, you know, this is what our team thought, at the outset. This is what actually happened in reality. And can we go and make that planning process a lot more effective going forward? Yep. Awesome. So I think this is the portion of of of opening things up. Welcome Brittney back and, yeah, we can go through some some questions. Absolutely. Well, thank you, David and Ethan. It's clear that the audience is really excited. I've been getting a ton of questions coming my way. So, David, maybe to start off with you, would love to hear from an audience member. When planning for requirements, did you try matching them to existing software architecture frameworks and patterns as guidelines? And I think relatedly, what are some mindset changes we should have when developing our coding Yeah. So, when planning for requirements so I I guess I'm trying to interpret? the question a little bit, but the the general framework of how the plan gets built is is, yes, it's trying to map to this idea of how the software happened. Right? So what phases do we go through? What part portions do we build first? Generally speaking, sorry, from, you know, changing the data model and then working your way up to to the more front end, variation systems. But, ultimately, there's a lot of freedom in in the way it's doing, and and Opus is doing a lot of the driving in terms of making these higher level decisions. And so there's also multiple different options that can take place. But initially, in the process, it's more of a a discovery mechanism. This is what you're saying you want. Do I first of all, do I have enough information to understand your intent? And if the answer is no, asking those questions. Right? But then what information do I need to know in terms of your organization, in terms of the repository to then start to formulate even, the semblance of a plan? But Opus is the brain behind that and a lot of that just using those best practices that are coming into, the planning system. And then what was the second half of. that question? Second half of the question was, what are some mindset changes we should have when developing our coding plans? I think the idea that it is, a new point of of quality that the more effort you put into that in a certain sense, the the more explicit you are about the things that you want and double checking if those assumptions make sense to you, for your company, for the business, or for the other stakeholders that are involved. And the extra time, whether it's half an hour or whatever that you put into that, is gonna lead to less wasted effort getting things out of it that then people don't agree with at the end of it. So you find that out a day later or whenever you have your next meeting with that group rather than being upfront and sort of getting that getting that out of the way early. And this is what used to happen pre AI. The difference now is that the overall cycle is is shrunk, but at pre AI, we had this problem too. If I was to just go off and decide everything about a feature that I wanted to build, and then my product manager would come in and be like, what did you do? That that would have been, you know, not not great. So their process these processes existed before. It's just now we need to figure out how to how to update them in the error Awesome. It's clear the CodeRabbit. has obviously built an incredibly comprehensive, I would say, like, code review pipeline with the planning piece as well. Someone wanted to know how does CodeRabbit's planning first approach differ from tools that use rag based code based indexing for reviews like Octopus code review? Do you see these as complimentary or competing approaches? So, I mean, ultimately, the actual implementation features, whether you use RAG or something else, they have benefits and drawbacks depending on how exactly you implement those systems. RAG can have issues in terms of the reranking portion of it. I can pull in information that while similar in terms of the embedding might not actually apply, and then that goes back to those that budget when it comes to the context. Right? And you can make sure I'm optimizing that and not filling it with information that while might appear related, might not actually be. And so one of the things that we noticed, is that doing, it's kinda like the LLM Wiki article that, came out with recently is that allowing the system access inside our jail sandbox environment, this this ability to do discovery, in a real time now fully up to date system so I don't have to worry about whether or not my my rack system is slightly out of sync, is it it's very smart when it comes to writing these shell scripts and pulling in information. And that discovery mechanism being dynamic, leads to usually, first of all, a lot more information, but a lot more targeted information that's specific to the thing. It's because IQ is doing a lot of work in terms of pruning data, so we're not overfilling context for the planning. So there's a lot of stuff going on in there that's helping to drive that. And I'm not gonna say whether one is better or worse. It kinda depends on your particular problem, and they're just different approaches. At the end of the day, I think, we tried a bunch of them and this this approach worked better for us. Yeah. Jumping off from that point, we had an audience member ask, could you provide some specific use cases for Haiku? I usually use Sonnet as my default. I'd love to hear, David, from both you and Ethan on this one. We've talked a lot about how both Opus and Sonnet obviously do very well, I would say, in the greater orchestration components of it and the more complex tasks. So maybe would love to hear a bit of, from you both on where Haiku fits in. Yeah. Yeah. Definitely. I mean, for for me, it's it's measurement. Like, if I measure it and Haiku does does well and does just as well when I measure the outcomes as Sonnet, then I'll use Haiku because it it makes sense from an an optimization perspective, from latency and from, and from, cost. Right? And so it comes down to measurement for me. And so for our use case, I think I think of it as information distillation. If I have a file and I just need to take something out of it that Sonnet knows what it wants from there, and it's just like go and distill this information, the token cost is cheaper with Haiku. It's faster. And so doing that and that's gonna happen a lot. Right? So that number of times that happens is is a lot. And so you're saving a lot of cost and time. I find Haiku extremely smart when it comes to tool use as well. I mean, SONNET is really good and it can be necessary. Again, if I have a measurement that says that it is, but I'm always trying to optimize and so I'm trying Haiku out in the situations I think it's going to do well and then evaluating it. Yeah. I think it's a good framing. I think my personal mental model is effectively treating Haiku as a sonic light model. So if you can sort of scope down a problem such that you have it really well defined, you know, kind of roughly what you expect the model to do, it still has to go and, like, be a smart model. But if, again, the the problem is well scoped enough, HICO usually does a pretty good job. The thing that I think is clear is once you get to those sort of, thresholds of, like, stressing the model, specifically whether it's, like, loading in a lot of context or asking it to sort of maintain coherence over again. Once you traverse many files or you are halfway through a very complex plan, that is where I think you start to see the degradation of a Haiku class model versus Sonnet. Again, it's not to say it's it's not a very capable model. It is. It just has to be, like, sort of more handheld in terms of the things you are asking it to do. But in terms of the, like, again, the capability profile, I treat it as, SONNET lite for for most use cases. Yeah. That's great. David, a bit of a spicier one coming your way. It's clear from your experience and also industry experience such as Amazon outage with Vibe Coding that these AI code generators have limited value for enterprise systems. We may have to use additional tools on top of it, and this makes it more complex, makes it a more complex process. Do you agree? I mean, at the moment, yes, I would agree that the process has to be more complex. I think the systems you leverage, especially when it comes to the validation portion of it, have to be thorough. Right? I think if each validation step is kind of like Swiss cheese, right, they there's holes in those validation layers, but if I stack them on top of each other, I can fill the gaps. And so when it comes to enterprise systems, you know, you need a security check, you need the the code review, and you need the validation, you need your test, you need all these layers to then be able to still run fast with AI. And I do think that, again, this this idea of what I'm building and why I'm building it being making sure that's clear upfront also helps. When you're talking about very large code bases, we have seen issues because there is only so much context that's that's available, and so those validation steps become even more critical. And and so seeing those issues and how they come about and then building in validation workflows and leverage it that leverage AI are going to help you in the long run. So I I think it it is very possible for these systems to be leveraged well in the enterprise environment. I just think it requires more work, yes, like you said. Yep. Okay. We'll take a couple more questions as we're nearing the end here. I'd be curious to hear from you guys. Does the collaborative planning model with AI, e g, the spec planning code, reduce the need for traditional human code review, or does it just shift where human oversight matters? At what low level details such as security edge cases, performance implications, or architectural drift, is human intervention still critical even when the high level plan was already agreed upon? Yeah. So we're getting into the validation portion of it. I I think the way I see the space moving, especially as models like like Opus and Sonar are getting so much better at syntax generation and and being, you know, correct in their syntax. The the code review, the validation portion of it, again, coming out of that, you're saying your code is correct. I can see going down more and more people, not necessarily looking in detail at the code and being more in the level of, like, what decisions were made and do I agree with those decisions because maybe I forgot something. Right? And so being able to validate, whether or not it's working for my for my infrastructure, for the scale of my company, whether or not, the actual changes to the system did it in a way, that didn't make sense. There couldn't be like, say, for example, there are these decisions that are both valid and there's trade offs to each. I need to make that decision. Maybe I forgot that and it made one and I realized, oh, that doesn't make sense, in this instance. I I I don't have access to that. I I, AWS versus GCP, I forgot to tell, and so it's using something else that doesn't make sense for me, or I already have a Reddit server, I should leverage that and not use this other system that it's using. And so there's a lot of choices that are still gonna come up. And I think for now, understanding systems and understanding system design and how everything plays in especially at larger scales is is still gonna be extremely valuable to validate and make sure these systems are are tackling the right problems. And I think there's another thing of, like, if you do have a very large code base and and being able to visually see what's happening to understand whether duplication is happening, right, and whether there's things that's going on that are going to lead to potentially maintainability and long term issues, we still need to do that. But I think we're going towards the point where the syntax and, like, the functionality after these validation steps is gonna be correct. It's the other parts that we need to then, as human beings, check on because. we may not have given full context. The other thing I think is kind of interesting to note and and totally agree with your point there, David. Like, thinking about how much this has shifted even in the last, you know, two and a half years. If you were asking this question, again, two years ago, it would have been a very different answer. But we've sort of collectively moved to a point where sort of, like, threshold of of acceptable, that will use cases is, again, kinda moving more towards the direction of, like, high level system review. And I think the way that you kind of keep familiarity with the frontier and even point is just, like, by spending a lot of time with the models. It's really important to get a sense of where are the places that you, again, can trust outright that the syntax is probably going to be right, and the design spec is probably going to be followed. And, you know, the expectation that if you do lay out a good plan, the model will actually do a pretty good job of of implementing said plan. Awesome. Well, one last question so that folks can walk away with some kind of best practices here. What is the biggest architectural mistake you see developers make when integrating LLMs into real products, and how would you design it differently from day one? The ones that I've seen the most frequently are the ones where what gets built at the end was not what we wanted to have built. Yeah. That was the reason why, so that we we had this this ID in the first place is because we we saw that happening very quickly, a few more than a few times. And and so it's also when people are onboarding. Right? Like, they don't necessarily know all the details that they need to know to be able to make those systems very effective. And so I think, that's something that I've seen quite frequently, and that's why I think this shift needs to happen, at least within, organizations that are filled in with as a team. Right? Yeah. I think in a similar vein, more from the kinda, like, organizational perspective, if you're gonna give people on one end a a bunch of really powerful tools to go and generate a bunch of code and sort of give them the keys to, you know, moving a lot of AI generated code of production. You also, on the back end, have to do, you know, what CodeRabbit is is setting out to do and think a lot about the filtering mechanisms that you've built in on the back end to sort of, like, have equivalent parity with with the sort of volume of stuff that you're gonna generate on the front end. Awesome. Well, this wraps up today's conversation with CodeRabbit. Thank you, David and Ethan, for the nuggets of wisdom on how we can continue to improve the dev workflow, planning first and catching intent gaps early to ship higher quality code faster. For the audience, we'll be sending over a few resources over email, including recording of this webinar. You should have also seen a short survey pop up. If you have thirty seconds, we'd love your feedback on how we can make these sessions even better. Thank you everyone for listening in, and hope you all have. a great rest of your day.