Video: Building Products on Claude Opus 4.7 — A Customer Success Story with Solve Intelligence and Blitzy | Duration: 2879s | Summary: Building Products on Claude Opus 4.7 — A Customer Success Story with Solve Intelligence and Blitzy | Chapters: Welcome and Introduction (0.17200000000000415s), Housekeeping and Introductions (53.702000000000005s), Session Agenda Overview (182.317s), Opus 4.7 Introduction (222.167s), Coding and Vision (354.977s), Task Budgets (473.44699999999995s), Migrating to Opus 4.7 (610.032s), Solve Intelligence Introduction (733.2420000000001s), Chemical Structure Processing (1027.4869999999999s), Closing Remarks (1393.6119999999999s), Meet Sid: Blitzy Origins (1503.8619999999999s), Frontend Development Demo (1653.647s), Model Selection Strategy (2067.9770000000003s), Q&A Session (2410.147s), Introduction and Housekeeping (2875.935s), Knowledge Graph Systems (3040.357s), Opus 4.7 Model (3088.607s), Blitsy Environment Demo (3142.042s), Responsive UI Design (3166.782s), Model 4.7 Capabilities (3213.457s), Product Walkthrough (3242.187s), Coding Benchmarks Demo (3268.867s), Visual Fidelity Performance (3323.282s), Opus 4.7 Insights (3463.777s), Product Evolution (3662.402s), Enterprise Quality Focus (3803.737s), Optimization Strategies (3907.9970000000003s), Introducing Sid from Blitzy (4405.902s), Closing Remarks (4620.682s), Audience Q&A (5347.907s), Closing Remarks (5605.7970000000005s)
Transcript for "Building Products on Claude Opus 4.7 — A Customer Success Story with Solve Intelligence and Blitzy": Alright. Hello, everyone, and, good morning if you are on the, West Coast in San Francisco. Actually, that'll be pretty cool if, everyone can just, type in the chat window, where they're joining from. I'm really excited to have you all tune in and talk about Opus 4.7. We'll get into the specifics in a little bit, and, we're gonna talk about model capabilities. And, what we see in the model really shines that. But what I'm really excited to do is to have, two of our favorite builders coming and joining me, on stage, and talk about how they've been using the model in production and the lessons they've learned. But before we start, just a few housekeeping notes. There's gonna be a recording of this session, within the 24 hours of, the webinar ending. And second, we have a Q&A panel in this webinar polar portal. So at any given time, I encourage you to, type in your question there. We're gonna have a ten minute session at the end where we're gonna take questions. I encourage you to direct your questions to our builders and our guests today, Sanjay Seth. They are, you know, the real warriors that have been using the model and have a lot of lessons they've learned. And lastly, we have feedback. At the end of session, you're gonna be prompted to, fill in a survey. And, I would love if you all take your time to, to complete it because our team really, looks at every single piece of feedback, and, we want to make these webinars better in the future. So without further ado, this is the lineup for today. I'm your host, Marius. I'm part of the applied AI team here at Entropic. But before that, I was also a, YC startup founder. So I was on the other side, you know, building with, cloud models. Our team, Applied AI, sits between product research and go to market. And we are the first ones to, get to use the early models like cloud cloud opus 4.7 and figure out how they perform in the real world. That's how I got to meet, both Sanj, from Salt Intelligence and Sid from Blitsy, who were early testers, and they had amazing feedback to share, after they, tested OPRA score point seven. Just a quick recap. Here's the agenda for today. We're gonna do a quick intro into open for Opus 4.7. We're gonna talk about its capabilities, the new unlocks, and the latest benchmarks. Then we're gonna talk about migrating to Okus 4.7 from previous models. And then we're gonna kick it into gear and, get into a conversation with, Sanj and Sid, and they're gonna have some cool demos to share with us. And we're gonna leave some time at the end, for ten minutes of, questions. With that, I'd like to introduce the new model, Opus 4.7. This is how it stacks against our previous Opus 4.6 and, the current model, Sonnet 4.6. I would say that, Opus 4.7, shines through when the tasks are very complex, and they really need that extra mile of intelligence and sustained autonomy. We also we often hear a feedback that, hey. I don't, I don't see any difference, right, in my my chat window when I talk to Opus 4.7 And that's fair because, the real work is done not in a single, chat, you know, one turn conversation, but in a sustained long term multi tool call, multi session type of work. So I encourage you to use a model for those types of tasks. On the practical side, cost stays the same. So if you, if you look at Opus 4.6 and Opus 4.7 they are priced the same. It's, 5 mil $5 for, 1,000,000, input tokens and $25 for 1,000,000 output tokens. We want to make it easy for you to upgrade, with just a a model string, change. These are the highlights of, the new model 4.7. As I mentioned before, its long horizon reliability, really shines. It works through ambiguous tasks. It follows instructions precisely, and it catches its own mistakes, you know, keeping going over a long period of time. And it does that, through saving important notes along the way, through its improved memory memory capabilities. So it writes and reads to file based memory, and, with that, it moves more smoothly across, multi session work. We also reached, new peaks when it comes to coding, and we're gonna talk about benchmarks in a little bit. But we noticed that, Swingbench Pro, which is our hero eval, got a bump of 10 in, in accuracy on a task. For, vision specific tasks, I think there's also a new breakthrough. We've improved, the number of input pixels to, 2,500, and we'll talk about that in a little bit. Here's a sneak peek, at the benchmarks, and you can also read them all a card for a a full detailed, benchmark, table. What's interesting here, I'll highlight two things. One, our hero coding, benchmark, which is Swidget Bench Pro, got a bump of, 12%. So the model is just much better at coding. And then another thing that is, really, really exciting is, the performance on the visual equity tasks, like our partner Expo reported, where the model assigned at 4.6 reported a 54%, and now we saturate the task with 98.5% accuracy. So this is a true vision capability on law. It's not just a step forward. As I mentioned, we, allow, larger images, to go, through the model. So the Libet now has increased to, 3.75 megapixels, which is, three times the size what Opus 4.6 could do. So if you're doing any work related to computer use, screenshot understanding, pulling data out of dense diagrams, this will really you will really feel the change. There's no API changes. It's just a drop in replacement. And, I guess one thing to be aware of that, more pixels, mean, more, image tokens. So, be mindful of that. With the launch, we introduced a really cool feature, which is called task budgets. So this came straight out of, feedback from our customers. We kept hearing the same feedback over and over. People love that agent, runs for hours, but it's really hard to predict the bill at the end of the day. So we gave the model a, token target for its entire agentic turn, and, the model sees a live countdown. Instead of, like, instead of, like, getting cutting off mid through, it prioritizes and wraps the the tasks more thinly. This is an advisory target. It's not a hard cap, but it really makes a difference. We also added a new effort level, which is called extra high, which sits between high and max, and will give you, like, finer control on really hard problems without paying for the full, latency cost of of max, effort level. I would start here, if you're building anything coding or agentic related. If not, the default, is probably high for, any sensitive, intelligence, tasks. And our favorite, tool that launched with the model is the advisory tool. Here, you can think of, building a main agent with a cheaper model like a Sonnet or Haiku. And then, when needed, when you hit, something difficult, you can hand off that task with a single API request to, an focus 4.7, and we call this the adviser tool. And you don't need to implement any of this this harness yourself. This is a tool that ships by default, and you can just use it in your API. When it comes to migrating to, Opus 4.7, we introduced a, command in cloud code, which is cloud API and there's a subcommand called migrate. And this will help you, with two things. One, just the mechanics of it. So swapping in the model ID, you know, removing the the parameters that no longer make sense or converting thinking effort levels. And it also, like, give you insights into how to improve your prompts to support the new model. In terms of, like, API level changes, we removed some parameters. Temperature, top e, top k are deprecated now. As I mentioned, we introduced a new FA level called extra high. And then there's a change in how we count the, API the API tokens, and you will see a bump in, the number of them by three, 3.5 sorry. 35, percent. When it comes to prompting, there's a few changes to keep in mind to, make sure that the model works as, intended. So, instruction following is way more precise. Make sure you clean up your prompts, with that in mind. There's fewer tool calls, and, more reasoning. So 4.7 spans fewer sub agents. There's ways to mitigate that if you need more tool calls. And, the model really responds differently to different, level of complexity tasks. So for hard tasks, it will, respond or think for longer. And, that's something to note if you want, the same level of output. And, with that, I would like to, invite over Sanj, who's the cofounder and chief research officers, at, Solve Intelligence. Sanj, how's it how's it going? Hey, Marius. Thanks for having me. It's, really good to have you here, and, thanks for making the time. Could you tell us a little bit more about, you know, who you are, what you're building, and how did you end up, building your product? Yeah. So we're I'm Sanj. I'm one of the cofounders as you mentioned. My background is a PhD in AI. And, of course, you know, under understanding patents is a very, big intelligence problem, so very excited to work on it. What we do as a company at Solve Intelligence is every part of the patenting process. So right from the drafting stage, writing the patent, all the way to litigating it, and then everything in the middle. So also prosecuting the patent. And even at the earlier stages, coming up with invention ideas to help, you know, inventors with invention, you know, communicate with the attorneys who they then use to file the patents. Yeah. So that's kind of a brief overview. That that's really fascinating, and it and it sounds like a very complicated task. We love to hear how are you using Claude to do the heavy lifting in your product? Yeah. We kind of think of Claude as, like, the engine behind much of our product. So we're building kind of all of these harnesses around around it, building these specific tools and evals and datasets that really tune whatever model we're using to the the tasks that we're trying to solve for our customers. And we found, of course, that Claude Opus 4.7 performs extremely well, and so we are currently using that as a as a key engine in our product. That's awesome. And I guess for for folks who don't really understand your your your domain and or or why it's hard, what are the things what what are some specific things that people underestimate about the the the legal, and patent space? Yeah. So for a patent to be granted, it has to be novel and non obvious, which means that it has to do that over all prior patents that that ever existed, which is a really quite high bar. So our system kind of has to handle millions of patents. Right? Search over them, agentic kind of loops where it's looking at loads and loads of patents and coming up with the key insights. And the really interesting thing is that a single paragraph in a single patent might totally flip the conclusion as to, like, what is the, yeah, the the decision, like, whether this is in print this is, like, predicted by another patent or not. And so, that challenge is extremely complicated. It's also very multimodal as well, and we we'll get to that as well in the demo. Awesome. Yeah. Sounds like a very costly, task to get wrong. Could you give us an example where where a a a naive model gets, you know, gets a plausible answer, but it's actually wrong? Yeah. Exactly. So I think one example we came across was for a dental implant, and the patent had kind of well, the prior art as it's called, so an existing patent, it displays that this dental implant could have protrusions. But then somewhere in the text, they said, you know, this feature can be reversed. And what that means is that, those protrusions can come become recesses. And what is quite challenging for the AI is to have this kind of geometric understanding of, you know, building a mental model of this is not just the invention that I see before me, but also I have to imagine all variations on top of that invention that is that are described in the text. And so it turned out that the new patent application had been anticipated by this prior patent according to a big litigation case that, you know, went to court and was very, very expensive. But a naive analysis from, an AI that's, like, less well trained will say, oh, no. This patent isn't relevant because it said protrusions, not not, kind of recesses. So you can kind of see that that why, you know, sometimes you have, like, massive teams of lawyers pouring over patterns in detail trying to come up with, like, legal arguments like this, and it's really interesting to think that AI might be able to assist with that process. And and and I know you you you've been, like, heavy adopters of, like, really high intelligence models. So you you, when you got open four point seven, you're you're really excited about it. What unlocked for you, that you couldn't done couldn't have done before? Yeah. So, it'll be in the demo, and, essentially, it's what what I've been chatting to you about, Marius, which is trying to kind of understand chemical structures. We sell into a lot of, pharmaceuticals, people who are trying to do life science patents. And one of the the big challenges is really processing, you know, images of a chemical structure in a pattern and trying to build a textual representation of that, in the in the form of something called CX Smiles, which I can kind of explain in a little bit. And we found that Opus four seven is a massive bump up on this particular task relative to Opus four six. Now there's probably a few reasons for that, potentially the higher resolution zoom as well. But as I think we've also discussed offline, it's it's also, you know, just the the reasoning capability and the ability to think through kind of how all of these bonds are interacting with each other and so on. That's awesome. We we would love to see a demo if if you have one. Awesome. Okay. Let's jump to it. So this is our product. I hope you guys can see. And, essentially, we have four sub modules in our product. So we have invention harvesting that I mentioned, writing the patent, which is drafting, prosecution, which is kind of arguing for the patent at the patent office, and charts, which assist with litigation, which is doing deep, detailed litigation analyses. For the purpose of this demo, we'll stick to drafting. And let's imagine that I'm writing a chemistry patent application, and you know perhaps it's a very detailed application with many chemical structures embedded. And just to give you a flavor of the kind of thing we might want to do is you might want to enhance this image representation with a textual representation that the AI can sub subsequently use in, for example, tool calls. So here I'm just going to extract from the current document the chemical structures. And while that goes on, I'm going to show you some of our evals and how we kind of think about this problem. So I'm just going to reshare now Yeah. the evals. will I will be looking at evals. It's my. favorite my favorite job. Yeah. So let me, like, explain a little bit about the problem. Of course, chemistry experts spend a lot of time thinking about this. Patent attorneys have really gone deep on some of these problems. But here's an example, like, on the on the left here of, what's what's like a chemical structure from a patent application. And this isn't an ordinary chemical structure, it's actually quite specific to patents. For example, in patents, they define basically variables like r one, r two, m one, and m two. And these variables kind of represent placeholders for difficult different chemical groups that can be attached to that chemical compound. So the idea is that you patent something quite general, which means you can claim all kinds of chemical structures, and these placeholders represent that I'm not just patenting one particular compound, I'm patenting a family of compounds. So the kind of data we deal with might come in this form as image data. And what we do with our in house attorneys is get a set of ground truth, cx smiles examples. So this is the textual representation of that formula. So first is kind of like setting up those datasets. And you can see here that that then compiles to this SVG that looks based that is functionally equivalent to this diagram that we've got over here. So now we've kind of got this dataset. We then have we then run Opus 4.6 on that dataset. For the this is a relatively simple example, but I've turned thinking off. And, actually, in this, you know, just just for the purpose of illustration, you actually see that it's got it wrong. And the reason it's got it wrong is it's detected an oxygen here when there is, in fact, no oxygen between the r one and the r two. So this is actually wrong. It's got you know, it has written something which is valid syntax, so this is syntactically correct, but the actual structure has, got an extra oxygen atom. Now if we look at Opus 4.7 even without thinking, it's already just got this perfectly right, and this is an exact match for the for the ground truth. We run these tests on, you know, thousands and thousands of chemical structures, and we we've seen across the board and with different levels of thinking that Opus 4.7 consistently outperforms on this task relative to Opus 4.6. And and is and, you know, can with enough thinking can go to even much more complicated chemical structures. These structures are really cool to look at. They're they're they're fascinating. And in in in a way, I I I feel bad for not paying attention, in in in school at the at the chemistry chemistry lessons because, this this is really, really good looking. I've had to relearn all of my chemistry. And yeah, and just to close the loop, so I'll just switch back to one second I'll just switch back to prod. So you can kind of imagine that all of this is kind of going on under the hood, and so when we go here, this is all now in this tab over here, and, you know, under like, all of this is kind of hidden from the attorney, but the representations have now been created such that when the AI does perhaps it needs to kind of determine are there, is this pattern novel and non obvious over all the other prior art applications which have loads of chemical structures. It can reason in this, you know, representation space around that and then come up with useful feedback to, again, like assist the attorneys who are making these big decisions around what exactly to patent. So, yeah, that's that's pretty much it. I hope that was, interesting. This is super interesting, and it it does seem like a very difficult task even for, for experts. No. Absolutely. It's, it's a it's a good case of just going really deep on a problem, and we're still in the early stages as as we will are, you know, in in kind of really, like, pushing this to the to the limit of what's possible. But it's quite exciting to see, you know, the evolves change with with it with new model releases. So it has been really fun to work on. That's really cool. And and and when it comes, like, for for looking ahead, what are some big product questions that you're still wrestling with right now? Yeah. So I'm obviously sort of on the management level. So I'm often kind of watching a team of phenomenally talented AI engineers, you know, building, you know, the next version of Solve's products. And so the thing I try to do that I guess is useful is take a step back and just say, how can we build, like, the the most kind of beautiful, elegant architecture that will scale nicely with the improvements that, you know, the foundation models are making, say Anthropic in particular, as thinking increases, as we do even more long running agentic tasks, how can we build a harness and not, like, spread the work in a messy way such that, you know, all of these good all of the good properties emerge of, like, just better performance? So I I am often, like, stepping back and looking at the big picture, whereas the the AI engineers are deep in the weeds of actually getting this to work. Yeah. We we we love working with with you guys because you're really pushing the the models to the next level. So I'm, like, curious to to see what what we what can we unlock with the with your new versions. Well, thank you so much, Saj. I think, this this was fascinating. Thanks for the demo and for for the EVAS. And, yeah, we'll have you back, in in just shortly after we, we talk to to Sid. And, yeah, with that, I'd like to invite, Sid on on stage. Hey, Maurice. Hey, Sid. How's it going? Great. How are you? Really good to see you, and I really love your background. Oh, yeah. Thanks. Yeah. We've we've media team is hard at work here. We're able to get this done. I I I I thought for a second that's a virtual background, but I'd like to tell it. It it is real. Right? It's not. It's a beautiful Cambridge building with exposed brick. The woven fire hose was invented right here in Cambridge, Massachusetts. It's a historic building. We invented a ton of stuff for Brissy as well. That is that is absolutely fascinating. We'd love to visit sometime. Yes. That would be amazing. Well, thanks for joining us today. For for folks who who who don't know who you are, could you tell us a little bit more of, you know, how how you came to to build, Blitzy? Sure. So I'm Sid, cofounder of CTO at Blitzy. You know, I've been at NVIDIA since 2016 up to 2023. And for the better part of the last decade, you know, I've been on GenAI since 2017. I was there at NVIDIA, you know, reading research papers, working, you know, as part of the AI team when, attention is already dropped. And I've been inventing solutions for NVIDIA patenting stuff. So, you know, I'm familiar with the the problem that Sanj is is solving as well. So autonomous software development is the problem that Brian, CEO, of the c and I, you know, chose to solve and dedicate, our lives work to. And we're really passionate about this because we fundamentally believe you know, we've back in 2023, no one believed that AI could reliably write production grade code at enterprises. And, Blitsy is dedicated to solving exactly that problem just like cloud code. You know, you work with a human. You go back and forth, build features. Blitzy is built designed to build entire products, entire projects enormously at enterprise scale, working at hundreds of millions of clients. And we've been partnering with Anthropic, since the earliest days since 2024. We use all of the models, but cloud is primary, and we've had a fantastic experience working with Anthropic. Got it. Yeah. We we came a long way, right, from from the beginnings of of. of AI where, you know, you you would ask, the model to to do bubble sort, and it will implement bubble sort for you. And we're all super excited, and now we're, working through millions of lines of code. That's that's really cool to see. Cool. Well, we'll we'll love to hear if you can share more a bit a bit about what Claude is doing to to do the heavy lifting for for for your product. Yes. So, you know, enterprise scale software development and enterprise scale projects are different than your typical, you know, hobby projects or typically today use cases. But the thing is, many of our customers, they have, let's say, thousands of applications, millions of lines of code, and libraries that are maybe used across hundreds of those thousands of applications. When you make a change in one library and that affects all those applications, if you're only realizing bugs when you try and build the application, that's far too late. You need a way to be able to make changes and to leverage the intelligence of the model much more effectively. Blittys' underlying system uses a hybrid of a graph and a vector database and builds this knowledge base of the entire enterprise's source code. That's why we're able to use models far more effectively. Particularly with Opus 4.7, what we've noticed is that the model is far more intelligent when it comes to visual equity. You know, front end development is, for example, a really hard and underrated use case in enterprise development and especially when you tie front end development to specs. Getting the correct answer one shot is a really, really hard problem. To show you what that means and how Office 4.7 really pushes the boundaries of what's possible in this domain, I have a quick set of demos. The links for this, by the way, are in the docs section. If you want to follow along, viewers, feel free to do that. I'll quickly share my screen and bring up, the window, so that you can see this. So what I'm presenting now, and and more is just to make sure you can see this right. Everything's good? Yeah. We can. see it. Yeah. So this we're literally using Blitsy to build Blitsy, and what you see now is literally the environment's page on Blitsy. So you set up an environment that Blitsy uses to build your code, and you can see the Figma spec that's been detailed out by a designer. You can see how there's a sidebar. You know, there's some elements here. There's Linux and Windows. There's some setup instruction. There's some attachments. Right? There's icons and all of that stuff. And this is supposed to be responsive. So you have what it looks like at nineteen twenty px, at twenty five sixty px. And then what's more interesting and more important is how it behaves when you crunch the screen down to lower resolutions. Right? What happens when you go to seven sixty eight pixels? As you can see here, the sidebar crunches into a drop down. Then as you go lower, all of those attachments are now stacked vertically, and you have an elegant looking UI. This is quite often a feature. This is the work you see enterprises using code generation tools for and trying to build using all of the tools they have access to. I wanna show an example of what happens if, let's say, you take this this spec, right, this this entire box that's highlighted in orange, and let's say you provide that to cloud code with office 4.6, just pure just just just just as a baseline standpoint. Right? So this is what you can see. Now to the untrained eye, this might look fabulous. Right? But if you're deep into front end development, if you've done this before, you can quickly see and realize the gaps. Right? Like, for example, this is not Blitzy, Right? Up up letters. This is something There's some weird logo. The icons are different. Right? I'll quickly switch back and I'll zoom in to the screen. So you pay attention to the font weights this time. Pay attention to the icons. Look at the Linux icon, and and then I'll take you back. Like, the Linux icon here is completely different. You got Windows correct. But the font weights are different. If you scroll down and you look at the attachment images, those are also different. But overall, it seems to have done a pretty decent job for one shot. Right? This is also open six. But here's what happens when you start resizing. So I'm I'm I'm beginning to crunch the screen. If you if you if you don't see this, try to do this yourself. But I I've gone all the way down to a small resolution, and I can't go any further. And as you can see, it's not responsive at all. Right? So, you this is a challenge. Right? Because there's no responsive design. Now you're stuck with going back and forth and trying to implement it. And then if you combine with the Opus 4.7 with Blitsy, right, you get the following result. And this is one shot, right, with Opus 4.7 and Blitsy. And you can see it's far more accurate and has high fidelity relative to, the the implementation. Can you see the screens or are you guys stuck at Figma? I think wanna seeing sure. Figma. Yeah. I wasn't sure, if if that's the stop sharing and, let me share again. It should be quite quick. mhmm. One second. So share screen and then window, and I'll go here. Alright. Yeah. So I think, Figma right. Again, this is the Figma thing. This is the environments. Right? And, I'm gonna crunch the screen. Oh, great. I can see it. Right? I can see that you can see it. So I'm crunching it, and as as you can see, it's not responsive at all. Right? You scroll down. The attachments are all messed up. Right? You expect it to be responsive, but this is what you got one shot. And now let's look at what happens with Opus 4.7 and let's see. Right? So as you can see, far more high fidelity if I if I go back. Right? The font weights are correct. The icons are correct. Right? The style is correct, all of this. It preserve the image. If I scroll down, you know, the attachment images are correct. And now as I crunch this, right, as you can see, it hit that breakpoint and it transformed the sidebar into the dropdown. Right? And it still preserved those items. I can scroll down and those, you know, attachments are correctly stacked. Right? So this is phenomenal. If you think about how much time you just saved, right, just doing this, and then getting this kind of a result to one shot, from a tool, this is really phenomenal because you would have spent multiple sessions going back and forth and paying attention to the small details. That's really hard. The model has been working through the Figma XML. Quite often in the Figma XML, there's conflicting states. There's a button that has a hover state or a stable state. Right? There's there's a drop down that's not present unless you hit a certain breakpoint. So all of these visual details are really hard, and even humans struggle to achieve big slip effect fidelity. There's this constant battle in the enterprise between the UX and the engineers. But now we're at a point, with open open seven and tools like let's see where you can get phenomenal results one shot. Yeah. Yeah. That that is really impressive, and and I've done front end work before, and I know how hard it is to to make the UI responsive. And I know how how happy the the UX designers are when when you put in the effort and you actually make it pixel perfect. So, yeah, just happy removing removing that that kind of minutia, type of work. It's it's something that, I guess, we we all, are happy to to to to see. Absolutely. They're super cool. And and, I know you you folks are using, like, a mix of models. Right? You're using, both the the sonnets and the opus of the world. How do you decide between between each model and, yeah. How I guess that that's one question. And how do you how do you build confidence when you switch to to to new models in in production? Yeah. So we have evals, and all of our evals are real world just like you saw. They're focused on Priti's own code base, primarily as well as open source. For example, we have an eval where you have a COBOL to Java migration, but it's half done and it's done incorrectly. The model has to understand what's wrong, it's been documented, and reverse the incorrect actions and get to the correct state. These tasks, unfortunately, are not represented adequately in something like Seabenched Pro. Seabenched Pro, Seabenched Verify, very small task bugs. Models are able to score really high, but that doesn't reflect real low performance. At least our research team is constantly thinking about the real world. Right? So that's one lever. But you also talked about, like, SONNET and OPUS. Right? And what happens is the good part about what Anthropic is doing and, you know, even what the other place are doing is that having multiple thinking levels just lets you get so much more mileage out of the model because there's a difference between Opus 4.7 and medium and SONNET on high. Right? And there's a difference between, you know, fundamentally, Office on max. Right? So if if you're able to, as a builder, identify, you know, complexity and you can use AI to even estimate complexity. Right? Then identify areas where you can assign models based on their perceived skill. That's a good use of tokens. But ultimately, I think just being able to have evals where you can evaluate the skills of models, like one nuance with Opus 4.7 is clearly that it's so much better at visual. For front end tasks, you should be recruiting Opus and leveraging that. For instruction following, again, Opus 4.7 is amazing. People don't realize how different four point six and four point seven are. When we ship this to production on day one, we had unexpected behaviors just because Opus 4.7 takes instruction so literally. We had to go and rewrite many of our prompts to solve for Opus 4.7. When you don't do that, the perception is that, oh, this this is a worse model. That's absolutely not the case, and Opus 4.7 really crushes it in so many domains. Really, really cool. Yeah. This is a a a really awesome milestone to see where we're at in terms of, like, visual capabilities, software engineering capabilities, and not just coding, but, like, end to end, full, full full tasks. What do you think is next? What are what are some some product questions that, you know, are you still wrestling with? Were you hoping for in the future? Yeah. I think, we have this fundamental opportunity, you know, as builders, as models are getting better. We're a product that gets better, exponentially as models get better, so we absolutely love the speed and the rate of progress that Anthropic is able to produce. We believe we can have proactive autonomy. There's tons of features that we're anticipating being able to build just as models get better. Like, Mythos, for example, is a fantastic model. Right? It can detect vulnerabilities. It can detect bugs and issues that people have ignored and missed. Right? And it's fantastic that Anthropic is giving the world the time it needs to brace for Mythos. But we believe that once we have that next frontier of intelligence, we're going to be able to do so much more and, you know, get so much more out of software, build products that are truly fundamentally change, you know, our our expectations. Like, for example, for the first time, you can now, fix vulnerabilities that have been omitted for, like, so so so so such such a long while. You can now because, you know, computer use is so good with all of the with all of Entropix models, You can profile applications at run time and proactively identify opportunities. Hey. Your ecommerce website takes three seconds to load. If you bring that down from three seconds to one second, your conversions could increase by x percentage, and here's the research to back that. Right? So those kinds of opportunities, like, you had to hire experts. Right? Like but now because even startups have access to such intelligence, you can now have LLMs do that kind of software. You can really crack the next frontier for products and for business. That sounds like a really awesome future and, you know, excited I'm excited for it. Just just scaling sure. the engineering team, scaling the the UX design team. Thank you so much, Sid. Appreciate you walking us through your product and and through how you think of of, building the the the kind of frontier when it comes to, Absolutely. software engineering, AI software engineering. With that, I would love to invite Sanj back on stage, and then we'll, take a few questions from the audience. And I think, Sanj, we we have one for for you, from from Francisco. What are the benchmarks, does Solve Intelligence use, to make sure that, the search results for prior art, analysis are are valid or or not? In terms like search, what what are you guys doing there? Yes. So we like to ground this very heavily in, like, what actually happened in reality. There are, like, a number of datasets. I I can't remember all of them off the top of my head. But, basically, with our we have, like, European attorneys, basically, for example, US attorneys who are looking what actually got litigated. And, basically, if these things are actually brought up in the opposition, then we can use those as a as a set to calibrate our models. So, yeah, we tend to just really ground it on the actual data rather than trying to guess around what the, correct search is. Really cool. Thank you for that. A question for for Sid. How do you decide, I guess, to balance OPUS versus other models when it comes to performance versus cost, like reliability, 80% versus 99%? Like, what what plays into into that decision making process? Yeah. So, you know, for us, the primary factor we optimize for is quality. Right? Ultimately, we are an enterprise scale product, and what matters most of the enterprise is reliability and quality. So we prioritize that above all else. In terms of, you know, models themselves, it's more of if you're using Sonnet, right, like you guys came out with the adviser tool, that approaches like that are fundamentally really helpful in making sure that the model is indeed achieving its goals. For example, you can define a task for a model, and then at the end of it, you can ask an advisor to decide if that goal has been met. We've been using techniques like that is greater. Anthropic is also introducing now these techniques in the API itself, which makes it a lot easier for developers and builders to adopt them and make sure they create high quality experiences. And and I guess may maybe to to double down on that, are there any tricks that you use to to get good results with with, with smaller models? Yeah. That's a that's a great question. I think one of the most underrated trick that people don't realize is that even though the model has, let's say, 1,000,000 tokens of context, it's still based on attention and sparse attention. Attention is a quadratic problem. What happens is every token in the context window has to be mapped with every other token. As you increase the amount of tokens you're asking the model to deal with, you're depreciating intelligence. Just because you stack much more compute to come up with a response. Instead, if you're able to be more efficient with your prompts, if you're able to design systems that give the model only what it needs, the context just in time context, which is what we've we've spent so many years building and perfecting. Right? Then what you can do is if you can let's say a request takes, like, 300 ks tokens for whatever reason. It's a large code based on instructions. But if you're able to compress that to 200 ks, right, or lower than that, you're now operating in a a a layer where the model is more intelligent. Right? And that's underrated. Right? And there are this is backed by research. You can look at, you know, MRCR leaderboards. There's Graphvox. There's other and Claude, by the way, crushes those leaderboards. The leaderboards, are representative of the real world but not always. Right? But there's a lot you can get out of smaller models if you're able to optimize how much information you're stuffing into the context window. Cool. Pretty cool. Thank you for that. And and, I guess, question for both. Can you share some best practices with OPPOS 4.7 when it comes to cloud identity file or or skills or prompting, any anything that you you you love to use. I'll let you go first, Sanj. I think the principles of prompting still remain roughly similar with four six, four seven. We haven't seen dramatic differences maybe in in the same way you have, Sid. But I think, yeah, as you as you say, like, providing a lot of that, specific detail that is potentially missing, in the first version of a prompt. I think what's been fun for us is we've had to shift models so many times at this point that every time we do a model shift, like, our prompts need to provide more and more detail and specificity, which means that they're they're kind of now protected a little bit from the next model shift, in a way that wasn't necessarily true, like, at the start of the company. So I would just say speak to it like a human who you're trying to explain something and often imagine that they are not a subject matter expert. So for a lot of the chemistry stuff, actually, the models don't really know a ton about it. But they can. They're smart, so they have the ability to learn about it if you provide a lot of information in the prompt that's very specific, and they can then soak that all up and actually reason about it just like, you know, I could, even though I'm not a chemistry expert. Right? But if someone explained it to me in in a few hours, I'd be able to soak that all up and then actually apply the rules, and they're very good at doing that. You know, one trick I I'll I'd love to share with everyone here. You know, what I what I would do is create you know, go to the anthropic console or create a cloud code project and give cloud anthropics prompting guidelines. Right? And ask it to review all of your prompts and suggest corrections and improvements that are tuned to the newer model. You'll be surprised by what you might find. I'll give you an example. With 4.6, one of our prompts said you have to retrieve every file. You have to look across every dependent file and make sure that whatever you're writing in terms of code solves for all of the dependencies. Very quickly, we had an enterprise job where that file imported 17 other files, things from 17 other files. To make the change that Cloud was making, it was a one line change. It didn't need to read all 17 files. Now you have a situation where you're spoon feeding the model, but the model has got more intelligence. And what you're doing to spoon feed is actually hurting the process. Right? So as models get smarter, it's important to stay on top of things like that, and optimize your prompts to get the most out of the model. Awesome. Well, I think we're at the end of time. So, I wanna, like, give you a huge thanks for for both of you for, for taking the time to to be with us, this morning. Thank you, Sid, and thank you, Sanj. And, I guess, one one more housekeeping thing before we end. There will be a survey, so please, go ahead and and and fill that, and, see you next time. Thank you, Marius. Thanks.