Video: Building Products on Claude Opus 4.6 — A Customer Success Story with Shortcut and Hex | Duration: 5400s | Summary: Building Products on Claude Opus 4.6 — A Customer Success Story with Shortcut and Hex | Chapters: Welcome and Introduction (16.095s), Dropbox Model Classes (216.655s), Introducing Hex Analytics (750.22s), Opus 4.6 Improvements (1066.45s), Model Skepticism Importance (1407.27s), Evaluating Model Performance (1572.255s), Product Strategy Insights (1762.92s), Shortcut Excel Agent (1954.77s), Model Migration Strategy (2687.885s), Model Evaluation Strategies (3202.41s), Concluding Thoughts (3606.125s)
Transcript for "Building Products on Claude Opus 4.6 — A Customer Success Story with Shortcut and Hex":
Awesome. Welcome, everyone. I'm psyched to have everyone tuned in here today to talk about building on OPUS 4.6. My name is Carly, and I'll get into a little bit more about what I do at Anthropic in a bit. But we can start with some housekeeping. So first off, a recording of this session will be shared within twenty four hours. So if you need to step away, don't worry about it. We'll have a recording. Second thing is questions can be submitted anytime using the q and a button, and I will go through them and ask some of them at the end. I think that's gonna be kind of the best part about all this is asking Nico and Olivia some questions about building from you guys. And then finally, give us feedback. This is a new webinar style. It was a fun idea to have customers come on and demonstrate, for this model launch how OPUS 4.6 lands for them. So we'd love to hear your feedback. So a little bit about me. My name is Carly Ryan, and I work on a team and a topic called Applied AI. The way we like to describe it is Applied AI sits between product, research, and go to market. My team actually used to be under research. We now officially are under go to market. But kind of the charter, the way I like to think about it, is we are looking to optimize clock models in the wild. We are trying to figure out how they best work out in the world, doing real things. And so sometimes that takes the shape of working with research. Sometimes that takes the shape of being deployed to internal product teams and working with anthropic product people. But most of the time, the way that that looks is working with customers, really smart people looking to do really cool things with our models out in the world. So another cool thing about my team is when a new model comes out, we are some of the first people to get our hands on it. And we internally will bash it a ton. We'll build demos. We'll see what it can do, which is super fun. We also will bring it to some of our customers and ask them what they think, ask them what it does well in. And so I offered to do this webinar for building on Opus 4.6. And, originally, my idea was to build my own slides, build a demo, and talk about it just from my voice and just from Anthropix's voice. And I was, candidly, was, like, a little uninspired. I love Opus 4.6, but I thought that the biggest and coolest parts about launching this model were working with customers and hearing from them, like, wow. This model really changes the game in my product. And that was, like, just the most electric part. So today, we're bringing on Olivia and Nico, from Hex and Shortcut. They'll tell you a little bit about themselves in a little bit. But I'm super excited because they, you know, were some of the early testers of our model and had really positive and cool feedback about it. So a little bit of an agenda. I'm gonna go through an introduction to OPUS 4.6, what it's good at. I'm gonna quickly talk about migrating to OPUS 4.6. But if you see the docs panel, that has some links that are are really detailed, a prompting guide and a migration guide that I think are gonna really be your golden resource for migrating. And then we're gonna go into the exciting part, which will be a demo with Hex, with Olivia, and then a demo with Shortcut with Nico, and then some audience q and a. Cool. So let's start with the basics. And Dropbox has three model classes, haiku, sonnet, and opus. Most of you probably know this, but I've had some customers be like, oh, I never put together that those are actually, like, poem types. And so haiku is, you know, a short and sweet poem. It's three lines. But it still, like, is beautiful. And so that's kind of our smallest class of model. It's the most affordable, and it's the fastest. But I think, like, the important note there is that, like, a haiku is still a worthwhile poem, and haiku is still, like, holds the bar pretty high for intelligence. And then conversely, we have opus, which is, you know, generally in poetry and music, it's like a great work of art. It's multiple volumes. And this is where we're really looking to push the frontier. This is our biggest model. And then somewhere in between, we have sonnet, which is, like, a very popular model, which is neither Haiku nor an Opus. I have a personal, like, soft spot for for Opus, I think or, actually, a lot of people in Nythropic love Opus because this is where we really are pushing the boundaries. This is where we're really pushing the frontier and looking to change the game. This slide is, like, little information because the next slide has a lot of information. But this one is just which model to use when. And the thing I wanna highlight is Opus is recommended for. Basically, it's recommended for highest intelligence and highest agency use cases. I think some people, when they encounter one of our new Opus or one of our new models, and they're doing, like, some single shot thing in Cloud dot ai, they maybe don't notice. Like, oh, I don't feel like it's smarter. Like, I I don't, like, see the texture of why it's different. But the thing that Anthropic is really looking to push and the boundaries that we're looking to push are on these, like, long horizon agentic tasks, where it's not it doesn't show through in a single turn, and where it shows through is through tool calling through, doing things that take a long time. Let me get my slides back real quick. Cool. So what's improved since OPUS 4.5 is another question you might ask. So one thing is improved situational awareness and adaptive reasoning. So OPUS better tracks where problems stand, updates approach based on new evidence rather than doubling down, and provides more grounded progress reporting. It's also better at planning code review and debugging. So more polish around scoping work, revisiting assumptions, and self checking output before calling tasks done with fewer error loops and runtime surprises. So both of these kind of relate to what I said earlier about, like, agency and long horizon work. And that's, you know, long running agents, strong multitasking agent orchestration. Something that we noticed a lot internally and customers have noticed is OPUS 4.6 markedly spins out more sub agents and more parallel tools. It can handle more concurrent work streams, manage complex tool call sequence, and delegate effectively in multi agent setups. This has, like, a very meaningful effect in a lot of customers' products. And then finally, one of the things that I'm really excited about and we'll hear about from Nico is we see a meaningful step change for finance, life science, and cybersecurity use cases. We're seeing that OPUS 4.6 is the best model for knowledge work. When we think about knowledge work or knowledge work capabilities, we think about three pillars. So the first one is search, which is to find the right information. The second one is analyze, to make sense of that information. And the final one is to create, to produce valuable assets from that information. And so you can think about the skills that a very skilled, well trained knowledge worker has, an auditor after ten years of doing that work. A lot of the capabilities fall into those three things, like can you search, can you analyze, and can you create. And, again, we'll hear more about this shortly, which I'm excited about. So one thing we introduced with Opus is Opus 4.6 is adaptive thinking. This is where we let Claude decide when and how much to reason based on the effort level you set. So rather than token budgets, we basically have this, like, self modulating thinking, that Claude does itself. So the benefits, cost performance optimization, and then also latency. And then one thing that I would, like, would wanna shout out too is, like, developer experience. Like, you don't have to think about token budgets and max out your token budgets. So before with, like, traditional extended thinking, customers enabled thinking for all requests and had to set a token budget. How they would pick the token budget was kind of an art and a science. And Claude would think on every query up to that limit whether it needed to or not. And now with adaptive thinking, Claude decides when and if to reason based on the effort level you set as well. On a higher effort, Claude thinks on most queries. At lower effort, it skips the thinking on simpler requests. And, hopefully, also, Nico and Olivia will tell us a little bit more about how adaptive thinking works in their product as well because people are excited about this one. The next thing we're doing with OPUS 4.6 that we're excited about is context compaction in beta. This increases effective context window length by automatically summarizing all their context when approaching context limits, and the benefit is context optimization. So a lot of us have maybe encountered this in Cloud Code with Compact. That might be kinda like the first time we've thought about compaction of context windows. And the current solution is, like, most people do this client side or in their product, and Anthropic is is offering now that we will do it on our side if you provide the strategy. So I'm gonna go through a little bit on migrating to OPUS 4.6. Again, I think the links are gonna be, like, your best resource here. But yeah. Cool. So the first thing these three things are more like, API headers that you can think about. So the first one is effort. Effort allows control over how eager Cloud is to spend tokens, giving the ability to trade off between response thoroughness and token efficiency. It's important to note that this contributes to, like, all the types of tokens, so not just thinking. It also contributes to tool calling and other output types. Next one is adaptive thinking. So OPUS 4.6 uses thinking type adaptive. Cloud decides when and how much to think on its own, and so we're moving away from this manual token budget. And the effort parameter can be used in tandem with adaptive thinking to, like, to try and tune that from the developer side, but you also can can trust Claude to to know when. And then finally, we're deprecating pre fills. Starting with OPUS 4.6, pre filled responses on the last assistant turn are no longer supported. So this is something to think about if you're if you're using pre fills today. And then these are more prompting tips. So the others were, you know, API headers. This is more tips on prompting. So as I said, have trained OPUS 4.6 to be a lot more autonomous and have more agency. But that means that OPUS 4.6 does significantly more up front up front exploration than previous models. So if your old prompts used aggressive language to force their ownness, you might wanna rethink that. I would try OPUS 4.6, see what how it works and how it looks in your product. But we've seen that people will have to revisit some of their old prompts that had to, like, push the models to have more agency and to do more exploration. So the next one is subagent use. OPUS 4.6 has a strong bias towards spawning subagents, sometimes for tasks where a direct rep or a single tool call would be faster. What you wanna do here is add explicit guidance about when subagents are and aren't warranted if you see this. Another thing that I've seen people have to do is if they already have explicit guidance to, like, trigger as many sub agents as possible, you might wanna peel that back because OPUS 4.6 will go a little wild. Finally, autonomy and irreversible actions. Without guidance, OPUS 4.6 may take hard to reverse actions. So you wanna add explicit confirmation prompts for destructive operations if you want a human in the loop. This is a really good, like, product tip anyways to think about, like, where do you want human in the loop? Where should where should actions not be allowed? But we are trying to push OPUS 4.6 to just take more action and have more autonomy. Amazing. Okay. Well, so now we're gonna get into the meat of it, and Olivia will join us. Hi, Olivia. It looks like you might be muted. Can you. hear me? Yeah. I'm. listening. Hello? Would love if you could give our watchers just a quick intro about you, and then we can dive into Hex and Opus 4.6. Yeah. Absolutely. So my name is Olivia. I'm a product lead at HEX where I focus, a ton on all of our agentic analytics surface areas. We're gonna get into a bunch of what that is. But I spent a lot of time talking to Carly on how we can best take advantage of anthropic models. And I'm just really excited to be here and share more about what we've learned. Amazing. So why don't you tell me a little bit more about what you're building at HEX? Yeah. Absolutely. So some quick background here. Hex is the AI analytics platform that really just makes it easy for anyone to ask and answer questions with data. There's kind of three pillars of our product. We have this really incredible AgenTic notebook surface area where you can go really deep using SQL, Python, and native charts, input parameters, kind of everything you would need to do a deep dive analysis. And we have this app surface, so you can take that analysis and turn it into dashboards. But more and more, what I'm most excited about is all of our conversational q and a features. So it kind of brings the best of our notebooks and our data apps, but brings it to this really cool q and a surface area. And that's where we're seeing a huge amount of growth right now, which is really exciting. But underlying all of this is this really kind of, like, important context layer. And this is what we spent a lot of our time talking about, Carly, is what is the context that empowers the agents to actually have good answers? And so at Hex, we kinda think of this as figuring out how to build this really awesome context loop where we can take deep answers or deep insights, you know, data teams have found, turn those into artifacts that are scalable and trustable and can be acknowledged for the model to go reuse. And then keep iterating on those, and you have this, like, really cool compound and context loop where as you work longer in hex, the agent actually can get smarter, and use more of your previous work to actually answer questions. Yeah. I love that, like, continuous learning aspect as well. So a lot of, like, what you and Nico do, I think, like, surface level might seem a lot like cogen, and that's where anthropic models are are most well known. Can you tell us like, I think on the first time we met, you kind of were describing to me by why it it's actually very different from CodeGen. You know, there's some similarities and some differences. But what what makes it still hard to do data analytics today, and potentially harder than CodeGen? Yeah. Totally. So I think cogen is a part of data analytics. And so we get to benefit every time the models get better at Python and SQL. We get a benefit. But a lot of what we do is a lot more than just generating the code. Here's actually a really good example, from one of our evals. Carly is familiar with this, I think, benchmarks. We always send, this specific example. The models always trip up on this question, and it's so funny. If you look at the question, what is the top country for fraud? The it's this is actually considered an easy question, and it should be pretty simple. And what ends up happening is the dataset, as you can see, is a multiseries dataset where there is both fraud volume and fraud rate. And models by default love to just look at fraud volume. But if you're a true data analyst data analyst, you wanna actually look at both those values. And the kind of thing that you'd want the agent to respond with is something more like this answer where you see a chart that shows you both and calls out that, hey. Your organization prefers measuring by fraud rate, but you should also take a look at fraud volume. And so the answers here are not necessarily as black and white, and they're a lot harder to test for. And we find it's actually a lot harder to evaluate models. And, because of this, like, the nature of these questions are actually far less black and white. You can't unit test them. And there's a lot of taste, and judgment required for a lot of data analyst work. Yeah. Yeah. I feel like it does boil down a lot to both, like, this taste. I think that you learn through, like, looking at data yourself or, like, in, like, high school math, what what looks right, what looks wrong, how to think about trade offs, how to how to be a good communicator when you're thinking about data, and also how to how to make the trade offs yourself. And then I think the second thing that's, like, really kind of clicked for me when you highlighted was this verification. If you make, like, a small error, it's just it's really hard to verify in data, and it's much easier in, like, software engineering style code gen. Exactly. There just really isn't always a correct answer, and a lot of it has to do with just what does your organization think? And we'll actually get into that in the demo. We kind of think about that a lot in the way we build our product. Yeah. Awesome. Well, I guess, like, we can get into the demo, but maybe you can voice over. Just to start, like, what does OPUS 4.6 unlock for HEX specifically? I wanted you to come on because I think that when you guys encountered OPUS 4.6, you, you know, you mentioned that it changed the game. Yeah. It was really awesome. I think there's a couple of big things, and we'll touch on in the demo. But adaptive reasoning for us is huge because we get all types of questions, easy and hard. And I can tell you a little you can go into more depth about them. But the ability for the model to just figure out how much to think is a game changer for us. It makes a much better product experience. The other thing is just much better context reasoning. And this this eval in particular is one of the many that Opus did a lot better on. A lot of the work of our agents is actually looking really deeply at the context and not over fixating on what the user asked for since often users can ask very, vague prompts. Yeah. Awesome. Love it. Should we dump it into the the demo? Yeah. Let's do it. Alright. Okay. Amazing. Carly, can you see my screen? Yeah. Sweet. Alright. So we are here in Slack. As many of you probably know and love, most questions, or conversations happen in Slack. So one of the features that we've seen growing a lot is our Slack integration. And so if you have a data channel or kind of just anywhere you're con you wanna ask conversations, you can start in Slack. And so, here's actually a quick little example. Help me understand how each segment is performing right now for closing deals. So what's gonna happen is we actually kick off a thread in Hex. You can actually click in and view in Hex, but we'll also bring back all the answers to you directly in Slack, which is awesome. So I'm gonna actually just go ahead and open this up in Hex. Let me go over here, actually. And so what this is gonna gonna do is actually kick off the Hex agent, which is our conversational q and a feature. And you can actually see it's gonna go and look through all of your context and hopefully get us an awesome answer where I can just keep going from here. While this is going, I'm gonna talk a little bit about our context layers, so everything that really powers this this agent. So what we have is something called the context studio, and this is where data teams, have a lot of observability features to understand what are the types of conversations and questions folks are asking so they know where to put effort and time into curating context. So you can actually have see what kinds of questions folks are asking. And one of the things that I really think is really cool is the ability to look at conversation topics and figure out, like, oh, hey. You know, do we see a lot of warnings in certain areas? And so one of the things that we're actually doing offline is, like, looking through the conversations to try and figure out, did the agent have the right context it needed, and does it need more, issues? And so you can actually see we have some here, with missing context or data limitations. And this gives data teams a lot more power to understand, hey. Where do we need to go to put more, where do we need to put more effort into our context curation? Very cool. I love. that you can kick it off in Slack too. That's awesome. I mean, that's where we've seen a ton conversation happening. So we wanna make sure we're where folks are. And, also, you can kick it off from Claude as well with our MCP. Love it. Love it. So maybe we can walk through, like, where or or maybe is this what's running on the agent? Like, where we're seeing Opus four point six do things that, you know, the other the other models couldn't. Yeah. Absolutely. So while this is going, one thing that we're seeing is it does just a really good job of being very thorough with looking at the context before coming up with a conversation. And I'll show you an example that's prerun-in a sec. But when you're working with data, you have a lot of context. You have usually a full warehouse. We have, you know, hex assets like projects. You also have things like model data, which are different asset types that we have for curation and hex. And you'll often wanna run a lot of SQL queries to understand what the shape of the data looks like. And so the agent can do a lot here to understand how to answer something. And so, just to give you an example here, some charts that it's created so I can actually dive in to better understand what my answers are here. But all of this really matters of, like, can, can the agent pay attention to all of the context? So here's actually a really good example. So we have OPUS 4.6 on the left, and we have a previous model that we were using on the right. And on the right was actually a pretty common failure point for models that we were seeing and had been seeing for a long time in the data analysis world. A user tends to ask a somewhat ambiguous question or a question that maybe is not perfectly worded to how the data is mapped in in at the warehouse level. So in this case, how many subscribers are on my pro plan? And the agent did a pretty good job. It got the answer in terms of it gave the user the number of subscribers on the pro plan. But what we can actually see is that there's a gotcha here. At some point, they changed the pro name plan to professional. And the agent really locked on to what the user asked for, which is the pro plan. And it didn't actually call out that there's this other category, professional. And so with Opus four point six on the left, you can actually see it did answer the same question. It went a step further and actually told you, hey. You should look at the active subscribers, not just all of the subscribers. And it also called out the Nuance here that there's another plan called professional, which the user might have meant. These are kind of, like, the small differences in model behavior that make a huge difference for a product like ours where, like we were saying earlier, there's a lot of ambiguity in the way folks interact with our product, and there's not always a right answer. Yeah. Yeah. I think that's awesome. I think a lot of your I like your guys' email set because I think it's a lot of these kinda, like, gotchas, which is funny that we're just, like, looking to trick the model and see where it has good judgment. But even a human make this mistake. You know? It's like, oh, I just wanna see what's on Pro plan. My colleague asked me. I'm gonna look up Pro. But I think, like, the good taste and the good practicing would be to explore the data a little bit before jumping to conclusions about, you know, potentially the bias in the question. And you guys and and other people have seen that Opus is just, like has more of that, like, exploration. And I and this is what I talked about in the prompting as well. It has this, like, exploratory kind of predilection. And so it it might find these gotchas faster or better. Yeah. Absolutely. And that kind of thing makes a huge difference for our product. One of the things that we always say is, you know, data teams tend to be skeptical of the data that they're looking at always, and that's something that models were never skeptical of. And so that's something that's, like, really exciting to us in OPUS 4.6 that we're starting to see more of. Yeah. I think this is actually a great segue. Like, I feel like we've talked about this. And, even internally with data analytics, like, outside of HEX, we see this. Can you illustrate, like, some of the ways that models used to flop or didn't have enough skepticism? I think there's, like, some funny anecdotes there. I can share some as well. Yeah. One that I can think of is I remember, like so I use HEX literally nonstop to kinda look at feature adoption. I'm always trying to understand how our people are using our features and what habits are changing. And I remember it was, like, the week after Thanksgiving, and I I was, like, talking with, you know, the HEX agent. It gave me this really big, like you know, like, you should be really concerned. There's been a huge drop in usage. Like, this is very concerning. And it was just the week after Thanksgiving. It was really not a big deal. Yep. Because I'd asked, you know, about feature adoption, the agent was really focused on what I had asked for and had a hard time taking a step back and reasoning through kind of broader context. And we're actually seeing this even now, just like a couple months later, we're seeing that models are a lot better at this now, which I think is a is a really awesome one because it's always really funny when it's like, you know, this is so this is so concerning, but it's. just a hollow. Yeah. It's also, like, the drama, Claude. Like, it'll be, like, alert alert emoji. Yeah. Revenue has dropped. And it's like, no. It's actually just one day into December. Yeah. Cool. So I guess, actually, I think this is a good segue to potentially talking about evals a little bit. Yeah. And some maybe some more of the gotchas, I think, like, just even examples. But can I guess I think you maybe were gonna have a visual about just some of your evals? Is that possible? Yeah. Okay. Yeah. Okay. So at least a few queued up here. And I think one thing that matters for us a lot is, like, paying attention, to every single model and trying to understand what's what's happening and what's going wrong. So we have this pretty extensive eval set, and we're consistently trying to build on it and make it better. But this is how we kind of understand how a model performs. And so we always run evals on a new model. And there's a couple things we look at. One of them is, objective met, and that's kind of what this, shows. But there's also a couple other things. Tool efficiency is one as well as, like, the ability for the model to collaborate with you. And, generally, one of the reasons we are always really excited about the Anthropic models is, you know, more often than not, we want our product to feel like a collaborator and not just something that takes a long time and gives you necessarily the most accurate answer. We care about that collaboration aspect a lot more. But in 4.6, we have a couple evals here. So this is actually a little behind the scenes of what our evals look like. But here, this is actually, the eval that I was kinda pointing out earlier, the fraud rate versus fraud volume. So OPUS 4.6, it actually did a really good job going through the manual or, like, the, you know, like, kind of, like, curated context to realize that this organization defied fraud as, fraud rate, not just fraud volume. And so it was able to capture it correctly. So kinda like, Carly, you said earlier, going back to looking deeply at the context is something that we've consistently seen in 4.6. Another example of this, that we saw, there was a question here. The question's at the top. What is the average transaction volume? Example and I'm sorry. You're kinda really seeing behind the scenes the raw answers and everything. In this case, the agent was actually it basically hit a wall. And, usually, when agents hit a wall with SQL, it'll kinda, like they often kinda just, like, give up. And they're like, actually, I can't find the data, or they kind of just, like, overpower forward. And in this case, Opus 4.6 kinda took a step back and looked at other, tables to find, the rights where Swift like, what was SwiftCharge, which was, like, a type of card scheme, not an acquirer in this case. And it was able to actually pivot correctly to find the new table that was needed even though the answer the user's question was extremely ambiguous. Yeah. So I guess this is, like, resilience too. Kinda like not. lazy. Yeah. Exactly. Cool. Whereas in previous models, we found that, like, they also, like, tend to just, like, make assumptions if they hit a wall. And so this is actually a good example where the model just kind of assumed that the swift charge was an acquirer, not a card scheme. Awesome. Is this the empty array one? then this is the empty array one. This was yeah. So this is another awesome one. So this was a case. The question was basically asking for I will scroll up here. For certain account types, what was the average fee for the card scheme for that swift charge? We're charged for transaction volume. And in this case, this one was really interesting. OPUS 4.6 was properly looked at all the different data and realized that there wasn't an empty array meant to wildcard, which meant actually, that counted for what it wanted to include, to compute this. Whereas previous models would, again, over fixate on what the user was asking for and actually not properly look through all of the data to understand the nuances. And so OPUS 4.6 was able to get this correct, whereas previous models had a much limiter limited set of data it was looking at, so, fundamentally, it was getting the wrong answer. Yeah. I love getting in the weeds of your emails, so I'm glad you you brought these up. And I love the empty array one because I think this was one of the ones when we come out with a new model, we ask you what it's doing well, but then we also ask you, like, where is it still failing? Like, how can we think about that? And I feel like last December, this was one that was still failing, and then I was super psyched when it was like, no. Opus 4.6, like, can get this one now. And so. that's just awesome. Also, I like your eval suite. I think you guys, like, vibe coded it all in house. Right? All of our eval tools are vibe coded in house, which is really fun. And we're working on vibe coding for evals right now, which has been really cool. Awesome. Okay. I have, like I could ask you a million more questions, Yeah. but I think maybe we go to Nico, and then I save the questions for the end. I guess let's do one more. You talked about MCP, and just, like, how you're thinking about different product services. So I'm curious, like, what are some of the big product questions Hex is working through right now? And and, like, how are you thinking about this ever changing landscape? Yeah. Great question. I think the big thing, and I'm I'm sure everyone feels this right now, is where's the best place to put our time? Right? And, like, where if, you know, the the puck keeps moving and how can you stay on top of that. And I think one of the conversations we were just having, for example, is, like, should we go all in on MCP? Should we continue building this in house agent experience? And I think if I fundamentally take a step back, the thing I'm always thinking about is what is HEX's strategic advantage? What can we do better than anyone else? And that's where we should be putting our time in. We shouldn't do things that we think, you know, Anthropic is just gonna go do. And so that's a huge question for us. And one of the things that we've I think we believe right now is that working with data is fundamentally very hard. There's a lot of gotchas. It can be very finicky. And so you really do want a dedicated experience for it. That entry point might be anywhere like Claude, or any other tool. But for really hard questions, you're gonna want a dedicated agent surface that really understands the nuance of working with data. I love it. Yeah. That's awesome. Okay. Well, you'll come back for q and a from audience, and I have some some saved questions for that as well. Thanks for chatting and demoing. your Yeah. Nico. Hi. Hey. What's up, Carly? How are you? Good. How about you give our watchers, like, a quick intro of yourself? Yeah. Sure. So I'm happy to be here. My name is Nico. I'm the cofounder of a research lab called Fundamental. We spun out about two years ago from MIT where my cofounder was a professor in computational neuroscience, and we set out to to do very broad, vague, and ambitious research of just giving machines a lot of the fundamental human qualities that they didn't currently have. We explored a lot of things, giving them collaboration, giving them, like, long term and working memory, and we ended up building agents that could play Minecraft, which is pretty cool. And then somewhere along the mess, we gave them access to a computer. And we had set the state of the art in OS world, which is the most important benchmark for computer use. Actually, think it was 4.6 like, 70, which was, like, over human level. But this was about a year ago, and we decided to just, like, pursue building products on that technology. So I led a small team and built Shortcut. Shortcut is a superhuman Excel agent, that can do just about anything a human can do on Excel, and it's meant to be used as a tool to make, you know, doing Excel far far easier the same way we use Cloud Code or whatever to to do our software engineering. Awesome. Yeah. I feel like one of the through lines for, like, what you've built and worked on is just, like, pushing the models past what, you know, maybe they're currently capable of and just, like, thinking about the frontier, which is really cool. Because I think that, you know, when you started Shortcut even seven months ago, it was like model people were still so dubious with models being able to understand finance. And I think that that that maybe has changed with newer models. But maybe we can kick off your demo, and then we can, like, discuss it while it runs. Yeah. Yeah. Totally. So I'll flip it up here. And then people can see what Shortcut looks like too. Personally, I've learned a lot about, like, finance and Excel through just, like, looking at your product and learning about your evals, because this is not my area of expertise. And there were times when it was not Claude's area of expertise either. Sure. And I can talk about that time. I believe, Carly, you might have to give me permission to to share my screen or maybe you could stop sharing. Okay. Yeah. Let's see. Perfect. So I'm gonna do a variety of demos all live here, and I wanted this to be conversational. Most of the task can take somewhere between, like, two and fifteen minutes, and some of these are longer term. So I will fire these off and then answer the question that you have pointed to. And to tee them off just a little bit, first of all, I actually send them, and then I'll tee them off. So I have these two very finance specific tasks. This first one is this, like, professional grade institutional FP and A, model that you would do, like, ten year financial projections over. Extremely built out, extremely, rich. There's a lot of dependencies. Pull back here. There's a lot of dependencies across linked sheet references that you'd to get right. And what I'm gonna ask it to do is to build out this twelve month projection sheet entirely, and it has to be done correctly. So this is a part of our internal eval suite, something that, you know, actually just until four OPUS 4.6, this thing wasn't even doing remotely close. I can talk about how we do these evals. And then another task I have here is we are going to be pricing a bond. I know it's part of sort of deep finance, but it's it is important. I think the global bond market's, like, bigger than the global equity market's, like, a 100 plus trillion dollars and something that investment bankers will be doing many, many times a day even when they're working with with bonds. So this is a task that is specifically bottle bottlenecked by, like, the complexity of really, really difficult finance. It's kind of fun to think that we're doing these tasks when was building shortcut, like, a year ago, and we were, like, just barely adding three columns together. Your big question for your big question before was, like, you know, what was it like building when we weren't there, and how did you know it kinda get there? For us, it's is the things solvable by machine learning? Meaning, is it verifiable? And then is it in the best interests of the lab to solve these things? So the moment that you feel like the capability overhang allows you to do anything that would be valuable in that space and that it is machine verifiable and that you want to pursue it, for us, it was a no brainer. We had one or two very powerful experiences back when we were building with Sonic 2.7, I believe. And it felt like a no brainer that the same thing was about to happen to knowledge work, specifically Excel that had just happened to all of us with software engineering. And so for us, that's that's how we made that decision. So I will, also sort of talk over what's happening in these tasks. Sort of like Carly alluded to, OPUS 4.6 is very, almost aggressive is the word for, like, spinning off sub agents. Now in our context, that's a good thing, because the bottleneck on this task isn't necessarily financial complexity, is that this is a massive freaking model. Right? Like, charts of every type, dependencies with thousands and thousands of formulas. So the the the bottleneck is that a lot of these these really complex big models don't have, like, very obvious scaffolding. So you have to be able to use and build, like, sort of ergonomic tools for your agents to to access the things that they need to access to make the right decisions over. So in that context, here specifically, sub agents are fantastic because we spin off sub agents to explore different sheets in parallel. Meaning, each of the sub agents can provide a summary of it. It won't, you know, contribute to the contact to context raw, and it's going to give Opus the ability to to take the information it needs to make the decision that it has to do. So I will, again, voice over, but it's built a to do list. It feels like it has a good understanding of the task here, and it's gonna start chugging away. So I'll flip to the twelve month projection. At the same time, it looks like it's ready to start making changes. It will preview the changes to me to approve or accept. For the sake of the demo, I'm just gonna accept all. It's like bypass permissions in Cloud Code. And I know that I can roll back to any point in the chain as well. Awesome. So it sounds like, I guess, the bottlenecks just to, like, paraphrase, and then curious if you have more to add. And the bottlenecks that maybe Opus four point six, you know, got passed, and I also wanna hear about the bottlenecks that we're still working on. But one of them is just, like, spawning sub agents and figuring out how to handle, like, massive amounts of context, and then also finance knowledge. Does it feel like there's any more that, you know, OPUS 4.6 maybe was a step change? But is there anything else that that OPUS 4.6? Yeah. Yeah. I don't think the, like, the special the special answer is that there were these other things that that we know that it does better than other people know. I think the real thing is that 4.6 as a class is a massive step change specific to knowledge work and intelligence. It feels like it's smarter than 4.5 in coding. Right? But, like, it feels like the jump from OPUS 4.1 to 4.5 in finance work. It's significantly more intelligent. You even see this in the the model cards for SONNET 4.6. It's smarter than OPUS 4.6 in specific to finance requirements. It has its other constraints. But to me, the story is that, like, it's very obvious that Anthropic is highly, highly prioritizing intelligence in these domains now. And things that weren't quite possible are possible. And you actually do have to do what you said earlier, which is, like, you have to actually, you know, constrain some of the prompting because Opus is Opus 4.6 is far more aggressive than 4.5 is. Yeah. It's also less verbose, which can make it feel a little a little weird in personality sense. But, for us, like a no brainer, and I can tell you why again in our eval suite. But to look through here, we are starting to fill this out. This entire this was entirely empty. If you flip through it, it's all formula driven. It's all in the standard account or standard standard accounting slash finance, formatting. The sparklines are populating. And I'm noticing that the current liabilities are empty here, and it's done. And, again, I wanna reiterate, you're doing spreadsheet work and you're not using AI, I do think you're falling behind, but you're gonna use it the same way that you're gonna use it in coding. Like, you don't expect to one shot all your stuff. Yep. You're gonna wanna paralyze as much planning before and as much auditing afterwards as possible and probably never hand anything off that you have a new AI to audit. So, like, the very natural thing for me to ask here is, hey. Why are current liabilities empty? And, like, I need to know that. Another thing so they're not empty. Said, So, look, the the assumption is that monthly working capital balance was zero, which now makes sense to me. I'll do a couple of checks. I even will oftentimes, and this is like I won't go through all of these in in length. Mhmm. But when I talk about parallelizing auditing, this is the same thing we do in coding. Yeah. I will say audit this in full. Grade it. And I also have another independent agent say this. And one thing we know about agents is that they're not too proud. Meaning that if you have many, many independent agents critique each other, they will converge on the optimal critique. And then you can use that to make the the most well educated, you know, edits that you have to make. This is a it's just it's very it's a very similar flow to developing now, and it's gonna be in probably most of knowledge work. I will flip back to this task. Actually, I saw some questions in the q and a. I wanna be able to make sure that we can answer these specific questions about, like, vision. As of last night, we pushed an update. I'm actually giving Opus 4.6 vision or, like, a screenshot tool over the spreadsheet, which we find is, like, a material improvement in the formatting accuracy. So it's actually looking at, like, what it's doing here. It's like, okay. I kinda like what we're doing. Here's a clean bond price. Here's I can double check on my work. I also repaired all the broken name ranges. So if I click here, like, you can kinda get a feel for the complexity of even just the formula syntax in the structure. Like, this is a very complicated sort of if else equivalent and very specific finance math, which was the other bottleneck. It's like the right way to craft bond pricing. It just was not possible until this 4.6 in a consistent manner. Yeah. Awesome. How do you, like, have confidence when to switch to a new model? Or I guess, like, my my lead up there is to talk about your evals a little bit because you guys have awesome evals, and that's, like, something I I value personally a lot about our partnership because I love to see sure. love your eval screen, which is. same. I know you guys are big fans of our devals. So for us, I think if you're building in a verifiable domain, meaning that you can prove you're right or wrong. For example, PowerPoint is hard to verify. Word is impossible. Writing is impossible to verify in most ways. But if you're building a verifiable domain, you should be my take is the most important thing you can possibly be doing is building a good benchmark. And when I say that, I mean, is it sufficiently difficult and sufficiently in distribution? Meaning, does it perfectly match what your users need to get accomplished? And is there a big delta between where you are and where you need to get to? So we spend most of our research engineering effort on benchmark in benchmark infrastructure. So we run a benchmark every hour on production. We run experiments, dozens in parallel with different prompt changes, new tools that we built for the agent, and we see what works and what doesn't. We have a leaderboard, and we also have a history of this benchmark. So for us, if you see over the last three months or so, we were scoring six out of 10. That means 60% of the cells were exactly accurate in comparison to a handcrafted expert validated file for that task that, again, is perfectly in distribution that we got it from the real world. We started off at six in December. We were actually we're at four or less than four when we launched shortcut. So we're now at a little north of eight, and you can tell very specifically when we when OPUS 4.6 launched. Right? We went from about, I think, low sevens to low eights, which in our our customers felt immediately. They're like, this thing is significantly smarter. And when I tested OPUS I mean, so we we we had we ran the the model in our eval suite before before launch. It was a shocking jump. I wasn't sure if the same was gonna be true for other people's eval, and I think it was disproportionately strong in Excel. But for me, it was like I couldn't sleep. Like, I felt like this is the moment that's gonna change, and it's obvious that we're at a heart attack for Excel. And I'm just I'm certain that, like, PowerPoint people thought the same. That's amazing. Yeah. Curious to hear more about you said your evals just showed it, and I think maybe you felt it when you were using your product. But it sounds like the customer reception has also felt it. Yeah. Yeah. Absolutely. So, you know, I'm a I'm a builder. Like, built shortcut, grown, made a lot of mistakes as a founder for sure. I think one of them might have been like, I spent too much time just, like, religiously focused on improving the product. But in the pursuit of that, I kind of realized that, like, enterprises are kinda banging the door down to buy this thing. So we started having some conversations, and I realized that a failure of mine that I don't wanna make, and I hope others learn from, but I think you have learn it yourself, is that the bottleneck to enterprise sales really isn't like, oh, you have to go through security handholding and, like, go through a long process. It's not true. If your product is good enough, like, that's not necessarily the bottleneck. And in fact, the bottleneck actually might still be product quality, and their constraints and their feedback might actually be more in distribution than anybody else's. So that feedback fly like, flywheel was huge for us. And the moment the 4.6 launched, you know, of course, we're doing some other things in in the background as well. But our enterprise motion just, like, took a took a very serious step forward. Nice. That's awesome. Okay. Well, I wanna have Olivia come back on and so we can go through some of the audience questions. I have two to actually kick us off, but then I'll make sure to get to. Oh, wait. Do you wanna see are we showing the end? Oh, yeah. Well, sure. Amazing. You know, obviously, what you would do is audit every single number, and you're gonna ask multiple of these agents to do it before you hand it off. But, yes, it built the model in its entirety. Love it. Cool. So I'm gonna start with, like, a flop moment. Like, what would you warn other product builders about don't do what we did moments so that people watching can, like, avoid that? Whoever wants? to is fine. Can you guys hear me okay? Okay. I think it's really I mean, I think this goes back to what we were talking about earlier. But really thinking about what your product can do uniquely well and finding your strategic focus. I think it's really easy to want to hill climb on quality, but what ends up happening is the models just get better on their own. And I feel like a good example internally is we try to actually build this, like, question routing system where if it was a simple question, it would use, like, a faster model. And if it was, like, a more complex question, it would use, like, a you know, like, something like Opus. But what we found is it was very hard, to actually build something like that. And we actually immediately ended up cutting the project. But then, you know, adaptive reasoning came out, it kinda just solved that for us. And so I always think back to, like, where were these where were there times we spent engineering effort and time where, like, fundamentally, the model was just gonna get better and solve it for us? Yeah. Yeah. I have to always, like, hark back to benchmarks for us. So when we launched, we did this demonstration that it was like it solved this financial modeling World Cup demonstration. It did the case right, which was shocking and very cool and, like, a symbol of power. But that's actually not how users use the product. Meaning that, like, the demonstration, even some of our internal evals, and that was a part of the eval suite, were just, like, not perfectly aligned, with the real use cases. Specifically, what was news to me is that people are trying to use shortcut for, like, every incremental prompt. They're not just, like, one shotting, like, something from scratch, for example, or, like, the lovable bolt equivalent to coding. Like, they're not doing that. They're trying to do the clock code, meaning that your benchmarks actually have to capture multi turn. And multi turn is this, like, extremely hairy problem of, like, everything forks in every way. And then how do you, like, really verify that each independent multi turn is correct? And it it's a real science problem. So for us, that was like a quick learned lesson to make sure. And I I don't think your your benchmark could ever be in distribution enough, but you should probably plan these things ahead. Yeah. Makes a lot of sense. What do you guys credit for being able to migrate models so quickly in production? I know that you get confidence from your evals. Is that the main thing, or, are there other things like culture, drop everything, and hustle for a new model? Curious what you give credit to. I mean, I'll take it. Yeah. I mean, we have seen this benchmark really mirror, retention and, like, revenue, meaning that the same point Olivia is making. We Like, can do all the clever context engineering we want, and we're gonna have to be able to, like, you know, make sure that the tools acceptable to agent really do up the performance the same way Cloud Code does. But the step function change you get in the model jump is, like, one of the biggest things that can happen to your agent. So the moment we get a notice that, like, there might be a model for us to benchmark, we will do it and drop things. Thankfully, we haven't had to have too much, like, prompt engineering. So we run the benchmarks, and then we do what we call, like, dog food, and we do the entire team 20 people in, like, one room. We all use it, and we're like, okay. This is like it definitely has, a weird spunk or pizzazz that's different, and we have to address this in the prompts. Or, like, there are some subtle things that aren't even fully captured in the benchmarking, like, some formatting differences. Like, it has a different taste for stuff. This is where benchmark plus subjective taste. Usually, one day of dog fooding is enough. That's not true always for different model providers. It becomes a bit of the, like, the moat you kinda wanna play, like, each provider a little bit specifically. Yeah. That makes a lot of sense. And we have it's the same thing for us. We have a ton of benchmarks and emails that we run to get quantitative, you know, insight. But then it's awesome because someone puts, like, a thirty minute block on the entire company's calendar to, like, go try the model on whatever surface that that the person. chooses. So I think, like, company wide dog fooding is just, like, a great way to build conviction and get to know a model. Olivia, do you have anything to add there about, like, Hex's culture around model launches? would just, like, double down around the culture piece. Like, this was something in the last four years we've actually really intentionally tried to prioritize, and I will literally push out our own product deadlines or feature deadlines if we a new model comes out and we need to put resourcing there. And so we definitely now have a culture. Carly messages us. Alright. We're on it, immediately. And whatever we're working on, we'll just get paused. Because, yeah, like Nico was saying, every time a new model comes out, and there's such a huge delta for us, and we think that is, like, the most valuable thing that we can be paying attention to and working on. Yeah. Yeah. I think it's, like, it's really hard to stay, like, in that cadence because, like, models are coming out so often. I mean, we've launched two in the last two weeks. But I love that, and, like, I do think it it's a win win when, like, we can all launch at the same time. That, like, that is amazing also, like, for this, like, marketing feeling, of having, like, Hex and Shortcut be able to say, like, we now have Opus 4.6 as well. Okay, Nico. This is kind of a a spicy one, but it seems like there was, like, a few different questions that kind of address this. Yeah. But Anthropic has recently released some Excel features. How does Shortcut think about this tension? What they tell you is to build in a space where you stand to benefit a lot if the models get better. What they don't tell you is that if you do that really, really well, the model companies will directly compete with you. Now that's fine for us. I figured as much. I figured the same from Microsoft and Google. It depends on the ambition you have. I fundamentally believe that I love Carly. I love the Anthropic team. They're my favorite of the model providers. I don't think one company could be everything to everybody. Yeah. And for us, Excel sounds narrow. We're also a research lab. But for you, it's, like, extremely among one of the many, many things you do. So I believe we are gonna beat you. And and if we don't, I deserve to lose. So, that's what we've signed up for. But I would really recommend to people who are building at this overhang. If you really do believe that drop is gonna compete with you, be ready for the knife fight of your life, for us, what we signed up for. So, I mean, I have no regrets. And then, yeah, we have a we have a Wall Street prep just, like, released yesterday that we ranked ahead of it. Like, we like where we are, but it's gonna be a never ending sprint. Yeah. Yeah. I mean, it it is a point of tension. I think that, like, we are looking to train the best model at this so you can use it, and then we also, like, have have launched a product here. And it is interesting, like, for me personally working with customers where I have no involvement in, like, the Cloud for Excel. And so, like, actually, my stake is to, you know, work with you to make it best for Shortcut because we want our cloud models to be in other people's products too. But I like the way you put that. Okay. Olivia. So when a new model like Opus 4.6 gets released, how does Hex perform testing and validation of the new model in Hex's business context to make sure your agents still perform the same or better and doesn't regress? Yeah. Absolutely. This kinda goes goes back to our evals. That's first and foremost. We have a number of different eval sets. We have some, you know, external data analysis eval sets, but we also have a bunch of internal ones that we've created that are both specific to the HEX environment, specific to our own data. So we actually ask it real data questions that we internally have, as well as specific to our agent harness setup. So, if you know about HEX, there's a lot of things that are very specific to HEX since we wanna make sure that our agent harness continues to perform really well. And so we look at all of those different eval suites and determine, hey. Do we overall see a performance and no net regressions? And then exactly what Nico said, immediately, we all start testing it out and using it internally. And we're looking at a bunch of things. It's not just, hey. Does the model get the answer correct? We also look at things like latency. We want the product to feel really interactive and collaborative, and so that's something we care a lot about. And, on the product team, like, that's something that we think about is not just over like, not just optimizing for accuracy, but also for the feeling in the product. Yeah. That's awesome. I like that strategy. And then I guess as a follow-up to just your demo, did the harness or system prompt change between the two versions of the Pro Plan question? Are those differences purely due to the model swap? Yes. I literally, like, swapped them out under the hood and asked the same question. Like, those were, like, literally asked within five minutes of each other. There was nothing that changed, which is really interesting and cool to see. And that was, like, fun for me. I actually wasn't expecting it to work, and so I was expecting to have to, you know, look through a bunch of questions, but it immediately got that one, which was really cool. Yeah. Yeah. I think that, like, when we are working with people to, like, test our models and we wanna understand how it performs against, you know, Sonnet or, the older version of Opus. Like, what we love to see is when you just switch out the model and don't change the harness because that gives us a lot of insight. But then, of course, like, product builders, I think you both have maybe made some minor changes to your prompt when you actually launched Opus 4.6 because a new model just has a new personality. And, like, there there might be you might wanna have two movable pieces, both your prompts and the model. So, Nico, Shortcuts multi agent architecture for Excel tasks is fascinating. When multiple agents collaborate on something like a DCF model, how do you handle disagreements between agents? Say, one agent's output contradicts another's assumption. Is there a controller agent that resolves conflicts, or is it more of a pipeline? I don't know what DCF models are, so so you're the expert here. But I think, like, handling disagreement is, like, is a really interesting one just, like, generally outside of. For sure. For sure. I will say we don't have, like, a super advanced, like, multi agent at the action at the action sequence framework yet. So so what we try to do is, like, use multi agent both in the planning and in the auditing. And then we think if we do both of those things, well, you can probably have, like, a single agent that spins off sub agents for exploring, but not for acting. Now that probably won't be true for very long. I don't I know it's not, like, everyone's probably starting to, you know, to use these parallel action agents for all things that can be parallelizable. So for in the specific instance of the DCF, you're gonna have your drivers, but then you might have the three standard model, all of which can be done in parallel. You might have the DCF which can done in parallel. For both field and sensitivity analysis. Like, none of these are sequential in, like, nature, and they can and should be done. And now your limitations are actually, like, what does the Excel API actually allow you to do in terms of concurrency, which there are software engineering problems and agent questions. Now, specifically, I think the most the coolest thing, and my cofounder taught me this, there's, like, an economic theorem called the Almond's agreement theorem, which is that, like, given the same priors, the same data, any two rational human beings like, are the it's impossible for two rational human beings to agree to disagree. Like, they will find the right outcome. Now humans are not rational. We know this. I'm not rational. Agents are rational, and they're super intelligent, probably. So, what you do is you can give all of these agents plans to each other, and they will convert on the best plan. They will convert on the best on the best verification. So now you need some sort of multi agent architecture to analyze agents to communicate with each other, which is, like, sort of difficult. So one way we've done it, which is kinda cool, is is we've built this, like, OpenClaw inspired memory system in which our agent has access access to the entire trace history that it's produced for you. Meaning, it's aware of these other traces that are happening in parallel. And you can ask any of the agents like, hey. What do think of that other agent summary? And I was like, yeah. That guy's smarter than me. And and then you play around with, okay. Well, you know, I could have done some subtle different prompting, or maybe you just wanna, you know, standardize the promptings so they're all the same. So, there's a lot of, like, human craft in it, but, you have to give them at least the bones to to do this. Yeah. That's awesome. I like that guy smarter than me. And it's funny when agents do critique each other. Like, Claude critiques Claude. So final question. We're we're almost out of time. And I this is for both of you. I also am gonna take stab at it. But how should we be thinking about when to use Opus 4.6 versus Sonic 4.6? Does Anthropic recommend using other models ever or always SOTA? State of the art. So the way that I advise customers is, like, find out where the ceiling is. Like, make your product magical and awesome and, like, do intelligent things that people didn't know was possible with LLMs, and then you can pull it back. But, like, discover what the maximum is. And we release Sonnet such that, like, the things you could do with Opus, maybe you can still do that with Sonnet in your product. And that's kind of our goal with Sonnet. And so, definitely, we wanna be able to, like, allow people to to latency and cost optimize. But the way that I would advise product builders, and I spend a lot of time with people building products, is to just, like, discover where the ceiling is. But curious how you guys both think about it in practice. Yep. This is a really interesting one because it's I'm literally having this conversation this week internally. Yeah. I mean, I think it's actually very similar to you. We always wanna pick the best model that is gonna have the best user experience, and that's what we'll pick first and foremost. But it is really cool when you all release things like Sonnet where you can almost pretty much have the best experience, but usually just faster and cheaper. And so it's something that we think of, like, as, like, an if we can get it to be just as good kind of thing. And we also kinda look a lot, like I said, at we don't care just about pure accuracy number. It's also a lot about what does the the experience feel like in product. And so latency actually does matter a lot to us. So it's something that we weigh pretty heavily. And so in cases where we think we can get a faster experience, again, we want it to feel like a collaborator to you. We're gonna optimize a faster model. Yeah. We we feel similar, but but, Sana, it's a funny a funny thing too. It's, like, disproportionately better at finance. Though on the model card, like, not quite as bright as all us. Right? Which makes sense. But on our internal benchmark suite, it was actually, like there was no statistically significant difference. So then it becomes about behavior, and then you're like, okay. Well, you know, it's cheaper and it's faster. But there's also a lot of things you might do in the product that go beyond finance. Right? So, like, it's a general purpose spreadsheet agent. Finance is not the only thing people do spreadsheets for. So it what we probably are converging on is, a toggle to allow users to configure that so that they can get, like, you know, more more beneficial to other consumption. But we do wanna be opinionated that we still haven't made the call because it's so close. Yeah. I think it's a tough call, and you guys you guys are thinking about it well. I think the toggle too just felt like, should you be opinionated about model, or should you let, like, your user pick, what they wanna do for the trade off? Well, we're out of time, but I really appreciate you guys coming on today, demoing your products, and also, like, taking the the the audience questions. Yeah. Our pleasure. Thanks for having us. Awesome. was some fun. Bye, y'all. See you guys. Bye. Bye. Bye, everyone.