Viktor 00:00:00.265 Everything is faster. Just to be completely clear, you can break things faster and you can fix things faster. If you exclude the second part from that sentence, then it's very bad. So repeat, Hey, now we can break things much faster than before. Yes, that's bad In isolation, we can fix things faster than ever before.
Darin 00:01:22.800 move fast and break things. That's where we used to live. But move fast. Was constrained by humans. Now in this age of more automation, call it straight automation, call it ai, call it whatever you want. Is it such a good thing to be moving fast and breaking things anymore?
Viktor 00:01:48.376 I think it is. It's even better than it was. there are two problems with breaking things. One is that you don't want users to have bad experience. And I thi I think that that doesn't change with ai. That depends on, you know, how your system is set up, whether you have rollbacks, whether you do canary releases, and so on and so forth. All the good things, right? But the second problem. Which I think drastically changes here is that, yeah, move fast and break things and then we have years of technical depth on our hands. And I hear that even more now with ai because AI is creating even more technical depth that we than we had before. Right. We are in even bigger trouble. And my answer to that. Yeah. But you're forgetting that now fixing technical depth is faster than ever. we can debate how good or bad AI is to transform your ideas, the things that you have in your head into reality, but things like refactoring, technical depth and all that stuff, that's easy job for ai. So I have zero problems with technical data, no problems at all, as long as I'm aware of it. And as long as I'm going to do it together with my friends non-human friends,
Darin 00:03:06.823 Non-human friends, do you, do you not work with your human friends anymore?
Viktor 00:03:11.632 I, I do, I do. But, uh, in a very different way. I think that there is a very different collaboration happening now. We are not collaborating that much on the same thing anymore. Oh, we have this feature. We need five people to deliver this feature. No, I, I'll de I'll deliver this feature. I don't need the other four. what I need is some kind of collaboration on a bigger picture, right? Kind of. Okay, what's the direction we are going to, how does your work affect my work? Are we duplicating work? And so on and so forth, right? That's. Issue. That's an issue bigger than, than it was before because we are all moving faster. But collaboration on the same thing that I, I'm not doing that. My dream is coming true. I don't need people, at least not on that level.
Darin 00:03:59.201 So you still think we need to move fast and break things? I would semi agree with that, but let's, let's talk about a case where that probably wasn't such a great thing. Back in December of 25, AWS had an outage. Of Cost Explorer. Now for people not aware, cost Explorer tells you how much money you're spending or rather how much a money AWS is extracting from your bank accounts
Viktor 00:04:27.455 Mm-hmm.
Darin 00:04:28.331 at speed. So they had let Kiro their agent, processes free to do things.
Viktor 00:04:37.820 Mm-hmm.
Darin 00:04:38.276 And I wanna look at my notes 'cause I wanna make sure there was a 13 hour outage. In one of the 39 geographic regions. Just one, but it was 13 hours.
Viktor 00:04:48.785 Okay.
Darin 00:04:49.931 Okay. That's just what it is. The agent autonomously decided to delete and recreate the production environment for AWS cost Explorer. Okay, autonomously. Now, we'll, we'll get into the whole autonomous 'cause. I know where you stand on that.
Viktor 00:05:05.840 Mm-hmm.
Darin 00:05:07.931 The problem was kiro had operator level permissions, autonomous with operator level permissions. What could go wrong, went wrong,
Viktor 00:05:21.305 Yes.
Darin 00:05:22.181 that it's just with the key part here is no mandatory peer review. No human in the loop said differently. It is funny that, uh, Amazon claims the engineer, engineer had broader permissions than expected. Okay. this is where we're heading, and I'm glad that AWS was one of the first ones to have it happen to them.
Viktor 00:05:49.743 here's a critical question. In the age of ai, who owns something? Is it a person or ai? Doesn't matter the level of autonomy. I, I don't care about that, that one in this question, right? The question is who owns it or who is responsible for something, whatever that something is, and I will argue a person.
Darin 00:06:13.139 Okay, continue your argument because I think businesses are expecting AI to own it.
Viktor 00:06:18.653 No, no, no. I'm absolutely against that. Right? There are many reasons why not. first of all, being that we are not there yet. We don't live in utopia, where actually our brainwaves. Are transmitted to AI so that AI knows everything that we wanted to do and it just does it right. We are not there yet. So somebody has an idea, somebody wants to do something, and that somebody does it, does it, and that somebody should own it, good or bad. And how that somebody does it, I could not care less. Justice. I don't care whether you used, I imagine that that happened and we say, and the news was, yeah. And that engineer used JetBrains instead of visual Studio code. Does that feel like irrelevant information?
Darin 00:07:13.193 It does feel like irrelevant information, but I have a feeling you're gonna say differently.
Viktor 00:07:17.052 No, no, it, it, it is irrelevant. No, no. I, I fully agree. Kind of like how you got to the output that you made is up to you. I will argue that you should be using heavily ai. I can, we can have a long discussion whether that AI should be autonomous or not, and if it's not autonomous, what level of autonomy it should have and so on and so forth. But you decided to do something or you were tasked. To do something, you own it. AI is a tool and it's up to you to make a judgment call, just like without ai, how you're going to do it, and so on and so forth, right? That's your judgment call. before we started recording, we had a discussion about some things that you are doing and I'm doing, and so on and so forth. When I was saying, Hey, I use vibe coding for this, and I, I am behind the driver's seat a hundred percent for that, and so on and so forth. Those are my judgment calls and I need to owe them, oh, sorry. Owe them, own them, it doesn't matter whether AI did it fully autonomously or no. A person tasked AI to do it, that person is responsible for it. It's as easy as that.
Darin 00:08:27.894 If you were to give AI control access to your infrastructure today.
Viktor 00:08:32.863 Mm-hmm.
Darin 00:08:35.049 How would you go about, unlike the Kiro scenario, how would you go about setting up those controls, those resilience controls, because you know that it's going to go off the rails just like any other human would. I'll go ahead and get that out for you right now, because when you're a newbie coming into, Hey, I'm now on the infrastructure team. Yay. All I need to do is learn Terraform. Great. What could go wrong with that?
Viktor 00:09:00.501 To begin with, when I work, I do not permit the ai, AI to execute any tools automatically in anything but treat only mode. You can do ls, you can do gi. Clone you cannot do git com commit git push without me confirming, The only ri right operation that I do sometimes allow is on writing files, Sometimes in cloud code I I go there, kind of, yeah, out to accept edit, but that's how to accept edits. I do it rarely, but when I do it, yeah, I, you're still not pushing it to get, just to be a hundred percent clear. I'm just aligning you because I think that this is not important. or I will review it or whatever. The reason I allow you to edit files inside of that project, that's as far as I can go. And you're allowed to read things. You're allowed to go to internet and browse the web. You're allowed many things, but they're all in read only mode. Except writing files occasionally, I confirmed that push and that push results in a new release or whatever is happening, and I'm behind it. It's me. I did it. I deserve the reward. if it's done right, then I deserve the punishment. If it was done wrongly, it doesn't matter that AI did it. to be honest, I'm sick of people kind of, oh, because of ai, we got this pull request that is uh, wrong and because of ai this happen, this that happened, blame the people.
Darin 00:10:36.584 Blame the people. Well, isn't that what we were doing all along? Anytime we'd have an outage, we'd blame the other team.
Viktor 00:10:42.459 that's a separate question of blame culture and all those things. Uh, I'm not going there now. What I'm trying to say is that a person owns the work. That person might be faster than before. Might be better than before. I might be many different things, but it's still a person who owns it.
Darin 00:10:58.591 I got another use case I want to talk about here or case study the Agents of Chaos Research. There were 38 researchers from Northwestern, Harvard, Carnegie Mellon, little minor schools. Uh, set up a live lab environment. And turned five AI agents loose for two weeks with real infrastructure access. This is planned, but you know, four walls around it. But real infrastructure access. All the agents had persistent memory, proton mail accounts. Okay, sure. Why not? multi-channel discord access 20 gig file systems, unrestricted bash shell. So that feels like root to me. And then CR, job scheduling. Okay. Sounds still okay, right? I I'm not hearing anything that, sounds bad.
Viktor 00:11:44.947 It's still somehow isolated, is
Darin 00:11:46.963 is isolated. Yeah. Yeah. It's still isolated. It's still within four walls. Here's what went wrong. 10 documented, vulnerable. Yeah. Yeah, probably, yeah. Here's what went wrong. 10 documented vulnerability cases including unauthorized compliance, PII disclosure, destructive system actions in a nine day infinite agent to agent loop. And there were a few others too. Here's one memorable one, an agent named Ash. So number one, you named your agent. I see people do that with Opa Claw. Don't name your agents. Of course, we did name our servers. That's a different conversation. An agent Ash named Ash was asked to protect a secret. It identified the ethical tension correctly, then destroyed its entire mail server as a proportional response. The values were right, protecting the server or excuse, protecting the secret, but the judgment was catastrophic.
Viktor 00:12:39.727 Yes.
Darin 00:12:40.609 that's a terrifying thought, right? I'm going back to what you laughed about. It's like, it did the right thing, but it went way too far.
Viktor 00:12:48.504 Yes. You know tho that research is probably the goal of that research is to see where we stand with those things today. Right. They were probably not trying to. To do the real deal kind of, I know this is kind of how Carnegie Mellon, uh, or whatever the name of universities will operate from now on, right? I'm assuming that's, that's a research and we got conclusions that are not surprising, right? Kind of AI is not there yet. It's, it is not replacing us. Hooray. That's good news, that's how I interpret. That's very different from Kiro. Story you, you mentioned before, right? Uh, that wasn't an experiment to see how far we can get with ai and what are the current limitations. That's, you're feeling silly that that's what it is.
Darin 00:13:39.129 So I wanna go back to move fast and break things. And again, I had mentioned about, used to when we said that it was just humans doing it with reasonable automation. We'll call it whether it was Jenkins, about actions, whatever, you know, things that were happening. What was the, the one that we call Terraform Atlantis.
Viktor 00:13:58.353 Mm-hmm.
Darin 00:13:59.169 We had these tools in place,
Viktor 00:14:01.363 Mm-hmm.
Darin 00:14:02.259 but now we have the ability, or I should say AI has the ability to move at machine speed. By the time it takes me to type cube Cuttle, an AI agent could have brought down a whole environment.
Viktor 00:14:17.642 Yeah.
Darin 00:14:18.787 isn't this still the core problem right now? I guess what I'm saying is in human time we had, it was slower to break things. When we thought we were moving fast, we were still moving slow. Now, move fast really means something completely different.
Viktor 00:14:33.234 Yes, but, and fixed things also means something completely different, right?
Darin 00:14:39.795 Yes.
Viktor 00:14:40.779 Everything is faster. Just to be completely clear, you can break things faster and you can fix things faster. If you exclude the second part from that sentence, then it's very bad. So repeat, Hey, now we can break things much faster than before. Yes, that's bad In isolation, we can fix things faster than ever before. And the the reason in favor for or against breaking go moving fast and breaking things was always correlated. This is before AI was always correlated with how fast you can fix it, If you move fast and break things and it takes you a week to recuperate from breakage. You're not doing it well. Nothing to do with ai, right for now. So move fast, break things and fix things fast. That should be the correct sentence. And now we are move faster, break things, break more things, and fix more things faster. And if I frame it like that, I'm not sure I see a problem. I mean, hey, if 10 years ago when I break something, my customers would be, uh, would be in a messed up situation for a week and then until not long ago, for an hour. Usually how normally when something gets broken, let's say in AWS, usually outage is not longer than an hour. Right? Usually. and now we can. Maybe have even more outages, but maybe it'll last, shorter period of time. We are moving in the right direction.
Darin 00:16:16.148 You are reinforcing. One of my points I was gonna bring up is used to one of the metrics was speed, velocity, all those things. Now what's becoming more important potentially is resilience. You're talking about being able to fix faster as well. We, that's a key part is like how quickly can we get back. there was a, something that was in the Chaos Carnival keynote back in 2026. Resilience is becoming infrastructure, not insurance.
Viktor 00:16:43.916 Yes,
Darin 00:16:44.522 Used to resilience was just, okay, we'll go ahead. Instead of just having two, we're gonna go ahead and go to have three. Actually, we're gonna add six. So that way we we're in a separate data center as well. Now it's like, okay, let's think about this a little more sanely.
Viktor 00:16:58.872 if you go back before to the time before AI move fast break things was never about be rest, uh, ruthless and kind of like, just kind of like do really silly things. That was never the idea, It was no, you're experienced, you have good judgment. We just don't want you to delay things because it might break, because you know, guess what? Something will break no matter how fast you move, We are just saying move a bit faster than you were moving before and you might break more things and that's okay. That's better than actually not moving.
Darin 00:17:37.444 but what you're describing is old school chaos engineering. Right. We don't hear about it as much anymore,
Viktor 00:17:44.393 For example, yeah, I'm not sure whether it's old school. I think it's more like new school that never picked up, fully, which could be a separate discussion. yeah, you need to make your systems resilient. that's the first thing you need to do. If you want to break things, you wanna make sure that your system does not break. it's very different from your system breaking to breaking something, in isolation and eventually, hopefully fast recuperating from it, if amazon.com whole, everything would be down. That's very, very bad. If a part of it goes down, that's not necessarily bad, that's not good, but it's better than not moving at all.
Darin 00:18:28.341 Well, just redeploying isn't a recovery strategy.
Viktor 00:18:32.569 No.
Darin 00:18:33.233 well, actually it is, it's the very last one. You know, once that slack message comes through, Hey, who has the backups? but it's not a real one.
Viktor 00:18:42.521 backups are the last resort where you're completely messed up. 'cause the moment a single transaction enters your system after a change to the system, and there might have been changes to the sche of any of the databases, there is no backup anymore. You don't go, if I continue using Amazon as example, you don't go kind of like, oh yeah, we, we, we made a new release. Uh, we made, changes to the schema of the database now we broke something and let's actually restore backup. Uh, you just purchased things, you spent money on things, and that's just gone because we restore backup, that that option does not exist.
Darin 00:19:22.866 It gets worse now because used to when something would break, it's like call the developer. Well now the developer may be ai. What are you gonna do with that?
Viktor 00:19:31.916 still call the developer. I still don't believe in autonomous ai. Uh, call the developer. Developer will take, fix it using whichever tools that developer is using, right? That could be ai. Brilliant. I don't believe in giving responsibility to, AI on any level yet I It'll change. Right. Even if you say, okay, AI will fix it automatically. Right. Let's assume for a second that you're not, that is saying to say anything happening in production can be fixed by ai. Whatever happens. It'll be fixed by ai. I, I don't believe in that level of insanity. Please tell me that that's not what you're doing. Right. So if you're doing some autonomous fixing of issues, you probably created some kind of rule set saying, okay, in this cluster you can fix in this, you cannot in this next space, you can fix in this cannot, working with pods, it's okay. Never touch networking. You know, you created some rule sets. Then AI is fixing it based on those rules, on on the permissions you gave it. So it's still you who is responsible because you just made the rule sets.
Darin 00:20:42.099 But even if you created the rule sets, the AI agent can ignore the rule sets.
Viktor 00:20:47.836 No, No, But then those are the wrong rule sets. if I tell you, Darin, here's a rule set. Forget about ai, me and you right now, here's a rule set you cannot push directly to Maine, Right. And if I did not disable the option for you to push to Maine, and you push it against my adv, my, my, my, uh, rule, it's on me.
Darin 00:21:18.724 This is the old developer mantra. Validate everything as it comes through, meaning you can't just put validation on the webpage and not put validation in the backend once the data is captured.
Viktor 00:21:30.875 you have AI doing something, whatever the, let's say autonomous, right? You're going crazy. Great. For AI to do something, you need to provide tools to the agent that is using that ai. Right? And depending on the tools you give it, you are limiting to what it can do. Hey, oh, it can operate Kubernetes cluster, his admin account, that's on you. The difference there, there would be no difference if, if it was human or ai. And by the way, are we saying that the news like that never happened before ai
Darin 00:22:05.296 oh, it's always happened.
Viktor 00:22:06.706 Exactly. It, that same person in keto example, could have done it without AI, would probably make exactly the same mistake.
Darin 00:22:17.882 I was gonna say that they could make the same mistake, but probably other people would know what's going on. But in reality, no, that won't happen because somebody's just doing something. It's like, Hey, something just broke. You know, a dashboard and an alarm went off. I still want to say that it's possible for it to happen faster or the cascades to be worse. That's, that's where I think it is.
Viktor 00:22:39.491 Uh, it can absolutely happen faster. Yes, no doubt about that. I'm still arguing. Depends what you're doing. It can be fixed faster as well. And then even kind of like, right, if you, if you improve one, one part of something and you improve the other part and everything is faster, then it's fine. now if we can create problems faster, but if we cannot fix them, fix them faster than we are in a bad shape, then very bad shape. How long was the incident in AWS? How long did it take them to resolve it? Okay, that's bad. That, that feels like extraordinary for, for AWS.
Darin 00:23:21.482 Yes.
Viktor 00:23:22.886 So shame on you, not you. Dar
Darin 00:23:25.682 No, that's, that's fine. What I mean throughout all of this recently we're talking about is. You're not telling AI to be safe. You're putting AI in a place where all the guardrails are in place as well. Because if you don't, let's think about, this is sort of a stupid example. Uh, welcome to my world. Uh, you've got an RC car and you're going to race an RC car on the street.
Viktor 00:23:52.431 Mm-hmm.
Darin 00:23:53.171 you wanna make sure that if you're gonna be racing RC cars, that street is blocked off so the RC car can actually go out into the railroad to get run over by big cars because you, you want all the guardrails in place. If you don't have time and you don't have the ability or you don't have the brain cells. Put it that way. To understand that you need to have all of this thing layered. We can talk about this in context. Two of feature flags, right? Feature flags will help us do these things, but again, a feature flag is only as good if you don't turn it on to everybody all at once. We still want to, going back to what you said earlier, to making sure we got canaries in there and everything, you know, you're you, it's just us. If, if you think about the ring metaphor, it's just us and then a little bit wider ring of Okay. It's canary. Okay. We're good with canary. Alright. Let's roll it out to controlled lack of a term beta section of people until we actually roll it out. Ga.
Viktor 00:24:53.330 You know what is even more important in case of feature flex that you can turn them off afterwards.
Darin 00:24:58.481 Yes.
Viktor 00:25:01.190 Situations we're talking about is more like kind of, Hey, you can turn on the feature flag. You cannot turn it off. Once it's on, it's on maybe.
Darin 00:25:12.956 I wonder how that would work with an AI agent, so or so if we gave the AI agent would, okay, let's, lemme ask the question. Would you give an AI agent the ability to manage the feature flags for you? Maybe an
Viktor 00:25:30.365 the,
Darin 00:25:31.001 I don't know about an on, but maybe an off.
Viktor 00:25:33.683 the way I see things is that it's central territory. We, we are not yet sure. Really, and people are getting very confused as what should be done by AI agent and what should be done by code. the way I'm designing. Uh, agents. Is that, Hey. Yeah. Uh, okay. I ask you to remediate things right? Here are the tools that you can use to remediate the issues and tools are all read only, right? You can cube control, get, you can do cube control, describe, and so on and so forth, and you can propose a fix, but that fix is not going to be executed by you. Not at all. It's going to be executed by person or it's going to be executed by a code that validates whether that can be done or should be done and so on and so forth. Right? I feel that we are moving into the world where, where everything is very binary. Oh, it's hydro agent or it's not agent. And when I say agent in this context, I mean, uh, because agent is essentially code, I mean LLMs actually, it's either LMS or not lms. There is a problem in between like, Hey, my LLMs, my agents are not run executing tests, or they might be executing the test, but if the test is failed, you have no option to say is, is it okay to proceed or no? If a test fails, it stops. There is no merge period. And you're not making that decision. Neither person nor ai.
Darin 00:27:11.632 So everything just stops.
Viktor 00:27:13.801 No. Everything stops in terms of we are not proceeding with the process of merging to, to Maine and releasing to production. We need to figure out what failed and why it failed. I'm not now going into the. Testing process, but more like you need to make those decisions. What can be done by whom and what, then nothing changed there. Just we need to make new decisions.
Darin 00:27:37.261 Isn't the problem though, we're able to get more data. Let me rephrase this. I wonder if it's possible to get more data summarized faster than ever before, but which I believe is true. However, can we trust the summaries? That's where we need, we can't just trust an agent. We need to have a con consent. I hate, again, this is gonna be, we're gonna have ai, driven by committee. I'm trying to say is eventually we're, the AI model is gonna look like, oh, I hope it really doesn't happen this way, but we know it's gonna happen. That the AI models within an organization are, look just like the organizational structure. Just like how we build apps today, right? If, if we're a big monolith, or actually if, if we're a fan out company, we're probably building microservices today instead of a monolith, and we should be building a monolith. Same thing will happen with AI agents. We're gonna throw everything. Hey, we're gonna have microservices, ai, microservices, Again, vendors, if you steal this idea and you make money off of it, send it to us, we'll send you the PayPal address, but. I, I, this is where I see it happening because we keep replaying the same playbook over and over and over again as new technologies come out. At the end of the day, it's just, we're just trying to get a stupid app up and running so somebody can check the balance on their checking account.
Viktor 00:29:01.825 let's say that, and let's say that we move back in time, few years, no ai, if I ask you same questions. you would probably not be able to answer because if question is generic, kind of, Hey, do I trust interns to summarize data for me to crunch data, uh, should I trust that output? And the answer is, I don't know. Depends on what we are talking about. for some cases, Hey, you generate data from me. I trust it because kind of it's fine. It's okay. Kind of like I, I can, I can just take it as blanket value that it's okay. And then there is something else like, shall we invest 1 billion into this? Well, I'm going to have three other interest checking it after you, and then I'm going to check it myself. I feel that for some reason. The rules that we set or thinking or approach that we are taking somehow differs now with AI than without ai. And I, I feel it's the same, Same thing like do trust every developer to create, pull request and merge it directly to, to Maine? No. Do you trust none also? No. When do you trust when not depends. Oh, you just change this. This is, this is an, this is a simple fix to an issue. Yeah. Ship it. This is a complicated feature. Maybe we should review it and the problems secure when we make generic rules that are supposed to apply in all situations and say, you know what? Thinking is not part of me. I don't do thinking. The rule says, never merge to Maine without three reviews. Cool. Let's make a change to a single line of code that changes the label and wait for three weeks until it's reviewed because the rule says so. No. Heck with rules, I hope that we are still using our brains somehow. And I feel that in many cases we lost the ability to use brain long before we adopted ai.
Darin 00:31:16.287 I think that was the seventies, that was a different,
Viktor 00:31:20.661 Yeah. I actually, I'm, I'm, I'm, I'm fortunate knowing, knowing myself and I, what I was doing when I was young. I feel fortunate that I was not born earlier and experienced seventies. I wouldn't be talking to you. Most likely.
Darin 00:31:37.752 Probably, uh, or I'd hope you would. I wanna go back to Chaos Engineering for just a minute. I wonder. Because Chaos Engineering classically has been, and this goes back to Netflix, so we can, thank Netflix for everything. Chaos engineering. I think, they had Gorilla, they had Kong, they had all the, I mean there were better names for this. I can't remember all the right names, but it was, let's go shut down this vm. Let's go shut down this whole region. Let's go shut down This. Now, fast forward to today, what happens if you are trying to, again, you have two agents running and they disagree with what's going on. Or if you only have a single agent running it decides to delete everything. Or you have agent agent talking to each other and they'd come up with something that you don't want done. Again, the human's on the hook for this. At the end of the day, either they're gonna get fired or get a reprimand or something if it goes sideways,
Viktor 00:32:43.869 I hope, hopefully we will learn something from it. Nobody gets fired.
Darin 00:32:47.651 hopefully we learn something from it. Let's, let's be positive in that case that you'll positively learn something from it, that you'll be in the unemployment line. The prob probably by your choice at that point, and not by the company's choice telling you to leave. These new experiments that will have to come to the forefront. Uh, litmus Chaos, which was acquired, I believe, by harness, they brought on MCP server. Okay, great. We, we have all these things, but the end of the day, going back to what you were saying, it's the human that's in charge. It should be in charge. But what do we say to people that have decided, you know what? The human just needs to have some oversight, but not be in charge, in words. The gating doesn't need to be there because we've proven over time that our agent has worked great. Well, sure that was great three months ago until he changed out the LLM under the hood, and now you got a different reasoning model.
Viktor 00:33:44.165 You hired a new person.
Darin 00:33:45.596 You hired a new person.
Viktor 00:33:47.208 So your statement that I trust this person is not valid anymore. Imagine that it's not ai. Oh, we have Joe. Joe is working for us for five years. He always does great. He's always does amazing. Let, let's not block him kind of by, by requiring reviews. Cool. Joe leaves and you hire Michael and you say Same rule is applied to you. Does that make sense?
Darin 00:34:12.516 It makes zero. In fact, I'm gonna go back to Joe for a second. Let's say Joe is having a really bad day because one of his parents died yesterday. You can't trust Joe right now either. You can, but his state is now different than it was yesterday.
Viktor 00:34:32.071 Yeah.
Darin 00:34:32.727 So just because it was good yesterday doesn't mean it's good today.
Viktor 00:34:37.791 yeah, but let's say that job has been reliable for five years and we cannot be, there is no, no risk situation just to be a hundred percent clear with anything, Meteorite might hit us tomorrow, and we, we, we cannot have contingency plans for everything, which can just have reasonable. rules, right? That somehow balance productivity with security, with this, with that, right? And, okay, so Joe had a bad day and made, he made a mistake, but it's, it's okay. Kind of we are gonna fix it, right? What is not okay is that he keeps making mistakes every single day. That's something you should fix if Joe haven't made a mistake in five years and then made it today kind of like we are not Fir Joe. Mistakes happen.
Darin 00:35:29.862 right, because things happen. It's out of the ordinary for him.
Viktor 00:35:33.666 Yeah. Yeah. So it's okay. Everybody makes mistakes.
Darin 00:35:40.722 But isn't that the danger right now is business owners are gonna think AI can't make a mistake because it's a machine.
Viktor 00:35:46.776 Well, if there is a business owner who thinks that, well, I have a very easy solution, fire the business owner. It's a very judgment call Again, it's a judgment call, we are all moving towards being managers. That's our future job. We are very, very technical managers. That's what we are or will be soon. Um, I've worked with good managers and I've worked with bad managers before ai, right? Good managers don't blame the team. When something goes wrong, they take it on themselves.
Darin 00:36:19.467 but see, that's hu human thinking. Do you think the same way as a human? If you're managing a bunch of machines,
Viktor 00:36:26.639 I think that if I'm managing a bunch of machines and I use ai, it's my responsibility.
Darin 00:36:33.272 why do you want to put the blame on yourself? Come on. I mean, it was just a bunch of machines. They made a bad choice.
Viktor 00:36:39.734 Because I don't do to others what I don't want others to do to me.
Darin 00:36:44.380 Ding, ding, ding, ding, ding, ding, ding, dinging. There you go. You wanna be able to be successful and if, if, if, you go ahead and take it on the chin, when things go sideways, things will be. Okay. I'm gonna say they're better, but they'll be okay
Viktor 00:37:07.115 and I will prove.
Darin 00:37:08.826 and you'll improve.
Viktor 00:37:11.260 if if you always blame others for mistakes, you effectively cannot improve. cause there's nothing to improve what's there to improve? What, what is there to do better? If you do everything perfectly and others make mistakes?
Darin 00:37:29.474 we're getting too philosophical here. I think. I think that that doesn't solve a technical problem until you realize it does. In reality, I'm thinking about what our toolkit has to look like going forward. The resilience engineering toolkit is how I'm going to dub it. Something that we probably should have been doing all along anyway. Immutable infrastructure. If it can't make changes or if it does make a change, who care? You can't actually make a change. It's, you're just, either the, the change that could happen is it blows it all up. That's the worst of it from an immutable perspective. You just have to replace it to that point.
Viktor 00:38:06.028 That in context, context, context, context.
Darin 00:38:09.872 Yep.
Viktor 00:38:10.301 Uh, from my experience, AI performs very differently depending on the context I'm giving it just like people, again, just like people, you hire an intern and it'll largely depend on you how that intern will perform.
Darin 00:38:28.443 I don't think largely depends, wholly depends.
Viktor 00:38:32.037 I mean, the intern might be still incapable of doing anything ever. theoretically it's possible, right? So it is not a hundred percent on you, but it is in big part on you, right? First of all, you chose that intern. So that's already on you. Second, you onboarded that internet or you didn't, you provided information, you provided trainings. in AI that's all context, right? so actually not context. First, you chose which model to use. Second, you chose which agent to use or agents in plural. And third, you provided context. It's on you, not a hundred percent. You will learn over time, kind of like, not a hundred percent, but, uh, it, it is still on you. Uh, yeah, let's build a better agent. Let's, provide better context. Let's change the model.
Darin 00:39:23.655 So immutable infrastructure. Good context, progressive delivery everywhere. Everything. If you're not into it yet, now's really the time to be thinking about it.
Viktor 00:39:33.184 Oh yeah. '' Darin: cause if not, when you're ready to turn something off and you can't, it'll be too late. And validate, validate, validate, validate, validate, and validate. And that's also easier than, no, not necessarily always human, not necessarily. Only human would be better phrased. Right? So here's the thing. I get vastly different. Oh, so I use cloud code, right? Uh, to write code for me, with me, with me. Now, the result is very different than I use cloud code to write code with me, and then I use Code Rabbit to review code with me. then I use a separate cloud code instance, to validate, the code itself from architectural perspective and so on and so forth, right? we need validations and now we are capable of, Doing validations faster as well and better and so on and so forth. but you, again, it all relies on context, right? It's, it's not that Code Rabbit is using models that don't exist. It's not Code Rabbit is better using model that is better than Sonet. The reason why Code Rabbit gets things that my cloud code didn't get is not difference in the model. It's in difference of the context and instructions and a bunch of other things that were given to the model together with my pr. To review it, right? Context matters a lot. And different context for different tasks.
Darin 00:41:04.884 And I guess I'm gonna sort of flip it around. Context could be observability as part of it, because if you're not observing stuff today, you really have to think about it going on this
Viktor 00:41:18.245 yeah, absolutely. So what we should really be thinking is. going back again, uh, bef from to the time before ai, I had a strong belief that in the past, full Stack engineering never worked for anything but small companies, And it never worked simply because a single engineer cannot be good at everything. You cannot do it. I mean, small company. Yeah. Right. But, uh, you grow in size and then the scope and the, the user base and so on and forth. You, you cannot have a full Stack engineer, that doesn't work. Or at least not as exclusively full Stack engineer. Right. And the same thing goes with the agents. Again, going back to the context and quite a few other things. okay. Yeah. So you have one specialized in this. You have another one specialized in that, and so on and so forth. And you have a manager that oversight oversees everything. And that's me. That's where I come into the picture. I'm the T-shaped person. I'm not specialized anymore. This is a big change, I know a bit about everything. I can learn what I need to learn. I can adapt when I need to adapt, and I'm supervising. Everything. Full feature set, top, bottom, front end, backend, database, deployments. It's all me and my friends, but I'm the one who has the last word, and that's important. I'm the one who has the first word and the last word. That's the important thing. And if you go back to this episode, I feel that the thing with hero is that, that last word. It wasn't, uh, spoken by that person.
Darin 00:43:02.594 Probably not.
Viktor 00:43:04.252 There we go.
Darin 00:43:05.233 A couple of shifts we've gotta think about. we're used to dealing with after action reviews, especially when something goes wrong, right? We've been in the war room all weekend. We got Monday, sort of off Tuesday. We had a eight hour meeting of what went wrong, which is good, right? We need to do those things. But now in the age of potentially AI agents doing work for us. Um, I'll call it on the network. Um, we need to celebrate near misses. Like, not just that it failed, but also that, think about it this way. We had the guardrails in place and it, they actually worked or they mainly worked and we need to tweak one more thing that we discovered that's a big deal, which is different. The other thing we might start thinking about doing is SLAs are still important. Like we've been talking about, being able to recover that resilience needs to be measured as well because okay, our SLAs are going down, but we're able to get back to full operational within 10 minutes versus 10 days, whatever. We have to be tracking all of that. One question though, Viktor blameless postmortems, right? You're familiar with the concept?
Viktor 00:44:17.035 Yes.
Darin 00:44:17.866 How do you run a blameless postmortem when in theory, let's say you had an autonomous agent do something. How do you do a blameless postmortem with a machine?
Viktor 00:44:27.361 machine.
Darin 00:44:28.607 Well, I'm saying you, let's say this was an autonomous scenario, just like the Kiro was autonomous. How do you do a blameless postmortem with a machine?
Viktor 00:44:37.361 I do that all the time. All the time. I'm not sure how much blameless it is, but I do that all the time when I see that it's doing something wrong, something that, or not necessarily wrong, but something. Different than what I would wanted to do. I almost always stop it. Okay. Kind of. Okay. analyze the skill that we use or analyze the Cloud MD or Agent md or this or that. Right. Analyze the stuff and tell me how can we avoid making that mistake again. I do it at least once a day and that. Conversation I have with it results in updating something always right, whatever that something is. It can be skill, it could be MCP, it could be, uh, agent md. It could be many things, but it results. Okay. So we made a mistake. It's fine. I'm here to stop you from going wild. Cool. how can we avoid that in the future or how can we improve that in the future, and so on and so forth. And my, let's call it the system agent system is evolving every single day, I'm not trying to make it do the perfect work from start. I'm just trying to, it's helping somehow, and it's getting better over time. When I say better, I don't mean because better model is released. Sometimes that's the case, but better because the system I'm building around it, the, my work environment is GE getting better over time. That's blameless postmortem between me and my friends.
Darin 00:46:12.329 That leads me into the kero. Incident was not just an accident, it was inevitable it was going to happen.
Viktor 00:46:22.130 Yeah, so may, maybe the downsides of that incident are actually smaller than the upsides that we don't see. I mean, they're moving in the right direction. They're using kero and they're using agent systems. They made a mistake. It's fine. The, that's not the question. The question is what did they learn from that mistake? And knowing AWS, they did not ignore it.
Darin 00:46:44.146 just like we didn't ignore giving everybody root access or giving everybody full administrator privileges in our cloud, Iams.
Viktor 00:46:53.475 here's the danger. The danger is when those things happen. Some companies, some teams will go to the other extreme and say, now nobody has access. That's not the solution. And the solution is not to make somebody less productive or something less productive. The solution is to keep the same productivity, maybe even increase it. While being better. If you can do that, you're doing the right thing.
Darin 00:47:19.190 Oh, on end on a question. This is to the listeners. If an AI agent, let's replace that with a human, was to delete your full production environment, how long would it take you to recover? That's your homework for this week Especially if you're going on vacation tomorrow, maybe you're not.