Viktor
00:00:00.265
Everything is faster. Just to be completely clear, you can break things faster and you can fix things faster. If you exclude the second part from that sentence, then it's very bad. So repeat, Hey, now we can break things much faster than before. Yes, that's bad In isolation, we can fix things faster than ever before.
Darin
00:01:22.800
move fast and break things. That's where we used to live. But move fast. Was constrained by humans. Now in this age of more automation, call it straight automation, call it ai, call it whatever you want. Is it such a good thing to be moving fast and breaking things anymore?
Viktor
00:01:48.376
I think it is. It's even better than it was. there are two problems with breaking things. One is that you don't want users to have bad experience. And I thi I think that that doesn't change with ai. That depends on, you know, how your system is set up, whether you have rollbacks, whether you do canary releases, and so on and so forth. All the good things, right? But the second problem. Which I think drastically changes here is that, yeah, move fast and break things and then we have years of technical depth on our hands. And I hear that even more now with ai because AI is creating even more technical depth that we than we had before. Right. We are in even bigger trouble. And my answer to that. Yeah. But you're forgetting that now fixing technical depth is faster than ever. we can debate how good or bad AI is to transform your ideas, the things that you have in your head into reality, but things like refactoring, technical depth and all that stuff, that's easy job for ai. So I have zero problems with technical data, no problems at all, as long as I'm aware of it. And as long as I'm going to do it together with my friends non-human friends,
Viktor
00:03:11.632
I, I do, I do. But, uh, in a very different way. I think that there is a very different collaboration happening now. We are not collaborating that much on the same thing anymore. Oh, we have this feature. We need five people to deliver this feature. No, I, I'll de I'll deliver this feature. I don't need the other four. what I need is some kind of collaboration on a bigger picture, right? Kind of. Okay, what's the direction we are going to, how does your work affect my work? Are we duplicating work? And so on and so forth, right? That's. Issue. That's an issue bigger than, than it was before because we are all moving faster. But collaboration on the same thing that I, I'm not doing that. My dream is coming true. I don't need people, at least not on that level.
Darin
00:03:59.201
So you still think we need to move fast and break things? I would semi agree with that, but let's, let's talk about a case where that probably wasn't such a great thing. Back in December of 25, AWS had an outage. Of Cost Explorer. Now for people not aware, cost Explorer tells you how much money you're spending or rather how much a money AWS is extracting from your bank accounts
Darin
00:04:38.276
And I wanna look at my notes 'cause I wanna make sure there was a 13 hour outage. In one of the 39 geographic regions. Just one, but it was 13 hours.
Darin
00:04:49.931
Okay. That's just what it is. The agent autonomously decided to delete and recreate the production environment for AWS cost Explorer. Okay, autonomously. Now, we'll, we'll get into the whole autonomous 'cause. I know where you stand on that.
Darin
00:05:07.931
The problem was kiro had operator level permissions, autonomous with operator level permissions. What could go wrong, went wrong,
Darin
00:05:22.181
that it's just with the key part here is no mandatory peer review. No human in the loop said differently. It is funny that, uh, Amazon claims the engineer, engineer had broader permissions than expected. Okay. this is where we're heading, and I'm glad that AWS was one of the first ones to have it happen to them.
Viktor
00:05:49.743
here's a critical question. In the age of ai, who owns something? Is it a person or ai? Doesn't matter the level of autonomy. I, I don't care about that, that one in this question, right? The question is who owns it or who is responsible for something, whatever that something is, and I will argue a person.
Darin
00:06:13.139
Okay, continue your argument because I think businesses are expecting AI to own it.
Viktor
00:06:18.653
No, no, no. I'm absolutely against that. Right? There are many reasons why not. first of all, being that we are not there yet. We don't live in utopia, where actually our brainwaves. Are transmitted to AI so that AI knows everything that we wanted to do and it just does it right. We are not there yet. So somebody has an idea, somebody wants to do something, and that somebody does it, does it, and that somebody should own it, good or bad. And how that somebody does it, I could not care less. Justice. I don't care whether you used, I imagine that that happened and we say, and the news was, yeah. And that engineer used JetBrains instead of visual Studio code. Does that feel like irrelevant information?
Darin
00:07:13.193
It does feel like irrelevant information, but I have a feeling you're gonna say differently.
Viktor
00:07:17.052
No, no, it, it, it is irrelevant. No, no. I, I fully agree. Kind of like how you got to the output that you made is up to you. I will argue that you should be using heavily ai. I can, we can have a long discussion whether that AI should be autonomous or not, and if it's not autonomous, what level of autonomy it should have and so on and so forth. But you decided to do something or you were tasked. To do something, you own it. AI is a tool and it's up to you to make a judgment call, just like without ai, how you're going to do it, and so on and so forth, right? That's your judgment call. before we started recording, we had a discussion about some things that you are doing and I'm doing, and so on and so forth. When I was saying, Hey, I use vibe coding for this, and I, I am behind the driver's seat a hundred percent for that, and so on and so forth. Those are my judgment calls and I need to owe them, oh, sorry. Owe them, own them, it doesn't matter whether AI did it fully autonomously or no. A person tasked AI to do it, that person is responsible for it. It's as easy as that.
Darin
00:08:35.049
How would you go about, unlike the Kiro scenario, how would you go about setting up those controls, those resilience controls, because you know that it's going to go off the rails just like any other human would. I'll go ahead and get that out for you right now, because when you're a newbie coming into, Hey, I'm now on the infrastructure team. Yay. All I need to do is learn Terraform. Great. What could go wrong with that?
Viktor
00:09:00.501
To begin with, when I work, I do not permit the ai, AI to execute any tools automatically in anything but treat only mode. You can do ls, you can do gi. Clone you cannot do git com commit git push without me confirming, The only ri right operation that I do sometimes allow is on writing files, Sometimes in cloud code I I go there, kind of, yeah, out to accept edit, but that's how to accept edits. I do it rarely, but when I do it, yeah, I, you're still not pushing it to get, just to be a hundred percent clear. I'm just aligning you because I think that this is not important. or I will review it or whatever. The reason I allow you to edit files inside of that project, that's as far as I can go. And you're allowed to read things. You're allowed to go to internet and browse the web. You're allowed many things, but they're all in read only mode. Except writing files occasionally, I confirmed that push and that push results in a new release or whatever is happening, and I'm behind it. It's me. I did it. I deserve the reward. if it's done right, then I deserve the punishment. If it was done wrongly, it doesn't matter that AI did it. to be honest, I'm sick of people kind of, oh, because of ai, we got this pull request that is uh, wrong and because of ai this happen, this that happened, blame the people.
Darin
00:10:36.584
Blame the people. Well, isn't that what we were doing all along? Anytime we'd have an outage, we'd blame the other team.
Viktor
00:10:42.459
that's a separate question of blame culture and all those things. Uh, I'm not going there now. What I'm trying to say is that a person owns the work. That person might be faster than before. Might be better than before. I might be many different things, but it's still a person who owns it.
Darin
00:10:58.591
I got another use case I want to talk about here or case study the Agents of Chaos Research. There were 38 researchers from Northwestern, Harvard, Carnegie Mellon, little minor schools. Uh, set up a live lab environment. And turned five AI agents loose for two weeks with real infrastructure access. This is planned, but you know, four walls around it. But real infrastructure access. All the agents had persistent memory, proton mail accounts. Okay, sure. Why not? multi-channel discord access 20 gig file systems, unrestricted bash shell. So that feels like root to me. And then CR, job scheduling. Okay. Sounds still okay, right? I I'm not hearing anything that, sounds bad.
Darin
00:11:46.963
is isolated. Yeah. Yeah. It's still isolated. It's still within four walls. Here's what went wrong. 10 documented, vulnerable. Yeah. Yeah, probably, yeah. Here's what went wrong. 10 documented vulnerability cases including unauthorized compliance, PII disclosure, destructive system actions in a nine day infinite agent to agent loop. And there were a few others too. Here's one memorable one, an agent named Ash. So number one, you named your agent. I see people do that with Opa Claw. Don't name your agents. Of course, we did name our servers. That's a different conversation. An agent Ash named Ash was asked to protect a secret. It identified the ethical tension correctly, then destroyed its entire mail server as a proportional response. The values were right, protecting the server or excuse, protecting the secret, but the judgment was catastrophic.
Darin
00:12:40.609
that's a terrifying thought, right? I'm going back to what you laughed about. It's like, it did the right thing, but it went way too far.
Viktor
00:12:48.504
Yes. You know tho that research is probably the goal of that research is to see where we stand with those things today. Right. They were probably not trying to. To do the real deal kind of, I know this is kind of how Carnegie Mellon, uh, or whatever the name of universities will operate from now on, right? I'm assuming that's, that's a research and we got conclusions that are not surprising, right? Kind of AI is not there yet. It's, it is not replacing us. Hooray. That's good news, that's how I interpret. That's very different from Kiro. Story you, you mentioned before, right? Uh, that wasn't an experiment to see how far we can get with ai and what are the current limitations. That's, you're feeling silly that that's what it is.
Darin
00:13:39.129
So I wanna go back to move fast and break things. And again, I had mentioned about, used to when we said that it was just humans doing it with reasonable automation. We'll call it whether it was Jenkins, about actions, whatever, you know, things that were happening. What was the, the one that we call Terraform Atlantis.
Darin
00:14:02.259
but now we have the ability, or I should say AI has the ability to move at machine speed. By the time it takes me to type cube Cuttle, an AI agent could have brought down a whole environment.
Darin
00:14:18.787
isn't this still the core problem right now? I guess what I'm saying is in human time we had, it was slower to break things. When we thought we were moving fast, we were still moving slow. Now, move fast really means something completely different.
Viktor
00:14:40.779
Everything is faster. Just to be completely clear, you can break things faster and you can fix things faster. If you exclude the second part from that sentence, then it's very bad. So repeat, Hey, now we can break things much faster than before. Yes, that's bad In isolation, we can fix things faster than ever before. And the the reason in favor for or against breaking go moving fast and breaking things was always correlated. This is before AI was always correlated with how fast you can fix it, If you move fast and break things and it takes you a week to recuperate from breakage. You're not doing it well. Nothing to do with ai, right for now. So move fast, break things and fix things fast. That should be the correct sentence. And now we are move faster, break things, break more things, and fix more things faster. And if I frame it like that, I'm not sure I see a problem. I mean, hey, if 10 years ago when I break something, my customers would be, uh, would be in a messed up situation for a week and then until not long ago, for an hour. Usually how normally when something gets broken, let's say in AWS, usually outage is not longer than an hour. Right? Usually. and now we can. Maybe have even more outages, but maybe it'll last, shorter period of time. We are moving in the right direction.
Darin
00:16:16.148
You are reinforcing. One of my points I was gonna bring up is used to one of the metrics was speed, velocity, all those things. Now what's becoming more important potentially is resilience. You're talking about being able to fix faster as well. We, that's a key part is like how quickly can we get back. there was a, something that was in the Chaos Carnival keynote back in 2026. Resilience is becoming infrastructure, not insurance.
Darin
00:16:44.522
Used to resilience was just, okay, we'll go ahead. Instead of just having two, we're gonna go ahead and go to have three. Actually, we're gonna add six. So that way we we're in a separate data center as well. Now it's like, okay, let's think about this a little more sanely.
Viktor
00:16:58.872
if you go back before to the time before AI move fast break things was never about be rest, uh, ruthless and kind of like, just kind of like do really silly things. That was never the idea, It was no, you're experienced, you have good judgment. We just don't want you to delay things because it might break, because you know, guess what? Something will break no matter how fast you move, We are just saying move a bit faster than you were moving before and you might break more things and that's okay. That's better than actually not moving.
Darin
00:17:37.444
but what you're describing is old school chaos engineering. Right. We don't hear about it as much anymore,
Viktor
00:17:44.393
For example, yeah, I'm not sure whether it's old school. I think it's more like new school that never picked up, fully, which could be a separate discussion. yeah, you need to make your systems resilient. that's the first thing you need to do. If you want to break things, you wanna make sure that your system does not break. it's very different from your system breaking to breaking something, in isolation and eventually, hopefully fast recuperating from it, if amazon.com whole, everything would be down. That's very, very bad. If a part of it goes down, that's not necessarily bad, that's not good, but it's better than not moving at all.
Darin
00:18:33.233
well, actually it is, it's the very last one. You know, once that slack message comes through, Hey, who has the backups? but it's not a real one.
Viktor
00:18:42.521
backups are the last resort where you're completely messed up. 'cause the moment a single transaction enters your system after a change to the system, and there might have been changes to the sche of any of the databases, there is no backup anymore. You don't go, if I continue using Amazon as example, you don't go kind of like, oh yeah, we, we, we made a new release. Uh, we made, changes to the schema of the database now we broke something and let's actually restore backup. Uh, you just purchased things, you spent money on things, and that's just gone because we restore backup, that that option does not exist.
Darin
00:19:22.866
It gets worse now because used to when something would break, it's like call the developer. Well now the developer may be ai. What are you gonna do with that?
Viktor
00:19:31.916
still call the developer. I still don't believe in autonomous ai. Uh, call the developer. Developer will take, fix it using whichever tools that developer is using, right? That could be ai. Brilliant. I don't believe in giving responsibility to, AI on any level yet I It'll change. Right. Even if you say, okay, AI will fix it automatically. Right. Let's assume for a second that you're not, that is saying to say anything happening in production can be fixed by ai. Whatever happens. It'll be fixed by ai. I, I don't believe in that level of insanity. Please tell me that that's not what you're doing. Right. So if you're doing some autonomous fixing of issues, you probably created some kind of rule set saying, okay, in this cluster you can fix in this, you cannot in this next space, you can fix in this cannot, working with pods, it's okay. Never touch networking. You know, you created some rule sets. Then AI is fixing it based on those rules, on on the permissions you gave it. So it's still you who is responsible because you just made the rule sets.
Viktor
00:20:47.836
No, No, But then those are the wrong rule sets. if I tell you, Darin, here's a rule set. Forget about ai, me and you right now, here's a rule set you cannot push directly to Maine, Right. And if I did not disable the option for you to push to Maine, and you push it against my adv, my, my, my, uh, rule, it's on me.
Darin
00:21:18.724
This is the old developer mantra. Validate everything as it comes through, meaning you can't just put validation on the webpage and not put validation in the backend once the data is captured.
Viktor
00:21:30.875
you have AI doing something, whatever the, let's say autonomous, right? You're going crazy. Great. For AI to do something, you need to provide tools to the agent that is using that ai. Right? And depending on the tools you give it, you are limiting to what it can do. Hey, oh, it can operate Kubernetes cluster, his admin account, that's on you. The difference there, there would be no difference if, if it was human or ai. And by the way, are we saying that the news like that never happened before ai
Viktor
00:22:06.706
Exactly. It, that same person in keto example, could have done it without AI, would probably make exactly the same mistake.
Darin
00:22:17.882
I was gonna say that they could make the same mistake, but probably other people would know what's going on. But in reality, no, that won't happen because somebody's just doing something. It's like, Hey, something just broke. You know, a dashboard and an alarm went off. I still want to say that it's possible for it to happen faster or the cascades to be worse. That's, that's where I think it is.
Viktor
00:22:39.491
Uh, it can absolutely happen faster. Yes, no doubt about that. I'm still arguing. Depends what you're doing. It can be fixed faster as well. And then even kind of like, right, if you, if you improve one, one part of something and you improve the other part and everything is faster, then it's fine. now if we can create problems faster, but if we cannot fix them, fix them faster than we are in a bad shape, then very bad shape. How long was the incident in AWS? How long did it take them to resolve it? Okay, that's bad. That, that feels like extraordinary for, for AWS.
Darin
00:23:25.682
No, that's, that's fine. What I mean throughout all of this recently we're talking about is. You're not telling AI to be safe. You're putting AI in a place where all the guardrails are in place as well. Because if you don't, let's think about, this is sort of a stupid example. Uh, welcome to my world. Uh, you've got an RC car and you're going to race an RC car on the street.
Darin
00:23:53.171
you wanna make sure that if you're gonna be racing RC cars, that street is blocked off so the RC car can actually go out into the railroad to get run over by big cars because you, you want all the guardrails in place. If you don't have time and you don't have the ability or you don't have the brain cells. Put it that way. To understand that you need to have all of this thing layered. We can talk about this in context. Two of feature flags, right? Feature flags will help us do these things, but again, a feature flag is only as good if you don't turn it on to everybody all at once. We still want to, going back to what you said earlier, to making sure we got canaries in there and everything, you know, you're you, it's just us. If, if you think about the ring metaphor, it's just us and then a little bit wider ring of Okay. It's canary. Okay. We're good with canary. Alright. Let's roll it out to controlled lack of a term beta section of people until we actually roll it out. Ga.
Viktor
00:24:53.330
You know what is even more important in case of feature flex that you can turn them off afterwards.
Viktor
00:25:01.190
Situations we're talking about is more like kind of, Hey, you can turn on the feature flag. You cannot turn it off. Once it's on, it's on maybe.
Darin
00:25:12.956
I wonder how that would work with an AI agent, so or so if we gave the AI agent would, okay, let's, lemme ask the question. Would you give an AI agent the ability to manage the feature flags for you? Maybe an
Viktor
00:25:33.683
the way I see things is that it's central territory. We, we are not yet sure. Really, and people are getting very confused as what should be done by AI agent and what should be done by code. the way I'm designing. Uh, agents. Is that, Hey. Yeah. Uh, okay. I ask you to remediate things right? Here are the tools that you can use to remediate the issues and tools are all read only, right? You can cube control, get, you can do cube control, describe, and so on and so forth, and you can propose a fix, but that fix is not going to be executed by you. Not at all. It's going to be executed by person or it's going to be executed by a code that validates whether that can be done or should be done and so on and so forth. Right? I feel that we are moving into the world where, where everything is very binary. Oh, it's hydro agent or it's not agent. And when I say agent in this context, I mean, uh, because agent is essentially code, I mean LLMs actually, it's either LMS or not lms. There is a problem in between like, Hey, my LLMs, my agents are not run executing tests, or they might be executing the test, but if the test is failed, you have no option to say is, is it okay to proceed or no? If a test fails, it stops. There is no merge period. And you're not making that decision. Neither person nor ai.
Viktor
00:27:13.801
No. Everything stops in terms of we are not proceeding with the process of merging to, to Maine and releasing to production. We need to figure out what failed and why it failed. I'm not now going into the. Testing process, but more like you need to make those decisions. What can be done by whom and what, then nothing changed there. Just we need to make new decisions.
Darin
00:27:37.261
Isn't the problem though, we're able to get more data. Let me rephrase this. I wonder if it's possible to get more data summarized faster than ever before, but which I believe is true. However, can we trust the summaries? That's where we need, we can't just trust an agent. We need to have a con consent. I hate, again, this is gonna be, we're gonna have ai, driven by committee. I'm trying to say is eventually we're, the AI model is gonna look like, oh, I hope it really doesn't happen this way, but we know it's gonna happen. That the AI models within an organization are, look just like the organizational structure. Just like how we build apps today, right? If, if we're a big monolith, or actually if, if we're a fan out company, we're probably building microservices today instead of a monolith, and we should be building a monolith. Same thing will happen with AI agents. We're gonna throw everything. Hey, we're gonna have microservices, ai, microservices, Again, vendors, if you steal this idea and you make money off of it, send it to us, we'll send you the PayPal address, but. I, I, this is where I see it happening because we keep replaying the same playbook over and over and over again as new technologies come out. At the end of the day, it's just, we're just trying to get a stupid app up and running so somebody can check the balance on their checking account.
Viktor
00:29:01.825
let's say that, and let's say that we move back in time, few years, no ai, if I ask you same questions. you would probably not be able to answer because if question is generic, kind of, Hey, do I trust interns to summarize data for me to crunch data, uh, should I trust that output? And the answer is, I don't know. Depends on what we are talking about. for some cases, Hey, you generate data from me. I trust it because kind of it's fine. It's okay. Kind of like I, I can, I can just take it as blanket value that it's okay. And then there is something else like, shall we invest 1 billion into this? Well, I'm going to have three other interest checking it after you, and then I'm going to check it myself. I feel that for some reason. The rules that we set or thinking or approach that we are taking somehow differs now with AI than without ai. And I, I feel it's the same, Same thing like do trust every developer to create, pull request and merge it directly to, to Maine? No. Do you trust none also? No. When do you trust when not depends. Oh, you just change this. This is, this is an, this is a simple fix to an issue. Yeah. Ship it. This is a complicated feature. Maybe we should review it and the problems secure when we make generic rules that are supposed to apply in all situations and say, you know what? Thinking is not part of me. I don't do thinking. The rule says, never merge to Maine without three reviews. Cool. Let's make a change to a single line of code that changes the label and wait for three weeks until it's reviewed because the rule says so. No. Heck with rules, I hope that we are still using our brains somehow. And I feel that in many cases we lost the ability to use brain long before we adopted ai.
Viktor
00:31:20.661
Yeah. I actually, I'm, I'm, I'm, I'm fortunate knowing, knowing myself and I, what I was doing when I was young. I feel fortunate that I was not born earlier and experienced seventies. I wouldn't be talking to you. Most likely.
Darin
00:31:37.752
Probably, uh, or I'd hope you would. I wanna go back to Chaos Engineering for just a minute. I wonder. Because Chaos Engineering classically has been, and this goes back to Netflix, so we can, thank Netflix for everything. Chaos engineering. I think, they had Gorilla, they had Kong, they had all the, I mean there were better names for this. I can't remember all the right names, but it was, let's go shut down this vm. Let's go shut down this whole region. Let's go shut down This. Now, fast forward to today, what happens if you are trying to, again, you have two agents running and they disagree with what's going on. Or if you only have a single agent running it decides to delete everything. Or you have agent agent talking to each other and they'd come up with something that you don't want done. Again, the human's on the hook for this. At the end of the day, either they're gonna get fired or get a reprimand or something if it goes sideways,
Darin
00:32:47.651
hopefully we learn something from it. Let's, let's be positive in that case that you'll positively learn something from it, that you'll be in the unemployment line. The prob probably by your choice at that point, and not by the company's choice telling you to leave. These new experiments that will have to come to the forefront. Uh, litmus Chaos, which was acquired, I believe, by harness, they brought on MCP server. Okay, great. We, we have all these things, but the end of the day, going back to what you were saying, it's the human that's in charge. It should be in charge. But what do we say to people that have decided, you know what? The human just needs to have some oversight, but not be in charge, in words. The gating doesn't need to be there because we've proven over time that our agent has worked great. Well, sure that was great three months ago until he changed out the LLM under the hood, and now you got a different reasoning model.
Viktor
00:33:47.208
So your statement that I trust this person is not valid anymore. Imagine that it's not ai. Oh, we have Joe. Joe is working for us for five years. He always does great. He's always does amazing. Let, let's not block him kind of by, by requiring reviews. Cool. Joe leaves and you hire Michael and you say Same rule is applied to you. Does that make sense?
Darin
00:34:12.516
It makes zero. In fact, I'm gonna go back to Joe for a second. Let's say Joe is having a really bad day because one of his parents died yesterday. You can't trust Joe right now either. You can, but his state is now different than it was yesterday.
Viktor
00:34:37.791
yeah, but let's say that job has been reliable for five years and we cannot be, there is no, no risk situation just to be a hundred percent clear with anything, Meteorite might hit us tomorrow, and we, we, we cannot have contingency plans for everything, which can just have reasonable. rules, right? That somehow balance productivity with security, with this, with that, right? And, okay, so Joe had a bad day and made, he made a mistake, but it's, it's okay. Kind of we are gonna fix it, right? What is not okay is that he keeps making mistakes every single day. That's something you should fix if Joe haven't made a mistake in five years and then made it today kind of like we are not Fir Joe. Mistakes happen.
Darin
00:35:40.722
But isn't that the danger right now is business owners are gonna think AI can't make a mistake because it's a machine.
Viktor
00:35:46.776
Well, if there is a business owner who thinks that, well, I have a very easy solution, fire the business owner. It's a very judgment call Again, it's a judgment call, we are all moving towards being managers. That's our future job. We are very, very technical managers. That's what we are or will be soon. Um, I've worked with good managers and I've worked with bad managers before ai, right? Good managers don't blame the team. When something goes wrong, they take it on themselves.
Darin
00:36:19.467
but see, that's hu human thinking. Do you think the same way as a human? If you're managing a bunch of machines,
Viktor
00:36:26.639
I think that if I'm managing a bunch of machines and I use ai, it's my responsibility.
Darin
00:36:33.272
why do you want to put the blame on yourself? Come on. I mean, it was just a bunch of machines. They made a bad choice.
Darin
00:36:44.380
Ding, ding, ding, ding, ding, ding, ding, dinging. There you go. You wanna be able to be successful and if, if, if, you go ahead and take it on the chin, when things go sideways, things will be. Okay. I'm gonna say they're better, but they'll be okay
Viktor
00:37:11.260
if if you always blame others for mistakes, you effectively cannot improve. cause there's nothing to improve what's there to improve? What, what is there to do better? If you do everything perfectly and others make mistakes?
Darin
00:37:29.474
we're getting too philosophical here. I think. I think that that doesn't solve a technical problem until you realize it does. In reality, I'm thinking about what our toolkit has to look like going forward. The resilience engineering toolkit is how I'm going to dub it. Something that we probably should have been doing all along anyway. Immutable infrastructure. If it can't make changes or if it does make a change, who care? You can't actually make a change. It's, you're just, either the, the change that could happen is it blows it all up. That's the worst of it from an immutable perspective. You just have to replace it to that point.
Viktor
00:38:10.301
Uh, from my experience, AI performs very differently depending on the context I'm giving it just like people, again, just like people, you hire an intern and it'll largely depend on you how that intern will perform.
Viktor
00:38:32.037
I mean, the intern might be still incapable of doing anything ever. theoretically it's possible, right? So it is not a hundred percent on you, but it is in big part on you, right? First of all, you chose that intern. So that's already on you. Second, you onboarded that internet or you didn't, you provided information, you provided trainings. in AI that's all context, right? so actually not context. First, you chose which model to use. Second, you chose which agent to use or agents in plural. And third, you provided context. It's on you, not a hundred percent. You will learn over time, kind of like, not a hundred percent, but, uh, it, it is still on you. Uh, yeah, let's build a better agent. Let's, provide better context. Let's change the model.
Darin
00:39:23.655
So immutable infrastructure. Good context, progressive delivery everywhere. Everything. If you're not into it yet, now's really the time to be thinking about it.
Viktor
00:39:33.184
Oh yeah. '' Darin: cause if not, when you're ready to turn something off and you can't, it'll be too late. And validate, validate, validate, validate, validate, and validate. And that's also easier than, no, not necessarily always human, not necessarily. Only human would be better phrased. Right? So here's the thing. I get vastly different. Oh, so I use cloud code, right? Uh, to write code for me, with me, with me. Now, the result is very different than I use cloud code to write code with me, and then I use Code Rabbit to review code with me. then I use a separate cloud code instance, to validate, the code itself from architectural perspective and so on and so forth, right? we need validations and now we are capable of, Doing validations faster as well and better and so on and so forth. but you, again, it all relies on context, right? It's, it's not that Code Rabbit is using models that don't exist. It's not Code Rabbit is better using model that is better than Sonet. The reason why Code Rabbit gets things that my cloud code didn't get is not difference in the model. It's in difference of the context and instructions and a bunch of other things that were given to the model together with my pr. To review it, right? Context matters a lot. And different context for different tasks.
Darin
00:41:04.884
And I guess I'm gonna sort of flip it around. Context could be observability as part of it, because if you're not observing stuff today, you really have to think about it going on this
Viktor
00:41:18.245
yeah, absolutely. So what we should really be thinking is. going back again, uh, bef from to the time before ai, I had a strong belief that in the past, full Stack engineering never worked for anything but small companies, And it never worked simply because a single engineer cannot be good at everything. You cannot do it. I mean, small company. Yeah. Right. But, uh, you grow in size and then the scope and the, the user base and so on and forth. You, you cannot have a full Stack engineer, that doesn't work. Or at least not as exclusively full Stack engineer. Right. And the same thing goes with the agents. Again, going back to the context and quite a few other things. okay. Yeah. So you have one specialized in this. You have another one specialized in that, and so on and so forth. And you have a manager that oversight oversees everything. And that's me. That's where I come into the picture. I'm the T-shaped person. I'm not specialized anymore. This is a big change, I know a bit about everything. I can learn what I need to learn. I can adapt when I need to adapt, and I'm supervising. Everything. Full feature set, top, bottom, front end, backend, database, deployments. It's all me and my friends, but I'm the one who has the last word, and that's important. I'm the one who has the first word and the last word. That's the important thing. And if you go back to this episode, I feel that the thing with hero is that, that last word. It wasn't, uh, spoken by that person.
Darin
00:43:05.233
A couple of shifts we've gotta think about. we're used to dealing with after action reviews, especially when something goes wrong, right? We've been in the war room all weekend. We got Monday, sort of off Tuesday. We had a eight hour meeting of what went wrong, which is good, right? We need to do those things. But now in the age of potentially AI agents doing work for us. Um, I'll call it on the network. Um, we need to celebrate near misses. Like, not just that it failed, but also that, think about it this way. We had the guardrails in place and it, they actually worked or they mainly worked and we need to tweak one more thing that we discovered that's a big deal, which is different. The other thing we might start thinking about doing is SLAs are still important. Like we've been talking about, being able to recover that resilience needs to be measured as well because okay, our SLAs are going down, but we're able to get back to full operational within 10 minutes versus 10 days, whatever. We have to be tracking all of that. One question though, Viktor blameless postmortems, right? You're familiar with the concept?
Darin
00:44:17.866
How do you run a blameless postmortem when in theory, let's say you had an autonomous agent do something. How do you do a blameless postmortem with a machine?
Darin
00:44:28.607
Well, I'm saying you, let's say this was an autonomous scenario, just like the Kiro was autonomous. How do you do a blameless postmortem with a machine?
Viktor
00:44:37.361
I do that all the time. All the time. I'm not sure how much blameless it is, but I do that all the time when I see that it's doing something wrong, something that, or not necessarily wrong, but something. Different than what I would wanted to do. I almost always stop it. Okay. Kind of. Okay. analyze the skill that we use or analyze the Cloud MD or Agent md or this or that. Right. Analyze the stuff and tell me how can we avoid making that mistake again. I do it at least once a day and that. Conversation I have with it results in updating something always right, whatever that something is. It can be skill, it could be MCP, it could be, uh, agent md. It could be many things, but it results. Okay. So we made a mistake. It's fine. I'm here to stop you from going wild. Cool. how can we avoid that in the future or how can we improve that in the future, and so on and so forth. And my, let's call it the system agent system is evolving every single day, I'm not trying to make it do the perfect work from start. I'm just trying to, it's helping somehow, and it's getting better over time. When I say better, I don't mean because better model is released. Sometimes that's the case, but better because the system I'm building around it, the, my work environment is GE getting better over time. That's blameless postmortem between me and my friends.
Darin
00:46:12.329
That leads me into the kero. Incident was not just an accident, it was inevitable it was going to happen.
Viktor
00:46:22.130
Yeah, so may, maybe the downsides of that incident are actually smaller than the upsides that we don't see. I mean, they're moving in the right direction. They're using kero and they're using agent systems. They made a mistake. It's fine. The, that's not the question. The question is what did they learn from that mistake? And knowing AWS, they did not ignore it.
Darin
00:46:44.146
just like we didn't ignore giving everybody root access or giving everybody full administrator privileges in our cloud, Iams.
Viktor
00:46:53.475
here's the danger. The danger is when those things happen. Some companies, some teams will go to the other extreme and say, now nobody has access. That's not the solution. And the solution is not to make somebody less productive or something less productive. The solution is to keep the same productivity, maybe even increase it. While being better. If you can do that, you're doing the right thing.
Darin
00:47:19.190
Oh, on end on a question. This is to the listeners. If an AI agent, let's replace that with a human, was to delete your full production environment, how long would it take you to recover? That's your homework for this week Especially if you're going on vacation tomorrow, maybe you're not.