Jim 00:00:00.000 look, I was in IT operations 15, 20 years ago, and I still talk to people in IT operations, and the stories are all the same as the stories that I went through as my experiences 15 to 20 years ago.
Darin 00:01:16.712 Viktor, when was the last time you crashed a website?
Viktor 00:01:19.232 I crushed the website, somebody's else's website. I'm not that powerful man.
Darin 00:01:24.807 Are you sure? Are you really sure?
Viktor 00:01:27.252 I mean, I, I mean, it's not that I haven't tried. I tried to crash Google, but it doesn't crash because of me.
Darin 00:01:33.672 Okay, well, let's flip it around. What if you were the one sitting in the chair having to deal with somebody like you, trying to crash somebody's site, how would you feel
Viktor 00:01:42.562 Oh, change the drop.
Darin 00:01:44.052 you change the job? How, how quickly would you change the job?
Viktor 00:01:50.547 Immediately, or I would, I would claim that I'm hit by a bus or something like that. Incapacitated.
Darin 00:01:56.972 Okay. Hit by a bus. We don't wanna be hit by a bus. Today. On today's show, we have Jim or Shower on from Xurrent. Jim, how you doing?
Jim 00:02:04.228 doing great. How about you, Darin?
Darin 00:02:06.043 Very well. Thank you. did my story there make any kind of sense, at least the second half of the story, me being the grunt and the fire sets off and I just quit.
Jim 00:02:15.988 very much so. I mean, I've been that person on the other side of that rage clicking a website while it's not working properly. It's not a good feeling.
Darin 00:02:23.416 Not a good feeling. so that's the consumer side of it. But as an SRE or whatever the title is of the month, it feels like. We have an incident happen, right? So all of a sudden we get a big DDoS or fill in the blank, it doesn't matter. And now we have the fire drill. We get everybody in the office. No, they're not paying for pizza today because budget and we're just having to deal
Viktor 00:02:46.975 is spent on ai.
Darin 00:02:48.595 I'm sure
Jim 00:02:49.775 is,
Viktor 00:02:50.905 That's why there is no budget.
Jim 00:02:54.605 we got there way too quickly.
Darin 00:02:56.805 Right was way too fast to do that. if, if, if you, if this was a drinking game, there was your first one,
Jim 00:03:03.930 right.
Darin 00:03:04.685 nothing has changed in incident management. Has there, I mean, other than Okay, AI is starting to mix in, but let's ignore that for just a moment or maybe a few moments.
Jim 00:03:13.597 look, I was in IT operations 15, 20 years ago, and I still talk to people in IT operations, and the stories are all the same as the stories that I went through as my experiences 15 to 20 years ago. There're still war rooms, there's still disconnects, there's still silos, no matter how much we try and break down those silos. They're still there as far as I can tell. So yeah, I think we're still doing a lot of the same things during incident management that we've been doing for a very long time.
Darin 00:03:43.882 Will that ever change? well, well, we all hope so. I mean, some of the tools have gotten better over time, right? Because way, way, 15, 20 years ago, we didn't have hip chat, rest in peace. We didn't have Slack, we didn't have all these, we might've had Skype. We might've had a OL Messenger or Yahoo Messenger, but that was about it.
Jim 00:04:05.782 Yeah.
Darin 00:04:06.652 And a telephone. Do you remember what a telephone is
Jim 00:04:09.412 Oh, I, I remember the tragedy of having, remember the, the rotary dialing of the telephone when you would, you would mess that up and you'd have to start all over again. I still have nightmares about that.
Viktor 00:04:20.332 Let me play devil's advocate. I think that many of the problems we had back then are solved today. It's just the problems are very different than back then. Right. whatever issues you had like 20 years ago in production, right? Those are easy to solve Now. The issues that we are having now are very, very different. So it's not that we haven't moved on, it's just that somebody's hobby site is these days bigger. Influencer has a bigger personal site with more traffic than whatever you were working on 20 years ago.
Jim 00:04:53.639 Yeah, for sure. I no doubt the, the, the problem space is bigger. Trying to solve the problems has gotten much more complicated. We have much more complexity. You guys know this well, right? There's all these virtualization layers that factor in now that people have to troubleshoot. But one of the things that I, think is still very similar across this is the process that's used, and, how there are various stakeholders across. every company that need to be updated and they all need to be updated in different ways. I remember being in war rooms and having executives kind of drop in and just stop everything that was going on, and, sometimes nicely and other times not very nicely, sort of demand an an update as to where everyone was. And so we all had to kind of stop what we're doing and provide that update one by one to those executives everyone's under tremendous pressure. So I, I get the reasoning, but we should be able to fix those things. And again, in the conversations that I have when I talk to people about this, they're still saying, yeah, we still experience all of these things. We experience the siloing, the, the problematic communication, even though we have more communication tools than we've ever had in the past, and they're a lot better. It's still not a fixed problem f from like when you consider like the end to end workflow of incident management. So I wholeheartedly agree with you Viktor, but I think there's also some things that we're just kind of stuck in our old ways with
Viktor 00:06:26.736 So is the problem then realistically, human nature.
Jim 00:06:30.786 part of it.
Viktor 00:06:31.566 Kind of not being able to, you know, like I, I, one long time ago, I'm not going to name the real name. Let's call her Nina, right? I wrote in my incident report that this was, this took much longer because of her coming to ask for the status,
Jim 00:06:48.046 yeah,
Viktor 00:06:48.342 was the real problem.
Jim 00:06:49.366 yeah, yeah, yeah. Absolutely it is. It is a cultural thing, so there's for sure there's cultural aspects to this. Tooling can help. there's authority. So during an incident, an incident commander has tremendous amount of authority to, make sure that the service is restored, whatever's going on, that that service gets restored as quickly as possible. But then typically after that incident is over when everybody high fives, after we've restored service, we all have a mountain of work to go back to. We have our. email queue or our Slack messages, or we're still behind. We, we got even further behind than we were before this, war room exercise. And we've gotta go ahead and try and catch up now. And so a lot of times we don't go back and fix a bunch of the things that we really know we should go fix. the things that the postmortem has has said, Hey, these are root causes. These are things that you need to go back and circle back on and make sure this doesn't happen again. And so we see a lot of these repeat incidents, and I ask about this every time I go and talk. I still go and, and talk at conferences and, 75% of the room raises their hand when I ask, do you have the same incidents happening over and over again? So I believe this is still a pervasive problem throughout the industry.
Darin 00:08:08.293 If people are having the same incidents over and over again, we're at the, that definition of insanity is, is the problem that. The people that's had the incident before have all left. So now we have a new set of people and it's new to them other than having the history that it's happened before.
Jim 00:08:26.393 Yeah, that's a great question. Right? Tribal knowledge again, you know, I think back to 15, 20 years ago and, and tribal knowledge was so important and being able to capture that tribal knowledge and pass that along. so now we, you know, we have tooling where we should be able to capture that tribal knowledge. I fear bringing up the term AI here because of the conversation we just had about the drinking game, but. that should help us, right? We should. We're, we're kind of moving into a new frontier here where, tribal knowledge doesn't have to leave a company when a human being leaves a company. Now, hopefully we can do a much better job of maintaining that tribal knowledge and, ultimately the, the goal is, make systems as reliable and resilient as possible. I think there's been huge strides over the last 15, 20 years. I'm not saying there hasn't, but we still really have, a lot of work to do in this area. And I think that's a combination of culture, it's a combination of tooling. It's getting our processes right, it's getting our technology, utilized in the right way and aligned in the right way. To really focus on solving some of these challenges.
Darin 00:09:39.623 So all those after action reviews that people have written, after every incident,
Jim 00:09:44.428 Yeah,
Darin 00:09:45.048 they always get stacked up in. A directory somewhere got stacked up in a wiki somewhere and we're never analyzed to figure out what's going on. We didn't have the tooling then to do that again, take a drink, AI can help us with that.
Jim 00:09:59.508 yeah.
Darin 00:10:00.348 Right. That's, that's gonna be one of the things that summarization stuff. Forget all the gen stuff, gen gen AI stuff. It's, I could care less about that because that doesn't help me out a lot.
Jim 00:10:11.791 Mm-hmm.
Darin 00:10:12.496 But being able to feed a text and have it correlate without me having to think about it. Hmm, that's pretty darn useful.
Jim 00:10:20.926 Yeah, definitely. I think that's where we're headed towards. Again, I, I think we still have a long way to go, everything, AI, in my opinion, is really in its infancy still. I don't get much disagreement when I say that it's, things are very simplistic right now. We see a really great future forming in front of us, and a lot of companies are, are doing a lot of good things to push that along. But when we, we need to focus on problems, right? So that's the key here is like, okay, so like we have this problem that we've identified and now we're gonna focus on solving that challenge. So, you know, that's something that we're doing at my company. We're heavily focused in this area, and of course, it's near and dear to my heart. So that's why, that's why I'm a part of it, because I was a part of IT operations for so long. And, these problems, They need to be completely eradicated going forward. Incidents will still occur, but wouldn't it be amazing if we could just from every single incident, stop that same type of incident from happening ever again for that service, so you don't have to worry about that one.
Darin 00:11:20.462 I think back to AWS in in the great 2012, 1314 EC2 outage. I can't remember the exact year anymore. It happened once and it happened twice,
Jim 00:11:32.137 Hmm.
Darin 00:11:33.037 but that's been, very few and far in between. I think probably what happened under the hood for them, I have no real knowledge of this, but they had some run books there, probably whatever runbook looked like for them, and they probably broke it out when the second time happened. Started playing out to make sure nothing else. Occurred that way. I think back to run books in the eighties and nineties, it was a Word document, or excuse me, a word, perfect document. 'cause word didn't exist
Jim 00:12:02.020 Yeah.
Darin 00:12:02.741 But it would get out a date we'd think to update it maybe. Because we'd always have the same issues over and over again. 'cause we were in our own data centers. Now with this cloud thing, it's like things moving all over the place. Microservices on top of that. It's like, come on, the sprawl of our estates have become just probably our biggest problem and we've done it to ourselves.
Jim 00:12:27.072 it's interesting when you bring up Runbooks, again, thinking back to my IT ops days development team was. Usually running behind in development. And it wasn't the development team's fault, like the, the goalposts kept getting moved by the business on the development team. So naturally, as the business keeps changing what they want in their application, the development team would get further behind. And so when applications would launch, run books would be very sparse. There would be some level of information in those runbooks about, you know, troubleshooting procedures and what to do in certain circumstances. But they were pretty sparse in almost all cases to start out with. And then as the application got changed and upgraded, those runbooks didn't get updated very frequently, and so that the runbook information would get outdated as well. And so I think that's another great example of another area. Where we can improve today. I think, AWS is is an interesting example because they have a lot of resources, right? And, and so most companies don't have the level of resources that an AWS has or any of these gigantic providers have. So most companies need some level of assistance there. So that's another real problem that. Again, I think we're on the cusp of seeing some, some great solutions there. there's definitely already some attempts at, creating runbooks automatically based upon, the information that's gathered during an incident. So there's multiple different software vendors out there that, do that. So I think that's a huge step in the right direction. And I think it, that's one of the bookends of this, problem that we're really talking about. So when I think about incident management and incident response, overall, it's pretty easy to break it down into four, just big phases where you have like this pre-incident work run books, essentially you've got detection with. Observability systems, you're alerting, just identifying that there's some sort of issue going on. And then you've got the resolution step, and then you've got the post-incident work to really do the break fix and all of the, post-mortem tasks to make sure this doesn't happen again. And so it has gotten really good at those, those two middle steps, right? Detect it and, and restore service. We're really good at that. We've gotten good over many years. but we've sort of neglected the, the other steps that bookend, we've neglected that the pre-work, the runbooks, and we've definitely neglected the post incident work again, not because we don't want to make it better, but because the business is demanding of our time and we're all essentially overworked.
Darin 00:15:14.365 So can't we just shut the business down for a week and get all caught up?
Jim 00:15:18.745 Well ask the business leaders that, how they would feel about that.
Darin 00:15:22.675 Just tell them they can go play golf for a week.
Jim 00:15:25.135 Hmm.
Darin 00:15:25.555 us alone.
Jim 00:15:26.815 Yeah. I'd take it.
Darin 00:15:28.555 So pre-incident and post-incident, you're, that's where you're saying we're just falling down.
Jim 00:15:33.870 Yes. I think there are huge issues there. of course. during those two middle steps, there's always opportunities to improve, right? We we're constantly getting better at that. Again, we're good there. And so yes, we're falling down in those other two steps, the pre-incident, the post incident, and we're falling down during the incident, especially with like communicating out, even though many companies have status pages now, which are great. I love, I use status pages all the time. Great example of this, our website for my company. I tried to go on and make a change. And, I was having trouble. It was slow to respond, the user interface, and so I had to go look at their status page and get some status updates from them. So that's great. I got to see status update from, for the end users. What I started to wonder about is how are they updating their executive staff? You know, Viktor, you mentioned there was a person that slowed everything down because they wanted updates and so I started to wonder how did they update their internal executive staff? 'cause I guarantee they were asking about it, right? They wanted to find out when is this gonna be restored? Because nobody could go in and change their websites.
Darin 00:16:43.208 But in that case, and I had my variations of, Nina as well, those were always the funny to where, okay, we'd just gone through, we'd done a, a dump, and they walk in two minutes later it's like, Hey, I'm here now. Let's go ahead and have the meeting. It's like we just had it. Do you want us to get back to work or do you want to fixing it or do you want us to tell you again? No, I need you to tell me what's going on. Okay. Well then we can't work on it.
Viktor 00:17:08.788 I feel that part of the problem is that. Tools are generally created to speak single language. when I say language, I don't mean literally English, but hey, this is the tool that tops people use, right? They understand inputs and outputs of that tool perfectly, and you cannot give it to anybody else. So NAC has no usage of, of whatever you are using, your recording, whatever you're doing it's not the language that Nina speaks, so of course she's going to come and ask you. So what's the status? I mean, I see those 500 pages of something. I have no idea what's going on. And thi this is not, what I'm saying now is not even directly related with incident management. Right? We, we see that all over and over again. Different roles speak different languages and they don't understand each other. that's okay because, hey, I don't understand all languages either. but tools tend to be very focused on the buyer, almost exclusively not on. All the potential users of that tool. I'm now philosophical. I'm not sure how much this made sense though.
Jim 00:18:17.304 No, that makes perfect sense. And that's exactly one of the core issues that we still have today. And again, I think this is a great opportunity for us to utilize that two letter word, two letter acronym ai. Right? So it's a great opportunity to utilize that, to take the. things that it, operations understands really well and convert that into information and an update for the various stakeholders because yes, there are distinctly different stakeholders across the enterprise. The executive level, the application owner level, the practitioner level, the end user level, right? When. our company, we have software where the service desk operates off of our, our software, and so they get end users calling in, whether that be internal users from the company or whether they're external facing and they're actual. end users that are, are consumers or customers, there are people that are actively looking for updates on like, Hey, my stuff's broken, and I need to know what's going on. I need to be able to trust that you're going to give me an estimate as to when this will come back up. Maybe it's a, bill pay and I, I have a bill that's due and I've gotta pay it today. Am I gonna be able to pay it today? Right. So they need an update from their, per, everybody needs an update from their own perspective. So I think this is what becomes really powerful once you start to put all of this data together across incident management from within the, the platform where the incident, is being worked, right? So often that'll be, through Slack or teams. Right now they're, the modern tools can work just through those, And so they're an inner slacker. Teams will be an interface into modern tooling that is doing the actual incident response, across into the service desk software that is getting a lot of information about how many users are impacted, how many people are calling in here, how many people are in the chat bot, how many people are sending emails, whatever communication method, method they have. If you put all that, inform all that data together, now you can actually, create much more meaningful. Updates for the various levels, the executives can really get a much better understanding of how upset are our customers right now. Right. Because service desk bears the brunt of that, not the team working the war room at that moment.
Darin 00:20:39.653 So it's a sliver of time. So let's lay out the for again, pre-incident detect, restore, post-incident.
Jim 00:20:46.313 Mm-hmm.
Darin 00:20:46.733 The actual incident is in between detect and restore. It's a small sliver of actually what's going on. it seems like detect should always be going.
Jim 00:20:56.903 Yes.
Darin 00:20:57.488 what we detect should hopefully, well, that, that's my point here. Detect should always be going on and als always should be feeding back into pre-incident because to me, just thinking out loud, I should be able to see what my, if it's a, an application where it's web or client server, guess those still exist. We should be able to at least see standard patterns emerge. And that should feed back into pre-incident. And that way if that curve changes, pre-incident, should start yellow flagging things. If something goes awry.
Jim 00:21:35.102 Yes. In an ideal world, absolutely.
Darin 00:21:38.322 Why do we keep saying ideal world?
Jim 00:21:41.847 It's because like I have battle scars from trying to implement these solutions at so many companies. There are so many companies still today that have, certain levels of monitoring in place and observability in place, but. They're still completely underserved in many of the applications that are important applications across the business. so many years ago, again, it's been decades here since I was, working in ma like major financial services companies trying to, trying to get major financial services companies to monitor their important applications and have the right alerting set up. a war room would start, people would come into the war room and say, what monitoring do we have? And everyone would start looking around like, that's the wrong first question to ask in the war room. It shouldn't be like, what do we have? We should know what we have already. It should be what is the information telling us? What are the initial indicators here? we were starting from the wrong perspective and I've worked for many different companies over time and, one of the companies I worked for was a monitoring and observability company, or a couple of 'em actually. And so even over all of these years, I've gotten to see the lack of monitoring observability and proper alerting, and then a proper alert management beyond that. That still exists out there today. It's really incredible.
Darin 00:23:13.499 Is it just that because we think we've got the data, so that's all that we need. And we never really interpret it come up a way to truly interpret the data that's coming in.
Jim 00:23:23.559 Yeah. I fell into that same exact situation I had. I'll say instrumented for lack of a better term, but this was like a really long time ago with my IBM systems where I had, CP utilization, disc utilization, memory utilization, basic server metrics flowing into spreadsheets and charts were being created. And so I felt really good about all the data that I had. And then we had a massive incident and I started looking at all this data and I'm like. What does this chart even mean? Like what is my CPU normally? I don't know, but I see what it is right now. So then I had to start looking back through this history of charts and so we've gotten to this culture of like relying on dashboards, but we don't necessarily know how to interpret those dashboards. And I think they give us a very false sense of security. So again, another great opportunity to feed data into, systems that can make sense of that for us.
Viktor 00:24:15.309 I'm kind of negative towards dashboards for a simple reason that. I feel that dashboards represent knowns. Kind of like we know that this might happen,
Jim 00:24:28.979 Yeah.
Viktor 00:24:29.719 and for whatever reason, I choose not to fix it. for good and they would utterly fail when with unknowns, which are the real problems, can we see? The real problem is this never happened before. unless you have very good imagination or you're clairvoyant, you haven't designed a dashboard for something that never happened before. Sure. I mean, you will see the spike in CPU and memory and things like that. It's not that, I'm not saying it's useless, right, but it's not really covering the real incidents. And by real, I mean this never happened before.
Jim 00:25:04.894 And what does that mean though? What does a spike in CP utilization really mean? Did we just get busier and it's good, like we're doing more business and and the systems are handling it just fine? Or is it like a problem? Right. And our customers are unhappy. So like walk into any operation center anywhere and you're gonna see lots of dashboards, right? That's like the, all the monitors are gonna be plastered with dashboards. So we definitely need to do a better job of interpreting that the systems that receive alerts, hopefully we've got alerts configured. Again, that was a huge part of my job when I was back in these financial services companies was was going to each application and working with them to figure out what should we be we be alerting on. Right? So the alerting should go into this alert management system and hopefully that can make sense of this.
Viktor 00:25:53.359 That's another question I have, and it's truly a question. So, when you design a dashboard, you usually design a dashboard with, you know, some thresholds that say, Hey, this is probably not okay, level of CPU usage, right? So you already know what the threshold is,
Jim 00:26:09.244 Yeah,
Viktor 00:26:09.889 so that means that you can just as easily create an alert, right? For, for, for the same thing that you're. Yeah, you can. And then, if you already have alert value watching at the dashboard,
Jim 00:26:21.924 I disagree with most static thresholds, by the way. Like I really dislike static thresholds in most circumstances, except for like, you know, disc utilization. You know, once you get disc fills up to a certain point, you want to get ahead of that and make sure. You're cleaning out your discs and or adding more capacity. But in most cases, and this is what I started to implement in these companies, I used to work for, I did dynamic thresholding where the software would do machine learning and it would build bands of normalcy. And when things got outside of normal too far, then it would trigger an alert. And that didn't mean that there was a problem. Necessarily, it meant you should pay attention. Like it's kind of like, Hey, put the antennas up because something really bad could happen. And then maybe if you combine some of those together, then they really start to mean a problem. And Darin, I, I think this gets back to what you were talking about a little bit earlier with starting to see patterns in the data, right? And so this is really nearly impossible for humans to do, especially at scale with so many. Applications that we have, and so many servers, and virtual environments processing all of this. So again, another great opportunity for, our friend ai. There's many different, technologies that could be at play here, but something to look at that and make more sense of it. And then that should carry into the war room, right? That initial triage should be available as soon as the humans need to get involved. And that becomes part of this incident resolution process so that the incident response software becomes an assistant, right? It's gonna help. SREs or whomever is, is doing the troubleshooting firefighting. It's gonna help them come to decisions faster, and if they don't have information, they should just be able to ask for it. I need information on this, and the system should just go get it from the appropriate system of record for that information and then correlate it with what's going on so that it can help them come to intelligent decisions and maybe start to, to make guesses about what the root cause is. Or how to restore service immediately, and then what the root cause might be.
Viktor 00:28:37.380 Because I'm not sure whether this is my limited experience speaking now, but I have impression that. Very often the problem is not that you don't have information, but you do, you have too much of it, and you have to have because you don't know what will happen. So you're going to collect a lot of data and that feels okay, but then an incident happens, and what they do, do I go through those zillions of lines of metrics and logs and what are not, if some type of software, whatever it is, can actually. At least gimme a clue. Look over there, kind of like look to the left. That feels
Jim 00:29:14.145 what's happening now. Yes, so. Yes. And, and that's, that's exactly what's going on right now, by the way. Like that's, you're, you've hit the nail on the head. That is a huge problem. It's part one of these problems in this incident management process. And, yes, software is being designed to do exactly that, right? And so they're early versions of that that are out right now, and those are getting better and better. Really helping you just focus, right? Cut through the noise. here are the things that are important. Here are the things that are deviating, right? Me. Metrics that are, that look great, that look the same as they always looked. When there are no issues, you can probably throw most of those away. You probably don't need to look at those. But these ones that are skewed very out of the norm right now, these are the ones you wanna pay attention to because some, they're gonna point you to, how to restore that service as quickly as possible.
Darin 00:30:08.201 It seems like in that case, when we're trying to restore that, the way I've been sitting over here thinking is. This data that I'm being given, rather, the summary that I'm being given from the data is either gonna be spot on or it's gonna send me down a really bad rabbit trail and keep me off the real case for an hour or two. 'cause we're just chasing a bunny.
Jim 00:30:33.990 Yeah.
Darin 00:30:34.770 is that, I mean, AI is, AI is probably not going to fix that
Jim 00:30:39.600 Right. Not,
Darin 00:30:41.130 midterm.
Jim 00:30:42.180 yeah, not soon. But that's the expertise, right? There's that old expression, there's no, there's no substitute for experience, right? And so that's the human. Portion of this equation is absolutely critical and I, I, I shudder when I, I hear the stories and, and people come up and ask if I'm, you know, working at a conference and people will come up and ask and, and say, Hey, can I use your software to replace people? I don't want to replace people. People are absolutely vital in this process. You can use. The software to gain tremendous productivity and efficiency. And so you can do a lot more with the people that you have, but you need those people still. We need human beings in the loop here for sure. In the near term and probably that midterm as well. And the long term. Who knows what's gonna happen with all of this, but I don't want to live in that world yet. I don't have the trust in the systems yet. hopefully very few people have the trust in these systems yet where they feel comfortable letting these systems go and make decisions and make configuration changes and try and, do very difficult tasks within, our environments on their own, because I think that would be disastrous.
Viktor 00:31:56.306 there are of course. Too many people saying, ah, yeah, he is going to replace people. I completely disagree with that, but what I do feel that we are on, on the edge of replacing mundane or tasks or chores that people need to do to get to do what really matters. Like if I use the previous example, right, if there is an incident, me going through thousands of records is a chore. Right.
Jim 00:32:27.551 Yep.
Viktor 00:32:28.241 finding, me, me actually deciding what should be done when I get to the, to the meat of the problem. That's the value work.
Jim 00:32:39.216 Absolutely. Yeah. productivity gain. Right? So cut out the. The stuff that, takes us a long time and that machines are really good at doing really fast, and then it lets us do what we do best.
Viktor 00:32:53.241 Exactly.
Darin 00:32:54.261 I will say it a little bit differently. Let the machines be the ones that watch the dashboards instead of the humans,
Jim 00:32:59.896 Yeah.
Darin 00:33:01.055 because I mean, other than having a pretty graph in front of me, that's about all it is,
Jim 00:33:05.645 oh, they are pretty sometimes, I mean. It's nice to look at those dashboards, but maybe an occasional glance from from a human is fine just to admire the work that's been done.
Viktor 00:33:15.665 Are people so badly paid in this industry? They cannot afford Netflix. Is that what we're saying?
Jim 00:33:23.055 I hope not.
Darin 00:33:25.965 Well, I think the chart that you don't wanna see is a Christmas tree farm. That's what you don't want to see. And if you don't understand what a Christmas tree farm is, just think up, down, up, down, up, down, up, down. That's what you don't want to see.
Jim 00:33:37.155 Yeah.
Darin 00:33:38.000 Uh, for most things.
Jim 00:33:39.620 Although I think Netflix might, has gotten pretty expensive. Viktor, to your point.
Darin 00:33:44.345 Netflix is more expensive than my AI subscription now. So guess which one got dropped?
Jim 00:33:48.890 Yeah. Uh.
Darin 00:33:51.166 I'm still sitting here and I'm thinking time again. Going back to pre-incident, detect, restore, post-incident. And in that sliver between detect and restore is the incident itself.
Jim 00:34:01.467 Mm-hmm.
Darin 00:34:02.487 In a perfect world, pre-incident and detect are 90%, incident is 1%. Restore is call it 5% and then everything else goes to post-incident.
Jim 00:34:15.852 Yeah.
Darin 00:34:16.841 That's what it feels like it should be right When an incident happens, it should be able to be resolved quickly, especially if it's been something that we have seen or similar to what we've seen before.
Jim 00:34:28.186 Mm-hmm.
Darin 00:34:29.561 If it's a brand new, like that first EC2 outage that happened,
Jim 00:34:33.006 Yeah.
Darin 00:34:34.091 uh, okay, great. Anytime you have a first, it's fine for the window to, to slide way open and I would even say a second or third time, it's okay too because especially if they're happening in rapid succession, it's not good, but it's okay. It's understandable. And we've already talked about it. Lots of monitoring. Doesn't mean we have a solution. It's more of a problem to me just 'cause I have a lot of monitoring. That just means I'm paying a, observability vendor way too much money to store data that I don't need
Jim 00:35:04.121 a chance
Darin 00:35:05.036 yet. You've gotta, you, you do have a chance. Well, you, you've given yourself a false secu, a false security of a chance. Because if all you've been doing is stuffing the data back, you still don't know what's going on.
Jim 00:35:17.111 Right? Yep. Mm-hmm.
Darin 00:35:19.256 Or at least you'd have the data.
Jim 00:35:20.711 Yeah.
Darin 00:35:21.056 what do I do with it? I don't
Jim 00:35:22.841 Well. Think about that, small sliver though that you're talking about, a small sliver in the middle is outrageously expensive, right? If that's the
Darin 00:35:31.020 It's the most expensive.
Jim 00:35:32.850 Yeah. And, and, and,
Darin 00:35:34.290 expensive of the, of the five now,
Jim 00:35:36.150 and so the whole, the whole goal should be, avoid that sliver in the middle. Do everything you can. To avoid that small sliver. And the way, again, like I'm pushing this out there as my opinion, the way to avoid that is for with better run books, because those runbooks, hopefully, hopefully the operation center is getting some early indicators with some alerts starting to fire. another lesson I learned from, from many years in IT operations is things didn't usually just boom, they just didn't go down. Hard immediately. There was a spiral. Things spiral out of control. And in every postmortem I ever did, there were indicators there, things started to slow down. You know, there were occasionally, you know, like a full network outage, which is like boom, weird, strange things, but that almost was never the case, right? So 95% of the time there was, there were indicators there. And if you have the right run books in place. Hopefully your operations team can get ahead of it and start and test the right things or supplement that operations team with really good automation here where you have automated playbooks that, and run books that kick in and do the testing and flag things to the right people and say, Hey, you like that result of this test indicates that this is an issue. If you go fix this, then we, we should not have this outage that we're headed towards. Right. So that to me, that's a huge part of avoiding that little tiny sliver. And then when you do hit that sliver, because it's gonna happen for sure, that portion on the back end, after you've already hit it, you know there's gonna be a bunch of things that you can do to make sure or to try your best to make sure that that's not gonna happen again. So fulfill those things. Make sure you've really identified all the right tasks to follow up on, and then give people that ability. give someone authority. Give a team authority to follow up on it. Track these things. Make sure executives have the right view into the post incident activities that are supposed to happen, and incent people to fix that so that that little sliver in the middle. It doesn't happen again. So hopefully if you get those, the first part and the last part right, you may take care of that middle, significantly more often than it's happening now. Right. So significantly fewer incidents is the goal.
Darin 00:38:02.933 So it sounds like you're proposing in for a 2025 Look, run books are automated. Yes, that, that makes total sense. If If it can be automated, it should be automated.
Jim 00:38:15.289 The updating and creation of Runbooks should be automated
Darin 00:38:19.339 as well. Yeah. yeah, for sure. But I don't know how we're going to automate a way telling business that we need to not do what's been promised for the next quarter so we can make sure that this doesn't happen again.
Jim 00:38:30.410 Hmm. So that's a, again, a top down cultural issue. And so that's a business case. And so just with everything that we do, there's a business case. These operation centers, they know how many incidents they have. We tracked when I was in, again, these financial services companies, we tracked how much these outages cost us. These outages. Even though, you know, we're mentioning a small sliver as the typical outage. That outage, that small sliver can be anywhere from minutes, but in some cases it's days. the, head of product that works at my company. In his previous company, he worked for, an online retailer and they had a significant issue that went on for 72 hours. this is a major retailer, right? And so they do a massive amount of business, so that, that was a, a major impact to them because they lost those sales, right? That revenue went to some other company. So that's, it's the business case. And so that business case should drive the change, and there has to be a cultural change in this respect to either leave room for the people who, who need to make those fixes that work there today, or to spend a little bit more, to get a little bit more capacity to plan for this. Right, and so the business case is simple, right? It just depends on how exposed you are. It depends on how good your current, teams are. Again, some, some companies have a lot of resources and the the business case won't be there for them, okay? But for many companies who don't have that many resource resources, they should see that they can make this business case very easily.
Darin 00:40:20.649 What should be the utilization of an operations person today?
Jim 00:40:25.179 Ooh, that's a tough question. What should it be? According to whom? From whose perspective? From the, the ops person or their, their executive staff.
Darin 00:40:35.034 the executive staff, I believe, is they're gonna say, well there are three floors underneath the streets in the data center, so it doesn't really matter. We'll just keep 'em going a hundred percent of the time.
Jim 00:40:47.754 Mm-hmm.
Darin 00:40:48.797 or 99% of the time. 'cause they'll get a day off sometime 'cause they have to go to the dentist.
Jim 00:40:53.512 Yeah.
Darin 00:40:54.652 From the management of those operations people, they're probably gonna think 90%. The peers of the operations people is like, I do so much unplanned work. I probably spend 75% of my week doing unplanned work,
Jim 00:41:09.832 Yeah,
Darin 00:41:10.792 if not more. And sometimes especially, you're not gonna be able to have. An ops person that can do everything. Yes, there are a few that can do that, but that's not typical. I think that's the biggest problem is we, we are over allocating what the operations people can actually get done.
Jim 00:41:28.847 we are, and we're, we're making it worse today. There's. definitely the, do more with less mentality. it's definitely there. Uh, we see it, I've seen it for multiple years at this point in our industry. Uh. There's a lot of conversations about burnout and it's a real problem. and that has to be factored in here as well. We have to take care of our people. We have human beings that, you know, just think of a car. You can't run a car at a hundred percent all the time. You're gonna burn out the engine, right? It has to, rev up and rev down run at partial power. If you're just at full power all the time, it's gonna be a very short-lived engine. going to break down. The human is very similar to that. and we have, we're a lot more complex, right? So we have a lot more things going on in our lives. So this is another topic that is, is really near and dear to my heart. I know many people in this industry who have, burned out and they've just had to quit the industry because it just was too much of a grind and it chewed them up and it spit them out. And I don't, you know, when we have this conversation. I don't want to even imply that executives don't care. I believe executives really do care, and I think this is a huge challenge that they want to solve, but at the same time, there are these business demands and so there's a lot of factors here, and I don't think people have ill will. I don't think anyone wants to see another human being burn out because they're overworked, but at the same time. The business needs to deliver. And so there has to be some sort of balance in there. And I think that's what every company struggles to find.
Darin 00:43:06.827 Well, if they don't, I mean, if the executives don't hit their numbers, they don't get their bonuses. Or their golden parachute doesn't open,
Jim 00:43:15.017 Yeah.
Darin 00:43:15.632 sorry to all the executives, but that's the way I see it from bottom up, and I think that's the way that most people from bottom up see it.
Jim 00:43:22.928 and I would say at the same time I've known many executives and, and they work a lot.
Darin 00:43:28.325 Absolutely.
Jim 00:43:29.330 yeah. So
Darin 00:43:30.455 That us grunts that don't see from the bottom. Yeah, it's, it's even worse. But you know, I typically don't see the executives in when something's hit the
Jim 00:43:40.285 I.
Darin 00:43:40.445 2:00 AM on a Sunday morning,
Jim 00:43:41.894 You hope, not actually. Right. you know, that's the, that's the last thing you wanna see at that point in time. You just want to get it fixed and then go back to sleep as
Darin 00:43:52.034 Yes,
Jim 00:43:53.144 Yeah.
Darin 00:43:54.014 that's true.
Jim 00:43:55.064 Yeah.
Darin 00:43:55.634 So. Where does Xurrent fit into all this now? Xurrent. Okay. You're, you're doing your smart web two Oh, naming here. That's, that's what this is. Uh, it's X-U-R-R-E-N-T. So like current except with an X, and it's pronounced like Xerox Xurrent, if you
Jim 00:44:13.904 Yes. Yes. Yep. So, Zern is across all of these things that we've been talking about. you, unsurprisingly, right? think of Zern as, a, a platform for IT operations, for IT service management for enterprise service management. So. The service desk would, would utilize our software. And so we have this ticketing and request software with workflow automations, uh, that are designed to improve, collaborative productivity. Right? And so, think of a workflow as a process that. Kicks off from person to person and everybody knows what they're supposed to do at the right times. There's a self-service portal where end users would come in if, let's say this, there is an incident going on. End users might come into the self-service portal and report an incident, and look for some sort of status update on that incident. And so you tend to start to see many tickets for the same thing when an incident occurs. So that's why status pages become. So important, which is another part of Xurrent overall. We have this product called Status Cast that does status pages as well, so it gets that communication out, but also to various layers because. Based on your role, you can see different things. And so there can be an executive update in in there. And so the executive can just read it within status cast and another person in another role logs in and they'll see a different update. Right. So that's the key here. We're trying to. Communicate to the various stakeholders in the way that they need to be communicated with. And then the third part of what we have is this automated incident response where we understand the, you know, who is responsible for what services, and we take in alerts and we make sense of those alerts. and we assist in solving the problem as quickly as possible. Updating these runbooks and playbooks, sending the post incident activities, those tasks that need to be performed. It does like an automated postmortem, sending those back into the request system for that service, that the service management system and making sure that there are requests in there that are automatically generated so that teams can go ahead and track those and start working on those. So we're working on solving this end to end. Issue that we've talked about. We're, we're not there yet. All the way we have, we have many pieces of this in place right now, and we're working on, getting more of the pieces, more of the analytics layer in place to, help and assist as much as possible throughout the process. So that's what, that's what we've been working on. and again, I like, I joined the company because I had a background in this and when I heard the vision for where the company was headed. The problem resonated with me because I'd seen the problem throughout my career, in IT operations. And then after my IT ops career when I was working for various vendors in the IT operations space, I kept seeing the same problems over and over again. So that was a really worthwhile thing to participate in. And hopefully we can help solve a a, a very big challenge that many companies face.
Darin 00:47:20.651 We've already talked about how we sort of see AI today. It's infancy, for lack of a better term.
Jim 00:47:28.131 Yeah,
Darin 00:47:29.061 Is it being helpful now for y'all within Xurrent? think back to beginning of 2025 compared to October of 2025, how big of a jump has it been for you?
Jim 00:47:41.781 Yeah, so things have really coming together. So over time here what we've done is we've identified where are the places where AI can be a productivity assistant, where it's a productivity enabler for folks that need to use the system, whether that be an end user coming in, in a self-service portal, or whether that be within the system, you know, doing. Postmortems in, the automated incident response, which is called Zen Duty, by the way, then, you know, we have, within the service management portion, the agents or specialists, there are plenty of opportunities to assist them also with summarization of, of long tickets with lots of information in them. So quickly summarize that. Let the AI handle that with writing documentation. This is a huge one. This is like. Something I hated doing when I was in IT. Operations. I don't know many IT people that really like writing documentation. so AI is great at that. So from any note, in any request, you can say, Hey ai, here's from this note. I need you to create a knowledge article. It will look at the context of this request and all of the notes contained within. And this is for solving some sort of problem. And it will create a knowledge article on solving this problem. Right? It'll automatically route and classify tickets so that there's no ticket queues to arbitrage, where tickets get lost in there for days at a time and nobody gets the help that they're looking for. And so this is all culminated in, in, in an AI fabric. For us. It's a mesh of capabilities that are all productivity, enhancers. So that we could launch our virtual agent. And the virtual agent is like the first point of contact for someone coming in from a self-service perspective so they can chat with an AI who will, look through these knowledge articles that AI helped write or AI helped improve. it will recommend ways for the person who's having the problem to fix the problem on their own. So hopefully they can just self-serve and that's why. You know, we had that conversation about kind of do more with less. so you can, you, you don't have to hire more and more people in the service desk. You can do more with the staff you have. And then, let's say you can't get to a resolution on your own. It will even submit the request on your behalf and then route it to the right people. and then one of my favorite things is within that same interface, if you don't want to talk to ai, which many people still don't. You can just click the, Hey, I want to talk to a human button. And then you're chatting with the human being on the other side of that service desk instead of the AI system. So there's, there's so much potential for productivity gains with some of these basic. Capabilities, right? It, that, that even does, that doesn't even start to touch some of the things we talked about. You know, I'll get into a buzzword here with agentic ai. I think that's a huge buzzword right now, way overused. But when I think about agentic ai, it's like taking actions. And I made my feelings clear earlier that I think it's super dangerous to set a system free within your, within your environment to make it, let it make configuration changes and, and anything complex. But there may be some very simple things that you say, you know what? Hey, this is fine. We're gonna, we're gonna allow AI to do these few little things on its own because it's the repetitive things that that Viktor you mentioned. You just don't want to do all of those repetitive things. and that's fine to push that off to a machine. It's not gonna hurt anything. Right. So that's where these productivity enhancements are really coming in.
Darin 00:51:18.447 So Zurich can be found@zurich.com. That's X-U-R-R-E-N t.com. Jim, thanks for being with us today.
Jim 00:51:26.362 Yeah. Guys, thank you so much. I appreciate the conversation.