Jim
00:00:00.000
look, I was in IT operations 15, 20 years ago, and I still talk to people in IT operations, and the stories are all the same as the stories that I went through as my experiences 15 to 20 years ago.
Viktor
00:01:27.252
I mean, I, I mean, it's not that I haven't tried. I tried to crash Google, but it doesn't crash because of me.
Darin
00:01:33.672
Okay, well, let's flip it around. What if you were the one sitting in the chair having to deal with somebody like you, trying to crash somebody's site, how would you feel
Viktor
00:01:50.547
Immediately, or I would, I would claim that I'm hit by a bus or something like that. Incapacitated.
Darin
00:01:56.972
Okay. Hit by a bus. We don't wanna be hit by a bus. Today. On today's show, we have Jim or Shower on from Xurrent. Jim, how you doing?
Darin
00:02:06.043
Very well. Thank you. did my story there make any kind of sense, at least the second half of the story, me being the grunt and the fire sets off and I just quit.
Jim
00:02:15.988
very much so. I mean, I've been that person on the other side of that rage clicking a website while it's not working properly. It's not a good feeling.
Darin
00:02:23.416
Not a good feeling. so that's the consumer side of it. But as an SRE or whatever the title is of the month, it feels like. We have an incident happen, right? So all of a sudden we get a big DDoS or fill in the blank, it doesn't matter. And now we have the fire drill. We get everybody in the office. No, they're not paying for pizza today because budget and we're just having to deal
Darin
00:02:56.805
Right was way too fast to do that. if, if, if you, if this was a drinking game, there was your first one,
Darin
00:03:04.685
nothing has changed in incident management. Has there, I mean, other than Okay, AI is starting to mix in, but let's ignore that for just a moment or maybe a few moments.
Jim
00:03:13.597
look, I was in IT operations 15, 20 years ago, and I still talk to people in IT operations, and the stories are all the same as the stories that I went through as my experiences 15 to 20 years ago. There're still war rooms, there's still disconnects, there's still silos, no matter how much we try and break down those silos. They're still there as far as I can tell. So yeah, I think we're still doing a lot of the same things during incident management that we've been doing for a very long time.
Darin
00:03:43.882
Will that ever change? well, well, we all hope so. I mean, some of the tools have gotten better over time, right? Because way, way, 15, 20 years ago, we didn't have hip chat, rest in peace. We didn't have Slack, we didn't have all these, we might've had Skype. We might've had a OL Messenger or Yahoo Messenger, but that was about it.
Jim
00:04:09.412
Oh, I, I remember the tragedy of having, remember the, the rotary dialing of the telephone when you would, you would mess that up and you'd have to start all over again. I still have nightmares about that.
Viktor
00:04:20.332
Let me play devil's advocate. I think that many of the problems we had back then are solved today. It's just the problems are very different than back then. Right. whatever issues you had like 20 years ago in production, right? Those are easy to solve Now. The issues that we are having now are very, very different. So it's not that we haven't moved on, it's just that somebody's hobby site is these days bigger. Influencer has a bigger personal site with more traffic than whatever you were working on 20 years ago.
Jim
00:04:53.639
Yeah, for sure. I no doubt the, the, the problem space is bigger. Trying to solve the problems has gotten much more complicated. We have much more complexity. You guys know this well, right? There's all these virtualization layers that factor in now that people have to troubleshoot. But one of the things that I, think is still very similar across this is the process that's used, and, how there are various stakeholders across. every company that need to be updated and they all need to be updated in different ways. I remember being in war rooms and having executives kind of drop in and just stop everything that was going on, and, sometimes nicely and other times not very nicely, sort of demand an an update as to where everyone was. And so we all had to kind of stop what we're doing and provide that update one by one to those executives everyone's under tremendous pressure. So I, I get the reasoning, but we should be able to fix those things. And again, in the conversations that I have when I talk to people about this, they're still saying, yeah, we still experience all of these things. We experience the siloing, the, the problematic communication, even though we have more communication tools than we've ever had in the past, and they're a lot better. It's still not a fixed problem f from like when you consider like the end to end workflow of incident management. So I wholeheartedly agree with you Viktor, but I think there's also some things that we're just kind of stuck in our old ways with
Viktor
00:06:31.566
Kind of not being able to, you know, like I, I, one long time ago, I'm not going to name the real name. Let's call her Nina, right? I wrote in my incident report that this was, this took much longer because of her coming to ask for the status,
Jim
00:06:49.366
yeah, yeah, yeah. Absolutely it is. It is a cultural thing, so there's for sure there's cultural aspects to this. Tooling can help. there's authority. So during an incident, an incident commander has tremendous amount of authority to, make sure that the service is restored, whatever's going on, that that service gets restored as quickly as possible. But then typically after that incident is over when everybody high fives, after we've restored service, we all have a mountain of work to go back to. We have our. email queue or our Slack messages, or we're still behind. We, we got even further behind than we were before this, war room exercise. And we've gotta go ahead and try and catch up now. And so a lot of times we don't go back and fix a bunch of the things that we really know we should go fix. the things that the postmortem has has said, Hey, these are root causes. These are things that you need to go back and circle back on and make sure this doesn't happen again. And so we see a lot of these repeat incidents, and I ask about this every time I go and talk. I still go and, and talk at conferences and, 75% of the room raises their hand when I ask, do you have the same incidents happening over and over again? So I believe this is still a pervasive problem throughout the industry.
Darin
00:08:08.293
If people are having the same incidents over and over again, we're at the, that definition of insanity is, is the problem that. The people that's had the incident before have all left. So now we have a new set of people and it's new to them other than having the history that it's happened before.
Jim
00:08:26.393
Yeah, that's a great question. Right? Tribal knowledge again, you know, I think back to 15, 20 years ago and, and tribal knowledge was so important and being able to capture that tribal knowledge and pass that along. so now we, you know, we have tooling where we should be able to capture that tribal knowledge. I fear bringing up the term AI here because of the conversation we just had about the drinking game, but. that should help us, right? We should. We're, we're kind of moving into a new frontier here where, tribal knowledge doesn't have to leave a company when a human being leaves a company. Now, hopefully we can do a much better job of maintaining that tribal knowledge and, ultimately the, the goal is, make systems as reliable and resilient as possible. I think there's been huge strides over the last 15, 20 years. I'm not saying there hasn't, but we still really have, a lot of work to do in this area. And I think that's a combination of culture, it's a combination of tooling. It's getting our processes right, it's getting our technology, utilized in the right way and aligned in the right way. To really focus on solving some of these challenges.
Darin
00:09:39.623
So all those after action reviews that people have written, after every incident,
Darin
00:09:45.048
they always get stacked up in. A directory somewhere got stacked up in a wiki somewhere and we're never analyzed to figure out what's going on. We didn't have the tooling then to do that again, take a drink, AI can help us with that.
Darin
00:10:00.348
Right. That's, that's gonna be one of the things that summarization stuff. Forget all the gen stuff, gen gen AI stuff. It's, I could care less about that because that doesn't help me out a lot.
Darin
00:10:12.496
But being able to feed a text and have it correlate without me having to think about it. Hmm, that's pretty darn useful.
Jim
00:10:20.926
Yeah, definitely. I think that's where we're headed towards. Again, I, I think we still have a long way to go, everything, AI, in my opinion, is really in its infancy still. I don't get much disagreement when I say that it's, things are very simplistic right now. We see a really great future forming in front of us, and a lot of companies are, are doing a lot of good things to push that along. But when we, we need to focus on problems, right? So that's the key here is like, okay, so like we have this problem that we've identified and now we're gonna focus on solving that challenge. So, you know, that's something that we're doing at my company. We're heavily focused in this area, and of course, it's near and dear to my heart. So that's why, that's why I'm a part of it, because I was a part of IT operations for so long. And, these problems, They need to be completely eradicated going forward. Incidents will still occur, but wouldn't it be amazing if we could just from every single incident, stop that same type of incident from happening ever again for that service, so you don't have to worry about that one.
Darin
00:11:20.462
I think back to AWS in in the great 2012, 1314 EC2 outage. I can't remember the exact year anymore. It happened once and it happened twice,
Darin
00:11:33.037
but that's been, very few and far in between. I think probably what happened under the hood for them, I have no real knowledge of this, but they had some run books there, probably whatever runbook looked like for them, and they probably broke it out when the second time happened. Started playing out to make sure nothing else. Occurred that way. I think back to run books in the eighties and nineties, it was a Word document, or excuse me, a word, perfect document. 'cause word didn't exist
Darin
00:12:02.741
But it would get out a date we'd think to update it maybe. Because we'd always have the same issues over and over again. 'cause we were in our own data centers. Now with this cloud thing, it's like things moving all over the place. Microservices on top of that. It's like, come on, the sprawl of our estates have become just probably our biggest problem and we've done it to ourselves.
Jim
00:12:27.072
it's interesting when you bring up Runbooks, again, thinking back to my IT ops days development team was. Usually running behind in development. And it wasn't the development team's fault, like the, the goalposts kept getting moved by the business on the development team. So naturally, as the business keeps changing what they want in their application, the development team would get further behind. And so when applications would launch, run books would be very sparse. There would be some level of information in those runbooks about, you know, troubleshooting procedures and what to do in certain circumstances. But they were pretty sparse in almost all cases to start out with. And then as the application got changed and upgraded, those runbooks didn't get updated very frequently, and so that the runbook information would get outdated as well. And so I think that's another great example of another area. Where we can improve today. I think, AWS is is an interesting example because they have a lot of resources, right? And, and so most companies don't have the level of resources that an AWS has or any of these gigantic providers have. So most companies need some level of assistance there. So that's another real problem that. Again, I think we're on the cusp of seeing some, some great solutions there. there's definitely already some attempts at, creating runbooks automatically based upon, the information that's gathered during an incident. So there's multiple different software vendors out there that, do that. So I think that's a huge step in the right direction. And I think it, that's one of the bookends of this, problem that we're really talking about. So when I think about incident management and incident response, overall, it's pretty easy to break it down into four, just big phases where you have like this pre-incident work run books, essentially you've got detection with. Observability systems, you're alerting, just identifying that there's some sort of issue going on. And then you've got the resolution step, and then you've got the post-incident work to really do the break fix and all of the, post-mortem tasks to make sure this doesn't happen again. And so it has gotten really good at those, those two middle steps, right? Detect it and, and restore service. We're really good at that. We've gotten good over many years. but we've sort of neglected the, the other steps that bookend, we've neglected that the pre-work, the runbooks, and we've definitely neglected the post incident work again, not because we don't want to make it better, but because the business is demanding of our time and we're all essentially overworked.
Darin
00:15:28.555
So pre-incident and post-incident, you're, that's where you're saying we're just falling down.
Jim
00:15:33.870
Yes. I think there are huge issues there. of course. during those two middle steps, there's always opportunities to improve, right? We we're constantly getting better at that. Again, we're good there. And so yes, we're falling down in those other two steps, the pre-incident, the post incident, and we're falling down during the incident, especially with like communicating out, even though many companies have status pages now, which are great. I love, I use status pages all the time. Great example of this, our website for my company. I tried to go on and make a change. And, I was having trouble. It was slow to respond, the user interface, and so I had to go look at their status page and get some status updates from them. So that's great. I got to see status update from, for the end users. What I started to wonder about is how are they updating their executive staff? You know, Viktor, you mentioned there was a person that slowed everything down because they wanted updates and so I started to wonder how did they update their internal executive staff? 'cause I guarantee they were asking about it, right? They wanted to find out when is this gonna be restored? Because nobody could go in and change their websites.
Darin
00:16:43.208
But in that case, and I had my variations of, Nina as well, those were always the funny to where, okay, we'd just gone through, we'd done a, a dump, and they walk in two minutes later it's like, Hey, I'm here now. Let's go ahead and have the meeting. It's like we just had it. Do you want us to get back to work or do you want to fixing it or do you want us to tell you again? No, I need you to tell me what's going on. Okay. Well then we can't work on it.
Viktor
00:17:08.788
I feel that part of the problem is that. Tools are generally created to speak single language. when I say language, I don't mean literally English, but hey, this is the tool that tops people use, right? They understand inputs and outputs of that tool perfectly, and you cannot give it to anybody else. So NAC has no usage of, of whatever you are using, your recording, whatever you're doing it's not the language that Nina speaks, so of course she's going to come and ask you. So what's the status? I mean, I see those 500 pages of something. I have no idea what's going on. And thi this is not, what I'm saying now is not even directly related with incident management. Right? We, we see that all over and over again. Different roles speak different languages and they don't understand each other. that's okay because, hey, I don't understand all languages either. but tools tend to be very focused on the buyer, almost exclusively not on. All the potential users of that tool. I'm now philosophical. I'm not sure how much this made sense though.
Jim
00:18:17.304
No, that makes perfect sense. And that's exactly one of the core issues that we still have today. And again, I think this is a great opportunity for us to utilize that two letter word, two letter acronym ai. Right? So it's a great opportunity to utilize that, to take the. things that it, operations understands really well and convert that into information and an update for the various stakeholders because yes, there are distinctly different stakeholders across the enterprise. The executive level, the application owner level, the practitioner level, the end user level, right? When. our company, we have software where the service desk operates off of our, our software, and so they get end users calling in, whether that be internal users from the company or whether they're external facing and they're actual. end users that are, are consumers or customers, there are people that are actively looking for updates on like, Hey, my stuff's broken, and I need to know what's going on. I need to be able to trust that you're going to give me an estimate as to when this will come back up. Maybe it's a, bill pay and I, I have a bill that's due and I've gotta pay it today. Am I gonna be able to pay it today? Right. So they need an update from their, per, everybody needs an update from their own perspective. So I think this is what becomes really powerful once you start to put all of this data together across incident management from within the, the platform where the incident, is being worked, right? So often that'll be, through Slack or teams. Right now they're, the modern tools can work just through those, And so they're an inner slacker. Teams will be an interface into modern tooling that is doing the actual incident response, across into the service desk software that is getting a lot of information about how many users are impacted, how many people are calling in here, how many people are in the chat bot, how many people are sending emails, whatever communication method, method they have. If you put all that, inform all that data together, now you can actually, create much more meaningful. Updates for the various levels, the executives can really get a much better understanding of how upset are our customers right now. Right. Because service desk bears the brunt of that, not the team working the war room at that moment.
Darin
00:20:39.653
So it's a sliver of time. So let's lay out the for again, pre-incident detect, restore, post-incident.
Darin
00:20:46.733
The actual incident is in between detect and restore. It's a small sliver of actually what's going on. it seems like detect should always be going.
Darin
00:20:57.488
what we detect should hopefully, well, that, that's my point here. Detect should always be going on and als always should be feeding back into pre-incident because to me, just thinking out loud, I should be able to see what my, if it's a, an application where it's web or client server, guess those still exist. We should be able to at least see standard patterns emerge. And that should feed back into pre-incident. And that way if that curve changes, pre-incident, should start yellow flagging things. If something goes awry.
Jim
00:21:41.847
It's because like I have battle scars from trying to implement these solutions at so many companies. There are so many companies still today that have, certain levels of monitoring in place and observability in place, but. They're still completely underserved in many of the applications that are important applications across the business. so many years ago, again, it's been decades here since I was, working in ma like major financial services companies trying to, trying to get major financial services companies to monitor their important applications and have the right alerting set up. a war room would start, people would come into the war room and say, what monitoring do we have? And everyone would start looking around like, that's the wrong first question to ask in the war room. It shouldn't be like, what do we have? We should know what we have already. It should be what is the information telling us? What are the initial indicators here? we were starting from the wrong perspective and I've worked for many different companies over time and, one of the companies I worked for was a monitoring and observability company, or a couple of 'em actually. And so even over all of these years, I've gotten to see the lack of monitoring observability and proper alerting, and then a proper alert management beyond that. That still exists out there today. It's really incredible.
Darin
00:23:13.499
Is it just that because we think we've got the data, so that's all that we need. And we never really interpret it come up a way to truly interpret the data that's coming in.
Jim
00:23:23.559
Yeah. I fell into that same exact situation I had. I'll say instrumented for lack of a better term, but this was like a really long time ago with my IBM systems where I had, CP utilization, disc utilization, memory utilization, basic server metrics flowing into spreadsheets and charts were being created. And so I felt really good about all the data that I had. And then we had a massive incident and I started looking at all this data and I'm like. What does this chart even mean? Like what is my CPU normally? I don't know, but I see what it is right now. So then I had to start looking back through this history of charts and so we've gotten to this culture of like relying on dashboards, but we don't necessarily know how to interpret those dashboards. And I think they give us a very false sense of security. So again, another great opportunity to feed data into, systems that can make sense of that for us.
Viktor
00:24:15.309
I'm kind of negative towards dashboards for a simple reason that. I feel that dashboards represent knowns. Kind of like we know that this might happen,
Viktor
00:24:29.719
and for whatever reason, I choose not to fix it. for good and they would utterly fail when with unknowns, which are the real problems, can we see? The real problem is this never happened before. unless you have very good imagination or you're clairvoyant, you haven't designed a dashboard for something that never happened before. Sure. I mean, you will see the spike in CPU and memory and things like that. It's not that, I'm not saying it's useless, right, but it's not really covering the real incidents. And by real, I mean this never happened before.
Jim
00:25:04.894
And what does that mean though? What does a spike in CP utilization really mean? Did we just get busier and it's good, like we're doing more business and and the systems are handling it just fine? Or is it like a problem? Right. And our customers are unhappy. So like walk into any operation center anywhere and you're gonna see lots of dashboards, right? That's like the, all the monitors are gonna be plastered with dashboards. So we definitely need to do a better job of interpreting that the systems that receive alerts, hopefully we've got alerts configured. Again, that was a huge part of my job when I was back in these financial services companies was was going to each application and working with them to figure out what should we be we be alerting on. Right? So the alerting should go into this alert management system and hopefully that can make sense of this.
Viktor
00:25:53.359
That's another question I have, and it's truly a question. So, when you design a dashboard, you usually design a dashboard with, you know, some thresholds that say, Hey, this is probably not okay, level of CPU usage, right? So you already know what the threshold is,
Viktor
00:26:09.889
so that means that you can just as easily create an alert, right? For, for, for the same thing that you're. Yeah, you can. And then, if you already have alert value watching at the dashboard,
Jim
00:26:21.924
I disagree with most static thresholds, by the way. Like I really dislike static thresholds in most circumstances, except for like, you know, disc utilization. You know, once you get disc fills up to a certain point, you want to get ahead of that and make sure. You're cleaning out your discs and or adding more capacity. But in most cases, and this is what I started to implement in these companies, I used to work for, I did dynamic thresholding where the software would do machine learning and it would build bands of normalcy. And when things got outside of normal too far, then it would trigger an alert. And that didn't mean that there was a problem. Necessarily, it meant you should pay attention. Like it's kind of like, Hey, put the antennas up because something really bad could happen. And then maybe if you combine some of those together, then they really start to mean a problem. And Darin, I, I think this gets back to what you were talking about a little bit earlier with starting to see patterns in the data, right? And so this is really nearly impossible for humans to do, especially at scale with so many. Applications that we have, and so many servers, and virtual environments processing all of this. So again, another great opportunity for, our friend ai. There's many different, technologies that could be at play here, but something to look at that and make more sense of it. And then that should carry into the war room, right? That initial triage should be available as soon as the humans need to get involved. And that becomes part of this incident resolution process so that the incident response software becomes an assistant, right? It's gonna help. SREs or whomever is, is doing the troubleshooting firefighting. It's gonna help them come to decisions faster, and if they don't have information, they should just be able to ask for it. I need information on this, and the system should just go get it from the appropriate system of record for that information and then correlate it with what's going on so that it can help them come to intelligent decisions and maybe start to, to make guesses about what the root cause is. Or how to restore service immediately, and then what the root cause might be.
Viktor
00:28:37.380
Because I'm not sure whether this is my limited experience speaking now, but I have impression that. Very often the problem is not that you don't have information, but you do, you have too much of it, and you have to have because you don't know what will happen. So you're going to collect a lot of data and that feels okay, but then an incident happens, and what they do, do I go through those zillions of lines of metrics and logs and what are not, if some type of software, whatever it is, can actually. At least gimme a clue. Look over there, kind of like look to the left. That feels
Jim
00:29:14.145
what's happening now. Yes, so. Yes. And, and that's, that's exactly what's going on right now, by the way. Like that's, you're, you've hit the nail on the head. That is a huge problem. It's part one of these problems in this incident management process. And, yes, software is being designed to do exactly that, right? And so they're early versions of that that are out right now, and those are getting better and better. Really helping you just focus, right? Cut through the noise. here are the things that are important. Here are the things that are deviating, right? Me. Metrics that are, that look great, that look the same as they always looked. When there are no issues, you can probably throw most of those away. You probably don't need to look at those. But these ones that are skewed very out of the norm right now, these are the ones you wanna pay attention to because some, they're gonna point you to, how to restore that service as quickly as possible.
Darin
00:30:08.201
It seems like in that case, when we're trying to restore that, the way I've been sitting over here thinking is. This data that I'm being given, rather, the summary that I'm being given from the data is either gonna be spot on or it's gonna send me down a really bad rabbit trail and keep me off the real case for an hour or two. 'cause we're just chasing a bunny.
Jim
00:30:42.180
yeah, not soon. But that's the expertise, right? There's that old expression, there's no, there's no substitute for experience, right? And so that's the human. Portion of this equation is absolutely critical and I, I, I shudder when I, I hear the stories and, and people come up and ask if I'm, you know, working at a conference and people will come up and ask and, and say, Hey, can I use your software to replace people? I don't want to replace people. People are absolutely vital in this process. You can use. The software to gain tremendous productivity and efficiency. And so you can do a lot more with the people that you have, but you need those people still. We need human beings in the loop here for sure. In the near term and probably that midterm as well. And the long term. Who knows what's gonna happen with all of this, but I don't want to live in that world yet. I don't have the trust in the systems yet. hopefully very few people have the trust in these systems yet where they feel comfortable letting these systems go and make decisions and make configuration changes and try and, do very difficult tasks within, our environments on their own, because I think that would be disastrous.
Viktor
00:31:56.306
there are of course. Too many people saying, ah, yeah, he is going to replace people. I completely disagree with that, but what I do feel that we are on, on the edge of replacing mundane or tasks or chores that people need to do to get to do what really matters. Like if I use the previous example, right, if there is an incident, me going through thousands of records is a chore. Right.
Viktor
00:32:28.241
finding, me, me actually deciding what should be done when I get to the, to the meat of the problem. That's the value work.
Jim
00:32:39.216
Absolutely. Yeah. productivity gain. Right? So cut out the. The stuff that, takes us a long time and that machines are really good at doing really fast, and then it lets us do what we do best.
Darin
00:32:54.261
I will say it a little bit differently. Let the machines be the ones that watch the dashboards instead of the humans,
Darin
00:33:01.055
because I mean, other than having a pretty graph in front of me, that's about all it is,
Jim
00:33:05.645
oh, they are pretty sometimes, I mean. It's nice to look at those dashboards, but maybe an occasional glance from from a human is fine just to admire the work that's been done.
Viktor
00:33:15.665
Are people so badly paid in this industry? They cannot afford Netflix. Is that what we're saying?
Darin
00:33:25.965
Well, I think the chart that you don't wanna see is a Christmas tree farm. That's what you don't want to see. And if you don't understand what a Christmas tree farm is, just think up, down, up, down, up, down, up, down. That's what you don't want to see.
Jim
00:33:39.620
Although I think Netflix might, has gotten pretty expensive. Viktor, to your point.
Darin
00:33:44.345
Netflix is more expensive than my AI subscription now. So guess which one got dropped?
Darin
00:33:51.166
I'm still sitting here and I'm thinking time again. Going back to pre-incident, detect, restore, post-incident. And in that sliver between detect and restore is the incident itself.
Darin
00:34:02.487
In a perfect world, pre-incident and detect are 90%, incident is 1%. Restore is call it 5% and then everything else goes to post-incident.
Darin
00:34:16.841
That's what it feels like it should be right When an incident happens, it should be able to be resolved quickly, especially if it's been something that we have seen or similar to what we've seen before.
Darin
00:34:34.091
uh, okay, great. Anytime you have a first, it's fine for the window to, to slide way open and I would even say a second or third time, it's okay too because especially if they're happening in rapid succession, it's not good, but it's okay. It's understandable. And we've already talked about it. Lots of monitoring. Doesn't mean we have a solution. It's more of a problem to me just 'cause I have a lot of monitoring. That just means I'm paying a, observability vendor way too much money to store data that I don't need
Darin
00:35:05.036
yet. You've gotta, you, you do have a chance. Well, you, you've given yourself a false secu, a false security of a chance. Because if all you've been doing is stuffing the data back, you still don't know what's going on.
Jim
00:35:22.841
Well. Think about that, small sliver though that you're talking about, a small sliver in the middle is outrageously expensive, right? If that's the
Jim
00:35:36.150
and so the whole, the whole goal should be, avoid that sliver in the middle. Do everything you can. To avoid that small sliver. And the way, again, like I'm pushing this out there as my opinion, the way to avoid that is for with better run books, because those runbooks, hopefully, hopefully the operation center is getting some early indicators with some alerts starting to fire. another lesson I learned from, from many years in IT operations is things didn't usually just boom, they just didn't go down. Hard immediately. There was a spiral. Things spiral out of control. And in every postmortem I ever did, there were indicators there, things started to slow down. You know, there were occasionally, you know, like a full network outage, which is like boom, weird, strange things, but that almost was never the case, right? So 95% of the time there was, there were indicators there. And if you have the right run books in place. Hopefully your operations team can get ahead of it and start and test the right things or supplement that operations team with really good automation here where you have automated playbooks that, and run books that kick in and do the testing and flag things to the right people and say, Hey, you like that result of this test indicates that this is an issue. If you go fix this, then we, we should not have this outage that we're headed towards. Right. So that to me, that's a huge part of avoiding that little tiny sliver. And then when you do hit that sliver, because it's gonna happen for sure, that portion on the back end, after you've already hit it, you know there's gonna be a bunch of things that you can do to make sure or to try your best to make sure that that's not gonna happen again. So fulfill those things. Make sure you've really identified all the right tasks to follow up on, and then give people that ability. give someone authority. Give a team authority to follow up on it. Track these things. Make sure executives have the right view into the post incident activities that are supposed to happen, and incent people to fix that so that that little sliver in the middle. It doesn't happen again. So hopefully if you get those, the first part and the last part right, you may take care of that middle, significantly more often than it's happening now. Right. So significantly fewer incidents is the goal.
Darin
00:38:02.933
So it sounds like you're proposing in for a 2025 Look, run books are automated. Yes, that, that makes total sense. If If it can be automated, it should be automated.
Darin
00:38:19.339
as well. Yeah. yeah, for sure. But I don't know how we're going to automate a way telling business that we need to not do what's been promised for the next quarter so we can make sure that this doesn't happen again.
Jim
00:38:30.410
Hmm. So that's a, again, a top down cultural issue. And so that's a business case. And so just with everything that we do, there's a business case. These operation centers, they know how many incidents they have. We tracked when I was in, again, these financial services companies, we tracked how much these outages cost us. These outages. Even though, you know, we're mentioning a small sliver as the typical outage. That outage, that small sliver can be anywhere from minutes, but in some cases it's days. the, head of product that works at my company. In his previous company, he worked for, an online retailer and they had a significant issue that went on for 72 hours. this is a major retailer, right? And so they do a massive amount of business, so that, that was a, a major impact to them because they lost those sales, right? That revenue went to some other company. So that's, it's the business case. And so that business case should drive the change, and there has to be a cultural change in this respect to either leave room for the people who, who need to make those fixes that work there today, or to spend a little bit more, to get a little bit more capacity to plan for this. Right, and so the business case is simple, right? It just depends on how exposed you are. It depends on how good your current, teams are. Again, some, some companies have a lot of resources and the the business case won't be there for them, okay? But for many companies who don't have that many resource resources, they should see that they can make this business case very easily.
Jim
00:40:25.179
Ooh, that's a tough question. What should it be? According to whom? From whose perspective? From the, the ops person or their, their executive staff.
Darin
00:40:35.034
the executive staff, I believe, is they're gonna say, well there are three floors underneath the streets in the data center, so it doesn't really matter. We'll just keep 'em going a hundred percent of the time.
Darin
00:40:48.797
or 99% of the time. 'cause they'll get a day off sometime 'cause they have to go to the dentist.
Darin
00:40:54.652
From the management of those operations people, they're probably gonna think 90%. The peers of the operations people is like, I do so much unplanned work. I probably spend 75% of my week doing unplanned work,
Darin
00:41:10.792
if not more. And sometimes especially, you're not gonna be able to have. An ops person that can do everything. Yes, there are a few that can do that, but that's not typical. I think that's the biggest problem is we, we are over allocating what the operations people can actually get done.
Jim
00:41:28.847
we are, and we're, we're making it worse today. There's. definitely the, do more with less mentality. it's definitely there. Uh, we see it, I've seen it for multiple years at this point in our industry. Uh. There's a lot of conversations about burnout and it's a real problem. and that has to be factored in here as well. We have to take care of our people. We have human beings that, you know, just think of a car. You can't run a car at a hundred percent all the time. You're gonna burn out the engine, right? It has to, rev up and rev down run at partial power. If you're just at full power all the time, it's gonna be a very short-lived engine. going to break down. The human is very similar to that. and we have, we're a lot more complex, right? So we have a lot more things going on in our lives. So this is another topic that is, is really near and dear to my heart. I know many people in this industry who have, burned out and they've just had to quit the industry because it just was too much of a grind and it chewed them up and it spit them out. And I don't, you know, when we have this conversation. I don't want to even imply that executives don't care. I believe executives really do care, and I think this is a huge challenge that they want to solve, but at the same time, there are these business demands and so there's a lot of factors here, and I don't think people have ill will. I don't think anyone wants to see another human being burn out because they're overworked, but at the same time. The business needs to deliver. And so there has to be some sort of balance in there. And I think that's what every company struggles to find.
Darin
00:43:06.827
Well, if they don't, I mean, if the executives don't hit their numbers, they don't get their bonuses. Or their golden parachute doesn't open,
Darin
00:43:15.632
sorry to all the executives, but that's the way I see it from bottom up, and I think that's the way that most people from bottom up see it.
Jim
00:43:22.928
and I would say at the same time I've known many executives and, and they work a lot.
Darin
00:43:30.455
That us grunts that don't see from the bottom. Yeah, it's, it's even worse. But you know, I typically don't see the executives in when something's hit the
Jim
00:43:41.894
You hope, not actually. Right. you know, that's the, that's the last thing you wanna see at that point in time. You just want to get it fixed and then go back to sleep as
Darin
00:43:55.634
So. Where does Xurrent fit into all this now? Xurrent. Okay. You're, you're doing your smart web two Oh, naming here. That's, that's what this is. Uh, it's X-U-R-R-E-N-T. So like current except with an X, and it's pronounced like Xerox Xurrent, if you
Jim
00:44:13.904
Yes. Yes. Yep. So, Zern is across all of these things that we've been talking about. you, unsurprisingly, right? think of Zern as, a, a platform for IT operations, for IT service management for enterprise service management. So. The service desk would, would utilize our software. And so we have this ticketing and request software with workflow automations, uh, that are designed to improve, collaborative productivity. Right? And so, think of a workflow as a process that. Kicks off from person to person and everybody knows what they're supposed to do at the right times. There's a self-service portal where end users would come in if, let's say this, there is an incident going on. End users might come into the self-service portal and report an incident, and look for some sort of status update on that incident. And so you tend to start to see many tickets for the same thing when an incident occurs. So that's why status pages become. So important, which is another part of Xurrent overall. We have this product called Status Cast that does status pages as well, so it gets that communication out, but also to various layers because. Based on your role, you can see different things. And so there can be an executive update in in there. And so the executive can just read it within status cast and another person in another role logs in and they'll see a different update. Right. So that's the key here. We're trying to. Communicate to the various stakeholders in the way that they need to be communicated with. And then the third part of what we have is this automated incident response where we understand the, you know, who is responsible for what services, and we take in alerts and we make sense of those alerts. and we assist in solving the problem as quickly as possible. Updating these runbooks and playbooks, sending the post incident activities, those tasks that need to be performed. It does like an automated postmortem, sending those back into the request system for that service, that the service management system and making sure that there are requests in there that are automatically generated so that teams can go ahead and track those and start working on those. So we're working on solving this end to end. Issue that we've talked about. We're, we're not there yet. All the way we have, we have many pieces of this in place right now, and we're working on, getting more of the pieces, more of the analytics layer in place to, help and assist as much as possible throughout the process. So that's what, that's what we've been working on. and again, I like, I joined the company because I had a background in this and when I heard the vision for where the company was headed. The problem resonated with me because I'd seen the problem throughout my career, in IT operations. And then after my IT ops career when I was working for various vendors in the IT operations space, I kept seeing the same problems over and over again. So that was a really worthwhile thing to participate in. And hopefully we can help solve a a, a very big challenge that many companies face.
Darin
00:47:20.651
We've already talked about how we sort of see AI today. It's infancy, for lack of a better term.
Darin
00:47:29.061
Is it being helpful now for y'all within Xurrent? think back to beginning of 2025 compared to October of 2025, how big of a jump has it been for you?
Jim
00:47:41.781
Yeah, so things have really coming together. So over time here what we've done is we've identified where are the places where AI can be a productivity assistant, where it's a productivity enabler for folks that need to use the system, whether that be an end user coming in, in a self-service portal, or whether that be within the system, you know, doing. Postmortems in, the automated incident response, which is called Zen Duty, by the way, then, you know, we have, within the service management portion, the agents or specialists, there are plenty of opportunities to assist them also with summarization of, of long tickets with lots of information in them. So quickly summarize that. Let the AI handle that with writing documentation. This is a huge one. This is like. Something I hated doing when I was in IT. Operations. I don't know many IT people that really like writing documentation. so AI is great at that. So from any note, in any request, you can say, Hey ai, here's from this note. I need you to create a knowledge article. It will look at the context of this request and all of the notes contained within. And this is for solving some sort of problem. And it will create a knowledge article on solving this problem. Right? It'll automatically route and classify tickets so that there's no ticket queues to arbitrage, where tickets get lost in there for days at a time and nobody gets the help that they're looking for. And so this is all culminated in, in, in an AI fabric. For us. It's a mesh of capabilities that are all productivity, enhancers. So that we could launch our virtual agent. And the virtual agent is like the first point of contact for someone coming in from a self-service perspective so they can chat with an AI who will, look through these knowledge articles that AI helped write or AI helped improve. it will recommend ways for the person who's having the problem to fix the problem on their own. So hopefully they can just self-serve and that's why. You know, we had that conversation about kind of do more with less. so you can, you, you don't have to hire more and more people in the service desk. You can do more with the staff you have. And then, let's say you can't get to a resolution on your own. It will even submit the request on your behalf and then route it to the right people. and then one of my favorite things is within that same interface, if you don't want to talk to ai, which many people still don't. You can just click the, Hey, I want to talk to a human button. And then you're chatting with the human being on the other side of that service desk instead of the AI system. So there's, there's so much potential for productivity gains with some of these basic. Capabilities, right? It, that, that even does, that doesn't even start to touch some of the things we talked about. You know, I'll get into a buzzword here with agentic ai. I think that's a huge buzzword right now, way overused. But when I think about agentic ai, it's like taking actions. And I made my feelings clear earlier that I think it's super dangerous to set a system free within your, within your environment to make it, let it make configuration changes and, and anything complex. But there may be some very simple things that you say, you know what? Hey, this is fine. We're gonna, we're gonna allow AI to do these few little things on its own because it's the repetitive things that that Viktor you mentioned. You just don't want to do all of those repetitive things. and that's fine to push that off to a machine. It's not gonna hurt anything. Right. So that's where these productivity enhancements are really coming in.
Darin
00:51:18.447
So Zurich can be found@zurich.com. That's X-U-R-R-E-N t.com. Jim, thanks for being with us today.