DOP 54: Achieving Continuous Verification Using Chaos Engineering

Transcript

Darin Pope 0:00
This is episode number 54 of DevOps Paradox with Darin Pope and Viktor Farcic. I am Darin

Viktor Farcic 0:06
and I am Viktor.

Darin Pope 0:08
And today we have more guests. Welcome to our neighborhood.

Russ Miles 0:14
Thank you.

Sylvain Hellegouarch 0:15
Thank you.

Darin Pope 0:17
This neighborhood may be a little...how shall we say today? Chaotic? Viktor, do you want to go ahead and explain the backstory to why our guests are here today?

Viktor Farcic 0:34
Yes. So I was working here on and off in different experiments and ways of using chaos in the past. And then I wanted to refresh my toolset, let's say not long ago. And the one of the reasons for that need to refresh is because Darin and I decided to create, to write to book about chaos engineering in Kubernetes and do some course in Udemy and then when I explored what's there over there right now I stumbled upon something called ChaosToolkit. We used it together with few others decided that's the one we like the most. Created a course, wrote a book, blah, blah, blah, all that stuff, right? And, and so I thought, hey, it would be cool instead of me or Darin talking about chaos engineering in one of the episodes, why not invite the guys that made it? I mean not made chaos but made ChaosToolkit. And that turned out to be and I will probably butcher your name. Sylvain. Sylvain. Does that sound right?

Sylvain Hellegouarch 1:48
Yes, it sounds good. Sylvain.

Viktor Farcic 1:51
Russ

Russ Miles 1:52
Russ is easier, I think slightly.

Viktor Farcic 1:56
So I'll let you introduce yourself, guys.

Russ Miles 2:01
Okay, well I feel for a podcast like this, I think it should be. I think the tech expert should do the introductions. And he's looking at me with such with such glee at that thought. So I'm gonna hand over to you Sylvain to do the the first honors and then I'll just come in and correct you later.

Sylvain Hellegouarch 2:19
I appreciate that. Thank you so much. Well, as you can tell, Russ is not my friend. Yeah, we've known each other for for a long time. And I think ChaosToolkit came out a bit out of so much of our discussions over the years, probably more on the design part of software in the past until, I guess, three or four years ago, something like that, where we saw the other side, the upside of that, and that's where, you know, we ended up discussing about chaos in general and ChaosToolkit as our answer to that, so, so my name is Sylvain. So I'd say my my full name for once, but I'll keep to my first name, otherwise Sylvain Hellegouarch. You know, believe it or not even in France I struggle with it. So that's fine if you call me Sylvain. So I've been in software development for a very long time started and I'm still at the core a software developer, that's where I really, really love being but I've had to look to actually a new opportunity to work in various aspects, which led me to conclude that really to become a good software developer, you really need to have a systemic view of your of your system. And once you have a systemic view of your system, you realize it's all a castle of cards, and you really need to test for the castle reacts to variety of signals. And that's why you know, we ended up with chaos engineering in general and being really having a lot of fun recently, but that leading to reliability as probably we'll discuss later on. So it's been, it's been a long journey coming to that conclusion, but it's actually be very fun journey. I'm sure we'll speak more about the tech itself later on. So I won't dive into that right now. Russ, on to you.

Russ Miles 4:22
Thank you. So now I get an opportunity to correct Sylvain but you can't really correct him on his own perspective on himself. So, yeah, I think the only thing I can add to that is it as I explain to most people that introduced to us for the first time is that it's a minor miracle. He's French, I'm English and we, we get on so well. It's that we ignore many, many hundreds of years of problems to to ensure that we're best friends and best friend for as long as we have. And yeah, my background is that I've written now I think's written and contributed to six or seven books in software development delivery in different operational models. And chaos engineering has been a big part of that with Learning Chaos Engineering out last year, Chaos Engineering out in the last month, both from O'Reilly so it's it's been an interesting journey on that side of things. But that's not really. That's more like the documented stuff. The interesting stuff is the practice of what we've done and what we're doing together as a team. And so our journey has been going on now for probably four years, five years. And it's now really taking things to the next level by focusing on what people are actually doing and what they want out of doing these things. So the tech is really fun. They're really, really awesome. But the piece that always gets me the most excited is what people are actually learning from doing any of these things beyond the testing approach, but into the what are they doing to learn things is is where my head's always at. And, yeah, it's a pleasure to be here.

Viktor Farcic 6:09
So I really want to ask, what did they do? What did they learn? But before that very quickly, just in case somebody doesn't know what it is, what's chaos? What's chaos engineering?

Russ Miles 6:23
Okay. I can tell you what, what it is and then I'll I'll briefly address what it's not. Chaos engineering is first and foremost for me anyway, it's a mindset. It's a way of looking at a system and asking questions about it. And some of those questions are related to its robustness. So robustness is often misunderstood in my opinion. People invest in robustness. Robustness is for the stuff you know might happen. So when you look at a system you go, okay, it we probably need an HA database. We probably need a backup strategy for that database. We probably need more than one instance available so if one goes down, then we still have one left. And we maybe need to start to use circuit breakers, bulkheads, all these wonderful patterns that are out there that we try to introduce the systems to make them more robust. And you have to extend out of course to the people involved. It's even more important when you consider the people and how they react, where do they go, what do they say to one another? What are they feeling as they go through a challenging circumstance? So, chaos engineering for me is looking at that complicated environment. And Firstly, you can do chaos testing and experimentation, where you are looking at to prove and improve your robustness. So for example, you may have invested a huge amount of money in observability. And but have you ever explored what happens with observability when a problem occurs? What what are people looking for? And so that sort of practice gives you the ability to, in one circumstance, make sure that that particular part of your system is going to contribute a net benefit to things rather than become a source of obfuscation maybe. So there's that. And that's all about robustness, but the really interesting edge for me that the piece that's there's often forgotten is that it's a there's the desire to be more resilient when we're meeting challenging outcomes, challenging situations. So what that resilience means very, very loosely, very, very high level is that you are better prepared for when things surprise you. Okay? When surprising circumstances happen, the phrase is often used is you're poised to adapt to it. You have invested in your ability to have that poise you've ever you've invested in your capacity to have that poise. Chaos engineering has a huge part to play there. It's` one tool in the toolbox, really. And what it does is give you the opportunity to play out scenarios and practice what happens when something surprising occurs. This is different from proving what you knew was going to happen. This is trying to practice how you react and learn from when something surprising happens. And that's where the real value is. That's where the lessons to be learned really are, is when you start to look at Okay, what really does happen when something's going wrong, and maybe we don't know about it at first. How did we when did we first know about it? When do we then start to get to a resolution on what it might be? Who has to be involved? How do we synchronize? How do we respond? That those resilience capacities capabilities, if you like, are one of the things that you can also practice and develop doing chaos engineering experiments.

Darin Pope 10:11
What is it not? You never got to the not. I was waiting for the not.

Russ Miles 10:15
Yeah, no, sorry. I felt like on so monologue on it, I decided to stop. Okay, so what is it not? The answer to that sounds a little bit trite. But it's, it's more about trying to address what people mistake it for. It's not just breaking random virtual machines. For example, when you look at things like the Chaos Monkey is a great example. This is fabulous tool did a great job of causing a bit of a stir, shall we say, and getting people thinking a little bit this way. But I think a lot of the work since then, has been trying to tell people No, it isn't just that. It isn't just about breaking breaking things tends to be the easiest thing to achieve. Let's be honest, it's pretty simple. You know, just just take my case, take my daughter into work and give her a keyboard and let's see what happens. Those aren't that isn't the point, at least I hope it's not, though she may have fun on the way. What the point is, is firmly on the side of what are we going to learn from all this? You know, have we set ourselves up to learn? Breaking things is fine. Stressing performance, exploration, security exploration. What you're looking for the answers you're looking for, that you may not have right now. And this is the answers you get from chaos engineering that you don't get from, that you haven't had in the past, is you begin to answer the questions. How will everything come together when things are difficult? There's a great analogy for this very, very quickly that I like to use, which is that when you put a system together and you design it and you put your continuous delivery, and then you've got your DevOps, you know, silos broken down and every everything looks like it's delivering quickly and qualitatively into production. You're really happy with that. And you should be. That's been the major challenge for the last 20 years. So let's, let's celebrate that. The difficulty is what you've done there is you've built a really good set of players, you bought some good players, but you haven't yet tested those players under match conditions. So you could have fabulous observabilities, fabulous, you know, other strategies in their robustness strategies, but until you bring it all together, and put it under significant strain, not just performance and traffic, but the strain of inevitable failures. That's where you start to learn how it all really plays well together. So resilience engineering, chaos engineering, continuous verification. These are all approaches and tools to try to help you explore what will happen when something occurs whether it's known or unknown. So yeah, very, very long answer to what it's not is the things like just breaking bits, or just I had a great phrase from a bank that called us and said, they said, you're gonna love us. We do chaos engineering. In fact, we're done with it. We've done it. It's perfect. It's done. And I thought, that's an incredibly interesting perspective. When is it ever done? Are you telling me the system's turned off now because that's okay, I get that that would be done. But if it's a live system, and it's being used, and it's evolving, then you'll never going to stop learning how to look after it, how everything plays well together, how you know it. It's an it's a practicing reliability as we call it. Practicing reliability goes alongside your speed of feature and utility delivery. It's it's a part of your game and so you don't ever. You couldn't argue it's ever done until the system's turned off.

Viktor Farcic 14:07
What would be then let's say prerequisites. So let's say that I know nothing about it about chaos. I want to start practicing. Can I just do it or do I need to be on a certain level of maturity of certain things?

Russ Miles 14:26
So I'll have a crack at this and I'll let Sylvain do some talking. Otherwise, he's gonna fall asleep, and it's not fair on me. Very, very quickly. It's a great question, and I've addressed it in an article that I published yesterday. And I can say, get you the link and maybe it can be distributed to the podcast, etc. The short answer is that while it would be helpful if you had things like good observability in order to chaos engineering well, there is it's more it's less of a maturity model more of a mutually beneficial relationship. So as you think like a chaos engineer and you start to look at systems through this lens of how will it react when things go wrong, and you start, you can start to practice these things straight away. You wouldn't necessarily practice them immediately in production. Let me just put that out there. If your experiment is very likely to destroy the system, don't do it. Maybe pick something a lot smaller, more contained and start to learn from that first. But in terms of technical prerequisites, I don't think there are any obvious ones and it's an interesting perspective. So I heard this one. I'll just give you this quickly. Someone said to me, do I need high quality observability? Does reliability work require high quality observability? The answer to that question is how do you know it's high quality observability? How do you know? What is the measure of high quality observability? For me, one of the big parts of that measurement is how does it work when something goes wrong? How does it play into those scenarios, those moments when it's most needed? If you had don't practice what we talked about reliability, chaos engineering, system verification, game days, if you don't practice some of these things, you don't know how it's going to play in those circumstances. So my take is you don't know its high quality yet. What you do know is that you spent a lot of money on it, probably. You've you've bought the right thing. You followed all of the great advice to get it working well. You've got a great player. But you the thing that matters is have you got a great team and practicing reliability you can do the moment you think about doing observability because it helps you fashion the right observability for you, so that you have a great team at the end of it. It's all very well having great players. But if you the whole point is it's not about a single player. Sorry, Sylvain, I may have covered now. Okay, I'm getting thumbs up.

Viktor Farcic 17:24
In a way, I think it boils down to what you said before along the lines of Yeah, I mean, you. It's about finding out things you don't know rather than things you do right. Now my kind of dilemma, let's say is, yeah, you can initially contain it and say, Okay, I'm going to practice this only on a single application, right? And, and I don't have observability and all those things, but it's okay, I'm going to do it on a single application. I'm going to shut down an instance and see what happens. And you know what, I already know what's gonna happen. That's such a simple case that I know what's going to happen. I don't believe I would need an experiment for that right? Now, what I don't know is if I, if I do the same thing in the system as a whole, I'm might actually find out that some seven levels beyond what I'm actually touched, something got broken. Right? Kind of you shut down I don't know replica of an application do something with networking of application A, and it turns out that application F is misbehaving, right? Those are the things you don't really know. But my dilemma is then, but how do you find out if you are not able to observe the system? I'm not going to repeat the word high quality but to observe the system one way or another.

Russ Miles 18:48
So it's it. You're absolutely right. And I guess what I'm saying is that most companies, technical groups, think they're trying to do everything right, or if they're not doing everything right, there are other reasons why they're not doing it right and people are, you know, hiding costs or something I don't know. But generally speaking, people are trying to be as prepared as possible. What chaos engineering is good at is telling you where your assumptions fail. And so in going back to your point about observability, you should invest in observability. Yeah, absolutely. And just by thinking about a system, in terms of what will we do when the when the time comes, will get you towards observability. I think that's, it's not like, what I have a problem with is when the phrase is used, you must do this first before I see it as going very much in tandem. I'm doing it because I'm thinking about reliability because I'm thinking about how the system responds in failures. You know, you should invest in robustness of the system, and observability and other bits and pieces. What we're saying, though, with, with these reliability practices, like chaos engineering, is that you can do everything as right as you think you will. But you don't know it. You don't know. And, and know only happens. It can happen in one or two places. So it can happen when the house is on fire, which is not the best time to learn. You know, we've all had that moment where, you know, you're in a personal crisis, and one of your friends turns around and says, if you'd only done this differently in the past, you wouldn't be here now. And you know, they're right. But that's not the moment to learn. Okay, so what we do is proactively explore the system. When you think you've got things as right as you want to, or think they are can be be perhaps then these practices tell you where your assumptions are falling down. And they also tell you where you might want to improve Okay, when you might want to learn from you know, further exploration. They give you sometimes the evidence of your own ability, the the lack of the, of the mapping between assumptions and reality.

Sylvain Hellegouarch 21:27
I think if I may, what you said earlier, we are heavily based on hindsight where like Russ just said, You should have done that you should have done that we should, you know, achieve this on that. Right now with COVID, you see that right now, we we have a saying, I don't know in other countries but in France, we say we have 65 million coach for the French football team. Where basically everyone knows better than the actually the guy who's doing it or the person doing it. Sorry. So what I'm seeing is with COVID, we see that again, they should have done this. And you know, I'm not bitter at anyone sometime I feel like that. But at the same time, two months ago, there was no, there were signs of COVID, but not to what we have now. Right. So, chaos engineering is not here to fix something you know about. It may appear to us, you know, facets of that. But it is to prep you to behave. Right? Again, if you have great players if they've never played together. And you see that sometimes in in some teams where individually, they're all in the best, you know, clubs around Europe or elsewhere. But when they're in national teams, suddenly the team is not that as good as individuals that are part of it. Because they're not used to practice together. And the whole point of chaos engineering is is not to say, hey, I've done that and I fixed it. First of all your system the next day is different from yesterday. Because of that already your assumptions from you. So they are likely not necessarily wrong, but you know, they have an offset. And have you tried that offset? So the idea is not to continuously change. But I think it's the idea that if you practice you will feel less stressed. You'll have a better understanding of how to react. Last year at ChaosIQ, we had a TLS issue. And that is a shame because in chaos engineering talks I've given, I've always used that bloody example. And yet we've fell into it, right. So that there is that. But it was the 3rd day of my holiday, and I wasn't there. And the folks actually the team called me and said we found many things, but we can't figure it out. And of course they couldn't because up to now so many things have been in my mind. I haven't been very good at letting other people not even this document, because you could really documentation goes so far, but practice the issue. And that's really what at least at ChaosIQ and with ChaosToolkit we're all about is you going to practice, even the little things can have big repercussions if I may use a cliche. So it's the goal is in bad practice you do. It's not necessarily in the in the one off tests you're going to run. And that's what we, you know, we're passionate about. Making the teams better through that.

Russ Miles 24:33
Yeah, absolutely. I mean, just just to one other story is worth I think just pointing out is the fidelity of the information that you get out of practicing the way that Sylvain is talking about, because I've had that conversation and I won't mention the company, but they're a big company, massive global, very successful organization, entirely based upon their internet services. And I spoke to an engineer there, fabulous engineer, came up to me after one of the talks on chaos engineering. And he said, Look, if I know the system's gonna fail, I don't need to do the experiment. I know. And I said, Okay, you might be right. I like the I like your confidence, sir. However, that word know is a bit awkward. Because there are levels of know. And what we normally say when we use the word know in English is we mean we strongly believe. We really strongly believe. We have no evidence because it's never happened. But we strongly believe it. And I said, it's okay to be, you know, to have strong beliefs in how your system will respond to things you know about that you've got great observability, etc, you've got all these things that you know, right, you know, they're good, but what you're really saying is you strongly believe. And one of the things that you can get out of practicing chaos engineering and system verification for example, is you're going from believing strongly believing into knowledge, knowledge of a given moment in time, knowledge within a given context and a given set of circumstances, it's not generally applicable knowledge to everything that ever might happen. If you can get a really good MTTR over the course of several weeks, that's a good measure. But that doesn't mean the next instant, that's gonna be your time to recover. Okay, so you but you practice these things, because you will, one of the things you will get is a greater feeling of knowing more about how the whole system works together, when, you know, in these difficult circumstances, and in that particular case, this this gentleman who said, I know will happen, I said, you know what'll happen to the database, but what happens to the rest of the system? He said, I think it will fail. And then one of his friends, she came over and said, I'm not sure it would. And so you can see the doubts building right and it's good. A doubt is a good thing, you know. Chaos engineering is a scientific, it is based on a scientific mindset and a scientific approach. And so doubt, skepticism is a good thing. So we encourage skepticism. You practice because of skepticism. You practice because, you know you don't know. Or at least, you know, you realize it's a strong belief.

Viktor Farcic 27:17
I think that you both now mentioned, two of the huge issues that normally companies have that are not really directly related with chaos, right? That that that ability to say, I know, it's just horrifyingly dangerous. I mean, it's not really related in my head to chaos. You can say the same thing. Like I wrote those 50 lines of code. I know they work. You know, I don't have to write tests because this is simple code. I'm going to write tests when, when I'm unsure about it kind of it's it's that same mindset of translating word know, for instead of a guess maybe or something like that, right?

Russ Miles 28:10
Yeah, yeah, it's you can think of it like TDD for the system under really interesting circumstances. One of the things that we get back as a comment back from our customers actually, is that chaos engineering is really good at helping them prove and improve their robustness, which means they're learning from their assumptions. They've made good assumptions, they've made good investments, they've tried their best, and, and it helps them prove and improve on those guesses those beliefs, really. But the other big challenge is, of course, the resilience, which is the preparedness for when when these circumstances occur, you know, how good are you anticipating synchronization, synchronizing and responding to something? How good are you? You don't know until you practice. That area is one of the one of the tools in our toolbox for that is around game days, which I think is a is a phenomenal exercise. It's a it can be expensive in people's time. But it is singularly the most powerful tool I know of at the moment to proactively try and learn from how the system will play out and your chaos engineering experiments, the automated ones, can be part of that. They are often the backdrop to it. But in the middle of that game day, you're going to be learning so much. You're observing so much. And the key question actually, in all of that is, what happened? What is going on in people's minds, in the dashboards they look at, in the observability tooling that's provided in the system, elasticity that you can take advantage of. What happened and that's why we, the core concept in our recent tooling is the timeline, the ability to take many, many, many sources of information, and bring it together to help you answer and explore that question of what happened in what order to try and be what Sylvain alluded to earlier, which is one of the big problems with post mortems or incident analysis, as I prefer is that hindsight bias is a big issue. And so one of the things we try to do with the timeline is support analysis of what happened in the order it actually happened very, very quickly. There's a there's a real problem with analysis and that people tend to look at things from the back end to the front end. They go, Well, you know, that should have happened. Oh, they didn't do that. And when we see it now, with COVID all the time. Why didn't we you know, isolate two weeks earlier than we did? Well at the time, we had imperfect information and it was a complex picture and a judgment had to be made. And it was always going to be suboptimal, probably suboptimal. So they did the best they could do. Some countries got it extremely better than others, right? That's okay. That's what happens in those circumstances. There's no it's a complex problem. So the answer isn't going to be do that. Judgment has to be involved and judgment with imperfect knowledge. Anyway. So that's the same thing we see when people try to understand analysis of a system that's had a problem such as your online system, your internet services, unpacking what happened in the order that it happened. Coming from the front end to the back end, going from the front, going earlier than the beginning. You say you're looking at the system way earlier than that. You can start to understand the many facets that contribute to the complex situation that emerged and that's the interesting thing that we're that's the thing we're most interested in now. Chaos engineering gives us a way of, of deliberately orchestrating these opportunities to explore what happened and learn from them.

Viktor Farcic 32:17
To me it's also very interesting what Sylvain said before about what was it for example with TLS, right? And I feel that is happening also very often. No offensive Sylvain. But kind of that a person would a person or a group of people would be doing stuff. And that includes chaos engineering, but equally applicable to something else. They learn something from it. Great. But the the effect of that learning is extremely limited. Right, kind of like yes, I learned and that can help me in my job, but the rest of my company still has no idea how to fix problems with TLS certificates. I'm not attacking now you're using just example. Right? And that might be probably more important is actually how you propagate the the learning experience and the discovery maybe then actually, the, the experiment itself.

Sylvain Hellegouarch 33:21
Exactly. The learning is is is the key thing here. You don't realize but the more you do something, the more automatism you create. We often speak about automation, but we see it as a, as a tooling prospect you create. The more you practice on something, the more your brain let go and the more you respond faster. So we we've got that we've read some time where someone was going to say like he explained earlier, but I've done that once. You know, I fixed it and that works. It first of all, it's great you did once because at least you're one step. You know, on the on the on the ladder. I think it's fantastic. But the problem is, even if you found and fix something, the learning doesn't happen once. The I've got a baby daughter was four month. And soon she'll have to diversify her food, right. And what I've read is that you need to try certain foods up to seven times so that their taste gets used to it. The first time you give that food to the baby, usually they're just reject it, likely, unless it's like raw sugar, probably. But otherwise, it just, you know, I don't want that. So you need to train. And as at some point, their body is going to react more appropriately, automatically, and the brain will let go. And I think that's what you don't realize is when you practice, you're actually learning. The learning is not necessarily a concrete artifact at the end of the day. You may never fix the issue found, but you're better at recognizing it. You're better at looking for it. You're better at handling it in general. And that's invaluable for any company. If your team is getting better at it, it's just, you've gained so much. So the point we made when I come back from holiday, we had fixed that TLS issue was not to say, obviously, we didn't want that TLS issue to come back again, that's where you come back with monitoring, proper alerting and all the shebang. That's obviously the thing. But what we really want to do is how do we practice to you know, panic less you know, be better as a as a group of people. It's really people centric, to be honest, that thing. And that's, that's where we are at, right? It's, it's all about the learning arrives by practicing right and you don't necessarily know it, it just just happens, right? But you have to do it. Unless you have extremely raw talent for that thing, it's not going to happen, you know, on its own. So you want to practice.

Russ Miles 36:01
Actually just to call out a friend of ours, certainly someone we admire greatly is John Allspaw's recent work on that. He's done some amazing talks around how you learn from, from these sorts of activities. And a big deal in that is the concept of the story. So whatever you come out of having invested in doing a game day or or some chaos experiments, experimentation, the outcome of it is a story that can be told. A compelling story, something that everyone could be interested in learning from. And this is something that's lost when you distill these situations down to just like the bullet points. Yeah, here's our here's our post mortem analysis and it's got five bullets and actually what it says is, we should never let Sylvain go on holiday again, because you know, he's the TLS expert and that's okay. The story is more interesting. I remember being in that incident and I can remember talking to Sylvain over dubious text messaging mechanisms because he was going through black spots and lord knows, I don't know where he was in France, but it wasn't, wasn't an easy place to come back and fix anything. And so it was it was a hugely challenging time. And all of that is is better captured if you if you have a story to tell. It is a story that you know, doesn't I'm not using the word story in the fictional sense. I'm telling you know, the the actual narrative of it is the most compelling part. A good analogy for this is if you've ever you know, many people have watched the the recent Chernobyl TV program, and the beauty of that program is it tells it from front to back so you you go through it with them. And you do get this sense of oh my goodness, they really did think they were doing the right thing then and they thought they were doing the right thing then and they thought they're doing the right thing then That's the right way to tell the story to learn from it. The wrong way to tell the story is like I did just begin with, I said, you know, oh, well, you know, TLS problems. Sylvain is on holiday. Here's some answers, We're done, right? That's not what the outcome of a good incident analysis is. It might be some findings might be some suggestions. But the more important thing is the story. And that's what our tooling is trying to help you with.

Darin Pope 38:26
The tooling is ChaosToolkit.

Russ Miles 38:29
That's one part of the tooling yeah. That's the chaos engineering part. The ReliabilityToolkit is the piece that we we've added around that to do things like game days, and verification, which is achievable using ChaosToolkit. The reality of ChaosToolkit, when we when we first came up with the project was to enable the individual to do some chaos engineering. And so it's got all the pieces in there to do that, without a doubt, including how you then report on what happened and try and build that narrative. But it's, it's pretty limited in what it does. It doesn't easily or that can be used to support the larger picture of how everyone learns from it. That's not what it does. That's not what its purpose was originally. And so what we've done is we've built the ReliabilityToolkit, which is our SaaS that gives you that place to practice these larger gamut of activities. That's what it's there for.

Darin Pope 39:29
So ChaosToolkit, open source,

Russ Miles 39:31
Yes

Darin Pope 39:32
ReliabilityToolkit, pay us money.

Russ Miles 39:36
Yeah, very simply. That sounds about right. Yeah, but don't pay us too much. We're not asking for a mortgage.

Darin Pope 39:45
There's, there's a lot of financial jokes I can throw in there. I'm not going to. Cool. This was a great talk. Where have you seen people, let's say big organizations, because I'm assuming you've got some big organizations that you can't say their names, which is fine. It seems like the biggest disconnect is people think they know. And you start asking around and then the people they use the word know. And once you start probing more, that know becomes think. What was that biggest realization for one of your clients? When did the light bulb go off for one of those clients?

Russ Miles 40:28
Oh, I've got a couple of great stories in there. I'm quite fortunate in that I've been training people in chaos engineering now for a couple of years. So you get to hear some great confessions over, you know, end of day beers. I think one of my favorite stories is of the gentleman that came in on the first day of one of the courses, and he said, I like everything you said, Russ, but our system doesn't fail. And I thought, okay, that's, that's great. You know, well done. He said, Yeah, it's a single VM. It's a very simple monolith, there's only a few of us working on it. And it's been running for three years. It's never failed. I thought Fair enough, okay. And I said, you know, maybe there's a complexity argument in this. Maybe you know, the more complex the system is the more chance for failure etc. There's a provable number crunching kind of things, you can go there. The beauty of it was, I'm a big believer in Murphy's Law. Because the next day he came in and the first thing he said is Russ, did you break it? I was like, No, No, No, I haven't. I haven't touched the thing. I couldn't I don't think. I'm not a hacker. You know, I'm not gonna I'm not gonna do anything that's like that. And yeah, his one VM had failed overnight. Worse than that, it got into interesting corruption issues, and he had a disaster recovery strategy that had never been tried. The database has been backed up and never restored. And, you know, again, with hindsight, you look at these things and go duh but then everyone does that. And no one wants to explore the what ifs like this. It feels like overkill until it's not. And so we see that over and over again. We get it with some some of our bigger customers who've had, you know, whole scale problems of, you know, things that have hit the press. You know that. In the early days, I used to joke that I feel like I'm a bit a bit like an ambulance chaser. I'm watching systems fail and I'm waiting for a phone call. And it doesn't happen immediately. Because it's right in the moment when you're learning and you're trying to patch things back together and you're trying to mitigate things that's not when I get called, not when we get called. We get called when people go, okay, that can't happen again. Or if it does happen again, we need to be better at this. And so that's always the the moment when people find us interesting. Other big stories, well, I know of at least one large company that were very skeptical of chaos engineering, understandably, I mean, pretty much if you'd said to them, you want to do chaos engineering, their instinctive reaction was no, we've we've had chaos before. We never want it again. Why would we buy this thing. But one of their people was, I guess, open minded enough and friendly enough to go, let's give this a shot. Let's see what happens. And so they started doing their first chaos engineering experiments at 9am. And by midday, they'd found and proven several very large weaknesses in their system and not caused those weaknesses to catastrophicly fail. They found enough evidence of them to go Actually, this is for real. And that was enough for them to then go, you know what, we're not as prepared as we thought we were, even from a robustness perspective. And so that was always great to hear. I mean, it's a quote I always I begged them that we can say, but, but they won't let us quote it on their behalf, but they said, Yeah, people it happened. Absolutely. Because it's a great story and it did happen, it was a crazy conversation in Slack. But, yeah, it that's that was a massive company that found several weaknesses in a few hours. I'm not saying everyone can do that. I'm not saying that's necessarily the goal. What is interesting is that you only have to do a few small things, practice a few small things, and it will make you far better than you were without them.

Darin Pope 44:27
What is the biggest objection to doing chaos engineering within a company? Give me the top three.

Russ Miles 44:36
So the first objection, the top one is, this is gonna, this is gonna cost us money, right? This is going to threaten our production system. This is going to mean my developers are not shipping more features. It's the same argument and rationale that people use against putting in place testing. You know, I don't have time to write tests, I've got to ship features. It's the same naivety, I think of, well, you ship nothing until you've proven it does it. How would you prove it without tests? Oh, you're gonna have people do that? That's expensive, isn't it? So you know, the old argument of you will go faster if you have tests. And it's been proven now, in thing works like Accelerate, by Nicole and Jez, in that it talks about continuous delivery will speed you up, testing will speed you up. Doing these things makes you quicker, makes you consistently quicker. Well, it says same thing when it comes to running your systems and needing them to be reliable. Very few companies say we don't need to be reliable. But then you say okay, well, what are you doing about reliability? And they'll tell you all of the technical things they bought, and you say right, so you don't have outages now. Yes, we do. Okay, of course we do. What are you doing about those? We're trying not to have them. And then you say, Okay, well, that's, that's, that's a good thing. But maybe we could be good at them when they happen because we have to look back historically and say they do happen. So that's actually number two is people often say, it never breaks, it won't break. We've done everything right. It won't break. And in those circumstances, you sort of sit there and go, okay, but it has broken. Right? Almost everyone has relatively recent stories of when things didn't go right. And sometimes they're related to delivery. Sometimes they're related to execution. What we're saying is if you just do a little bit of this work on top if you balance a little bit of your velocity work as you see it again against the reliability work, then you will be doing this better and faster in the long in the medium and longer run. I guess complaint I see anyway, number three is that I used to get this a lot is I don't, I don't want chaos. I don't want to engineer chaos. And so there's a definite complaint against the phrase. And I've spoken at length of this with other great people in the industry and a lot of people are frustrated that it's misunderstood this way. And some people never ever hear someone complain like this, but I spoke to a lot of banks. And they'll tell you, they don't want chaos. No, no, no, we've had that. That was 2008. Do not want that again. Right now, they're probably saying we don't want 2020 again. So you know, when, when you talk to organizations like that, who say we don't want to engineer chaos, why would we do this insane thing? You do have to reframe it to well, do you want to practice reliability. Do you want to be prove and improve how you're prepared for issues? Do you want to do that instead? Do you want to? It's not insurance, it's practice. Do you want to do that? I used to use a phrase. I haven't used it for a long time now, but the phrase You know, do you want to do continuous limited scope disaster recovery? Which is chaos engineering, but I just don't use the phrase. Those those the top three I've seen anyway.

Viktor Farcic 48:13
It could be worse. It could be like serverless. It cannot be more misunderstood than serverless right.

Russ Miles 48:20
Don't get me started. If there's one thing we could stop doing in this industry is naming things. We're no good at it. We're terrible at naming things. I mean, NoSQL. We named a beautiful renaissance in data persistence and access technologies. And we named them by what they're not some of the time. I mean, that's just that that immediately should have got us barred from the naming groups immediately, right? Microservices, don't get me started on that one. I train people in micro service based development, but I don't really call it that. I say to them, Look, we're gonna do microservices. But first day one question one What is it microservice? You tell me, because it doesn't matter. We get there quite quickly. What matters is what can you do with your system that you couldn't do before? So yeah, I really do think we should stop naming things.

Darin Pope 49:15
And stop is probably a good word to use right now. Because we've gone on for a little bit and I think everybody's got meetings in 10 minutes. Thanks for hanging out today guys. It feels like we're just getting started at roughly 50 minutes and we'll have to continue this. Maybe you can come on the live stream on one one of the days and live on the edge and show what true chaos is.

Russ Miles 49:43
Sounds good to me.

Darin Pope 49:45
If you're listening via Apple podcasts, please subscribe and leave a rating and review. All of our contact information including Twitter and LinkedIn can be found at https://www.devopsparadox.com/contact. And if you'd like to be notified by email when a new episode is released, you can sign up at https://www.devopsparadox.com The signup form is at the top of every page. There are links to the Slack workspace, the Voxer account and how to leave a review in the description of this episode. We'll also have the link to that post you made yesterday, article, whatever you want to call it, I'm gonna say what was it was an article or post? Talk about naming problems. And then we'll have both your contact information. I've got your emails right now, but if you want other things go ahead and send those over as well. If you want to learn more about what's available, if you want to test out again, if you want to take the course that Viktor and I did or get the book, go to let's set it to https://www.devopstoolkitseries.com/, that that's the best place because it's got all the links for all the things we'll put up, make sure there's a link down. In fact, there's always a link for that because hey, we're always selling. And we'll also have a link to chaostoolkit.org which is where ChaosToolkit is as an as an open source project. And then also a link over to the company ChaosIQ. Is that the company name? I'm looking at the domain, but I'm assuming that's

Russ Miles 51:08
Yeah, that's correct. That's the company.

Darin Pope 51:10
So it's chaosiq.io. You're a cool kid because you have .io not .com. Were you just not willing to pay the money that probably somebody had for the .com?

Russ Miles 51:21
I don't think that's entirely wrong.

Darin Pope 51:23
Okay. Well, that's fine. That's that's what it is. You know,

Russ Miles 51:26
we wanted to spend money on the product.

Darin Pope 51:30
That's all that would be another episode. Let's just call it that way. Again. Let's see. Russ, you've talked a lot. I love you, even in this short time, but let's let Sylvain, close it out for you guys. Well, so sorry, man. I just sort of like threw you to the wolves right now.

Sylvain Hellegouarch 51:52
So Exactly. The only thing I can probably say just keep safe.

Darin Pope 51:56
Keep safe. It's a good idea. Now by the time you're listening to this It's going to be May 6. That's the release date for this episode. Hopefully things are starting to ease. In fact, we're recording this on April 24. And I just saw that Spain. Viktor, you probably haven't seen the show you may have Spain is going to start releasing the restrictions for letting children do things. Now help me understand. Talk about chaos. Okay. The restrictions have been lifted on the children but not the adults.

Viktor Farcic 52:26
So let's me clarify

Darin Pope 52:28
It's like Lord of the Flies, right? I mean, it's,

Viktor Farcic 52:31
yeah, so let's me clarify. Yes, children will be able to go out but they will be accompanied by adults.

Sylvain Hellegouarch 52:38
Remind me of that Stephen King novel, I think Children of the Corn or something like that

Darin Pope 52:44
Lord of the Flies. All that. There's nothing that can go go right with that. So anyway, that's, that's it for today. Check out all the links below in the episode. Be on a watch because we'll try to get these guys back on a live stream. Maybe we can see a live demo of the pay me things, right? I'm sure. I'm sure you guys will be happy to show. Give me money, right? I do want to call out, I want to call out a couple of phrases that I heard today. So this is sort of like the tags if nothing else, right? I heard continuous verification, or resilience engineering, those two sort of go together. Let me know everything known and unknown. That's always a good consulting phrase. I have a much longer joke. And then the other one was sort of the very beginning that Sylvain said, and this is how we'll end it today. Not a house of cards, a castle of cards. If you believe your systems can't fail, think about that last story that Russ just told. One VM. Up for three years. No problem. Comes back in the next day. What did you do?

Viktor Farcic 53:56
I still don't believe that he was not evolved somehow or he would never admitt it in public. I don't believe it.

Darin Pope 54:05
Thanks again for listening to episode number 54 of DevOps Paradox

DOP 54: Achieving Continuous Verification Using Chaos Engineering

Show Notes

Links from the episode

Guests

Russ Miles

Sylvain Hellegouarch

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript