DOP 67: Orchestrating Chaos on Kubernetes using LitmusChaos

Transcript

Uma Mukkara 0:00
It all started with SREs trying to think chaos is a first principle, right? It's a chaos first principle, right? I'm going to start operating a particular application and then chaos should be in the first in terms of you know my preparedness.

Darin Pope 0:20
This is DevOps Paradox episode number 67. Orchestrating Chaos on Kubernetes using LitmusChaos.

Darin Pope 0:31
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin Pope 1:23
Now, if you've been listening over the past few weeks, we're going to step away from serverless for this episode, and maybe for a couple others. In fact, I know we are based on the calendar I just looked at. Don't worry, we'll come back to serverless. Right, Viktor?

Viktor Farcic 1:42
Oh, yeah.

Darin Pope 1:43
Okay. But today, we're venturing back into chaos. When is our show not chaos? But no, we're talking about the legitimate chaos. Today, we have Uma from MayaData. Did I say that right?

Uma Mukkara 2:04
Yes, Darin, you did.

Darin Pope 2:06
Okay, good. All right, just making sure. He's the co founder and COO at MayaData. Uma. I'm not even gonna try your last name. Nor am I gonna try your first full name. So, I want you Why don't you introduce yourself real quick?

Uma Mukkara 2:22
Sure. Thanks, Darin and Viktor. My name is Uma Mukkara. I'm a co founder COO at MayaData. We started MayaData almost four years ago, to solve the problems for SRE around stateful applications and in general about DevOps. Today, here I'm going to talk a little more about Litmus, the chaos engineering project from MayaData, which we recently donated to CNCF as a sandbox project.

Darin Pope 2:57
So it was accepted as a sandbox.

Uma Mukkara 3:00
That's correct.

Darin Pope 3:03
Okay, let's let's I've got two things. You can do them back to back or Viktor may interrupt. I don't know. Why don't you Why don't you go ahead and explain what Litmus what, because we've we've talked about Chaos Toolkit. We've talked about PowerfulSeal. One what, what is LitmusChaos? What is what's sort of its role in the whole ecosystem? And I'm also curious, what was the process like about donating it to CNCF?

Uma Mukkara 3:32
Sure, I can cover both the topics. So first on LitmusChaos. Why LitmusChaos and why we started it probably that gives you an answer right. LitmusChaos is an end to end framework for doing chaos engineering in a native Kubernetes way. Let me explain what is native Kubernetes way and why we call it as cloud native chaos engineering, right? So everything about Kubernetes is being cloud native, right? It has to be completely helpful in automating whatever either the developer or SRE does, right. So when we were developing the automation framework, or the test framework for our earlier project, OpenEBS, almost two and a half years ago, we tried to build chaos engineering into our operations, and also into our OpenEBS CI/CD. We tried to look for chaos tools that can be used on Kubernetes. It was not very surprising because Kubernetes was early enough. We did not find a very native chaos toolset that we could use and then we started writing on our own. And at the same time, there was this concept of custom resources and operators and lifecycle management of any application that was all coming up. So, it was the right time that we said, Look, Kubernetes is getting adopted more and more. And the only thing that helps to fasten this adoption and also to make sure that the users the SREs are comfortable with whatever they've deployed is to have a good toolset to practice chaos engineering, right. So that was the intention of starting Litmus, almost two years ago, and then now today, it is completely declarative in nature. It's got the set of tools that you require to automate chaos in a Kubernetes native way, which means that you can pick up a chaos experiment, set few variables in your YAML file, and off you go. The chaos operator picks it up and does what it's supposed to do, and gives out the results in a custom resource form. So if you are an SRE or a developer on Kubernetes, and you know how to develop or operate things on there, you don't need to learn anything new to manage chaos. Litmus is just the same way as you manage other applications. So that's, that's what Litmus we started with. And also, in the initial days itself, look, you know, I know about my set of applications at that time, OpenEBS, and I can write chaos experiments. But in our operations, we were using a lot of other micro services in the cloud native environment itself. A lot of databases, nginx proxy, Kubernetes itself. These were totally new to us in terms of operating them in a proper way. So how to chaos test them? Right? I didn't have chaos experiments for those applications as well. So what we started doing is let's actually put together all this chaos experiments in a central place, and we call it as Chaos Hub. It's called as Litmus Chaos Hub. And it's like a Helm hub. Right? So all this developers, they have the images, artifacts pushed into a central place. Similarly, if you're doing chaos engineering on Kubernetes, you need to have experiments. You don't need to spend like first 10 weeks writing developing these experiments even before you start practicing chaos. It should be easy enough for anybody on Kubernetes. So with that intent, we started Chaos Hub. Right now it's got about 29 experiments. We got good traffic being pulled, the experiments are being pulled and being run. Yeah, it's a good story. That's where we are on LitmusChaos. And hopefully that answered your first question on on to your second question on sandbox. It was there was a process discussion going on into sandbox for almost like three to four months or maybe a little bit more than that. Fortunately, for us, we are late entrant into that process. It was around the time where they decided, hey, this is the new process, and it's much, much simpler. So our entry into sandbox was really seamless. The special interest group was very, very helpful and you will see made it very clear. This is exactly they actually put out things on an Excel sheet. These are the 10 conditions or eight conditions that you need to satisfy now to get into sandbox. And then we did satisfy all those conditions. And it was easy enough. We did wait for about three months. But that was purely about getting the process standardized. If I was getting in now, it could take much less time than that. But we are very happy with how CNCF, TOC, and the SIG has dealt with this project so far.

Darin Pope 9:32
What is the rest of the process? How do you get graduated?

Uma Mukkara 9:36
Yes. That's a good question. That process is also a very clear, right right now, it's about finding a home that is vendor neutral in terms of governance, right? I mean, vendors are still going to put the resources develop, but it is about governance of the project, right. And we have Chaos Hub, chaos experiments, as well as the infrastructure to manage, build and manage chaos, monitor chaos, all that is good. And the most important stuff for us right now is to go to the community, tell them that this is vendor neutral governance adhering project, and then you can bet on this Chaos Hub and chaos infrastructure and you adopt them. And also, we would, we should get more contributors coming in and contributing chaos experiments as well as to the core project itself. Actually, on that front, it's good news. We have maintainers from Intuit, and Amazon RingCentral already into that. There was a maintainer that who recently joined Amazon so we can say that we have maintainer from Amazon, but it's it's a it's good to have already four different maintainers from four different companies. And we are in sandbox. So it's about real adoption right now. We need to get another 10 or 12 different companies using Litmus in production. And then we are good to go for graduation. And from there it's about CNCF does due diligence in terms of talking to the end users how they're using, is it being governed neutrally? Are they doing monthly calls, in an open way? Is the roadmap being discussed and prioritized as per the community wishes not by us by one vendor. So this is a kind of a proven process and then they open it up for open it up for public comments. If all good, then you are going to get into incubation. Then from graduation. It's more adoption and more people are banking on it. It's naturally accepted as one of the projects for chaos engineering. And it may not be the project, right? For example, Envoy Proxy has equal and other options in the graduation level as well or incubation level. So you can have more choices, but the proof is needed in terms of being accepted as a good project or well adopted project and then we are good. And I'm pretty sure with chaos engineering being the need for almost every Kubernetes we're going to get good adoption and being part of sandbox helps to start the journey on a very positive path.

Darin Pope 12:50
So there is one thing that I'm kind of curious about because I feel that nobody really is tackling it or solving it. That is, so all the tools that I've seen more or less use the similar logic that there is some state that you define you check before some actions, you perform some actions and then you check the state after right? I believe that you call it entry and exit criteria or something like that right. Now, what is complicated for me is that actually, the exit criteria is usually not that simple, right? Let me give you let's think of an example. Let's say that you destroy a pod, right? You would create an experiment that says, the pod is running, then you do some action like destroy the pod, and then you do some exit criteria pod is running. It was recreated by Kubernetes or what so not right. Now what I kind of feel that we are missing in general not not with Litmus but in general, is that systems are not that simple. Like if I destroy a pod and I'm using a simple example right now, it is likely that actually what, what I will experience is that some other parts of the system broke somewhere outside of what I'm directly observing. And I'm curious if you have thoughts on that?

Uma Mukkara 14:27
Yes.

Viktor Farcic 14:28
Is that something that we are looking into observing, maybe connecting with other tools or what so not?

Uma Mukkara 14:36
Yes, thank you, Viktor, for asking a very important question. I was actually going to speak about it as one of the differentiators about Litmus but let me speak a little bit about that right the entry and exit criterias and how architecturally we started attacking that issue. It's not completely solved. If I say hey, I solved that problem, then, you know, we would be the most famous one and probably adopted by thousands of customers already, right? So it's not completely solved, but we have the architecture for that. And let me explain what I meant by that. So, the as you rightly said, when you kill a pod, a pod is being lost, it's probably because of something else right? The resources are not there, pod is evicted or there is software issue it crashed or there is a memory issue and then you know, it got into some issues right or there is some other chaos right. So that is kind of an entry right. So what exact chaos do you introduce? Is it enough to have a pod kill container kill network loss network corruption, disk loss, disk fail all this we call them as generic experiments. Right? Is that enough? Our answer is no. That is not enough, because applications are totally different. These are generic resources. And these are usually the end result of some other issue, right? So you need to actually introduce the fault that would have otherwise originated in a given application. Let me give you an example. I can take OpenEBS, right, so OpenEBS faults could be some other problem in iSCSI target, right? Can you introduce that fault, right, the developers of OpenEBS, or developers of MySQL DB knows what could go wrong in a code. So they could start writing that experiment. Let me introduce a possible chaos in these conditions, this fault can happen. So now I know how to simulate that. And that condition, right? So that's the reason why we have created application specific chaos, right? And that's the reason why Chaos Hub is there. It's not about the 10 to 12 experiments that everyone else is talking about. is about how to generate application specific experiments and then no one can write all right. So each vendor, or each developer knows their application best. And if it is an open application, and if it is on the cloud native environment, you can write that negative test case as chaos, and then push it up to the hub. So at least you're covering how to inject that part specifically, right that's on the entry conditions. Now on the exit conditions, it's even more difficult problem. While you have injected a specific chaos, now, how do you know that everything is working fine? How do we know what to check? Right? So that's where we have given a kind of a pluggable interfaces to the experiments in the YAML file. You can go and define, hey, this is what we think are supposed to be checked, right when a MongoDB goes down. A specific chaos in MongoDB goes down, you know exit conditions has what to be checked whether a MongoDB is behaving properly or not, but it does not cover what is the exit condition of an application that is using MongoDB. Right. And then you can write that exit condition in a declarative way. Right, you can write a script and attach that here is the regular exit conditions. And let me add another exit condition because my application is using MongoDB. Right. And that's how we are covering exit conditions also. And that framework is there. And we have few set of applications. We have developed OpenEBS as an example infrastructure that can use better exit conditions. But as Litmus gets adopted more and more people will write the exit conditions to what they think should be better exit conditions, and they could upstream those conditions up to the hub so that it gets used by others. So, that's our approach Viktor. Good question.

Viktor Farcic 18:58
I do agree up to a point to be honest. What bothers me more is that still, when we talk about exit conditions, we're talking about mostly about exit conditions that I predict, right? I'm much more interested in things that we cannot predict easily. In a way, so what I would really like to see or think would be very useful is if tools like that could hook into others let's say Prometheus or maybe Datadog or something like that. And say, okay, was there an anomaly in the system that is not something that I predicted not anomaly in this application this pod this database, but is there some did this experiment create some anomaly that you can detect with those other tools, whatever they are right. Tools specialized in monitoring the system rather than specific activity. Like, like Datadog or Prometheus or you know, there are many others in the market. So I think that that kind of would be, you know, the cherry on top of cake.

Uma Mukkara 20:12
Yeah, no, I totally agree there's metrics or the crux of observation, right? It's all about introducing chaos and then being able to predict through observation that hey, things are okay or not. Right. And also, it comes with a lot of training of the data. So we are hoping at some point when we have enough data coming in, there could be possibilities of machine learning of, of the metrics, based on what we've been observing. I think you're good, right. So for example, on a well working system, I have observed this Prometheus metrics to fall into certain criteria. Now on another system. I'm not seeing them the same conditions. So there is a possibility that there is an anomaly. Right? And it's some time to go there, Viktor, and it's a it's a great topic. And it is a science, chaos engineering and amount of applications that we are seeing the fail pods that can happen. It's a huge area. So we first covered the basic infrastructure. We are gathering the community together, we want community to give us more ideas. And with more participation, I'm sure there will be more ideas to tackle this.

Darin Pope 21:40
And what do you think kind of maybe you disagree with me right on this one, but I feel that chaos engineering is maybe unjustly still not main. It's not, you know, the big topic. There are no big so the Actually, let me rephrase that. The way how I see usually is that something becomes mainstream when that something sparks interest of, you know, big players. Usually we can see like Kubernetes, we can say that Kubernetes became mainstream really when Google and Microsoft and Microsoft and Azure sorry and AWS took sufficient interest in it and then created their own services. Kind of what do you think about that? When do you predict chaos will become a big deal if ever, or maybe it is already.

Uma Mukkara 22:42
It is to some extent, but not super popular. Good news is that there is a lot of positive advocacy coming from Amazon. You can see a lot of people from AWS or good speakers on chaos engineering right Adrian Hornsby or ah Cockcroft and who are a part of you know very active members of CNCF as well right. And yeah I mean, the CNCF SIG app delivery SIG itself has unanimously voted for LitmusChaos primarily because chaos engineering is a need in the process of application delivery right. So there is a good acceptance of the need for chaos engineering and then this is how it all starts right so now people will adopt more. And after we have announced Litmus in sandbox being part of sandbox, there are good pings coming from EKS customers on AWS. That means people are looking for it. And the moment there is open governed project, right? That's what is cloud nativity is about nowadays, and there will be adoption, right? So we honestly felt more generic Kubernetes native way of doing things the moment you put it out there, it's a matter of adoption going slowly, and then it could just go viral. And I'm pretty sure, in about a year, one or two vendors will start serving chaos as a service inside Kubernetes. And at MayaData, we have already been talking to some large system vendors, many service providers who are going to include chaos as a part of many service to their Kubernetes services. So it is happening. It's not mainstream yet, right? But it's only a matter of time. So it gives us good opportunity to be part of be part of that, that wave.

Darin Pope 24:47
So although you've got vendors that are looking at embedding, let's call it that my word, not yours. It's still going to be developers that although there is this library of different experiments that can be run, businesses are going it's like, well, I need these other 15 different experiments. But our developers can't do it because they're too busy doing the more important things. How do you see the landscape of chaos experiment development starting to change over the next 12 months?

Uma Mukkara 25:42
12 months? I would actually say it's short time. I would probably say 24 months is a good I can take a shot at it because 12 months you cannot predict. You know, a general prediction. Right. So again, very great question, Darin. Chaos is people are afraid of introducing chaos, right? It all started with SREs trying to think chaos is a first principle, right? It's a chaos first principle, right? I'm going to start operating a particular application and then chaos should be in the first in terms of, you know, my preparedness, right? And SREs should think, or will think that my developers should be afraid that I might introduce chaos. It's not just a fault is going to happen but SREs are going to induce faults right. So that is another theme or a general phenomena that is happening, and we've been talking to many SREs and then they're saying that Yeah, I'm pretty convinced my SRE team is pretty convinced that chaos is needed. Right now, we are buying in from the management that can we introduce chaos not only into staging, but into production. And then most of the resistance comes from the development teams, right? So when you talk to the management. So it is going to happen and it is a slow process. And as you get the chaos adopted into the staging environments, pre production environments, they get an idea of the real benefits of chaos engineering the developers, hey, you found an issue, even before you got into production. So thank you very much. And it's not my CI scripts that have found the issue in my code, but rather, it is the chaos engineering managed by the operations guys. Right. So that is going to happen and on the time required for developing this chaos experiments. It's mostly the SRE team that needs to develop new chaos experiments. Not the developers themselves at least to begin with. And that's the idea of giving these experiments, you don't need to develop 50 you got 25 or 30 of them already available, you can start. And then in a period of few months, you can add more experiments for your stack. Right? And it is a process. And good news is that, I mean, that's where we have invested more, because every time when we talk to OpenEBS admins or SREs they were saying that how can I do a failure testing on your code and make sure that you're very resilient, right? So that's when, okay, we need to practice show me the chaos engineering that you've been using in OpenEBS for me to deploy your stuff. Right. So these are the questions that are coming from SREs. And I'm very hopeful that it is going to go into mainstream and the moment, Kubernetes distributions start adopting chaos as one of the requirements, right it's a matter of time. If one distribution says that, hey, we deploy and also we provide chaos engineering as an added service into your deployments, right? Just like everybody embraced Helm and operator deployment operator lifecycle management as one of the services of the Kubernetes service or distribution chaos is very much in line with it, and then people will fall in line. And hopefully my prediction is right.

Uma Mukkara 27:10
So basically, you're saying, you're wanting LitmusChaos to be the next CoreDNS. You can't have a cluster unless you have LitmusChaos running.

Uma Mukkara 29:45
Yes, by the time it gets to graduated. Hopefully, that's where we would be.

Darin Pope 29:54
And to go back to the beginning, you wrote LitmusChaos in order to solve you're testing for OpenEBS or your so let's even though we're not going to talk about OpenEBS, how is that directly? How has having the trust because of the chaos tests you have against OpenEBS. How has that helped you in I'll say it pitching OpenEBS to people. How's that confidence, that level of confidence, instead of saying, hey, I've got this great way of doing statefulsets. But, you know, sometimes we have some weirdness happen. How did you solve that?

Uma Mukkara 30:42
Yeah, it's about being transparent and being open. Right? I mean, you can tell 100 different things until they see you doing that, the community will always have a doubt. Is this a marketing page or sales page, or are they really doing it right? So we started breaking OpenEBS through Litmus through chaos experiments. And we actually put it the entire process of our pipelines. the breaking of code inside the pipelines in open as well. So we call it as openebs.ci and we told the community that this is how inside the pipelines and in long running test cycles, we're using Litmus to break different negative scenarios to introduce different negative scenarios inside OpenEBS pod and we're still able to get some outcomes that a bug is not there or you're configured properly. Right. So that is that's well accepted, I would say and that has served as a good proof point as well. And also to us, right, it's not only that openebs.ci platform which is an open one. We also have another SaaS platform. Now we call it as Kubera and the operations of it we offered OpenEBS to that right as a stateful storage. And so we have a big SaaS platform where OpenEBS is being used, and we use Litmus to break things underneath, right? So once or twice, it has taken a hit. The SaaS platform went down primarily because there is chaos underneath, right? So we're not afraid to do such kind of experiments. And it's only a matter of time where things get better and better and Litmus is definitely helping achieve that goal of resilience.

Darin Pope 32:45
So for your SaaS, you are running, running LitmusChaos in production.

Uma Mukkara 32:52
Yes, that's how it started. It's almost third year now.

Darin Pope 32:57
So you went out of the gate with LitmusChaos running in production.

Uma Mukkara 33:03
We started with that, right? I mean, OpenEBS is another project and we want to use it ourselves, right? And we are the guy behind it. And we started writing chaos test Litmus is born actually after that. So the first test were actually put into production. And then we felt, hey, we could actually put it out and make it as a project. And we saw an opportunity to build some kind of business around it. And the first thing is build the technology in open. Right. So it all started from that.

Darin Pope 33:36
You're a brave man. But you had to do it.

Uma Mukkara 33:41
Yeah

Darin Pope 33:41
right. There's no way you could tell other people they need to do if you're not going to do it yourself. Viktor, do you have anything else for Uma today?

Viktor Farcic 33:54
No, I think I'm good.

Darin Pope 33:58
You think you're good?

Viktor Farcic 34:00
I cannot know for sure whether I'm good. It's a mystery I will most likely come up with 15 seconds after we close with Why didn't I...that's life

Uma Mukkara 34:15
Well, you if you have a Kubernetes cluster, try it out. Induce some chaos and see how it's going on. I mean, it's pretty easy to induce chaos at least the first ones. There is more complex chaos possible, but the first ones are very easy.

Darin Pope 34:32
Actually, I do have a question. Is the goal of Litmus to run on Kubernetes or to target Kubernetes.

Uma Mukkara 34:43
Both. There are there are two goals. One is to find weaknesses in the Kubernetes platform itself. It's not about code inside Kubernetes but how you configure Kubernetes right. Kubernetes can scale up to hundreds of nodes easily in real production, and then have you configured all the services that keep running Kubernetes properly, right? So we call it is platform Kubernetes platform chaos, and then comes the applications. There are thousands of applications that run on them. And then there will be different SREs that manage. For example, you take Kubernetes as a service, right? Amazon EKS. So Amazon could use a type of chaos to make sure that their service runs fine, and then the users who use EKS also will need chaos to make sure that their application is configured properly. And then my underlying support system is is behaving the way I expect, right, so kill a node, and that's good, but killing a kubelet service or killing etcd. One of the pods inside etcd has to be done by the Kubernetes service providers so that you know, in the eventuality of a fault happening in EKS or GKE Kubernetes continues to run fine. Right? So there are various levels of chaos.

Darin Pope 36:08
Yeah, but you're not targeting things outside Kubernetes like I don't know, like messing up directly with the storage, not with persistent volumes right? We are within the sphere of Kubernetes

Uma Mukkara 36:22
Kubernetes hardware, let's say the example we call it as inside the generic experiments, there is something called infrastructure experiments. They could go and kill a physical node, we can go and take out an EBS volume, we can go and induce a network latency into a physical network, right. But it is like a layer below the Kubernetes itself. Right? Yeah, it is there. And then we have some examples out there already.

Viktor Farcic 36:52
Nice.

Darin Pope 36:55
Okay. Well Uma, thanks for hanging out with us today. For everybody listening, all of Uma's contact information will be down in the show notes and also specifically a link off to litmuschaos.io. See, .io. We know they must be legit because they are a .io. Right?

Uma Mukkara 37:23
Yeah, we are legit. We are a CNCF project now. Yes,

Darin Pope 37:26
yeah, you're super you're double super legit because you're a .io and you're a CNCF project.

Uma Mukkara 37:35
Well, thanks, Darin and Viktor. It's nice to be here and looking forward to growing the community in this chaos engineering space. Thank you very much.

Darin Pope 37:47
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/ contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 67: Orchestrating Chaos on Kubernetes Using LitmusChaos

Show Notes

Links from the episode

Guests

Umasankar Mukkara

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript

33Across

host description

View Cookies

33Across