DOP 98: Kubernetes Troubleshooting Simplified With Komodor

Posted on Wednesday, Mar 10, 2021

Show Notes

#98: How many times have you been put into the situation to debug a production issue and you have no idea where to start? Probably more than you can count. Worse yet, your employer expects that you can troubleshoot the issue without having access to all the tools that you need. Today we speak with Itiel Shwartz, CTO and co-founder of Komodor, a startup developing the next-gen troubleshooting platform for Kubernetes.

Guests

Itiel Shwartz

Itiel Shwartz

Itiel is the CTO and co-founder of Komodor, a startup developing the next-gen troubleshooting platform for Kubernetes. Ex-Ebay|Forter|Rookout. He is a DevOps expert and technical leader that loves learning new technologies, k8s superfan, and constantly try to push the limits of the R&D velocity, speed, and confidence.

Hosts

Darin Pope

Darin Pope

Darin Pope is a developer advocate for CloudBees.

Viktor Farcic

Viktor Farcic

Viktor Farcic is a member of the Google Developer Experts and Docker Captains groups, and published author.

His big passions are DevOps, Containers, Kubernetes, Microservices, Continuous Integration, Delivery and Deployment (CI/CD) and Test-Driven Development (TDD).

He often speaks at community gatherings and conferences (latest can be found here).

He has published The DevOps Toolkit Series, DevOps Paradox and Test-Driven Java Development.

His random thoughts and tutorials can be found in his blog TechnologyConversations.com.

Rate, Review, & Subscribe on Apple Podcasts

If you like our podcast, please consider rating and reviewing our show! Click here, scroll to the bottom, tap to rate with five stars, and select “Write a Review.” Then be sure to let us know what you liked most about the episode!

Also, if you haven’t done so already, subscribe to the podcast. We're adding a bunch of bonus episodes to the feed and, if you’re not subscribed, there’s a good chance you’ll miss out. Subscribe now!

Signup to receive an email when new content is released

Transcript

Itiel: [00:00:00]
They expect the developer not to shake his head and say it's not my responsibility troubleshooting. I'm a developer. What do you want? They do expect it from the developer to step up, but in the end of the day while they do expect this from them, they don't really invest the time that they invest in the DevOps team or in the training or in making sure the tools are right.

Darin:
This is DevOps Paradox episode number 98. Kubernetes Troubleshooting Simplified With Komodor

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:48]
Shifting left continues to be a very hot topic in many organizations. Agree or disagree, Viktor?

Viktor: [00:01:56]
Oh, absolutely. And it's also one of the big failures, I think in many organizations.

Darin: [00:02:03]
Big failure or huge failure?

Viktor: [00:02:05]
You know, CTO, CEOs hear about, Hey, shifting left. We're going to enable developers and then they somehow all think that developers will automagically understand everything that those on the right of them understand. Here's a Kubernetes cluster. It runs Istio. Something is not working. Go figure it out. That's to me so unrealistic that I'm completely in favor of shifting left. I'm completely in favor and I think that we should all be enabling developers, but enabling developers does not mean, or it shouldn't mean that developers now should possess all the years of experience of everybody else in the industry. I do not understand fully all the intricacies of Kubernetes and I spend a significant portion of my time working exclusively with Kubernetes. I cannot expect a Node.js developer to debug what is the problem with requests going through Istio and not reaching the third service in a chain. That's not going to happen, but I do want to enable developers at the same time. It's a tricky situation. How to enable developers without forcing them to spend years of learning something like Kubernetes. That's my issue. That's why I think it's failing in many cases.

Darin: [00:03:23]
And guess what? We have a guest on with us to talk about these specific problems. We have Itiel on from Komodor. Itiel, thanks for joining us today.

Itiel: [00:03:34]
Hey guys, pleasure to be here. I'm a huge fan of the podcast.

Darin: [00:03:38]
Are you a huge fan or a big fan?

Itiel: [00:03:40]
Hmm. Hmm, I'm not sure. Between, I think in between. Like I listen to most episodes, but not 100%. So, so I can't lie about it.

Darin: [00:03:50]
It's okay. We forgive you. Introduce yourself just a little bit more and talk a little bit about what Komodor is and the problem area that it's trying to help resolve.

Itiel: [00:04:04]
So a little bit about myself. I'm a developer, a DevOps, I'm not sure, like I'm the CTO. I'm not really great with titles, but I know infrastructure and I know code. My previous life, I worked for eBay as a infrastructure engineer. Afterwards, I joined Israeli startup named Forter. Then I was the first developer in an Israeli startup that helps developers debug named Rookout. I was the first employee. About a year ago, I teamed up with Ben Ofiri, which is my friend from the university that worked in Google for the last seven years and we created Komodor together. A little bit about Komodor and the problem we're trying to solve. Basically, we are focused on Kubernetes troubleshooting and visibility and what we try to help both developers and DevOps is once they have an issue in the middle of the night or in the middle of the day, it doesn't really matter, to understand what has changed across their system and by whom. I'll try to give like a very small example. You have a problem now. Let's say that Datadog is shouting at you that the CPU is high or the latencies is high or increased error rate. The first thing you do, like most developer or DevOps do, is ask themselves, okay, what has changed across my system? Five minutes ago, everything was great. What changed? In order to answer this simple question, at the moment, like the number one tool that we see of all of our customers and the people we talk with is going over Slack and asking who changed the system? Who knows why did it happen? Or maybe they go to Jenkins plus GitHub plus kubectl describe and try to figure out themselves. What Komodor basically does is to collect all of the changes that happen across your system, and a change can be a Kubernetes deploy. It can be a feature flag via LaunchDarkly that just changed, or it can be other resource that changed across the system. We correlate both the changes and the alerts that happen across the system to give our users a single pane of glass of their Kubernetes cluster status, meaning for each service, when did it change, what was the diff, what changed over Kubernetes, what changed over GitHub, what changed over LaunchDarkly, so they can easily troubleshoot.

Darin: [00:06:22]
The question is, and Viktor was leaning into it a bit, we cannot expect a Node.js developer to understand the intricacies of a Kubernetes cluster but at the same time they need to understand at least the basics. So, what are the problems that you're seeing because obviously you're building out a solution to help make that simpler, but what sort of drove that need in the market that doesn't, I still don't see why it really matters.

Itiel: [00:06:53]
I'll try to explain from my point of view of and from the industry point of view. We are really focused on allowing this Node.js developer to develop faster. In order to deploy a new Express server, you are one YAML away from having it with load balancing and everything in between. We spend a lot of our time basically allowing him to move faster and giving him more super power and to empower him like deploy faster which is great. I am all in favor of CICD, but we somehow failed to invest a little more time in what is going to happen once your server is going to be up and running on the Kubernetes and it's going to handle some issue. It can be an issue because he wrote a bug on his code or it can be an issue because the Kubernetes cluster is having some hard time or any other things. I think most of the time spent over the last couple of years is allowing the dev team to move a lot faster. I saw a nice quote, like give a baby a tequila and a gun or something like that and basically to allow himself to hurt himself even faster than before. Then we see companies that have moved like the one who is in charge of deploying. It was once the DevOps or the sysadmin or some ops guy and now like everyone can deploy to production which is great, but somehow when there is an issue on production, the one who is getting the call most of the times is the DevOps teams and that is because we as an industry spent years building the right tools to help the DevOps troubleshoot faster and more easily, while we didn't spend this much time in allowing the devs and less experienced guys or girls troubleshoot. We focused really on one area of the shift left movement but somehow we forget that there is the Day 2 operation. My Node.js app is running. Now what? It's not even as part of the training or the culture. A lot of companies just don't expect a developer to solve those issues alone.

Viktor: [00:08:51]
It's not even necessary to me, always or often for developers to solve the issues. I think it's more about understanding where the issue is. I can easily now call you, Hey, Itiel, can you help me with this? But I need to know that that's what the problem is. There is a networking issue. Excellent. Then I know who to call, but if I have no idea, it doesn't work.

Itiel: [00:09:17]
I can give a very nice quote of one of the director of DevOps we talked with and he was like all I want is that the developer. I'm not sure if they can read troubleshoot by themselves most of the problem. Just let them know where in general the issue lies so they can talk with the relevant person and not with me. Let them know that there is a connectivity issue or a Kubernetes general problem or some AWS or Azure degradation. Give them enough hints so they can call up the right guy and it will save me so much time. A modern DevOps teams is a lot of times like a phone operator and he just routes the phone calls or the alerts to the relevant people. This seems like an application issue. Let's call the app guy. Oh no this seems like a database issue. Let's call the database guy or girl. Basically this is because we didn't invest a lot of time in helping the people who are in charge of troubleshooting to troubleshoot faster. I'm not talking about root cause analysis automatically. No magic or something like that but basically giving more context to the person who is in charge on solving the issue.

Viktor: [00:10:23]
What you just said who is charge is the key word that I believe many companies don't doing it right. I believe in self-sufficient teams. This is a team in charge of this application. That team designs, plans, writes code, writes test, deploys and is in charge of making sure that that application is running. Full responsibility. I cannot expect somebody else, somebody who is not writing that application to be responsible for it. I think that that's silly even though majority of companies do it like that. Now, being responsible for something, and I think that this is the part that is problematic, does not mean that you are alone. Does not mean that you cannot call me, you, Darin, somebody else in your organization and say okay, look, I have a problem. Can you help me solve it? That's perfectly okay. What you shouldn't be able to say saying this is not my problem. I write the application but somebody else is running it in production. That's the issue that I believe are causing a lot of problems.

Itiel: [00:11:26]
I completely agree and I will say that a big part of the issue is it's like the DevOps, the SRE, no matter how you call it, like you Viktor they expect the developer not to shake his head and say it's not my responsibility troubleshooting. I'm a developer. What do you want? They do expect it from the developer to step up, but in the end of the day while they do expect this from them, they don't really invest the time that they invest in the DevOps team or in the training or in making sure the tools are right. They do expect him to take the ownership but for most companies I won't say everywhere they don't invest the right amount of time in making sure the one that is picking up the page and knows the basic of what they should do. We speak with companies and for a new DevOps, he shadows the on-call for three months and afterwards, he do it only on like daytime and then on nighttime and he has the most loved dashboard on Datadog or on New Relic and he have everything in place and then they are like yeah but my developer, they just can't handle the issue, the alerts. It's a little bit funny. On one hand, you expect them to solve it and on the other hand you do so much abstraction that they don't really know what is happening and then the only time that you expect them to know Kubernetes or Istio is when you have the issues. When the Node server stop responding or something like that. I wouldn't say like a double-edged sword but it is a little bit problematic from what I see.

Viktor: [00:13:01]
One of the problems there I think is that you cannot expect somebody to be responsible for something and then force the solution on that somebody. You cannot say Hey everybody needs to use Istio and now you go and develop and deploy your application and figure it out. No because maybe I end up choosing no service mesh or maybe I end up choosing Linkerd. If you don't give me that choice, then you cannot hold me responsible for it. It's silly. Another thing is that I have a strong beef whenever somebody mentions, this is not really aimed at you Itiel, but whenever somebody mentions DevOps team, my head translates that into, so your operator, sysadmins, whatever, got bored with their title and now they're a DevOps team. DevOps is not about renaming sysadmins, operators, into DevOps. It is about having development and operational knowledge within a team that is responsible for a service or an application or whatever that is and that means that those ex-operators should start developing services that are consumed by users, you always have a user, and also those application developers need to have operational knowledge to deploy the application. Everybody starts developing and operating within the same team.

Itiel: [00:14:24]
I can't agree more with you and I will say that in the strong teams we meet, they don't have a DevOps team. They have a visibility team or a monitoring team, the name differs but the sole responsibility of this team is helping the other teams be much more effective when they troubleshoot. Meaning, we ask them if there's a page in the middle of the night, do you respond and they're like no. Why should we respond? What we try to do is to build the best tools for the developers so they can respond to the page in the middle of the night. It's not my job. I didn't wrote the code. I didn't deploy it. Why should I handle the issue? I really love those kinds of teams, but you need to spend a lot of time. It's worth it but you need to say I have four people that are not troubleshooting or they are just building tools to other people and to empower them.

Viktor: [00:15:16]
It's in a way about understanding what your product is and who your users are. If I go back to your example of teams in charge of monitoring. Their job is to provide a service. Yes, you can monitor with what I'm giving you and if there is anything to troubleshoot that's whether my monitoring service that I'm giving to you is working. Is it working? Yes. That's my troubleshooting. Your data that you're pouring into my solution that I'm giving to you is giving you information about your application that you don't understand. That's not me. My service is a monitoring service. My service is not your application in that example.

Itiel: [00:15:54]
Exactly and it's so obvious to some teams like it's not my responsibility. If I have a problem with the monitoring server, then yeah sure. Wake me up in the middle of the night. It's my team responsibility. But if the other developer are having issues, why should I care or maybe how can I make their life easier. You had an issue at 2:00 AM. You entered the tool that we developed for you. What was missing? How can I make you more productive? Do you need connectivity logs? Do you need better access to other tools? Do you need to work hand in hand with the people who are responsible, who are troubleshooting? I do see the elite teams are doing this but most of the industry it's more like the glorified sysadmins that are angry at the developers that are just deploying all the time and breaking their production. It's not your production. It's everyone production. It's the one who just deployed it.

Viktor: [00:16:47]
When you mentioned sysadmins, I think that that's misunderstanding who your users are. If you're a sysadmin you're providing a service to developer. Developers are your users just like users of developers are the end users people who shop on your site. Your job is really to provide a better service to those developers. If you have a online shopping site, if I have a need to buy more, you're going to provide me the need to buy more. If I have a need to deploy faster, more often, more frequently, that's your job. It's not your job to prevent me from deploying because it doesn't match some objectives that you have that are not objectives aimed at serving me as a user in a way.

Itiel: [00:17:26]
I think as an industry, we do understand that making the developer deploy faster has become a standard. I do see a lot of companies who are putting the resources in helping the dev team develop faster. I think a couple of years ago it was not that obvious but now I think it is. I do think in a matter of years it will be obvious that all they need to do is to help the developers be more effective once they have issues or bugs or troubleshoot. Even internally, we're a small company. We're eight developers and as part of the R and D meetings, one of the things that I keep on asking is what is taking us a lot of time and we don't even realize that? What part of the troubleshooting are we spending most of the time and how can we improve it? I think if you go to most companies, it's not even a question cause it's their problem. It's the developer problem that they don't know how to troubleshoot. They don't know how to SSH into Kubernetes and do kubectl exec and minus it. It's their problem, so why should I care? In a lot of the cases, you have different person in charge who has different company KPIs. For some organization, the DevOps team have their KPIs system uptime while the developers KPIs may be develop more features faster. Once you have those KPIs, they keep on fighting with one another. The DevOps doesn't really want to spend his time on allowing the developers to be more productive because it's not his job. The developers don't want to troubleshoot because why should they troubleshoot? They're not getting extra paid over it. No one say, good job troubleshooting yesterday. No one really cares. In those kinds of organization, it's super tricky. Having the culture and the change, it should come not only from one person. It should be a R and D decision.

Viktor: [00:19:14]
At the end of the day, I think that the problem is that KPIs are often not based on business but on technical things. How stable your system is or how frequently you deploy is irrelevant to business. What matters to the business is how fast do we deliver features to our users. That's actually a combination of those two things. It needs to work and it needs to get to the user. Either of those two KPIs is just silly.

Itiel: [00:19:40]
Yet they are the ones at least for a lot of the companies, those are the north star metrics for a lot of the developers we meet. Once you measure the wrong thing, you can get an excellent team which is super productive cause you're just optimizing the wrong thing.

Viktor: [00:19:56]
Hey put me in charge of infrastructure and I can guarantee a hundred percent uptime. Nobody will ever deploy anything.

Itiel: [00:20:03]
Exactly It's it's the easiest solution and like it's sad but but we do see it quite a lot. Sometimes we're talking with the DevOps or with the SRE and there are things that all developers can do but there are things that only he can do. Only he can deploy on these hours or only he can do a rollback. I met a lot of companies that the only guy that can roll back is the DevOps and I'm like why? How does it make sense? You let your developers deploy? Yeah, sure. We are fully CICD but when they have a rollback they need to go to you? He's like yeah. I don't really trust them to know how to do a full rollback. So yeah it's my job. Why? How does it make sense?

Viktor: [00:20:45]
That level of specialization is very problematic because I would guess that if I would go to that company right now and say Hey we can set up let's say and I'm inventing use case canary deployments that will monitor metrics in Prometheus or Datadog or whatever and depending on the results of thresholds it would rollback at one moment. That person is going to fight to the death for that to not to happen because I'm just removing its power in a way. You can be so specialized that a change any change puts your imaginary position within a company at real danger.

Itiel: [00:21:19]
Yeah like super problematic. For some DevOps, they do have good intentions. If you ask him I'm not sure if all of them will strongly disagree. I think they lack the trust in other people who are not them and even if they have the trust, they don't see their job again as empowering. They see themselves as the sheriff of the production and everything needs to go through them in one way or another. If not, then it's a problem cause he's the sheriff. Why should the dev be in charge of it? He's not the sheriff. We do see it a lot. I do think the industry is going on the right direction with the shift left and empowering the developers who are writing most of the code and doing most of the work, but I think this area still super tricky for companies.

Viktor: [00:22:05]
It's more about making a decision or even better It's about not being too clever meaning there are companies most companies are behind top 1%. Googles of the world. You just need to see where should we go? There. Whenever the Netflix is that's where you go and when you get to there, then we try to figure out how we can be clever in a way, right?

Itiel: [00:22:31]
I'm just in the middle of writing a blog post. I just read Netflix book about the culture book. No Rules Rules by the founder of Netflix. I finished reading the book. Listening to it on Audible. Great book. I really understand the DevOps culture like the technical culture that Netflix has. He explained how they recruit the top people in the industry. They are empowering them to the end. They fire their own people. It's so specific to Netflix and to be honest after I heard the book I understand the DevOps and the microsystem behind Netflix the culture a lot better. It was literally an eye-opening experience on better understanding the tech decision just listening to the cultural decision. This is the reason it failed in most companies. Most people are not the level of a senior Netflix engineer and even if they were in the same level, he keeps on talking in the book about the fact that you need to build a lot of internal tools to allow all of this empowerment. It can be on the ops side or the tech side or the marketing side but he keeps on saying you must have the right tools in place or else it will fail. I think people just see the microservices of Netflix and they're like yeah yeah. It's cool. It's empowering other people. It lets them move fast. They don't consider all of the benefits that Netflix and the Netflix of the world have. Super strong people. A very strong internal system and a sense of excellency in everything. Most companies are not Netflix, yeah it's sad to say.

Viktor: [00:24:02]
In cases of companies like Netflix and others, what is to me more interesting is actually understanding the journey that they had to get where they are. 99% of the companies cannot apply what Netflix is doing. You cannot give half a million to your engineers because you want top talent. I understand that. You're not running at that scale. I understand that as well. But Netflix did not start where they are at today. It's about, hey, why would I reinvent the wheel? Let me do similar steps to what Netflix did 10 years ago because I am where Netflix was 10 years ago. I cannot get the top talent but maybe I don't need to go to extreme and then get the cheapest talent either. Maybe the solution is to get less people that are not Netflix level but still higher level than what I am right now. Maybe I can have half of the people paid double. I'm inventing now the use case.

Itiel: [00:25:01]
I am completely with you Viktor but I think so many people are just I'm not sure how to call it they're like blown away by the microservices and the speed. Everyone wants to move that fast. In my company we also try to move as fast as we can, but it's very hard to know the price you are paying. Sometimes we need companies that are now doing a migration. They're migrating to Kubernetes or they're migrating from monolith to microservices. We try to help them both with Komodor product but also give advice because we do see a lot of companies. One of the questions that I really love to ask and almost no one really answered correctly is okay you are now breaking the monolith. So what do you expect are going to be the new problems that you are going to face? I understand all of the benefits. You can release faster and smaller teams, but where are you going to pay the price? You can't win win win in everything. Breaking a monolithic into microservices, there are some losses. I'll be happy to better understand with you where do you see the future bottlenecks or future problems? Some people know how to answer it but it's quite rare. Most of them are like what do you mean? I'm moving to Kubernetes and to microservice architecture. New problems? I'm trying to solve my old problems by moving to this system. When this is the mentality then usually what happens is they are migrating. It takes them a lot of time. After a month or two, they understand they are not in a better place than they were before they started. They are currently in a much worser place than ever. We do try to help in any way we can and I think Komodor do provide a lot of benefits for this use case but they don't understand the pitfalls of the Netflix architecture.

Viktor: [00:26:44]
That's the thing. When you're jumping into something newer. Kubernetes, microservices, this and that, the thing is that the closer you are to the edge, the closer to you are to what Netflix and Google does, the problems are harder to solve. Not easier. Microservices are harder than monoliths. Kubernetes is harder to manage than VMs. The benefits they provide outweigh those problems almost certainly, but still if you don't have the capacity to solve the problems that Netflix is solving, you shouldn't be doing what Netflix is doing. I'm in full support of microservices, but if you don't know what you're doing, you're going to be worse off. So it's kind of step-by-step. You cannot jump. Let's say that you live right now. Your tech stack is 1999 and then you say Hey I want service mesh and Kubernetes and microservices and functions and all that stuff. No you cannot because you don't understand what VMs are in a way. I'm ridiculizing. I'm exaggerating now but

Itiel: [00:27:49]
One of the things that we also see and I see quite a lot is that companies move to Kubernetes and now they have a lot of new issues. I'm trying to help and then sometime they tell me something like no but like next month we're going to implement Istio and it will solve the new problems that we have. I'm like I worked with Istio. It's a good technology but I'm not sure that having more complexity and more abstraction is what you need at this stage of time. It's like drugs. You take the next one and the next one and the next one. If you don't manage Kubernetes, the chances that Istio is going to improve your system are quite low. It's not going to happen. I hear it quite a lot from companies that are looking into Istio and I'm like what problem are you trying to solve using Istio and again what is the price?

Viktor: [00:28:35]
Here's the thing. You know what problems Istio are solving once you become proficient with Kubernetes. Not before. That's going back to your example. That's the thing. You want to jump into Kubernetes. Excellent. Jump into Kubernetes. Forget about the existence of Istio. It doesn't exist. After three months, half a year, using Kubernetes in production, you will understand why we need Istio and then that is the moment when you want to use it. It applies to almost everything. I'm playing some game recently. I can do double jump now in the game. Whatever game is but you start with a single jump first and the game teaches you over time how to do double jump to pass through obstacles.

Itiel: [00:29:18]
Yeah but my problems are going to go away once I'm going to use Istio. It's something I heard quite a lot. I used Istio on the beta cause I really had no choice and it has its benefits. I won't lie but I do try to say people wait up at least a couple of months. Maybe let's see if you can solve the issue without using Istio and worst case Istio is only going to get better then try to postpone it. It works sometimes. Again, Komodor is not a service mesh solution or something like that but more of a general concept.

Darin: [00:29:49]
We're bashing Istio again. Can we not have another podcast about bashing Istio?

Viktor: [00:29:54]
No. Istio is amazing. I love Istio. I love service mesh. I just think that signifigant percentage of people are not ready for it. That's not the same thing as bashing.

Itiel: [00:30:04]
I tend to agree. I understand the service mesh concept and use it, but you need to understand what you're doing. I think that it is as simple as that and Istio is not that simple to understand.

Darin: [00:30:16]
Itiel, thanks for joining us today. Again, all of Itiel's contact information will be down in the show notes. One more time, tell us what Komodor can do for a company.

Itiel: [00:30:26]
Komodor allows you to see in one place, everything that happened across your cluster or your Kubernetes cluster and it means deployments, changes, feature flags changes, alert, everything in one place and in a language that speaks both to developers and DevOps and not only for the top talent.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.