DOP 40: Continuous Reliability: How to Avoid the Biggest Mistakes Developers Make

Posted on Wednesday, Jan 29, 2020

Show Notes

#40: We’ve heard about Continuous Integration, Continuous Delivery, and Continuous Deployment. Today, with the help of our guest Eric Mizell, we discuss Continuous Reliability.

Guests

Eric Mizell

Hosts

Darin Pope

Darin Pope

Darin Pope is a developer advocate for CloudBees.

Viktor Farcic

Viktor Farcic

Viktor Farcic is a member of the Google Developer Experts and Docker Captains groups, and published author.

His big passions are DevOps, Containers, Kubernetes, Microservices, Continuous Integration, Delivery and Deployment (CI/CD) and Test-Driven Development (TDD).

He often speaks at community gatherings and conferences (latest can be found here).

He has published The DevOps Toolkit Series, DevOps Paradox and Test-Driven Java Development.

His random thoughts and tutorials can be found in his blog TechnologyConversations.com.

Rate, Review, & Subscribe on Apple Podcasts

If you like our podcast, please consider rating and reviewing our show! Click here, scroll to the bottom, tap to rate with five stars, and select “Write a Review.” Then be sure to let us know what you liked most about the episode!

Also, if you haven’t done so already, subscribe to the podcast. We're adding a bunch of bonus episodes to the feed and, if you’re not subscribed, there’s a good chance you’ll miss out. Subscribe now!

Signup to receive an email when new content is released

Transcript

Darin Pope 0:00
This is episode number 40 of DevOps Paradox with Darin Pope and Viktor Farcic. I'm Darin

Viktor Farcic 0:06
and I'm Viktor.

Darin Pope 0:08
Today we have a guest with us. His name is Eric Mizell. Hey, Eric.

Eric Mizell 0:13
Hey, how's it going, Darin and Viktor?

Darin Pope 0:15
Good. So in full transparency, Eric and I used to work together, eight, nine years ago. Doesn't seem that long.

Eric Mizell 0:22
Yeah.

Darin Pope 0:22
We were actually talking yesterday. He's like, "hey, can we reschedule this recording? I've got to take my son to get his permit." And the last time I remember your son, he was maybe four.

Eric Mizell 0:32
Yeah.

Darin Pope 0:33
So four or five. So it's been a while since we actually seen each other.

Eric Mizell 0:38
Yeah.

Darin Pope 0:38
Eric is the VP of solution engineering at OverOps. We'll talk more about OverOps later. There will be a shameless plug. Fair warning, but not until later. His title isn't as long as yours is Viktor. Your title is still the craziest.

Viktor Farcic 0:56
Yeah. When you're not sure what you're doing, you need to when to invent something so that nobody really knows what you're doing,

Eric Mizell 1:03
I like it.

Darin Pope 1:05
Maybe we need you to that's, that's, that's what we'll talk about on the career episode. Once I get that guest, that's gonna be pretty good. But anyway, Eric is with OverOps. He came up through the world, pretty much like I did. I was looking back over your bio, you had PowerBuilder there.

Eric Mizell 1:19
Yeah.

Darin Pope 1:19
Oh my gosh. PowerBuilder.

Eric Mizell 1:20
That's right. cut my teeth, man. PowerBuilder.

Darin Pope 1:23
Did you cut your teeth or cut your hair afterwards?

Eric Mizell 1:26
Yeah, we can see for my hair, you know. But no, I cut my teeth on DataWindows craziness, fat clients. But thankfully got in the world of Java and rewrote the whole app in Java, and never left that Java world since.

Darin Pope 1:41
Which is interesting because we talk a lot about the DevOps stuff here on this podcast. And one of the key words is always "continuous". So of course, we have the new course about canary deployments on Udemy, again, shameless plug. But Eric has coined a term that is another continuous term that I think it's really interesting in its "continuous reliability". Explain to us, Eric, what continuous reliability is, or what it should be.

Eric Mizell 2:13
So first thing, I'll give you a quick backdrop. You know, as a software engineer, the biggest challenge is trying to deliver quality code across a continuum of as you're developing to QA to production, with different types of testing, static, dynamic, you're running, you know, load testing, etc, and manual testing and in the New World CI/CD where we're trying to move faster to keep up with the Amazons of the world, the Googles of the world where these guys are new school, and they deliver or deploy code hourly, minutely, right? Like they're constantly deploying code. How do I move to that world when I'm coming from a monolith where or heavier code bases where I deliver code every six months or every three months to the notion of delivering every other week or once a month. Like for some folks are struggling just to get to deliver once a month, because they're used to every six months. How do I get there faster? And so this notion of continuous reliability comes up where I need to move faster, I need more automation. I need better tooling. I need better feedback loops. And I need more automated processes. And so I do a lot of talks on continuous reliability. Because what's happening is you kind of have this notion of a Venn diagram of where I have speed, which is how fast can I move? How fast can developers deploy or develop code? I have quality, which is how good or reliable is my code and have complexity, right. So this complexity says, you know, how complex is it, we're moving to microservices, which is by far more complex than a monolith, but yet easier to maintain. So I look at that Venn diagram. And I look at that there's no way to really get all three because if I have a complex product I'm going to give up speed. If I have I want higher quality to give up speed, like you get this trade off to all of these things, right? If I want to go fast, and I have high complexity quality is going to suffer. So this notion of continuous reliability is how do I get to the middle of this Venn diagram where I can deploy code faster, I can have better feedback loops catch errors sooner in a shift left world. And I'm sure you guys have seen a lot of shift left. And we want to catch the errors before they get to production. There's a lot of talk about, you know, I need more code coverage I need, you know, better testing, but the reality of it is is code is never really tested till it gets to production. So you get this vicious cycle of how do I get those test cases back into production or into my QA cycle and move faster? So, you know, the way I see the world changing? First thing is automation. I think we're all seeing that right. More tooling, more automation, and less reliance on humans. I hate saying that, but it's reality, the more human intervention in this pipeline, the more chances for failure. So think of manual testing. Someone does a test case they miss a step, they pass it, it fails in production, no one knows, right? Less reliance on logs, if you will. Everything goes in a log. And the problem with a log is it doesn't tell me anything proactively, it's reactive. So I need more things, monitoring, live running apps, live running tests. So I like to call it AI or "code monitoring code", giving me better insight, I guess I'm going too much I should probably stop and take a break on this and let you ask some questions. But this notion of automation, getting a shift left world, getting things sooner before I get to production, and then if it gets to production, having a way to detect issues faster, and get it back through the cycle faster. And that pipeline helps me move really, really fast.

Viktor Farcic 5:52
You know, I think that the last part is the crucial. You know, I think that most of the companies throughout history of software engineering was focused primarily how to make something not fail in production. That's why we would have like a year long cycles, kind of like, Oh, I need to spend months and months and months because when it put it to production is going to be rock solid. The thing the part that I was always curious about the theory is that kind of like you would fail every single time and you would still think that that makes sense.

Eric Mizell 6:26
Mm hmm.

Viktor Farcic 6:27
To continue but if you have that attitude is going to I'm going to make it work no matter what in production then whatever you just said, this is useless.

Eric Mizell 6:37
Right and to your point, the the notion of when I used to write software, we would go through the same model you were talking about you think of every permutation that could break the code, and you spend weeks on this, and months with testing and retrying and the experiments, you don't have that time anymore. The product has to get to production. Business doesn't wait. We're competing with so many other, you know, you could start a company, the three of us today can start a company today, right on Amazon. And we can have something spun up and be competing with somebody in a week. And when you talking about new school development and you moving faster versus someone who has legacy, they move very fast. We have to change our mindset, and how do I move faster, but have a way to catch the the issues in the code or the issues in my pipeline, whether it's database, middleware, whatever, before it impacts my customers, without having to take the months of time It used to take to your point.

Viktor Farcic 7:35
I don't believe it's it's all it's not even only about speed, I would actually say that even speed comes being faster comes later. First. I think that we need to abandon the idea that you're going to no matter how much the time is spent a day or a year, you're not going to make something that works in production from the first attempt, 100%, so it's not gonna happen. It never did. And then once you once you accept that there's my belief, then yes, we are going to deploy to production is going to fail. So why why not at least go faster and it's an in shorter iterations, then at least we can control the scope of that failure better.

Unknown Speaker 8:14
And that's interesting. I think that's where people are getting into the microservice world where I'm making smaller, incremental changes faster, right? To your point where I can move faster, more agile. My biggest piece of this is knowing this is where I struggle. You know, shameless plug, you know, I've been working as a field engineer for years, and I work with a lot of companies. And most people don't know when a new error happens. That's my problem. Right? So I'm moving fast and I'm going accept this model that I'm going to break things. It's okay. But if I don't know things are broken, then it's impacting my customers and that's where the pain starts. How long can I tolerate that from a business? That's what I want something monitoring code, shameless OverOps plug. It detects new errors and will tell me right away and if I can know right away, I can fix them and I can move faster I can, hey, look, I saw the problem, fix it, deploy, fix, deploy. And that model keeps me moving faster to your point helps me not to have to spend hours and hours ahead of time.

Viktor Farcic 9:11
You know, one of the big changes that I think about monitoring productions that, you know, past users were monitoring production, you would find out that something doesn't work because the user would call you like, this doesn't work. Then you find it out. And today, users don't do that anymore. really doesn't work. They go somewhere else you never find out from from users anymore, that there is a problem. They are somewhere else.

Unknown Speaker 9:34
That's exactly right. I gotta give you a hilarious story. And this, I can't make this stuff up. So my wife was trying to buy something on a website won't name the website for Cyber Monday. And she puts everything in her shopping cart, hits buy, disappears, nothing. So she does it again and nothing. And to your point two types of people in the world. One calls the call center. That's my wife. little old school here, right? She calls their call center and orders the stuff. The rest of the world goes away, never comes back to that company. Both of them cost you money. Because one, you have to have a call center person to take the call, or you lost business permanently. And then they tell 100 people, right? And the sad part is is nobody knows. Nobody knows unless someone complains. To your point. Nobody knows. It's an abyss.

Darin Pope 10:27
That's always fun. But then, when the failure occurs when you've learned that new error, what have you seen? Again, not a pitch for OverOps, but it helps in this space, for people that actually drink that Kool Aid and do it. Do they actually do the work? Are you seeing people actually, hey, this error just happened. Let's fix it. And let's deploy it. Do you see that happen enough? Or is it still? Oh yeah, there's new error. Okay. I'll go out in three weeks

Unknown Speaker 10:59
No. So what's interesting about OverOps is we actually give you the business context. Right? So is it happening in a business transaction? I think that's what I've learned heavily in this last few years I've been doing this with OverOps is that people in operations need business context, if I have an error, what does that mean? What is it impacting? Nobody knows. Until someone complains. That's the notion of it, you know, is it a database problem? Is it No one knows. But if I can tell you that a payment transaction is failing, then that becomes super interesting, right? So if I'm sitting in the ops desk, and shopping cart purchases is failing. And you get an alert that says that that's going to catch your attention. That's the type of details we give you. And so that's what gets people's attention because that becomes a P1 versus the noise of errors that you get. We think about how people detect errors today, most of it is volume based thresholds. And they just get bombarded with alert storms, kind of like the notion of a car alarm going off, no one cares anymore. So it gets ignored to your point Darin. So we want to make sure we give data that gives high value context related to something business related so that it's human readable. And that someone who's at an ops center or can go, oh my god, this is, you know, shopping cart or this is payment or this is open enrollment or whatever it might be. That's important to the business that demands attention. And we give you a root cause screen that actually is something you can triage in minutes versus sifting through logs for hours.

Viktor Farcic 12:32
Maybe it's also about limiting scope, because that reminds me in a way of testing, you know, when you run automated tests, and then when you have situations for 10s, or hundreds of tests are failing. And then you just say, yeah, that's a failing, right? What can I do? It's 10s of them. But if there is if you get to the point that a test fails, a single test fails, at least psychologically is different. Oh, okay. So I have a test that needs to be fixed. I can do that.

Eric Mizell 13:04
Right.

Viktor Farcic 13:04
When you have a hundred tests that needs to be fixed and you just give up. I would give up.

Eric Mizell 13:09
It's funny you say that. And that's one of the things that has been foremost what we're hearing from our customers and from analysts and all these, the folks that we spoke speak with, but everyone has accepted technical debt and legacy debt. Your application has tons of errors to your point, thousands of these things fail. Oh well, no one is complaining? That's the answer, right? What people don't want is the new issues or the critical so we have a way to tell you what's new, what's critical, prioritize them. So to your point, when you run a test, probably shift left, what do I care about? I don't care about the 10s of thousands of problems I've had for years. Don't have time to fix them. No one's complaining. I reboot. I have to say this. This is sad. And Darin knows this story because we lived it at Terracotta. I reboot my server every night when has a memory leak. That's how we solve it. We've accepted that. What we don't want is we put this new feature out that caused us to not process a payment or process caused us to lose financial, whatever, that's what we want to focus on is those types of errors. That's that's what the focus needs to be continuous reliability. If I'm going to move fast, I want to know the new features I'm building and the functionality I'm changing doesn't cause me financial pain or cause me to lose customers. And if I can identify those over the technical debt, life is good.

Viktor Farcic 14:28
I bet's kind of like I acknowledge that I have technical debt, but at least I cannot, I can work on not increasing it.

Eric Mizell 14:36
Correct. Correct. And then maybe burn down if I can troubleshoot faster, because I have a root cause screen that shows me actual, actual problem versus sifting through logs. Then maybe I can start burning down technical debt in a sprint.

Darin Pope 14:56
And we got a puppy today. Who's the puppy? is that you are a puppy

Eric Mizell 15:00
Yeah, my wife would have one of those. So here we go.

Darin Pope 15:04
Well, mine is sitting over here quietly behind me for right now, but at 70 pounds, he's not such a puppy anymore.

Eric Mizell 15:10
No, I would think not.

Darin Pope 15:12
He thinks he is. Again, we're off on a different topic that has nothing to do with DevOps. We're talking puppies. As long as we don't have kittens. Of course, Viktor is the one that has the cats. I'm surprised I haven't heard him crying today. There was one episode, he was crying a lot, but that's okay. So technical debt. Alright, so let's ask this sort of in a generalized way here. In the people that you've seen, whether through OverOps or other places, is technical debt really a big deal? People say it is. But what do you see? We probably have a big overlap in client sets. But bottom line is, do you see people really caring about technical debt?

Eric Mizell 15:49
No, unfortunately, I don't. And it's funny you say that. When I first joined here two years ago, my mantra my schtick is I'm a software engineer by trade is to not have technical debt. And I wrote a blog about changing the status quo and why is it acceptable to have all these errors? which no one cared, right? Because the facts are the facts that business moves fast. These VPs of engineering have, you know, 10s of hundreds of developers and they have feature requests and things they have to build. And the technical debt is technical debt. And so what they've chosen to solve it with is, like we mentioned, reboot my server every night because I have this critical error. I have customers that reboot every four days, because at the fifth day, everything crashes. So to avoid it, we just reboot. We have people that have you know, we're going to do canary deploy, because canary helps me roll back. There's all these different ways. I got a customer asked him how do you solve for these problems? Oh, well, if the container dies, a new one spins up. Yeah, but your SLAs go for a toss while that container is dying because you know, memory is churning, threads are locked. Whatever. And they're like, Oh, I never thought of it that way. So to your to your question, no, people are not solving technical debt. They're very focused on new features, because they have to keep up with business.

Viktor Farcic 17:12
I think that there is another problem. And there are many other but reasons actually, I think that many people don't have a valid point of reference. It's kind of you know, on average, it takes us three months to release a new feature, because and that's normal. Because your only point of reference is yourself.

Eric Mizell 17:30
Correct.

Viktor Farcic 17:32
I think that it's our obligation in a way to kind of show people that no, that's actually not the reference. The reference is that and it's this and you're there. Once they realize that, then then the technical debt becomes a thing. Oh, yes. Okay. So that's how normal looks like and this is what that's why technical debt matters because we cannot be normal. But as long as you think it's normal to release once a year, then why would you fix technical debt? You're within the boundaries of normal?

Eric Mizell 18:07
You're absolutely correct. It's what people are used to. It's, it's what I call the status quo. This is what we've had. It's what we accept. This is how we do stuff, how we've always done it, and it's fine. And to your point, needs to change. I'm actually starting to hear "error budgets" coming up with companies and technical debt burndown budgets, bug bounties. They pay them to come in on a weekend and for every bug they bash, they get a little stipend or little perk, right? And so you have to change the way you think. And those things everything is a cause and effect. If I fix all this technical debt, guess what? I don't have to reboot every night at midnight because of a memory leak. I fixed it. You know, these deadlock issues or the thread problems or my database index problems go away if I fix them, and guess what? Cause and Effect. Everything runs better. My SLAs are met, applications run better, less headache for everyone. Let's Sev1s. The other thing I love is less in the logs. So what kills me is logging has become our hammer. And every problem is a nail. And we throw everything in the logs. And when people have issues, they go to the logs. And we used to do a query for a timeline. And that timeline now has 10 million entries. Because there's so much in there, because we accepted it. Can you imagine if you cleaned it down and there was only 100 entries? Boy, you could find stuff fast, right?

Darin Pope 19:31
Let's go back to your earlier comment real quick. And I'll say it this way, we are trying to eliminate the humans as much as we can.

Viktor Farcic 19:37
not eliminate sorry to move humans to do something else.

Darin Pope 19:42
Well, it gets back to budgets. Maybe it is eliminating the humans because it gets back to the automation, right how much of this stuff is reallocating people to the correct place? Some people will get scared because their whole life. The only thing they deal with is going through logs and finding errors. Right, that's their job.

Eric Mizell 20:01
Correct.

Darin Pope 20:02
That's what they believe their job is. And they're scared to make a change.

Eric Mizell 20:07
There's always that "Who Moved My Cheese" right or changing viewpoint. This is my job. And I like when I said eliminated humans when that was eliminating the human factor. So let's play this scenario out. You guys release code to me, I'm the QA manager. And today, it's my decision to decide to promote the code. And we run all of our manual tests. And I'm like that it's good to go. I made the decision. And the decision is based off of what I know, which is I queried the logs, I grepped, I did this and it's good. But if I could have automation and quote unquote, AI, tell me, hey, in this release, after all your testing is done, you had three new errors, 12 critical and then 50,000 technical debt errors that we all know about. I have a better lens of making a decision. It's not the human decision of this is the status quo and it's okay. It's Here's subjective information that tells me Should I promote or not? Because it analyzed my code at runtime and told me Should I promote or not? That's what I'm talking about is let's give the humans better tooling to tell them, hey, what does that release look like? And I'll give you the two second, probably more than you want to know. But part of my past career, I was working with an offshore group. And they kept giving us code that was more broken than fixed. And so we decided to back charge them for every error they created. And so we got the next release from them. And guess what, there was no errors, not one. And my QA team was like, Yay, everyone's applauding, and I'm like, How can that be, but I didn't have time to look at it because I had so much to deliver. So we released the code. And I get the first call And this went to big insurance companies get the first call, probably an hour in and they're like nothing saving to the database. I'm like, What do you mean, and they're like, well, we went through this whole new wizard feature you guys built, entered all this data and nothing saved. So we found out is the offshore team was swallowing the exceptions, because they couldn't get the inserts to work into the database. So instead of them turning in code that has errors in it, they turned in code that has none, but didn't work. We had no idea. It took us two months to unwind. All the swallowed exceptions were easy to find by searching and going through the code. The problem was we had to fix all the code. Took two months. So that's what I'm talking about eliminate that human decision factor by giving really good information that detects all these issues and tells me. So it's a big change.

Darin Pope 22:33
And if there was only a tool or service that did that for us.

Eric Mizell 22:39
Wouldn't that be great? OverOps? OverOps? That'd be awesome.

Darin Pope 22:44
So give us give us a 30...1 minute pitch on OverOps. What's its scope because I've looked at the site. In case you're listening, it's https://www.harness.io/products/service-reliability-management. It looks like it's primarily Java. Is it only Java?

Eric Mizell 23:00
No. So, we're Java and .NET. We just released .NET. So to be fair, that just came out. And we'll be announcing GA of that here in another week or so. But yeah, so the scope of OverOps is about identifying, prioritizing and resolving your most critical exceptions before they impact your customers. So we're a runtime agent. With no code changes, no configuration, you turn it on, it automatically does all this magic I've been talking about. The idea is to be able to automatically detect all these critical issues, whether they're then at something that doesn't show up in a log. So like I swallowed exception unhandled exceptions, help you prioritize them, categorize them and tell you what's new, what's critical, what's resurfaced, and give you a proper alert with context. And the context is something that no one's ever seen before. Something I would have loved to had when I gave you that example of this offshore group with swallowed exceptions. We actually will show you source code and all the variables that were moving through the system at the time of the event, and helps you to triage and troubleshoot problems in a fraction of the time. So not diving through logs, looking at stacktraces, just trying to figure out what's happening. I'm actually looking at what we call a snapshot of the error at the time it happened with the source code, show me exactly where it failed, and all the data. And data is usually about 90% of why the errors fail. There was something we weren't expecting. And now I have all that context, I can solve the issue quickly. So that's that notion of identify it, tell me what's new, critical, whatever, and prioritize it, what's the most important ones, and then resolve it. That's what we do.

Darin Pope 24:38
Yep. And it's a SaaS, I would assume.

Eric Mizell 24:40
we're SaaS, on prem

Darin Pope 24:41
onpremise. Okay. either one. Cool. All right. Well, Eric's information will be down in the show notes, his LinkedIn, his email, his bank accounts. And there also be a video from DevNexus that he did back in 2019 2019. that's when it was? Okay, 2019 talking more about continuous reliability. This is this is one of his much like Viktor is all about Kubernetes and now canary, another pitch for the Udemy course. Still on sale. Hurry up that coupon expires in two days I think or one day, or did it already expire? It already expired. Sorry. There's probably another one coming though. It's continious reliability is what you're all about right now.

Eric Mizell 25:25
Yeah.

Darin Pope 25:25
So all of his information is there. If you're listening via Apple podcast, please go ahead and subscribe, leave a rating and review. All of our contact information including Twitter and LinkedIn can be found at https://www.devopsparadox.com/contact. New thing. New thing. Actually, two new things. If you'd like to be notified by email when a new episode is released, you can sign up at https://www.devopsparadox.com/ The signup form is at the top of every page, pretty much every page and maybe one or two that I missed. And finally, we have transcripts now. Thanks to the wonders of AI, because there's no way I could type out everything that we do. They're not perfect, but the transcripts are also going to be on the page at the episode link for at https://www.devopsparadox.com/. So like if this was this is episode 40. So when you go to https://www.devopsparadox.com/40. you'll scroll down you'll see the transcript from today's episode. And for pretty much episode, every episode, this point forward. Whether or not we go back and do the old ones. I don't know yet. But maybe. Anything else from Viktor, you don't have anything right? No? Okay. You're good. Eric. Do you have any final words you'd like to say before we cut you loose?

Eric Mizell 26:43
Yeah, just thanks for having me, man. It's good. Good. Good chat. Appreciate the time. It's fun.

Darin Pope 26:47
Yeah, man. Appreciate it. And thank you to listening to episode 40 of DevOps Paradox.