DOP 37: 50 Shades of Canary Deployments

Posted on Wednesday, Jan 8, 2020

Show Notes

#37: We take a walk through the continuum of deployment strategies.

Hosts

Darin Pope

Darin Pope

Darin Pope is a developer advocate for CloudBees.

Viktor Farcic

Viktor Farcic

Viktor Farcic is a member of the Google Developer Experts and Docker Captains groups, and published author.

His big passions are DevOps, Containers, Kubernetes, Microservices, Continuous Integration, Delivery and Deployment (CI/CD) and Test-Driven Development (TDD).

He often speaks at community gatherings and conferences (latest can be found here).

He has published The DevOps Toolkit Series, DevOps Paradox and Test-Driven Java Development.

His random thoughts and tutorials can be found in his blog TechnologyConversations.com.

Rate, Review, & Subscribe on Apple Podcasts

If you like our podcast, please consider rating and reviewing our show! Click here, scroll to the bottom, tap to rate with five stars, and select “Write a Review.” Then be sure to let us know what you liked most about the episode!

Also, if you haven’t done so already, subscribe to the podcast. We're adding a bunch of bonus episodes to the feed and, if you’re not subscribed, there’s a good chance you’ll miss out. Subscribe now!

Signup to receive an email when new content is released

Transcript

Darin Pope 0:00
This is episode number 37 of DevOps Paradox with Darin Pope and Viktor Farcic. I am Darin.

Viktor Farcic 0:06
and I am Viktor.

Darin Pope 0:09
And today we're going to talk about deployments. They could be old school big bang, or we could try to do new kid cool stuff. Right?

Viktor Farcic 0:22
Yeah.

Darin Pope 0:23
We'll talk about the continuum of what we believe the basics. I mean, there's obviously going to be variations of every one of these. But we think it rolls up into four big ones, right?

Viktor Farcic 0:37
Exactly. Big Bang, rolling deployments, blue/green, and canary deployments. I guess that those are the three you're referring to.

Darin Pope 0:48
Yeah, the four. Yeah, that's it.

Viktor Farcic 0:51
And if you previously said kind of like choosing the right deployment, I don't think that it's it's choice. It's mostly forced on you. You can do one or the other depending on the way how you assemble your applications. So it doesn't scale.

Darin Pope 1:08
Well, it's not so much assemble, it's how it was written.

Viktor Farcic 1:12
Yeah, exactly. I mean, that's architecture of your application and the system in general.

Darin Pope 1:19
So big bang, that's sort of where everybody starts. Right?

Viktor Farcic 1:24
I mean, I wouldn't say that everybody starts. But everybody was starting in the past, let's say.

Darin Pope 1:30
Yeah. This is this is this is the history of deployments, if you will.

Viktor Farcic 1:37
Exactly. So we learned a long time ago, in a galaxy far, far away. We were all doing big bang. Because most of us get the single update single replica of our application. A big one, you know, the one that you were scaling by adding more CPU and memory to the problem. And that application had to be shut down for a new application to be put in its place. And we're going to call the big bang, even though probably there is a better name for that. And that's inevitable downtime. There is absolutely no way to avoid downtime if you don't have multiple replicas of your application. Then we got to the idea to do blue/green deployments, which would be running two big bangs in parallel; the old and the new one and directing traffic to one or the other. Which already means that you need to have an application that scales, because you already have at least two replicas of your application running. The problem with that is that with blue/green, you have to have everything double. So if you have 100 servers, you would need to have 200 servers in production, which is very costly operation, assuming that you really want to really, really do big bang in terms that you can rollback anytime of the day. Then we got rolling updates, which would upgrade one replica at a time. So if you have 10 replicas, you would have, like nine replicas of the old release, one replica of the new release, and then 8 replicas of the old release 2 replicas of the new release, and so on and so forth. And for that, you definitely need to have a scalable application simply, it's impossible. And then, and I'm going really, really fast here, then we have canary deployments, which would be the same as rolling updates, but you're making a decision whether to proceed or know 10% of the traffic here at 90% there, make a decision, go forward, go back. And that decision is made, hopefully by machines, running some tests in production and testing in production means that we are evaluating metrics. Is the error rate sufficiently low? Is the speed of responses of our application sufficiently high?, and so on so forth. And I would say that probably there is kind; that will change, everything changes. But today canary deployments are the pinnacle of deployment strategies. I don't think that there is anything more sophisticated or more reliable than canary deployments.

Darin Pope 4:21
Okay, well, great.

Viktor Farcic 4:23
Actually, if you combine canary deployments with feature toggles or feature switches, then you'll get something even more complex and more sophisticated.

Darin Pope 4:38
So, in doing that, and you know, let's, let's see, you basically just gave the full history over the past 20 years of deployments. Maybe longer, yes, give or take

Viktor Farcic 4:52
that as some, you know, shades of gray in between and all those things, but yes.

Darin Pope 4:58
So if today, your big bang, or blue/green, one or the other, maybe you made it far enough in your application re-architecture to get to blue/green to where that was a safe deployment or non blow up deployment. How do you get to canary from there? Because it's probably going to have to still continue to be an application re-arch as you move through.

Viktor Farcic 5:28
It really depends on how you're doing blue/green, for example, you can say I'm doing blue/green, simply for the sake of rolling back fast and in an easy way. When a human detects a problem, then you're kind of far away from canary or whatever else you might be doing. But if you're using blue/green, combined with some evaluating some metrics, and in other words, if if you're rolling back automatically, or semi-automatically, then you're probably in a fairly good shape to go to canary deployments. It all depends on kind of, do you collect metrics that and bear in mind that this is the easiest part collecting anything is easy. But do you trust those metrics? And do you trust the formulas or queries sufficiently that you can say, okay, whenever, whenever the result is x, or higher than x or lower than y or whatever the criteria is, then actually I'm going to roll back. That would be kind of, if you can get there if you can have that trust in the system and if your application is scalable, and all those nice things that we know about applications, then it's all about understanding the process understanding the tools about around the process, and just doing it. It's not really a big deal if you can trust metrics and if your application is ready.

Darin Pope 7:07
But you could apply metrics to any of the four. You could.

Viktor Farcic 7:11
Oh, yeah. Exactly. I mean, if you don't have metrics, but you I always say the same thing and I always gets surprised over and over again, how much people actually they metrics but they don't really use it is still mostly watching it some dashboards and then wondering what's happening and things like that. But yes, you can apply metrics that are universal doesn't matter how you deploy things.

Darin Pope 7:42
Is that really the first part of the re-architecture? If you're still big bang today? Is that the first thing you should do is collect metrics and analyze?

Viktor Farcic 7:50
Yes, collect metrics and analyze and here comes the difficult part. It's, it's not only about the easy part is to collect metrics of the system. A bit harder part is to collect metrics from inside your application. That's a bit tougher, because, for example, you cannot here is it doesn't matter how you deploy stuff, right? You want to make a decision to do something based on metrics. And now the latency of your requests is not good enough. You need to be able to say, okay, latency of this function of this endpoint of this part of my application needs to be this and not that part needs to be that. So it cannot be it's not good enough to say, "is my application fast?", because that will give you some averages that don't really are not really useful. because by then, like response of signup process might be slower than what you expect from the rest of your application. And if you take all that together, then you're going to have really high thresholds that apply only to part of your application.

Darin Pope 9:03
So how would you? Do you do cross cutting because people don't want to go in and instrument everything.

Viktor Farcic 9:11
But they should, I think.

Darin Pope 9:16
But that introduces risk in the application too? Right? Done incorrectly, bad metric collection could be worse than user impacting performance.

Viktor Farcic 9:30
Yes. But that's that's what, that's why you have that those more general metrics, usually collected from outside of your application that are going going to tell you whether your application slow this in general kind of as a whole, right? So if you implement instrument, your application with metrics, and the other external metrics are telling you this is slow, then you know, it's slow, you know that you did something wrong. So the first step is definitely collect general metrics. Then you can start instrumenting, you will know whether it's too slow for you.

Darin Pope 10:04
And you called out one thing that I thought was interesting, the signup process. Don't necessarily just instrument for technical but instrument for business reasons,

Viktor Farcic 10:14
of course, because you need to know are people giving up? I mean, just to keep with the latency example from before, if you will now say, Viktor, there is a problem in our application is too slow, I will tell you to back off kind of what what value it is to me to know that application is slow, no value. It provides absolutely no tangible value. I need to know which part of the application is slow. Right? That would be. That would be the same as saying if you've been queuing saying opening an issue saying application doesn't work. Yes. What doesn't work. It's a huge application. Tell me what's not working. Same thing with metrics. If you don't know what is wrong? If it can just say the cluster is not is not sufficiently fast that that's silly.

Darin Pope 11:10
Yeah, but that's where most people end up going to first. I need bigger faster. And they're solving the problems the quote unquote problems by throwing hardware at it. And sometimes that's Yes. Sometimes that's right. Well, because they started with an undersized thing to begin with. Right, because they didn't have they didn't. It's green field deployment, hey, this is the first time we're doing this. We don't know what we're doing. So this is what we think.

Viktor Farcic 11:43
Yes.

Darin Pope 11:45
And yeah, but you could load test that every way to Sunday. But until you get real humans/bot traffic running against it. There's no way to tell.

Viktor Farcic 11:56
Exactly. But I think that the bigger problem is that It's and I think if you've been repeating this over and over again in this podcast is that there's a disconnect between different groups. Because if you have some operations group or sysadmins group that is in charge of those metrics, of course, they're going to look at the system from a very high level perspective, because they have no idea that you have different functions and endpoints and different stuff in your application. How could they know? And you assume that they're going to do all that for you as a developer. So there is that discrepancy that it's not my job and the person whose job that is cannot go that low because he only knows that there is an application.

Darin Pope 12:45
Yeah, and real in this calls out the the still huge divide between Dev and Ops.

Viktor Farcic 12:53
Exactly. Not working in a team.

Darin Pope 12:57
Not working in a team. We all want to play in it? Well, I said we all want to play in a team. I don't believe that's true. But that's another conversation that we don't need to do to derail ourselves with. The short of it without derailing too far is everybody, every little part of the company is building their own little kingdom.

Viktor Farcic 13:18
Exactly.

Darin Pope 13:19
That's that's all i mean by that. That's it's not individuals not working, wanting to work as a team, but it's the teams not wanting to work with teams.

Viktor Farcic 13:28
I would say that, basically, when you don't even necessarily have anything against each team playing separately with being company within a company and all those things. That's not the real issue. That's almost unavoidable. What is the real issue is how you organize those teams. So are the teams around the technology like, I don't know, middle layer team, whatever that could mean? Or is it a team around the product?

Darin Pope 14:01
Is it following line of business or is it following technology?

Viktor Farcic 14:04
Exactly. If it follows lines to business kind of I mean, I have a team in charge of this product, Gmail. Right? then yes, you can be as selfish and you can be as separate from others as you want because you are a business. The problem is that QA is not a business. There is no business motivation behind having QA, or operations or all of those things in most companies. No matter how much we want it there is nobody ever says my business will be to have QA in a company, my business will be to release this application. That's business.

Darin Pope 14:46
Yeah. Okay. And let's look at all your worries somewhere else. That's okay. We can come back. We'll come back to that another day. But with deployments big bang, blue/green, rolling, canary, right? Those are the sort of the big four buckets. The and, you know, you have to instrument your app in order to collect metrics, but then what do you do? Okay, we can pump metrics to fill in the blank. What's today's hotness for filling in that blank? Is it Prometheus/Grafana? is it? It could be anything, right? It doesn't really matter.

Viktor Farcic 15:34
Doesn't really matter as long as it actually does matter. It does matter in terms of what what today's application assumes some to be a standard. And if we are talking about relatively new stuff, usually running in Kubernetes then everybody assumes Prometheus. So in that sense, it does matter. It's not a question whether Prometheus is better or worse, but you're going to pick a tool x that will do a job y whatever that is, right. And that tool is likely to assume if it assumes only one metric storage backend, it will be Prometheus. So you're going to choose Istio. Istio works natively with Prometheus, you're going to Flagger. Again, Prometheus, you're going to choose this you're going to choose that is Prometheus, Prometheus, Prometheus. And in this case, we can have a separate session, I'm not entering now whether Prometheus is a better or worse choice from a technical perspective, but it is assumed. So if you don't use it, then actually you will be you will have to discard parts of the things that you will be adopting.

Darin Pope 16:52
Right

Viktor Farcic 16:53
and that might be a big problem.

Darin Pope 16:55
Yeah, because if it's sort of decided upon that, oh, yeah, this By default, this is what this sub tool uses will use Istio for a second. By default, Istio pumps data to Prometheus. Or vice versa, I guess. But what if you're not in Kubernetes yet?

Viktor Farcic 17:18
If you're not in Kubernetes, then it's almost anything. I think that there is no, that established standard.

Darin Pope 17:28
Yeah, I mean, if you've got money, you can use Splunk. Right? You could mean that's, that's an answer. Or you could write your own time series database. Because the key part of all this is time series. Well, no, it let me rephrase that. Don't not write your own time series, but use a time series database to do because that's the key part to to the metrics. Time is everything.

Viktor Farcic 17:52
Yes, at least if again, but now if you look at the landscape of similar tools you will notice that most of them are not really time series. A good example would be Elastic, right? It's a very popular tool to store not only logs, but metrics and it is not time series. And it is very inefficient in that, and I would not recommend it because of that. But I cannot say that time series is still the most common option. Splunk, I think is also not time series. It actually swallows everything.

Darin Pope 18:27
It's ELK-ish under the hood. So it's

Viktor Farcic 18:30
Yeah. But with, you know, if I'm hosting it myself, let's say ELK, then I do care a lot because of the performance and maintenance and all those things. But now if I use some service, like Datadog or whatever you're using, then I don't really care whether they're time series behind the scenes or not because I'm not maintaining it as long as the price is right. I don't really care. It's their problem.

Darin Pope 18:56
right. And as long as I can query As long as I can get data into it fast enough to be able to query it fast enough to make decisions, that's what I care about.

Viktor Farcic 19:07
Yes. However, what I'm seeing more and more. And when I say more and more with green field, usually, you know, because enterprises are slow in adopting anything is that people are, even if they're using something else they're still putting Prometheus because of that connection with with other stuff. I know that we are, for example, using Stackdriver in Google, but we're still going through Prometheus simply because of that connection.

Darin Pope 19:37
Right. So let's, let's pretend that I'm a developer. Well, I am a developer, but a day to day developer, I'm not doing that anymore. I'm a developer. And the mandate has come down from Management saying, hey, we've we can't do these big bang all weekend releases anymore. We've got to get to where we're deploying, when things are ready to go out. Not necessarily continuous deployment. But just when things are ready to go, even though that infers continuous deployment. I'm on Kubernetes. I'm new to Kubernetes. But I'm comfortable enough with it. I've been working with it for about 8-9 months. I've got my first app out there, I sort of understand rolling updates, because that's the default out of the box with Kubernetes. But now, I want to get a canary because I've had even with my rolling, the rolling goes out, and I've had to roll back before because it's just a pain in the butt. And this, I don't understand how to do this. What options do I have to really learn how to do this correctly?

Viktor Farcic 20:58
Go to the course that we released a few days ago. Actually by the time you listen to this it will be a month ago or something I don't know. It was weeks. Udemy course.

Darin Pope 21:09
There's a Udemy course.

Viktor Farcic 21:11
yes there is and we can put

Darin Pope 21:15
put the link code that

Viktor Farcic 21:17
yeah I think mine I know 13 something I'm not sure whatever discount Udemy gives us yes

Darin Pope 21:24
yeah it's it's I mean yes it's a course yes we're talking about it but the it's not super expensive

Viktor Farcic 21:35
here's the decision that was the thought process behind the decision how how expensive it will be. We put the cheapest price that Udemy allows.

Darin Pope 21:47
Yep. That's it.

Viktor Farcic 21:50
Cannot be cheaper than that because Udemy has fixed prices.

Darin Pope 21:55
Bands of prices, but yes. And it was that simple. Right. So I worked with Viktor on it. Not as much as I should. But it's an amazing course, if I do say so myself, because it takes you from soup to nuts from how to take an application assuming that it's architected correctly, that it's compliant for lack of a better term, all the way to the point to where you can do canary deployments the right way, and right way, meaning right way as of early 2020. Because I can guarantee you this is going to change over time. Maybe not fast, but the concepts there will be new things that get introduced into Kubernetes that make canary easier.

Viktor Farcic 22:49
So here's the thing about that course, is that if you're just interested to make it work, then you can just skip straight to the last or second to last section. And get it over with. The course really tries to teach you what's happening behind the scenes. So that you and when things go south, and they will you understand what's really going on and how to fix it and all those things. So it's, if you're looking for some quick fix, then maybe it's too expensive for you. Because you could actually do it in in one hour. But it really explains what's behind all of that.

Darin Pope 23:25
Yeah, the you could. The interesting thing to this is Viktor did again, most of it, I'm not gonna say that I did most of it. He only did a little bit because that'd be a huge lie. But in applying this because I've been through and I've actually taught courses from Viktor's books before he followed that same process. So if you've been through any of Viktor's books before, it's little by little, so by the time you end up at the end, it's like oh, now I understand how it all is all interconnected, and why those things are doing those things. That same processes applied. And watching goodness and just there there are zero slides. So if you're a little concerned about that, there are zero slides. It takes us about what four to five videos in before you're actually doing real stuff. There's a big setup at the front sort of summaries and stuff right up front. There are 74 videos. Don't let that scare you. Some of them most of them are short

Darin Pope 24:13
Four or five minutes on average.

Darin Pope 24:31
Yeah, yeah. So it's it's easy enough to do if you're, you know, taking a break at lunch, you could squeeze through a couple of the videos during lunch. It's, it's it's doable. You will need a Kubernetes cluster to do the things but that doesn't mean you need full blown you could even got to the point to where it works on Docker Desktop or Minikube.

Viktor Farcic 24:53
Yes, so there are instructions for how to set it up in Minikube and other places.

Darin Pope 25:00
So down in the show notes, there will be a link off to the course, with the best discount we can give you at the time. Udemy also maintains how they do discounting. So we'll include that link there. So just if you're either listening on are listening on the site or whatnot, just look for that link off to. It may be a Udemy.com or it may be a bit.ly link. I don't know what it will be, but it'll be something that's clear of what what it is. It'll say... What is the name of the course by the way? I've already forgotten.

Viktor Farcic 25:37
Me too. I need to check something. It's Canary and Istio and Kubernetes is there.

Darin Pope 25:44
Yep. And then just put Viktor in there too. But anyway, the link will be there, but we'll put the full name to the course as well right above that link in the notes.

Viktor Farcic 25:53
I found it..."Canary deployments to Kubernetes using Istio and Friends"

Darin Pope 25:57
"Istio and Friends" Sounds like a TV Morning Show. But anyway. So now you might be saying, Okay, why is this course coming out and not the other course to where we're comparing the four? or however many we said, which Kubernetes service provider sucks the least? Well, I'm still not done yet. And honestly, the canary was more interesting than comparing the four. Right? I mean. Let's be real. It was more interesting the comparing the four.

Viktor Farcic 26:38
But I guess depends on the audience.

Darin Pope 26:41
Yeah. Let's put it this way was more interesting to you. And I haven't had time to finish up my part of the other one yet. So that's what that is. So, a course, not the course you're expecting. But if you're wanting to understand canary deployments and again, these are sort of, they're not even what I would call primers on Istio or the other bits and pieces there. It's just here's how you would use them. So don't don't consider this. consider this to be a comprehensive course, on Istio on Kubernetes on everything, there's a lot of assumptions there. But we, we he, Viktor goes into a lot of these in clarity, so. So that pretty much wraps it up for today. Thanks for listening. If you want to contact us, you can contact us or you can find all of our contact information at https://www.devopsparadox.com/contact, Twitter, LinkedIn, all that stuff's over there. Please leave a rating and review. And now we can sort of add it if you purchase the course leave a rating or review on the course over there, but more than likely, we need you to please leave a review for the podcast that will help us out greatly and that's pretty much it. Canary deployments are good thing. Manual deployments are horrible thing. So, if you're still today doing manual deployments, maybe this is gonna be a little bit of hope for you today. If you're doing manual deployments what should be Viktor what should be that person's first step?

Viktor Farcic 28:20
Open a Word document with instructions, copy and paste and then change the characters that Word changes, like double quotes, and then run it. But you know, as soon as you convert that Word document into a script that you execute kind of deployme.sh you did most of the work that most of the work is done. It's that easy.

Darin Pope 28:48
If you're copying and pasting commands any way it needs to turn into a script, or sets of scripts. Again, another conversation for another day. Alright, thanks again for listening to episode number 37 of DevOps Paradox.