DOP 70: High Availability Does Not Mean 100% Availability

Transcript

Viktor: [00:00:00]
High availability is a worthy goal. It's something that everybody should be doing. It's just understanding that high availability is not about being a hundred percent available. It's about being close to that.

Darin:
This is DevOps Paradox episode number 70. High Availability Does Not Mean 100% Availability

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:15]
Now as you've listened to the last couple of episodes...you did listen to the last couple of episodes didn't you? If not at least go listen to last week's. One of the key topics that we were talking about is availability and specifically today we're going to talk about what is high availability and how do you get it. So Viktor I'm going to first define high availability. You ready for this?

Viktor: [00:01:55]
I am

Darin: [00:01:56]
A hundred percent uptime. No failures. Without any humans involved.

Viktor: [00:02:04]
You know if you can promise that, I can promise you that I'm going to find you a company within a day that is going to sign you a billion dollar check. Immediately. Are you ready to get rich?

Darin: [00:02:21]
It's always a good thing. Now, let me take you back to 1990, when was this? Seven or 98. Probably 98. 22 years ago at the time of this recording. A friend of mine was Hey we need this website. I'm a web guy. I can do websites. And small company It's like well it needs to be up all the time. It can not be down even for a split second anytime within the year. Okay. At the end of the day in today's terms it was basically a Google form. Let's just put it that way. I went okay And our budget is only going to be $5 a month. We can't spend more than $5 a month. Well you're not going to get that kind of availability for $5 a month. Well these big companies are doing it. It can't be that expensive. Well let's fast forward back into this decade. What is high availability?

Viktor: [00:03:26]
Being available almost always and I'm choosing my words, almost. You can never be available always. Here's an example. Actually first let me give a complaint to you. This is going to be horrible episode for me because I have real trouble pronouncing availability

Darin: [00:03:51]
just call it HA

Viktor: [00:03:53]
HA. Excellent. Yes. So so people usually associate HA with when your application or system is down and definitely when it is down you lower the percentage of HA. There is no doubt. But the trick is that it's not only that. If a single request that produces response 500 or doesn't provide a response, that already means that it's not hundred percent and the chances that not a single request will ever be lost are zero kind of impossible, unless you have a traffic of five requests a month. Now simply it's not going to happen. You will have 500 sooner or later. Now you can mitigate that. That's a different issue with retries and what so not but sooner or later somebody is going to not receive a response. It's inevitable. And I'm now saying without your application being down. Nature of HTTP prevents you from or networking prevents you from always being a hundred percent HA. But if I would have to define it in simple terms that is a period of time when a service is available, it's as simple as that. If that period of time is a hundred percent for absolutely every single request absolutely everything then you have 100 percent HA. But as I said nobody tries to get there. Nobody who understands the game is trying to get to a hundred percent simply because and there is that because this is the important part because of the investment. Because getting from I'm going to invent figures but let's say that if you need a $100K investment to get to 99.9 then you need $300K to get 99.99 and a $1MM to get 99.999. So it's exponentially growing right. It's not Oh I just need 0.1% to be hundred. No. No. It's exponentially growing for those last bits and pieces.

Darin: [00:06:18]
Let's play this out for a minute. I am a fill in the blank shop. I'm a Microsoft shop. I'm an Oracle shop. I'm a whatever shop. Those systems within themselves may be highly available. Well, I'll give them that, but they're also a Cisco shop They're also an AT&T shop because they're getting their landlines in from AT&T. You have public internet. People coming in across their wireless phones. You may consider yourself highly available but the true perception of high availability is at the end user.

Viktor: [00:06:55]
Exactly. At least the perception that they have of your availability. Even if everything works a hundred percent, if your users cannot reach you because of the reasons for outside your control, again that's means that you're not a hundred percent available.

Darin: [00:07:11]
Well we'll go back to this. Last year or year before back when we used to be able to travel you were in China for something and we're trying to do the podcast And you were having to do all sorts of end runs in order to be able to connect up so we could run.

Viktor: [00:07:28]
Yeah

Darin: [00:07:28]
That's not high availability

Viktor: [00:07:32]
Yeah. I mean it really depends on how you define it. Right? It's not 100%. It's never going to be 100%. But, you can realistically get close to it. Just to give you some idea. If you're aiming for and I'm doing this because I'm reading I could not be capable of memorizing this but if you want to be 99.9% available that means that you have downtime, and that includes also lost requests, of 8.76 hours a year. Eight hours throughout the whole year. If you want to go with two 9s, point 99, that's 52 minutes a year or only the six eight seconds a day of you being unresponsive. And now you can only imagine if I keep adding nines how low those figures get. If you have two nines you're amazing. You're doing amazing job. Absolutely amazing. When you think about it, it's enough that you have a single really you know let's say that we are all striving to do some type of rolling updates or canary deployments for new releases. That they're backwards compatible and what so not. But it is enough that once a year you have a special release let's say and sooner or later everybody does that would need to bring your service down for a very very short period of time and that short period of time is already going to remove one of those nines.

Darin: [00:09:11]
The other fallacy that people fall into or the other ditch people fall into is they believing that they can produce a higher availability than the systems that they depend upon. If I'm trying to build a three if I'm trying to build a one nine forget about three nines I'm trying to build a one nine but my network only offers 95%, I can't give you 99.9. The best I'm going to potentially give you is 95.

Viktor: [00:09:44]
Exactly. Is it like at least once a year we hear news from Azure, AWS, Google that a region stopped being responsive hopefully for a short period of time but it happens every year.

Darin: [00:10:01]
Well, recently there was an outage with CloudFlare that caused basically a large part of the internet to go down. Great Post-mortem how they wrote it up. If you haven't read it. I'll put the link to their post-mortem in the show notes Sometimes there is this unplanned outage. Now when you get to unplanned outages and you fall outside of your guaranteed SLA, there's usually a penalty to pay. For some people, it could be a cashback. For some people, it's just bad press. Some people is like okay why did the human get involved in that? And the problem with that is the blame game starts very fast and that just shows weak management. I'll say it out loud. Instead of just saying look things happen. Don't worry about it. Just don't do it again. That's the kind of response you would hope to get. No. Normally what it is is okay we're calling this all hands meeting. We're going to get around the table and we're gonna figure out who who broke this.

Viktor: [00:11:10]
You know what is usually result of those post failure meetings is that everybody starts going slower because you add a dozen additional restrictions to prevent people from creating outage. As a result everybody you deploy the less frequently, you do other types of mistakes and what so not and then three months later that same management is going to say we are not releasing fast enough. Go faster. It's the circle of life.

Darin: [00:11:44]
And when those things happen somebody needs to be able to stand up and say well look you told us we can't. So which is it? Take your pick. We can't do both.

Viktor: [00:11:59]
I'm paraphrasing because I don't even remember exact quote or who said it but it's not really about whether you will fall, it's about capability to stand back up. That's what really matters.

Darin: [00:12:13]
I think the core part to this is high availability isn't just a technical issue. It's a managerial issue. It's a money issue. It involves everything. So just to say that we can never go down, we can never have any outage at all. Do you realize how many zeros that would take on a check?

Viktor: [00:12:41]
Actually there is a way to be a hundred percent available. Just don't do anything

Darin: [00:12:46]
Just shut it down?

Viktor: [00:12:47]
Yeah Shut it down. Don't not allow public access to begin with.

Darin: [00:12:51]
Yeah. It's running completely self isolated, not on a network. It's a hundred percent available. Oh wait. We have this one problem called storms where power gets knocked out. Well then I have to have generators That means I have to have fuel. Well what happens during a worldwide global catastrophe when you can't have fuel delivered to your tanks and now you're out? Do you see the idiocy in high availability?

Viktor: [00:13:21]
High availability is a worthy goal. It's something that everybody should be doing. It's just understanding that high availability is not about being a hundred percent available. It's about being close to that. And what is close depends on every company. It's a balance between investment and return of investment. That balancing act will get you to anything from 99% to 99. some number of nines afterwards.

Darin: [00:13:50]
So recently I had to help my mom renew her driver's license and in the state of North Carolina you can do that online at least for hers. But we went to renew it and it popped up and said Hey we're doing maintenance. Check back after one. Okay. Do you consider that system to be available?

Viktor: [00:14:20]
Well it's a government system. It's not supposed to be available by design.

Darin: [00:14:24]
Okay But ignoring that because they told me check back at one, is that still available?

Viktor: [00:14:33]
Of course it's not

Darin: [00:14:34]
No. See I disagree. I believe it is available because at least it was there to tell me. It's no different than somebody saying that Hey you're at a retail store. They put the sign up be back in five minutes. The store is still open. There's just nobody there to check you out because they had to go to the bathroom.

Viktor: [00:14:54]
Uh depends Are there

Darin: [00:14:55]
So you're queuing you're queuing You're queuing right That's all this is.

Viktor: [00:15:01]
Some other service is available. So let's say that service that shows you that message is available but the service that allows you to renew driver's license is not available. That one is not available. Now something else that tells you come back later. Yes, but that's particular service is not. In this specific case it's not a big deal because you're not really going to go somewhere else. In case of Amazon, actually in case of Amazon is also not a big deal because there is nowhere else to go. They have 70% of the market. But let's say most other businesses, not being available means that I'm going somewhere else because my patience in this day and age is measured in seconds or minutes. Netflix is a good example. I have both Netflix and Amazon Prime and HBO. And you know what happens if Netflix is not available? I just switch. It's a button for me to HBO. That's what happens. And then when I when it once a year when it starts thinking about okay it's a bit ridiculous that you have three or more services for video streaming, it would make sense to ditch one of them. I chose the one that caused me the most pain.

Darin: [00:16:18]
So that's interesting. You have built in high availability for your video viewing.

Viktor: [00:16:25]
Yes, as long as I have internet.

Darin: [00:16:29]
Oh but wait. You have to have internet. And this goes back to that Do you have power? Do you is it's cause if power goes out, your internet doesn't matter. If you have power but no internet, you're still stuck and you can't blame any of your three providers.

Viktor: [00:16:52]
Actually, I tend to use heavily the feature in Netflix to always download a couple of extra episodes or from the one I'm watching, so actually I have battery on my mobile and downloaded episodes or something. Haha, so.

Darin: [00:17:09]
All right But see you as a consumer have planned for your availability. Most people would expect Netflix just to push it to their phone for them so they wouldn't have to worry about doing any kind of work themselves. That's the part I'm getting to. Sometimes you have to do work because in doing that little bit extra work that will help mitigate an outage. Here's the use case from an application development perspective. I've got a service. It needs to interact with a database. But instead of always going to the database, I insert a read-write cache in between my app and the database. Now that gives me the ability to take my database offline if I have to do maintenance and have a minimal impact on my application.

Viktor: [00:18:04]
Exactly

Darin: [00:18:06]
it's easy to do, but it's one more spinning wheel that you are going to have to keep spinning. And then that has to have its own availability. We haven't even talked about multi-AZ, multi-region, multi data center, multi anything else.

Viktor: [00:18:21]
Think of it in these terms. There are three major causes of an application not being available. Application can be down. Because application is down or because hardware is not available. Many different reasons. You solve that with scaling, horizontal scaling of infrastructure, of application. We can now go multi-region, multi-zone a long discussion but scaling in general. Then you have loss of availability because of you making new releases. And you solve that by again first of all you need to scale previous point and then we were talking about rolling updates canary...

Darin: [00:19:03]
We actually just had an availability issue. Viktor lost internet. You were saying blue-green, canary, rolling. We've solved these issues. This helps us mitigate availability issues when we're doing a deploy for an application.

Viktor: [00:19:17]
Yeah So what I'm really trying to say is that then we have that unplanned or potential loss of availability not caused by you. You know service goes down and then loss of availability caused by you. If you do a release in a way that makes your application not available. And the worst one and this is the most complicated one is when a service you depend on and this is what Darin you were mentioning before the service you depend on stops being available because Hey you're responsible for application A. Application A depends on application B which can be a database can be another application can be backend, many different things. And when that one is not available, then how do you keep yourself available? Because I like everybody to focus on their stuff. Right? And that's where caching comes in. If it's a frontend, it could cache. If it's a backend, it could cache. If it's a database then that's a different story. But you really don't care about I mean you do care about the database but your applications is what counts because that's user facing. Database is not. So it's usually solved with some form of caching. Then we go back to your driver's license example. How can you design your application so that if something is not available, people do not suffer a hundred percent. Like I dunno maybe you have a shopping site where purchasing something is not available but adding items to the shopping cart is. That's then that separation that okay so if my user is going to have negative experience, let it have that negative experience limited to something.

Darin: [00:21:10]
Well a specific use case for that one is a retail site called B&H Photo. It's a photo video store in New York. The owners observe the Jewish Sabbath. So from sundown to sundown for Saturday you can't buy anything. Now you can put stuff in the cart all you want. You can say Hey let me know when checkout opens back up and I'll come back and buy. It's like me as a consumer I know that Oh okay Look Oh yeah I ordered during Sabbath. No big deal. I'll come back and check out on Sunday. No big deal because I know they're not going to ship on Saturday anyway. I know that they are not going to ship until Monday so it's like it's not going to get there any faster.

Viktor: [00:22:02]
Exactly and your user experience was not maybe a hundred percent but it was 90 something percent. You still had pleasant experience.

Darin: [00:22:14]
Yeah I was able to get this stuff in my cart and I'm just queuing up waiting to buy. That's complete completely reasonable to me. Now for some people millennials that's not okay. I want it now. No that's not that big of a deal.

Viktor: [00:22:34]
Oh man I'm not a millennial but I also got hooked into when I want something It needs to be shipped to me the same day. Not shipped. Arrive.

Darin: [00:22:46]
It needs to be on my table in the next two minutes.

Viktor: [00:22:50]
Yeah because it's kind of it's excitement Like I've just bought Pandora box you know 3000 arcade games and I'm nervous. It's not coming yet. It's been three hours.

Darin: [00:23:01]
So availability is important. To believe that you as an application developer, you as an SRE, can provide a hundred percent uptime. If you have been given that task, run far away. Because unless you are the all in all and you also own the power company and you also own everything else that goes along with getting your application running Oh also the public internet. You also own all the wireless carriers, globally, if you're a global application. You're not going to have a hundred percent availability. I even question if you're going to even have 95% availability.

Viktor: [00:23:56]
Yes. If you measure it for real. No faking results. Yes. Most of companies don't have 99. Many hardly reach 95. Now, if you fake it, nobody will admit that right

Darin: [00:24:09]
Yeah well and again I'm defining availability and I believe we're defining at least I am the end user availability

Viktor: [00:24:19]
Yes. End user availability.

Darin: [00:24:22]
Because at the end of the day that's the one that matters because they are the ones going on Twitter. They're the ones saying I'm going to sue you because I didn't get my blue widget yesterday.

Viktor: [00:24:33]
And that complicates things because for example at least in my head if your application is unresponsive or slow to response, not unavailable, but unresponsive, that still counts as not being available. If I search something in Google and your site does not load within seconds and I'm being generous here I press the back button and go to the next one because I really it didn't go to your site. I just went to the first result in Google. If it's not responsive, it's not available. If it's not fault tolerant, it's not available. If it doesn't recuperate from failures automatically, it is not available. If it's not scaled across servers, clusters, zones, what so not. It is not available. It's an infinite list not infinite but the large long list of if this then not available. A huge one.

Darin: [00:25:28]
Well, let's go back to the very beginning. If you want high availability you can write us a check because we can solve it for you.

Viktor: [00:25:38]
I mean high availability can be solved. Hundred percent cannot.

Darin: [00:25:42]
How is it solved?

Viktor: [00:25:45]
High availability? By creating a target I would say that high availability means that the target is anything you define between 99%. So greater than 99% smaller than a hundred because hundred you cannot reach. But now it's negotiable. How many how much you want to be above 99%. I think that that's why the term is high availability It's not total of availability. It's not complete availability. It's high. I think that unlike many terms in our industry, this one is actually well defined or sounds right. It's not misleading. It's high and high is I would say anything above 99.

Darin: [00:26:33]
So that's our stab at availability. What do you think? Do you agree with us? Do you disagree with us? Please disagree with us. If you do, send us a message. Maybe we can fight about it on the podcast. We'll be polite. Sort of. And if you have any more questions about availability, let us know because obviously we have some opinions and we've implemented some of these systems. I still believe 99.9 is still a lie. I still think 95 is the most reasonable number to hit but it doesn't look good on paper when you say 95% .

Viktor: [00:27:23]
I do think it is doable, it's just that we are talking now about you know Facebook's and the Google's of the world which is a relatively small percentage of companies.

Darin: [00:27:37]
Which are managing their own data centers. They're laying their own fiber. They're producing their own power. It's all those kinds of things that most enterprises cannot do.

Viktor: [00:27:48]
Yeah and it's also about volume and distribution of users because let's say that if you are a Google or Facebook or any of those. Even if you have one hour of not being available in let's say US, that is not a full hour of full service not being available. That's US. While the rest of the world is still available. So on a global level it doesn't count like a full hour for everybody. If you're a let's say a local bank that operates within a single country then actually loss of availability in that country or region means that you're fully out. It really depends also how global you are. How big of a traffic is and so on and so forth. If you lose one request and you have an average of thousand requests, that's 99.9. If you lose one request and you have a thousand or a million requests then that's much higher percentage right?

Darin: [00:28:54]
And the question is can we get high availability of toilet paper during a next worldwide problem?

Viktor: [00:29:03]
Yeah, just don't rush to the store to buy a thousand rolls. That's easy to solve.

Darin: [00:29:12]
okay. Availability. What do you think? Let us know.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/ contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 70: High Availability Does Not Mean 100% Availability

Show Notes

Links from the episode

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript