DOP 71: Observability in the Cloud With CloudWize

Posted on Wednesday, Sep 2, 2020

Show Notes

#71: Observability can be broken down into three layers; software, infrastructure, and knowledge. Which of these things do you think is most important? Today, we discuss these items and more with Yotam Atad and Chen Goldberg from CloudWize.

Links from the episode

Guests

Yotam Atad

Yotam Atad

Yotam Atad is the co-founder and CEO of Cloudwize.IO, a company that helps The RND/CloudOps gain observability and Control over their cloud architecture.

Yotam is an entrepreneur and has 15 years of experience leading technical teams and products. Before founding CloudWize, Yotam was the General Manager of Anyday INC., and before that the Co-Founder and Head of Product at O.P.S solutions. Yotam was also a product lead at Voyager Labs and a software team leader at the IDF intelligence unit 8200.

Chen Goldberg

Chen Goldberg

Chen Goldberg is the co-founder and CTO of Cloudwize.IO, a company that helps The RND/CloudOps gain observability and Control over their cloud architecture.

14 years of experience in development and operation, building infrastructure and implementing best practices in start-ups and enterprise companies, as well as 14 years of experience as a software developer and later on an automation and performance team lead. In the last eight years, managed production cloud architecture and implementation of DevOps methodologies in startup companies and enterprises, managing their architectures in multi-cloud environments. DevOps tools freak.

Hosts

Darin Pope

Darin Pope

Darin Pope is a developer advocate for CloudBees.

Viktor Farcic

Viktor Farcic

Viktor Farcic is a member of the Google Developer Experts and Docker Captains groups, and published author.

His big passions are DevOps, Containers, Kubernetes, Microservices, Continuous Integration, Delivery and Deployment (CI/CD) and Test-Driven Development (TDD).

He often speaks at community gatherings and conferences (latest can be found here).

He has published The DevOps Toolkit Series, DevOps Paradox and Test-Driven Java Development.

His random thoughts and tutorials can be found in his blog TechnologyConversations.com.

Rate, Review, & Subscribe on Apple Podcasts

If you like our podcast, please consider rating and reviewing our show! Click here, scroll to the bottom, tap to rate with five stars, and select “Write a Review.” Then be sure to let us know what you liked most about the episode!

Also, if you haven’t done so already, subscribe to the podcast. We're adding a bunch of bonus episodes to the feed and, if you’re not subscribed, there’s a good chance you’ll miss out. Subscribe now!

Signup to receive an email when new content is released

Transcript

Yotam: [00:00:00]
Some people think that if you have a managed services, managed infrastructure, so everything is okay. You're, you're cool. You don't have to do anything, but that's actually not true. The opposite is true. When other providers, other parties are managing your infrastructure, you better watch it very closely.

Darin:
This is DevOps Paradox episode number 71. Observability in the Cloud with CloudWize.

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:21]
Okay, Viktor, in the catalog book, we've recently finished up since we're recording in the past again. Okay. But by the time you're listening to this, we're pretty close to being done with security. And one of the items that we've dealt with in there was probably a bit of observability. You can't do security without observability.

Viktor: [00:01:48]
If you cannot know what's going on, you cannot do anything including being secure.

Darin: [00:01:54]
Exactly. So today we have a couple of guests with us to talk about observability and control in cloud architecture environments. We have Yotam and Chen. Yotam, why don't you introduce yourself first. Chen, then you can introduce yourself and then we'll start talking about observability.

Yotam: [00:02:17]
Sure. Thank you, Darin, Viktor for having us here. So I'm Yotam and I'm the CEO of CloudWize and I'm working with Chen here to make observability possible on all the clouds.

Chen: [00:02:33]
I'm Chen and I'm the CTO and co-founder at CloudWize. I worked before with Viktor in a project called Docker Swarm Proxy Flow. And this is me.

Viktor: [00:02:53]
It's so funny how they sit in the same office and they don't call themselves the same. One says Chen, the other one says Hen. There's there's confusion going on in CloudWize I see.

Yotam: [00:03:04]
I was absolutely sure his name is Chen. Wow.

Viktor: [00:03:07]
A short confession, just to clarify. I saw Chen's face for the first time like a week ago. Two weeks. Right. Until then, I was sure that he's Asian.

Chen: [00:03:23]
Oh, you know, when I used to live in North Carolina, they told me that I'm looked like Italian and they called me Tony. So...

Darin: [00:03:33]
I could see that. I could see that.

Viktor: [00:03:37]
Yeah, you definitely look like Tony more than Chen.

Chen: [00:03:41]
Okay.

Darin: [00:03:42]
But that's not what we're here to talk about today. Today, we're here to talk about observability. Interestingly enough, we're talking about control. Why don't we first define what observability is from the way that you guys are taking the angle on it and then how that observability can't stand by itself. There also has to be control.

Yotam: [00:04:10]
So basically there's the code definition of observability, which is being a property of your system, which the company or the team can inspect and understand what is going on on their environment. So if you want to know what is the status of your environment and how this status is changing and how this reflects different pillars, different aspects of your system, then you need observability. This is what you need. It was traditionally used to be logs and metrics and later on, it became tracing, but observability is all of those together. Having the logs and metrics and tracing is not enough. It's just the different components of observability. We have to have them all under one point of view, one context. We see observability as something you do on three levels. You can observe your software, which is obviously something everybody do because when you design your software and when you develop it, you put logs, metrics and other stuff that helps you understand what's going on. The other level is your architecture. There it can become a little bit more complex because part of the infrastructure is not yours to control and that's sometimes misleading. Some people thinks that if you have a managed services, managed infrastructure, so everything is okay. You're, you're cool. You don't have to do anything, but that's actually not true. The opposite is true. When other providers, other parties are managing your infrastructure, you better watch it very closely. And it's it become harder because you don't have control of how it is implemented and how the different metrics are exposed. The last and I think most important is something we call the knowledge of observability. The knowledge is basically what you need to know in order to maintain the system intact, because things keep changing criteria in very dynamic environment. Things are changing from your third party providers. Things are changing from your product point of view and things keep changing when your customers have new requirements, new scale. In order to keep this level of observability, you must keep learning every day, every hour, how things are changing. And that's like a flood. That's a data flood. You must be on top of things all the time to keep that level. So that's observability as we see it.

Viktor: [00:07:12]
There is at least I have impression, right, and I might be wrong, but observability might be new devops in terms of popularity, not in terms of what it does, right. It's exploding. The amount of talks I'm hearing, the conversations, the solutions, you know, the money also floating around. I could say that observability is the topic of this year, right or probably, hopefully the next one. If I'm right, what changed? We obviously did not learn about observability this year, right? It's not a new thing. It exists in this industry since the dawn of time. And yet it's a focal point or one of the focal points of the time and space we live right now. What changed?

Chen: [00:07:58]
Correct. I think that most of what we're hearing in the last year, there is not only devops around the observability. Also there is dataops, cloudops person, the security person. They all need one language or one point of view to understand each other. When you are talking now about security, so if we're talking about the cloud, so we have the security layers and we need to know exactly if my service, my specific software is actually secure end to end, and most of the providers offer more than one service that give you the observability, the view, the logging, the tracing, the metrics about your cloud security. But the devops, he's looking about the observability from the operation, the performance point of view. And now, when he using the continuous delivery integration services in the cloud, he need to be aligned with the compliance and security policies that the security person security ops is charge of. But most of the time they don't have one language or one point of visibility or a language to talk with each other about what's the trade off. Because when I want to put a new service, that's going to help the R&D team. Okay. What's the influence in the security landscape in the cloud? What we are thinking about it's that you have to build one language that all the stakeholders in the company that consume the cloud, that in charge of the cloud, any part, implement stuff to the cloud needs to understand each other and talk with each other. And that's what I think changed in the last year, because this is not only the devops person. There is the SRE person. There is the cloudops person. Sometime it's the R&D that in a small startup, he don't, he can't afford devops in his company because it's very slow. Yeah, you're very small. And now he needs to do all those stuff from the integration until the deployment into the cloud. When you have more and more person and that involved into the cloud, you need to understand each other and sometimes there is a trade off between the stuff and you need one single point to view everything, to understand each other. That's why I think it's the control and the observability in the cloud.

Yotam: [00:10:45]
I would add to that, that new buzzwords, new things that we put as the main thing that happens are not just happen. This is an evolution. So it began with logs and then metrics then tracing. Then we moved to microservices. We had to rewrite rethink of everything because suddenly it's not just one place that you have all the logs and you just write scripts to monitor those. Now we're at the next phase of this evolution when things are getting really hard because of the rapid changes that are around us, around technology. So it's not something new. It's just the point in the evolution that people must have observability. They can't live without.

Viktor: [00:11:33]
Can we live with the toolset that is provided by providers? Let's say that you're using AWS or Google or Azure. Do they provide what you believe we need or people should go some third party or where are we now?

Yotam: [00:11:50]
Okay. So it's a good question because eventually you have, except from the last level, we talked about the knowledge where you can get knowledge. You want to get knowledge from the community in addition for what the cloud providers are giving you, so that you don't have in the AWS of course, but all the rest of the other two levels, you do have the tools from the provider, but you need to write scripts in order to gain the information, right? You have the APIs, you have CLI ,sometimes you have the console that you can go in and see, but in order to take all of this knowledge and put it in one place and understand how it affects your system, you need to write code. We found some of our customers maintain a huge, huge repository of code, and they have like teams of 10 people sometimes to write and maintain this code in order to keep the information on track, right. To keep getting new information, keep doing some old types of analytics. There are sometimes have all kinds of algorithms to monitor their information. So. Yes, they do have the information mostly, but in order to use that they have to work really, really hard.

Chen: [00:13:15]
If I'm looking at the tools that you have today. So AWS have only in security something like four or five tools. AWS Analyzer, AWS Security Hub, AWS CloudWatch, CloudTrail, and X-Ray. And you need to go to each tool to understand what actually you have to do to gain the observability. It's not so easy. So you need to learn more and more stuff every day to keep up to date with all the changes. I don't think today company have the capacity to do that by themselves. They have one people let's say they have DevOps. Most of the time he have to tackle or deal with the issues that the R&D teams opened in the JIRA Monday and stuff like that. And when you come to him and say, okay, I need you now to implement this service that will be secure mostly in the cloud and be optimized by cost and performance is at whole. Wait a minute. I don't know this specific service in the cloud environment. I need to learn it. Maybe I will do mistakes. Now we are talking about only one provider. What about Azure, GCP? You need to learn each provider, what services they are offered and what specific results or logs or metrics that you can gain from that. And you can use third party tools. Okay. Let's say you have Prometheus that you can take all the logs and try to analyze that. And you have the Elastics, you can put there the metrics, but then also you re' putting more and more tools in order to get the observability. And it's kind of hard to do that with more than one, two, three tools at the same time.

Viktor: [00:15:14]
So I guess now that you mentioned third party, this is a entry point into what you guys have, right?

Yotam: [00:15:22]
I would add another thing that we didn't talk about yet is some people say it's part of observability. We believe that. We believe in this approach and this is the visibility. Visability is the part where you can see. You can actually see what's going on. You don't just write your code and react to stuff. You can actually see a graph view of your architecture and understand by visual what's going on, who is talking with who, which cluster, which pods, for example, belong to which cluster and to go back to what you asked about. Our solution. So basically what we did is to put all those things inside one solution. So you can go in, you can visualize your architecture. See the relationships. See the metrics. See a configuration. See everything that you have, and you can query that. And by query, I mean, instead of writing very complicated code that will go and seek for all kinds of problems that you know of, or kind of best practices that you want to maintain in your policies. So you can use external tools such as ours to just drag a few nodes to a canvas and then filter them according to the desired pattern or to desired practice that you want to maintain. So this is where our tool becomes the observability tool that helps you visualize your system and your policies.

Viktor: [00:17:06]
So is it something similar to, let's say Kiali from Istio but on a level of the whole infrastructure, right?

Chen: [00:17:15]
We also trying to give the cloud ops the opportunity to build their own custom policies and share that with the community instead of going and writing blog posts or looking for the best practices and said, okay, when it's going to happen in my architecture. When I have this pattern, I'm going to check that. They actually can build today, the pattern they just heard about it. It's just a read about it and actually simulate the situation, the pattern and said, okay, now run continuously. We want you to monitor this pattern into your architecture graph and when. It's gonna meet the pattern. You're gonna let me know, because I know what I need to do to change it. We let the specific person that write the rule to be proactive and he actually can know what is going to be happening in his architecture. And not only when somebody open an issue, now he needs to go in analyze and start handling this situation and this is something that we think it's going to help company to understand and tackle bugs or analyze issues before it's going to production.

Darin: [00:18:46]
Right now, you're covering the big three, correct? Or do you cover any cloud based provider?

Yotam: [00:18:54]
So right now we're covering both Azure and AWS. We're going to cover GCP in a couple of months. Our solution is basically a very flexible data structure. So we can add new provider when demands come, right, if you want to add some private cloud, we can do that. We can also do that per demand. So, we currently, as you said, we are covering the two big ones and we can cover any, if it is a requirement.

Darin: [00:19:30]
So technically you could do on premise if it was private cloud.

Yotam: [00:19:35]
Yes. Yes we can. Yep.

Chen: [00:19:38]
Yes. Today we build our platform as agentless in order that you won't have to put any agent in the cloud environment. So we actually what we did in the last two weeks we built agent to the Kubernetes okay to the managed services, EKS and AKS, and now when we build that agent, we can put that agent in any data center on prem environment and get all the metadata from that agent. So regarding to your question, yes, we can do it. And, but not right now.

Darin: [00:20:18]
so it's pre-alpha if you will. At the time of recording. It's pre-alpha it's, it's just the other side of vaporware, right? It's a joke. It's a joke. It's real. It's real. It's real code. It's just very, very fresh code.

Yotam: [00:20:36]
Good

Darin: [00:20:39]
so you're covering Azure and AWS today. What was the third one that you said you were bringing on soon?

Yotam:
GCP.

Darin:
Okay. That's that's what I thought you said, but it broke up just a little bit. Do you see anyone going into clouds other than those three? Do you see people going into Alibaba or do you see people going into Oracle's solution, which seems to be gaining a lot of steam recently?

Yotam: [00:21:05]
Oh, yeah. Yeah, sure, absolutely. We're currently concentrating on the market in our local, in Israel, in our local country, and we are looking ahead to Europe and USA where AWS and Azure, and then GCP are the biggest. But if you are looking at Far East there, you only have not only, but the biggest is Alibaba, eh, so basically it depends on the market you, you aim for. I believe in a future to come, they will all be, it will take some part of all the markets. So this is why we made our data structure very flexible because we understand that things are changing and are changing fast. So when Alibaba will be a meaningful part of our market we'll support Alibaba too. Same goes for Oracle, Ocean9 and etc every other cloud provider you can think of.

Chen: [00:22:05]
You actually can see, IBM is a big player in that landscape. They have sponsored KubeCon and you can see they have their big commercial stuff and probably the trying to get some of their market share in the cloud. So we do see the five big cloud provider that IBM, Oracle are trying to be in that game.

Viktor: [00:22:35]
it's too big of a game to miss it in a way. And I still have a theory that actually we are still almost in early stages of cloud adoption, kind of this is going to grow. I'm pretty sure that there will be new players coming in very soon as well.

Yotam: [00:22:52]
Yeah, I believe you're right, because today I cannot imagine a company that doesn't use the cloud. When we started, we only saw the SaaS companies or in maybe the software companies that are not SaaS, but using the cloud because they want to be able to scale. But today we found sometimes we find customers from the retail or even from the pharmacy. We have customer from the pharmacy industry, which on the surface has nothing to do with technology or cloud. But then when you drill in, you see that all their marketing, all their test results and all these analysis, everything they do that is on computer, basically. So today they are using the cloud because it doesn't make sense for those companies to keep huge data centers. It costs too much money. It's not their business. So the cloud is the solution. And I think this is something that just began. Future to come more and more companies will go to the cloud, even if they are not technological companies.

Viktor: [00:23:59]
There is no such thing as not a technological company. There is no such thing. Doesn't exist. There are only companies who didn't realize it yet.

Yotam: [00:24:08]
Yeah.

Chen: [00:24:09]
That's that's correct. A few years ago, you write in your book that the DevOps now it's, it was four years ago. Something like that in one of the first books you write, you wrote. So you said there that it's a process. Okay and people that want to gain scale rapidly, they have to implement DevOps, continuous delivery processes. The implementation can't be you can't achieve scale in data centers. It's very hard to do that. You can, but it's very hard. Most of the things are going to the cloud and we can see the Cloud Native Foundation, the CNCF, that have growing not in linear way in, uh, in very exponential way. So, as you said, everybody's going to be on the cloud. Probably they're going to have some workload little workload in the data centers, but everybody's moving to the cloud.

Darin: [00:25:14]
What's been one of the interesting stories from your client that you can talk about that they went, S3wow. If who wouldn't have used this tool, we would not have found out about X.

Yotam: [00:25:29]
I think the story about S3 is a good story for that. So, we had one customer that had massive instances on AWS that got information from S3 buckets, and eventually it cost them a lot of money. They hadn't realized why until they got an alert that the connection to S3 was supposed to go through a VPC endpoint. And in their case, it was through a NAT gateway, internet gateway and outside to the world, and then take all the information from S3 outside, and then back in. So it was a very good use case because first, they paid a lot of money. That was the trigger, but they also paid in performance. They also paid in security because it went out, they also paid in compliance because part of this information was private information. So like this story. In one mistake, one mistake that happens because at the beginning there wasn't the VPC endpoint, right? It was new back then. So just because they didn't track the knowledge, the new service that changed everything, they suffers from so many problems in so many different layers. So that's why I liked the story. It's actually what Chen just said how the knowledge makes tradeoffs and problems. This is a perfect story of how the knowledge could save this company from making many mistakes together.

Chen: [00:27:03]
What's funny about that, once we write the rule, we saw other customers that also have the same problem. So we just send them a report. Hey, listen, this is the situation you right have. You can, first of all, most of, most of the person, like when you tell them, Hey, you can optimize your costs by 60%. So they say, Oh, how can I do that? And yeah, we think that was a great use case.

Yotam: [00:27:36]
yeah. Chen talked about the knowledge sharing which is the result of this. So after you find it once, we talked earlier about the community that can share this knowledge. So after we found these ones, we shared it with other customers we have to help them avoid this.

Viktor: [00:27:56]
so you're, you're nasty. You're ruining the business of AWS and Azure. Your goal is to ruin their income.

Chen: [00:28:06]
I think that once told me one of the AWS architects, it said no matter what you're going to do, AWS is going to keep earning a lot of money because people still making mistake and when you actually find some problems in the architecture and solve that it's gained more confidence to the people to use the cloud because when it's going to be optimized, as we spoke before, He's going to move more of his workload into the cloud. But we do believe that today there is a lot of knowledge out there in the community. A lot of best practices, a lot of companies that did some trip into the cloud architecture, and they have so much knowledge that they can share and we believe our community solution is going to help them to write that and help other people. I think we are aiming to do something like DockerHub that people can share their architecture in the script and share with everybody. And I hope it's going to help the community.

Yotam: [00:29:18]
Not only the architecture, also the insights on the architecture.

Darin: [00:29:23]
If you're listening to this today and you're interested in learning more about CloudWize , you can visit them at cloudwize.io. That is cloud w I Z, or Zed for our Canadian friends , e.io. I'm sure Zed is used everywhere else too, but you know, I'm a dumb American and I only know anything. So it's, it's Z to me. CloudWize.io with a Z or Zed and check out what they have. I'm sure somebody from a sales department will be calling you up within five minutes of landing on the website, right? you're not that aggressive.

Viktor: [00:30:03]
15,

Darin: [00:30:05]
Okay.

Viktor: [00:30:05]
They're not that fast.

Darin: [00:30:07]
One of the stories, so let's stick with that S3 story for just a second and we'll start to wrap up here. That S3 thing. When that VPC endpoint did not exist, you had no choice.

Yotam: [00:30:19]
Correct.

Darin: [00:30:19]
Right? And then once it did become available, how many people, how long did it go for many people to not even know that existed, well, That's the thing. So now you look at, if you follow the AWS What's New RSS feed there's 10 or 15 new things a day. Now, sometimes they're just, okay, it's something in a new region, but maybe that's really important. I just saw last week, week before last, EKS was now available in Sao Paulo or something. Right. Whatever, whatever it was. And if I was working in Sao Paulo and all of my workload was basically in the Sao Paulo area, why would I have an EKS cluster running in US East? That would just be insane because of transit costs. Now I could bring it back closer, but if I'm not watching for that, I'm not going to know it. As an architect. It's like, I'm going to use Viktor's favorite right now, Google Cloud Run. I could build a Kubernetes cluster. I could build GKE, but why? I just have a container. I want to run it. I'm going to use GCR until the price becomes too much. At which point it might make sense for me to run my own cluster. You got to know about the tools and if a community can help tell others about, Hey, didn't, you know that this was over here, basically, you're sort of a clearing house of information to keep people from doing stupid things.

Yotam: [00:31:57]
That's what observability is all about I think. You must know what happens and how to keep your stupidity away. Right. So I guess that's what it is.

Darin: [00:32:12]
It's really your third point, you know, listening to the setup today, we were talking about software was the number one architecture. Those two were pretty normal for observability, but it was that third one that I was still scratching my head on until about five minutes ago was knowledge. And really, I believe that that knowledge is probably as important, if not more important than software and architecture combined. Because if I come from a nineties workload where I'm used to using PowerBuilder and SQL server, and that's all that I've been doing for the past 30 years, and I'm now transitioning to cloud, my very first inclination is going to take what I've been doing, PowerBuilder and SQL server and throwing it in the cloud, which may be the dumbest thing in the world.

Viktor: [00:33:02]
In the past, I think architecture and infrastructure, and what so not were greatly influenced by people who had many, many, many, many, many years of experience with something, right. And that is now becoming, it is still extremely important. So there is nothing that can replace a real experience, but it's not that important anymore. Right. Because it's more about being up to date and following the than anything else. What I really like about the story. It's not about following the trends to be up to date with the latest and greatest, but it's also following the trends so that you don't get screwed by your vendor, who is making changes all the time. That puts a lot of people into kind of complicated position. What we mentioned a couple of times in previous episodes. Yeah. What do I do with my PowerBuilder experience. Am I still a great expert? Nah,

Darin: [00:33:56]
no, you're not. Well, you may still be an expert.

Viktor: [00:33:59]
Your expertise for sent three hours a day on an RSS feed.

Yotam: [00:34:03]
yeah.

Darin: [00:34:05]
okay guys. Thanks for hanging out with us today. Again, if you're interested in finding out more about CloudWize, go visit them at CloudWize, that's cloud, w I Z e.io. See, yet another .io. We keep on having all the cool kids. We may have to change our website from devopsparadox.com to something else.

Viktor: [00:34:30]
We're too lazy. By the time, by the time we do it, io will not be a thing anymore.

Darin: [00:34:35]
I know. Okay. Thanks guys.

Chen: [00:34:40]
thank you very much.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/ contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.