DOP 77: NOC as a Service With Xiteit

Posted on Wednesday, Oct 14, 2020

Show Notes

#77: The unsung hero of any company. The NOC engineer. But what happens if your most skilled NOC engineer is on vacation and there was no backup for him? Enter NOC as a Service. Today, we talk with Avi Shalisman and Asaf Matyas from Xiteit to understand how NOC as a Service can minimize risk so your favorite NOC engineer taking a vacation.

Rate, Review, & Subscribe on Apple Podcasts

If you like our podcast, please consider rating and reviewing our show! Click here, scroll to the bottom, tap to rate with five stars, and select “Write a Review.” Then be sure to let us know what you liked most about the episode!

Also, if you haven’t done so already, subscribe to the podcast. We're adding a bunch of bonus episodes to the feed and, if you’re not subscribed, there’s a good chance you’ll miss out. Subscribe now!

Books and Courses

Catalog, Patterns, and Blueprints

Buy Now on Leanpub Buy Now on Udemy

Kubernetes Chaos Engineering with Chaos Toolkit and Istio

Buy Now on Leanpub Buy Now on Udemy Buy Now on Amazon

Canary Deployments to Kubernetes using Istio and Friends

Buy Now on Udemy

Guests

Avi Shalisman

Avi Shalisman

Avi brings 30 years of experience in both software and hardware, a career that includes establishing and managing global divisions of support, project management and IT services in segments such as Telecom, Mobile and billing systems. Avi holds an MSc in electrical and computer engineering and an MBA in industrial management, both from the Technion – Israel Institute of Technology.

Asaf Matyas

Asaf Matyas

An experienced sales and business development executive with a strong background in selling software and services, from early stage startups to international organizations. Asaf holds a B.Sc in computer science and mathematics, and leverages his technical knowledge and skills to build winning sales strategies and processes that exceed revenue expectations and increase market share.

Hosts

Darin Pope

Darin Pope

Darin Pope is a services consultant for CloudBees.

His passions are DevOps, IoT, and Alexa development.

Viktor Farcic

Viktor Farcic

Viktor Farcic is a Principal DevOps Architect at Codefresh, a member of the Google Developer Experts and Docker Captains groups, and published author.

His big passions are DevOps, Containers, Kubernetes, Microservices, Continuous Integration, Delivery and Deployment (CI/CD) and Test-Driven Development (TDD).

He often speaks at community gatherings and conferences (latest can be found here).

He has published The DevOps Toolkit Series, DevOps Paradox and Test-Driven Java Development.

His random thoughts and tutorials can be found in his blog TechnologyConversations.com.

Signup to receive an email when new content is released

Transcript

Asaf: [00:00:00]
Imagine that you have a developer that is working on a very important feature and there is a major outage and he needs to go out of his zone and remediate something. So you immediately lost two, three hours. So the cost of unplanned work is also something that sometimes companies and startups are short of factoring it into the equation.

Darin:
This is DevOps Paradox episode number 77. NOC as a Service with Xiteit.

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:20]
Now last week we were talking to Joe from Pulumi about creating infrastructure, getting things going. We've talked about all the different things about managing infrastructure. We've talked about logging with Loki. We've talked Prometheus for metrics and we're sending all these things out, but yet somebody has to manage that, right?

Viktor: [00:01:50]
mean it doesn't have to, but.

Darin: [00:01:53]
Okay.

Viktor: [00:01:54]
it might be a good addition.

Darin: [00:01:56]
Our alien overlords are not here yet. Skynet has not taken full effect yet. We'll probably see that from the other sides of AWS and Azure and Google in a few years. But in the meantime, I'm old. So I remember working in a network operations center, four stories underground under a parking deck back in the mid eighties. Let's just say it was a very dark, very cold, not humid, thankfully, but very dry. I don't miss those days. But today we have some guests on. We've got Avi and Asaf from Xiteit. Guys, thanks for joining us today.

Viktor: [00:02:44]
Wait, wait. Are you saying that kind of you don't miss those things and then you introduce people. Is this indirectly saying that they might be missing those things.

Darin: [00:02:54]
I am saying that they have found a better way to do it.

Viktor: [00:02:58]
Okay. That sounds better.

Darin: [00:03:01]
So guys, thanks for joining us today.

Avi: [00:03:04]
You're very welcome. Good morning.

Asaf: [00:03:06]
Good morning. Thank you for having us.

Darin: [00:03:08]
Yeah. So why don't you go ahead and introduce yourself and explain a little bit about what Xiteit is and how it came to be.

Avi: [00:03:16]
Xiteit is a product, SaaS-based product, that we have developed for managing NOC around the world. It came out from MoovingON. MoovingON is a company that provides NOC as a service or as Darin said in the introduction, a MSP that provides NOC as a service to customers. So it started as a internal product that we had put everything that we know into it and then sometimes later we have all this. So that's the relation between Xiteit and MoovingON.

Asaf: [00:03:53]
My name is Asaf. I run sales and marketing for Xiteit and MoovingON for almost two years now. Just as Avi mentioned, out of our own growth challenges, Xiteit came to be the product that basically runs our MSP. Again, thank you for having us. We're excited speaking to you.

Darin: [00:04:12]
Let's go back to the MoovingON days. Why did you decide to do NOC as a service? How did you even see that as a gap in the market?

Avi: [00:04:22]
That's a good question. Go back to 2011 when all the SaaS started to form and Roy, which is my partner, and myself, are veterans in this area and we noticed that once you go to the cloud, you have to be on all the time. We have managed some NOCs in the past that we know how much hassle is that. So that's where we came the idea to create MoovingON and start a company for NOC as a service. So actually, we have the experience and we give each company the share and the resources they need for that. So again, that's the idea behind MoovingON.

Viktor: [00:05:14]
So does that include tools? So can I, for example, contract your service right now, or I need to change the tools that I'm using.

Avi: [00:05:26]
From day one, we have understood that we do not want to make the customer's life difficult. Loki, Prometheus, Zabbix, Grafana, all the monitoring tools are the customer's ones. He created them, he put a lot of effort into make them working before we came on board. So we are using this as a base. Down the road for the NOC, we have developed the tools and the, of course here we developed Xiteit which takes all the monitoring and manage everything. But the idea for us is that the customer use whatever he has developed and we give tools on top of that and services on top of that.

Asaf: [00:06:11]
So to add on top of what Avi just mentioned, a part of our own internal challenges, developing the product, and why MoovingON came to be an MSP that provides NOC as a service, we're based in Israel and Israel is being perceived for many years now as the startup nation or the innovation nation. And that basically means that the amount of startups per capita generated each year, is one of the largest globally. And so many of these startups are B2B or they're ad tech. So they're very diversified. There's a lot of markets, a lot of segments, and some of them really do require 24/7 uptime. Now, when you build a startup, you have your own team, your garage days, and then you start to scale out the team and then you start to put in a bit more structure and processes in place, such as DevOps and customer success and you have your R&D and the rest. These teams usually are divided and they do on-call, but it's also very difficult for them to continuously stay alert, keep themselves on call. Not even most during the night, but also during the day. Imagine that you have a developer that is working on a very important feature and there is a major outage and he needs to go out of his zone and remediate something. So you immediately lost two, three hours. So the cost of unplanned work is also something that sometimes companies and startups are short of factoring it into the equation. There was a gap in the market and by providing NOC as a service, I think we filled that gap and that allowed us to really grow and support the growing ecosystem here.

Darin: [00:07:43]
So in providing NOC as a service, what? I mean, I know what a physical NOC is. I slept under the desks. What does a virtual NOC look like?

Asaf: [00:07:57]
So basically, it's a combination of both technology and people. So we have 24/7 engineers, analysts that are, well, hopefully they're not sleeping under the tables such as yourself, but they're working with various technologies that flows essentially into the technology that we've developed. Through our platform, they basically have access to the customer's environments and all the customer's monitoring environments, logging, tracing, APMs are flowing into the NOC and as part of the service that we provide, we build and maintain and develop runbooks, which is still a concept that still is very viable and valid in our industry. And we remediate, that could be all the way from the very basic tiers, all the way up to DevOps services, people who actually are very intimate with our customer's environment. So it's a combination of both people and processes and technology, essentially.

Avi: [00:08:53]
On top of all that stuff you're saying, there is something more fundamental is that running a NOC is very expensive and very complex. It's very correct way for large company to have, they need one person or two person in each time. But if you go to startups, they have automation. They have teams, and you don't really need to put a dedicated person. So that's where the NOC as a service. It came up also to answer the needs that you don't need someone that is dedicated. We can take one resource and divide it among many customers. So they save money. We give them the service and besides the shared mode, we also provide the experience. Think about a company, a MSP that has thousands of customers. You learn from everyone, it gets feedbacks. So you can, you get one customer problem, and then you give benefits to another customers. And that's the concept of NOC as a service is very much models like any infrastructure or platform as a service. It's the same concept.

Viktor: [00:10:04]
Are you reactive only or proactive as well? In terms like, for example, do you go to the teams to say, Hey, let me show you how to instrument your applications so that I get more visibility. Do you try to shift parts to the left as well or completely to the right?

Avi: [00:10:20]
I'll go back prior to MoovingON. We were working for an American Israeli company and we have a very, very big NOC and what they were doing was like what we call proxy. They got alerts and they wake you over the night. And we were the team that got these calls and then we said never, never have NOCs like such like that. So the DNA of our NOC as a service is not just to give. We try to solve the problems and we make it more sophisticated along the time. So when we started, it was a manual procedures. Now with the combination of our internal product, Xiteit, we can combine automation, human intelligence, and escalation. What we try to bring is the additional value that resolve the problem before or without involving the customers. We cannot do it all the time, but we're trying to give the 80% and save times for our customers.

Viktor: [00:11:23]
Are you trying to transmit that knowledge as some kind of operators that are running inside of those clusters and trying to do the same thing. You know, kind of like, I don't know, you, you found the outage, this happened, fix it like this. By the way, from now on, run these things, so that I don't need to fix it for you anymore. I mean, basically I'm talking about something equivalent to Kubernetes operators or what so not, right?

Avi: [00:11:47]
yes. We try to advise and consult our customers as much as we can from two main reasons. One, this is one of our added value. If we don't give good value, at the end of the day, they would find a different solutions. On the other hand, I look at NOC, good NOC, an efficient NOC is not that is proactive. Let's try to solve the problem before they happen. This is basically helping the customer and helping my team to be more relaxed and more accurate.

Darin: [00:12:22]
A different way to say it is the best outage is no outage.

Avi: [00:12:26]
Exactly. Although we can never get it. Right?

Darin: [00:12:31]
We'll say that's true for now. Um, I think we can at some point, I want to go back to one word that Asaf said that just made my skin crawl and it was, or phrase was you guys still develop runbooks. What does a runbook look like in your NOC and in your processes? Is it a Word document?

Avi: [00:12:54]
Oh, no.

Darin: [00:12:59]
It's a valid question. Unfortunately,

Viktor: [00:13:01]
Just don't say, just, don't say it's not a Word document. It's Google Doc now.

Asaf: [00:13:06]
Or an Excel or an Excel spreadsheet.

Avi: [00:13:09]
of course. When we started as a company, we had Excel sheets and Word documents as an MSP. It was not scary. And we were looking into a computerized solution and we never find it at that time. Remember, we started at 2011, nine years ago. So that's why we have developed Xiteit and Xiteit is basically a computerized NOC, a platform, and then it has a runbook, but eh, is now experience. And then we are not the only one that did it. We are now taking the runbook into a workflow. And when you do a workflow and this is the support system, then you get ability to automate some of the things and the ability to lift the important things to, for the human, but it's helping us the way that be computerized the runbooks to remediate fast, to help our team, and basically to create a workflow.

Asaf: [00:14:12]
I think also the concept of runbook historically might be of a Word document of do this, do that. Here's the script that you need to run. Restart their service. You know, stuff like that. But I think when we look at a runbook or cookbook or playbook, doesn't matter how you call it, I think the context is a bit more wider. It's the entire process of managing an incident. It's the process of not just from a technical perspective, but also from a business perspective and the ability to modularize the entire chain of events starting from when you have identified the incident all the way to notification to the right individuals who are on call. Running a bunch of automations to try and remediate that. That didn't work. Escalate that. Start a process of notifying your internal CS team so they can be aware of what's going on so they can basically communicate that back to the internal stakeholders and their own customers. So if you look at a runbook, we look at it from a much higher view, which is both of a technical remediation process as well as a business remediation process and it really boils down to the fact that it's a governance process to ensure that every time I have an incident and this is something I know I'm expecting. I know what are the processes and what are the workflows that I need to take in order to remediate them and close the case.

Darin: [00:15:35]
And the good thing about that process is if you have a runbook fire off and you see this, especially on the business side, these are called standard operating procedures. That's just yet another way to call it. So you're basically have your runbook. It's workflowed. So by the time and using the other word governance by the time somebody is actually being woken up and getting on a bridge, you can see exactly how much of that runbook has already been run and at what point we're at in the runbook and why am I here? It's almost like I've done this before, except I didn't have Word documents back then. We actually had dot matrix printers. Yeah, I'm that old. I'm just on this side of punch cards. Let's put it that way. I just missed punch cards.

Avi: [00:16:26]
I started with the punch cards by the way.

Darin: [00:16:27]
Yeah, so you're, you're just a hair older than me. Maybe by a day. How important is that runbook process? I imagine that is critical in everything that you do. It has to be in order to eliminate the humans as much as possible.

Asaf: [00:16:42]
I think, yes, you're right. I think what really defines the success is consistency and uniformity. And in your processes and runbooks and standard operating procedures. I always use Avi's very well known phrase that when you run a NOC, you're basically being measured on your failures and not on your successes. I think it's a fundamental element in running a very well organized NOC and basically ensuring either when you do it as an MSP or when you're operating an internal NOC, the outcome basically is really protecting your customer expectations and experience. And so you have to have that processes. You have to gather that institutional knowledge. People come and go. They don't necessarily share that knowledge because of various reasons. So when you have that documented processes and automated up to a certain extent, I think it really plays a critical role in the world of NOC.

Avi: [00:17:45]
I have to add here that in our world, when we give services to cloud companies, creating a runbook is a big challenge because of exceptions. We are dealing with a startups company that believes that everything can be automated, that if they know what to do, they would automate it. And the concept of running a well organized NOC is not something very familiar with them and we have to explain them the benefits of writing these NOCs, how do they take the silos of knowledge and put them into one place? So this is for us in the cloud environment is sometimes a challenge convincing the customers to have these organized NOCs

Darin: [00:18:34]
Let's stay there for just a second. Why is that such a hard pitch to people that well, if I could automate it, I'm going to automate it. I mean, that seems like that would be the answer, right?

Avi: [00:18:46]
Right but let's look about its automation. Automation at the end of the day is a piece of code. So someone has to monitor the automation as well. Right? On the other hand, you can automate what you know, right. So if you do automation and you have things that you never expected, then what do you do? So the best way to handle it via a runbook. And a runbook, if you get things that are closed, I think that give the NOC engineer the ability to look for other things.

Viktor: [00:19:23]
But isn't the runbook also the things you know? If runbook is also not a thing you things you don't know. That's kind of contradictory in a way, right?

Avi: [00:19:33]
Yes. It is contradictory in a way, yes. But you know, if you're not sure, then there is a problem. You have to look at it again, that we are measured by failures, not by success. What's that mean that if you find the system interruption, no one will clap your hands for you. You said, let's go work. If you fail to do that, then you would probably will have to go CA and things like that. So in our runbook, if you're not sure then do something. It's better than having nothing. Only just automation.

Darin: [00:20:06]
well, I think part of it is too from a runbook perspective or SOP is if you can automate, please automate. But at some point when an incident occurs, it's going to be something that you hadn't seen before. Maybe the outcome of that is more automation, but maybe it's just a line item that here's something we have to check until we can figure out how to automate it.

Avi: [00:20:30]
In our world, when you do agile and the software all the time, sometimes doing the automation on before introducing the features will delay you. So you can, eh, like Darin said, you can introduce the feature to manual checks, runbooks and so on, and later on automate it. Runbook and automation must live together.

Viktor: [00:20:55]
So then following that line of thought, unknown things happen, unpredictable things happen, right? You figure it out. You find out that something wrong is going on that nobody predicted before. You fix it. What happens after that? How does the postmortem look like in your case? And especially in relationship with the customer.

Avi: [00:21:15]
Okay. So basically, in our world, it's a two stage post-mortem because the customer has to do the post-mortem on his application level and his internal processes, because we are just one player in his environment. What happened on our side is that we always look what we can advise the customer for the next time. Can we do automation? Can we improve the runbook? Did we responded well? So it's actually a two stage process. Once internally and the second is the customer's post-mortem.

Viktor: [00:21:54]
You're now talking about basically something that I would identify as creating silos, right? When you start talking about developers and you. And, you know, that's what they do. That's what we do. Isn't that kind of dangerous?

Avi: [00:22:11]
First, yes. It may create a silo, but you have to remember that our internal runbooks and a product is available for the customers. So you can use it also by himself. So everything we have internally is transparent. So if we change the runbook, if we do anything it's available for him in real time. So you can use our knowledge and that's the way that we are merging the knowledge internally and the knowledge of the customer. Transparency is the most important things in our world, in my opinion.

Asaf: [00:22:46]
I think one of the reasons and without sounding too markety is we have customers that really stayed with us for, for many years now. They started when they were 10, 15 people, and now they're big and well known brands. At some point, because we've reached that level of intimacy and every customer of our service automatically gets access to the platform, to some extent we are an extension of their internal teams. So one might think that it might be dangerous and silos of knowledge could be created, but in reality, as dangerous and difficult as it is, we don't see that often. It's really about communication and transparency.

Viktor: [00:23:26]
So, in that sense, communication, transparency, do you also kind of perform some similar services, like SREs that you sit with those same teams and say, go through the code of their application and tell them. Do you do code reviews, kind of like, Hey look, if you act, I dunno. Uh, are you involved in pull requests and get back to them saying, Hey, look, if you we just detected that if you do this in your application, then this is potentially going to happen when you run it live, right. Things like that. Do you sit with them. Do you spend a day, a week or something like that, reviewing the application before going live, let's say for the first time or something like that?

Avi: [00:24:08]
We are not doing it only on the onboarding. We are doing it on a ongoing basis. The way that we walk since we are doing it for many years, we, we cannot be a silo or even though we are extension, we cannot be a silo. So for each customer, we have a allocating hours of DevOps and customer success and the requirement in that role is to work with the customer, understand the customer's environment and advise him. Our NOC is not. If our customers doesn't want us to monitor infrastructure infrastructure today in the cloud environment is seamless. We monitor the business, we monitor the applications. In order to do that, we must understand the customer's environment. We must understand what they are doing or some other standard application. No, we are not developers of the customers. We are not debugging the customers, but we understand what the customer does. We work with him when he introduces a new features. We take it and put it into the right process for the NOC and also as a company MoovingON also serves in DevOps and SREs capabilities, things like that, but this is not, this is an optional add on services, but the NOC basic service will include professional services in order for us to be part of the customer's team.

Viktor: [00:25:49]
How does the data transfer work? Are you pulling that? Are you pushing data? Are you integrating with third parties, other third party services like Datadog, or how does that work? What's the process kind of, let's say that I'm hello. I want to be a customer. What do I do?

Avi: [00:26:07]
Okay. So let's go back to what I've discussed like 15 minutes ago. Our perception is that the monitoring environment are the customer's proprietary. So if we can build it for the customers, but we never keep it in our premises because I don't want to lock the customer. I want the customer to be my customer because I'm good and not because I locked him with a monitoring platform. Going back to the monitoring infrastructure, it belongs to the customers. We can help you to build it. Okay. Once we understand that, the next thing is how do we integrate it into our NOC services? And here we five, six years ago, the easiest way when we introduced the customer was to go to his emails, get the emails into our NOC inbox, and we have developed a smart parser. At the beginning, we look at emails. Then later on, when we started Xiteit, we have developed a smart parser that took the emails into xiteit and we have it. Emails are absolute or going out in the next, next level. We have an API, so we help the customer to integrate, Prometheus, Loki, Datadog, and so on into our system. So they send an API call and we get the event into our product. And one of the things that we have developed internally is that we connect the event directly to the runbook. So our system Xiteit knows how to connect the event into the right alert into the runbook and that's from a different story.

Asaf: [00:27:47]
I think what is important to mention Viktor is also to emphasize then on what Avi's mentioned. I think we're really very agnostic when it comes to any monitoring platforms that the customer comes with. Whether it would be that right now, he sends his alerts into an email system or to a Slack channel or a Microsoft Teams channels or even his own proprietary monitoring, either embedded into the code or something that he built internally. We're completely agnostic around that. As you mentioned, you tell us what you have in your monitoring stack, we'll plug it into our system and that's basically it.

Darin: [00:28:20]
so your basic part is what's the hook to the runbook. That's one of the key parts.

Asaf: [00:28:28]
Right. So essentially, as we mentioned, I mean, if we're talking about the NOC, the important thing is really to ensure that there's consistency and uniformity in your standard operating procedures. We talked a lot to customers that already maybe operate a NOC. They say, well, we have a runbook. What is that runbook? It's a wiki somewhere, somewhere, a SharePoint list or a document that someone created. We were very smart in bringing the alert already with the link to the runbook So, but still that's theirs. Well, there's always the manual element to it. And then, so someone needs to look at the runbook, they need to read it, go over it and so on and so forth. And we said, well, we want to combine the two worlds. We want to get the alerts, of course. We do all the magic of aggregating and correlating. And that's, you know, that's important right now to really crystallize and distill the important stuff. But on the same note, we said, well, let's surface the entire remediation process together with the alert to the NOC engineer who is currently doing his shift and we can automate what needs to be automated. So it's easier to actually ensure a very consistent remediation process. So that's the hook. Here's the alert. Automatically, the runbook is attached to it. It could run automatically so it doesn't even really seen by the NOC engineer or at one point he needs to intervene, maybe do something manual. And if that doesn't work, maybe start an escalation process.

Avi: [00:29:55]
There is another layer to that is that, okay, the listeners are here and we get alerts and connect them. But life is more complex than that because sometimes since we get information from multiple monitoring environments, remember the customers, probably everyone has multiple platforms. They choose best breed to monitor different types of their applications. So we get the monitoring from different customers, different monitoring platform. One of the things that we have developed internally is the way to do monitoring cross platforms. What we call it internally is smart rules, but this is more likeliness type of an AI to understand the rules that allows us to connect. Take the simplest example. Your router failed and you get 1000 alerts because everything failed. So the NOC engineer will get 1000 alerts. He doesn't know what to do with that. Okay. Hey, what what's happening here? So now we know how to connect it into one and tell him, okay, to understand the root cause analysis and then connect it to the right runbook. So running a NOC is not just getting alert connected to the runbook. It's also understanding the gross environment. What happened? That's the connected one.

Darin: [00:31:18]
I think that's a great place to stop guys. Thanks for being with us today. If you want some more information, as you're listening to this about Xiteit you're tired of living under the table in your NOC, both Avi's and Asaf's information is there, but then link off to Xiteit will also be there as well. And then if you really don't want to do it yourself talk to MoovingON and they'll just take care of it all for you have a nice day type thing. Yeah. Okay. Right. Cause nobody enjoys running a NOC, right? Oh, you do. Somebody has to. Okay. All right. Thanks guys and we'll talk to you again.

Avi: [00:31:57]
Darin Viktor Thank you for hosting us.

Viktor: [00:32:00]
Cheers guys.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/ contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.