DOP 73: Logging With Loki

Transcript

Viktor: [00:00:00]
I just changed my de facto standard from Elasticsearch as being more or less the only option for self managed to Loki. And if no other reason, I love their tagline. If you go to their site, the first sentence you will read and I'm paraphrasing, so I might not do exact, I'm not reading it right now is like Prometheus, but for logs. That solidified my interest in the product before I even tried it. Like Prometheus, but for logs. Brilliant.

Darin:
This is DevOps Paradox episode number 73. Logging with Loki.

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:34]
Now, recently we've had an update to the catalog course and book that covers the item of logging. Who would have thought logging would be so sexy?

Viktor: [00:01:49]
It depends for whom, right? If you split users in general, among those who use third party software as a service and self hosted third party software, then it's probably not exciting for the first group of users because they're already shipping logs to some third party, which could be Google Operations. By the way, Google Stackdriver was renamed to Google Operations. I discovered it yesterday. Or CloudWatch or Datadog. If you use any of those, then yeah. The talk about logging is not really exciting I think because they all do the same more or less same thing, right? They're better or worse in the background in resource consumption, UIs are sometimes horrible, sometimes great, but ultimately what you really want is to ship all your logs somewhere and be able to filter those logs so that you can find the issues that you're looking or causes of the issues you're looking for. Now on the self managed side of the story, basically ELK stack is the thing, right? It's the de facto solution for logging, among other things, when you self host. An ELK stack is horrible. I mean, okay. Let me correct myself. It's not horrible. It's great. But administration, operations of the ELK stack are horrible. Elasticsearch is a memory hungry CPU, hungry beast that no matter how much you feed it, it is still yelling at you saying I'm hungry, I'm hungry. Right? I never saw a happy administrator that has Elasticsearch. Now, I haven't met all of them. I might have been unlucky. But what I'm really trying to say is that Elasticsearch to me is a beast that is more or less the only thing that we have right now, or we had until recently, and we had to deal with it. But that, to me, sounds like a hammer that you try to use for everything, which is great because it's an all around database that can store anything that needs real time query. But for logs, too much. What happened is that I wasn't informed enough about the latest development in that sphere and then came Vadim. I don't know if you know Vadim. Vadim is the guy from the Slack channel and wrote the first draft of the chapter of how to do logging and lo and behold, it uses Loki. And then I dived into Loki. It's awesome. I just changed my de facto standard from Elasticsearch as being more or less the only option for self managed to Loki. And if no other reason, I love their tagline. If you go to their site, the first sentence you will read and I'm paraphrasing, so I might not do exact, I'm not reading it right now is like Prometheus, but for logs. That solidified my interest in the product before I even tried it. Like Prometheus, but for logs. Brilliant.

Darin: [00:05:32]
and the reason why it caught your eye, is it because when you saw Prometheus you thought simplicity?

Viktor: [00:05:41]
Yes. I thought simplicity. I thought query language that I like, yeah, but simplicity mostly. My beef with Elasticsearch was never the I'm not getting the features I need. Uh, I do. My beef with Elasticsearch is that it does too much. From operational perspective, it is a huge burden. So as a user, I would not dream about switching from Elasticsearch to something else. It just does what I need it to do. I'm now talking exclusively about logs. I'm not touching other potential usages and there are many of Elasticsearch. But as an operator, oh my God. Or as a person who needs to pay the bills. Oh, that's even worse. That's horrifyingly high. So simplicity, yes. That's what caught my eye. That's my first association with Prometheus. Actually, simplicity and the way how it plays very, very nicely natively with Kubernetes. The ability in Prometheus to say, Hey, I don't need to tell you anything. You're going to figure out about all the applications I was, I will, and I am running in my cluster and you figure out how to get metrics. There is no need, no special need for me to write any specific configuration. I might do that, but that's a bonus. That's not a requirement. Oh, that's brilliant.

Darin: [00:07:09]
So let me ask you this question, because we haven't talked explicitly about this. Is logging the first step to true observability?

Viktor: [00:07:19]
No. I think that metrics are the first and the most important step. Some people think that, you know, if you with centralized logging and this and that, you will be able to detect errors and you will be able to do many, many, many, many things. I think that is not true. I think that you detect errors through metrics. I think that with whichever method you're using to get alerts about problems is going through metrics. When you watch dashboards as a substitute for Netflix, you watch dashboards based on metrics. And so on and so forth. So, metrics are the alpha and the omega of observability. Logs to me are a supplement of metrics. You detect that your application is not responding. Simple example, right? You're going to detect that through metrics. And then you're going to probably go through metrics and figure out, okay, so what is not responding? Oh, this app is not responding. Uh, excellent. And this and that. And then after that, you dive into logs to figure out the details, rather than to find out about an issue, because, okay, let's play this out. Let's say that you have only logs, no metrics. Right? How will we ever find about the issues? If you're going to be alerted whenever there is a word error in any of the log entries, you will never sleep. Kind of like, there's just so many errors in every single application, every single system that are not really errors or not errors that should wake you up in the middle of the night. That's kind of unreliable, right? There are always some issues that are very often not noticeable by users. The same thing is valid for metrics as well, because you're never going to set up your alerting system to say, whenever a request does not receive a response, tell me about it. You're not going to do that because there is constant flow of unresponded requests. That's why we never have a hundred percent availability. But if the tendency of not receiving response is higher than it was than whatever is the norm, or if there is an elevated number of 500 responses. When you aggregate data, then you get something meaningful and you don't normally aggregate through log entries. If more than 2% of log entries contain the word error, then tell me about that. That's not really, you can do it, but it's not really the right way to do it, I guess.

Darin: [00:10:11]
We talked with Phillip in episode 29 about Elasticsearch and we talked with Eric from OverOps in episode 40 about continuous reliability. Elasticsearch is open source as well as commercial. OverOps is commercial only. But now what we're seeing here with Loki and Prometheus, that's a good starting point for any company or any project that's based on Kubernetes to actually have those features without going down that deep route. Not saying that you couldn't, it may make sense to go those ways. This is more of the quote unquote cloud native way of doing it. That would be the Loki and Prometheus way of doing things would be the cloud native way of doing this part.

Viktor: [00:11:01]
You know, it all really depends on how you define cloud native. Every person has a different stuff. I would rather characterize it that we are continue seeing that trend of going away from do it all type of tools. So what differentiates, let's say Prometheus and Loki from Elasticsearch is that Prometheus is focused on metrics. Loki's focused on logs. While Elasticsearch is focused on any type of data you can throw at it. Which I understand that gives a lot of benefits because at least seemingly on the first look, it sounds better to have one tool that does many things, at least from operational perspective, than to have many tools that do certain things. But I'm afraid that that train has passed in a way that I don't think that that tendency is ever coming back, like having a single tool of a single vendor that does everything for you. Now, Elasticsearch is definitely not a single tool that does everything for you, but it might be doing a bit too much. So it's a great piece of tool.

Darin: [00:12:15]
Yeah, you might be using it too much. That might be the answer is it might not be doing anything, but just because you have Elasticsearch you might be using just because I have it, that's what I use. If all I have is a hammer, everything's a nail.

Viktor: [00:12:30]
It's not only about what you use it for. I think it's about also what it is designed to be used for. So I like if you put the same quantity of metrics in Elasticsearch and in Prometheus you will see that Prometheus will use like I cannot give you exact figures, but let's say quarter of memory to store the same amount of metrics. It will be much faster to retrieve them and so on and so forth. Now that's not because to clarify that because people working on Elasticsearch are worse engineers than people working on Prometheus. They might or might not. I don't know. And that's not the point. But if it's designed to do one thing then, and if it does it well, then it is arguably more efficient than something that is designed to do many things. And it's up to us to decide that trade off and there is a trade off of having three tools or one tool to do three things. I personally prefer Prometheus. I think it kind of became the de facto standard and what I like about Loki is that applies the same logic and the logic, and here's the major difference. What makes Loki very different from Elasticsearch. Elasticsearch indexes in memory, all your data, right? Makes huge, massive, gigantic indexes of your stuff. Now, Prometheus doesn't and Loki neither because they basically use the same logic. What they do is that they label your data when it's coming in. So basically you're not searching through data. You're searching through labels, which are stored in a specific way and are limited in what they are. The total amount of all the labels is much smaller than total amount of all the data coming in. If labels are providing all the information that you might need when querying data and not more than that, then you have that perfect balance. We can argue about what is the perfect balance, but let's say when I search for logs, I need to basically I'm searching for application, maybe a function and the severity, like, you know, uh, low, high, uh, warning, this and that and there's probably a few other things. If I get those things attached as a label, then, uh, yeah, I have what I need.

Darin: [00:15:05]
Let me ask this question. Does Loki have the same weaknesses as Prometheus. Meaning there can only be one Prometheus today, right?

Viktor: [00:15:16]
You can scale Loki

Darin: [00:15:19]
You can scale Loki?

Viktor: [00:15:20]
Yes. Yes. Yes, actually. Yeah, it's a bit weird. You can split it into different components, the one for querying and the one for storing data, and that gives you some level of scalability. So you can split it into more components more than what comes by default. And then each of those can be horizontally scaled as well. Because ultimately what they do is that they don't keep things in memory. They read from disk. I mean, they do keep things in memory. Let me actually be more explicit, but they are not creating a full text index in memory.

Darin: [00:15:58]
So they're being very, my word, efficient by using labels. The problem is if you, if the labels get misapplied, then you've got a bad data set to play with, which is normal. Even if you do that with Elasticsearch. The thing with Elasticsearch because in times where I've implemented Elasticsearch to do the way you've explained Loki, I've followed that label model. I would tag things specifically, and then I would put the raw text in there. I typically would chase after the labels, but then if I needed to do a full text query, I could because it was there. What I'm losing with Loki is that full-text query. I'm only, I can only deal with labels.

Viktor: [00:16:46]
No, so you can do full text query. Let me clarify that. It just, that, that full text query might or might not be more efficient than in Elasticsearch. So it can go through in a simplified way, which is not right, which is not really how it, how it's really, really going on. But you can, you can limit through labels, say, give me only logs of this application so that it can load it and then you can search within a subset filtered by labels through any text. You could theoretically search directly through the text and skip labels, but that will be horrifyingly bad.

Darin: [00:17:29]
Right. That that would be as compared to a Elasticsearch which has made that part highly efficient

Viktor: [00:17:34]
exactly. Exactly. So if you just do say all the logs for from the last week in the whole system, find me the word error, please go to Elasticsearch. Now, if you say this app, in this namespace, last day, week, hour, whatever, and then find me the one that has the word error, you're better off with Loki.

Darin: [00:17:56]
Right, because you will have been able to drill down to that time series point, then you're basically just scraping and that's where you're paying whatever cost that is. But hopefully there's only like five to 10 of those there. Yeah. Okay. It's basically like doing a query on a database without an index. That's what it amounts to.

Viktor: [00:18:20]
Yes, exactly. I mean, doing query on a database by combining index and not index.

Darin: [00:18:28]
right. I've got an index to get all the way down to the end, but then it's just, okay. Now I have to go brute force through it.

Viktor: [00:18:34]
Yeah. Now I must, before we proceed, I must have a public disclaimer. I haven't spent just as much time with Loki as with Elasticsearch. So this is consider this a preliminary, uh, findings that might, you know, we might have another episode where I say, this is horrible.

Darin: [00:18:56]
but at first blush, and for the time that you have spent with it, Loki appears to have kicked Elasticsearch butt from the basics of logging. Like if all I need, if all I need to do is logging, Loki is going to outperform. Excuse me, let me rephrase that. If all I'm doing is logging and that's all I care about and I'm going to follow, I'm willing to follow the opinionation that Loki has put around logging of how to analyze and how to query, then Loki is going to be as, as efficient, if not more efficient than Elasticsearch without carrying the Elasticsearch operational overhead.

Viktor: [00:19:53]
Now the real question is where do you store your metrics? Let's say that we are talking about self-hosted only. If you're storing your metrics in Elasticsearch, don't go to Loki. You're already paying that price. Don't be silly and even think about Loki. Now on the other hand, if you're storing metrics today in Prometheus or something similar and logs in Elasticsearch, then you already went down the route of accepting more specialized tools. Then Loki makes perfect sense. There is another thing is that both are using the same, and this is the part I like the most, both Loki and Prometheus are using exactly the same library for generating those labels. So that means that if you are using Prometheus for metrics, you query both with same labels and you get results from both based on same application, same component, whatever it is. You have that parity, which is awesome. But again, I think that the main thing that everybody needs to ask is are you using Elasticsearch or Prometheus or something else for your metrics. That will give you the answer what makes sense.

Darin: [00:21:01]
So I think the subtext to all of this is if you're not collecting metrics today, start collecting metrics prior to really logging. That's the bigger one to solve first. Because using your phrasing, if you're watching dashboards instead of Netflix, you want to make sure that your metrics are good because that's going to be your leading indicator of any issues, not a log item. The log item, as you said, is a supplement to your metrics. If you're not thinking that way today, that's probably the biggest takeaway today as you've listened to this. Get your metrics right first. Whether that's Prometheus, Datadog, it doesn't matter. Right. Just get your metrics right first and then put the logs on top of those metrics or underneath. However you want to look at it. And then at that point, you're starting to get the beginnings of observability.

Viktor: [00:22:06]
There is one use case that I cannot not mention. I know at least a couple of companies where people told me that they plan to gather metrics from logs. You know, I'll put the metrics into log entries and then put the,

Darin: [00:22:21]
and then scrape that

Viktor: [00:22:24]
That's stupid. Don't do that. That's I know that I'm not supposed to say swearing in this show, but in this case I cannot handle myself. That's stupid.

Darin: [00:22:33]
Yeah. Yeah. You just said the S word. Stupid. That's OK. We'll leave that one in. I will allow that one as the editor. It is stupid because that is backwards. Metrics need to be either emitted, published, or scraped and whichever way you want to go with from there. It doesn't matter. Do not bury it in logs and then scrape from logs. That's the wrong answer. So I have a feeling there will be more on Loki in the future, just because. Can you think of a good use case of why you would use Loki in a non Kubernetes environment or would you just go Elasticsearch?

Viktor: [00:23:17]
I would probably go Elasticsearch. And what I'm about to say comes from not having sufficient experience with Loki outside Kubernetes but something tells me that it can run outside Kubernetes just as Prometheus can run outside of Kubernetes but it is so focused on Kubernetes that something tells me that it wouldn't be a right choice. I don't know why. I, it just kind of gut feeling, not really educated response.

Darin: [00:23:47]
So that's Loki for now. And that's what we'll say. Loki for now. Anytime I say the word Loki, I think of Thor. And then, then I'm sad that the first phase of Marvel is over, but life will go on and Loki does have a show coming on Disney Plus. So at least we'll get to see Loki a little bit more, but that's not the Loki we're speaking of.

Viktor: [00:24:09]
I don't know. I think it's probably based on Loki. Loki the name.

Darin: [00:24:13]
Probably is. And think about. That would probably be part of the second episode is we need to figure out the history of the word Loki, of how they're using it for logging. I'm trying to tie Tom Hiddleston back to logging and it just doesn't work. Can't figure it out, but it doesn't matter. Okay. If you're not doing Prometheus today and you're running Kubernetes, do that first. If you don't have Elasticsearch or you don't have a logging solution today, look at Loki before you go Elasticsearch. It doesn't. No?

Viktor: [00:24:46]
No, no, no, no, no, no, no, no, no, no. The first option. Go with something as a service. Just don't bother with those things altogether. If you have to self host it, then Darin, continue.

Darin: [00:25:02]
Okay. If you have to self host, thank you. Yes. Always use a managed service provider to take care of that for you. If you're just starting out, you can't use a managed service provider, you gotta self host it, get your metrics right. That's Prometheus. Get your logs right. This is the question point that happens. Actually before that. If you're using Elasticsearch centrally within the company, that is a service provided to you by some central team, then that may be the place to start. Maybe kick the tires on Prometheus. Figure out, okay, is this better or not? Okay. We end up on Elasticsearch. If metrics already going to Elasticsearch, more than likely your logs are going to be going to Elasticsearch or we haven't said the, the other S word today, Splunk. If you're in a company that's got Splunk, you're probably going to be shipping logs to Splunk because that's what happens.

Viktor: [00:25:50]
I always get confused to be honest, because I have such a strong association with Splunk or Datadog as a service, that I continuously forget that that exists as a self hosted option. So maybe, maybe we can have three code categories: as a service, paid enterprise something something, and then self hosted open source. Now you need to repeat all everything you said again, but starting with all those ifs first.

Darin: [00:26:20]
Okay. No, cause I don't do nested ifs, I do quick returns. But anyway, if you're fully opensource and that's how you're building it out, Prometheus, Loki. If Loki doesn't work out for you, then go Elasticsearch right? We've got nothing against Elasticsearch but there will be an operational tax that you will pay for running Elasticsearch because in my opinion, and this may be a whole nother firestorm. I don't believe you should be running Elasticsearch inside of Kubernetes at all. That needs to be treated like a normal database outside of Kubernetes, with all the care and feeding that you have to do for a real data store. That's important. If you were to run it inside Kubernetes, that to me, that you're telling me that it's not important. Your data's not important to you. That's my opinion.

Viktor: [00:27:13]
If you have to run it inside of Kubernetes, run it on a dedicated node pool. Which kind of defines the defies the purpose of Kubernetes but hey.

Darin: [00:27:23]
Exactly. If I can only run a single pod on an instance, that's 128 gig. Actually, yeah a hundred cause you need, because you don't want it. And even though we're going down the wrong path here, but Elasticsearch says don't run containers larger, or excuse me, run a process larger than 32 gig for Elasticsearch. That's that's their printed in, in writing thing, if I remember the number correctly. And if you follow the Oracle recommendation of never have a JVM size larger than one quarter of physical memory, that means I would need 128 gig box to run a 32 gig heap. Okay. If I'm going to run a 32 gig, keep as a pod on Kubernetes, why bother? Just run it on the bare metal at that point or virtualized. The math is simple.

Viktor: [00:28:15]
When you get to that point, the only thing that can help you is everybody gather in circle, hold each other hands and sing kumbaya.

Darin: [00:28:25]
Yeah. Cause it's game over at that point. You're doing silly beyond silly things at that point. But anyway. Okay. Logging with Loki. Hopefully you found this useful today. If you've got questions, you know where to go put the questions. Over in the Slack channel and we'll go ahead and shout him out one more time. Vadim. Thanks for bringing up Loki and sending Viktor down a very dark path. He will soon have a sceptre and take over the world. No, wait. That is Loki.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/ contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 73: Logging With Loki

Show Notes

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript