DOP 86: Your Internal Developer Platform Sucks

Transcript

Alan: [00:00:00]
Our developers are super excited because the legacy Windows VMs ticket based system is not sustainable. There's a lot of things that you need to know to be effective in your job, and it's not always straightforward. There's these other groups that are in charge of things like load balancing that you don't really see or feel or hear about, but when you try to take your application to production, all of a sudden you're having to learn a bunch of terms that you didn't really know, or you're not exposed to, where in this new environment, you don't have to know so much out of the box.

Darin:
This is DevOps Paradox episode number 86. Your Internal Developer Platform Sucks

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:37]
Last week we were talking with Yuval about the hidden costs of DevOps. Viktor, have you ever seen any kind of hidden costs in DevOps?

Viktor: [00:01:48]
I'm still not sure what DevOps is, but I've seen a lot of hidden costs.

Darin: [00:01:52]
We've seen lots of hidden costs. Today, we're going to build on top of that story just a little bit more. We have Alan Barr with us today. Thanks for joining us, Alan.

Alan: [00:02:03]
Hey. Yeah. Great to be here.

Darin: [00:02:05]
Alan contacted us a while back and I was like, man, you should just come on the show to talk about this. So his item we'll get to in just a minute, but it falls into the same vein as we had with Yuval. So Alan why don't you introduce yourself just a little bit more, and then we'll get into the meat of the show for today.

Alan: [00:02:33]
Yeah, definitely. First thing I want to say as a huge fan of the podcast, I really love the discussions you guys have. Let me tell you about myself. My name is Alan Barr. I'm a platform product owner at a mortgage broker. We're called Veterans United Home Loans. What we do is we help our veterans exercise their VA benefit. In order to do that, there's a lot of automation that's required as part of mortgages and we're building a new platform on top of Kubernetes and what we've learned so far is that there are a lot of directions you can go and they're not all good directions. So what I'd like to cover is that blog post that I discussed, that you can learn what you really need to be building and I think there's a danger that people could really build the wrong thing.

Darin: [00:03:18]
No, that could never happen. That's not possible.

Viktor: [00:03:22]
I never saw that. No.

Alan: [00:03:24]
Really. Wow.

Darin: [00:03:24]
We will have a link down to this blog post. So the title for this blog post is your internal developer platform sucks. By the way, because you're already listening to this podcast, I am totally stealing that title for this episode as well. Is that okay, Alan? If it's not okay, I'll come. Okay. Good. Thank you. All right. Good. That's one less thing I have to think about. So the title and the link to this blog post will also be down in the descriptions and show notes and all the different places. So what is it that you ran into? So you were building on Kubernetes. It looks like you were building on Rancher. So does that mean you're working completely on premise?

Alan: [00:04:08]
Yeah, we're working completely on premises. I'm in the mortgage industry. We're a little allergic to public cloud. It probably sounds silly in today's day and age, but that's just where we're at. We have a really long-term strategy and trying to make the most out of the money that we have. So for us, it makes a ton of sense to be on premises, but it doesn't make sense for everybody.

Darin: [00:04:26]
That's probably a very debatable subject. I understand your statement. I disagree with it, but that's okay. That's not why we're here to talk about it. As you're working on Rancher, which thumbs up Rancher, right? We're all in favor of Rancher. Great supported product. That's the big thing is you have support for it.

Alan: [00:04:43]
Huge.

Darin: [00:04:44]
but what are these rough edges that you're hitting that you're going, what is going on?

Alan: [00:04:50]
What I see when I talk to people in industry is that they buy Rancher or they set up Kubernetes themselves, and then they just hand it off to their developers and they say, good luck, have fun. You're going to enjoy using these YAMLs and that's your platform and that's not enough. It's not good enough.

Darin: [00:05:09]
Why isn't it good enough?

Alan: [00:05:11]
You really basically lose out on all the benefits of creating a very opinionated platform on top of it and so you really are enforcing a lot of learning for a lot of people that ultimately for a lot of places is not a value add for your business. It's not really helping you solve business problems.

Viktor: [00:05:30]
It is kind of true. What I think that many people don't realize, or I'm just inventing things that don't exist, but there are maybe three or four distinct phases through which Kubernetes is going to go. There is first Kubernetes core, which was not meant to be used by anybody. Ever. Ever never, ever, ever, ever. Networking is horrible and storage is horrible. It's not designed to be used. Then we have a lot of different companies, people, projects, building components that can be plugged into Kubernetes that make it better, let's say. More like a platform that somebody should use. Then we have the third phase, even though second and third is mixing, where companies are building platforms based on those Kubernetes core, those components. The last one would be, Hey, I don't even know that there is Kubernetes is behind that and all that I think goes pretty well in what you just said, that Kubernetes is too complicated for an average person to use it. It's not going to happen. I don't understand Kubernetes fully to be honest and I spend most of my time in Kubernetes and I cannot imagine how can JavaScript developer. You give him Kubernetes and say go. Where? Where?

Alan: [00:06:45]
It's super dangerous to get lost in all these choices. I could have spent two weeks, three weeks for every single particular decision that we're choosing to build on top of this platform. You're just going to get lost. You're going to end up six months, one year, a year and a half down the road. You won't have anything built. Nobody's going to be using it and you'll still be chasing down ideas or thoughts of directions you can go. You want to be at that last stage, focusing on that last stage. For your business, maybe there's an earlier stage you need to really optimize, but for the most part, what's the value of getting lost in those details?

Viktor: [00:07:20]
I think that this is the part where you are at a disadvantage. I mean, not you only but that a lot of things are going on that are applicable both on prem and in cloud, let's say, and then there is a whole range of things that are available only to people who are using cloud. GitHub Spaces just became a thing. I'm going to let you now explain actually, what is a development environment, but I guess it goes pretty much in that direction, right?

Alan: [00:07:49]
Yeah, certainly I don't discount the cloud at all. There's tons of great services that you can use and gives you new opportunities to do things, but you can't afford all the tools. At some point you have to commit to something, some type of experience. Code Spaces is phenomenal because you can just spin up an environment within the cloud. You have all your tooling available and that familiar GitHub that everybody's really used to. But when you're in a constrained environment, you have a lot less choices. That was a big challenge that we ran into with going on premises was that many of the CICD tools we wanted to use had some dependency on cloud. They were expecting some load balancer in the cloud that we didn't have, so that really constrained a lot of our options. But you have to really think when you choose use one of these tools to hook into your process, you're really making a huge cultural decision of how your team is going to be using these tools and what level of value they're going to be adding on top of everything. You might be choosing something very low level and maybe for a good choice, maybe for a good reason, but from what I've seen, it's mostly because people didn't want to go that next step to provide that outer edge platform.

Viktor: [00:09:01]
It's maybe not in the DNA of companies actually to think in those terms. That's probably the shift where we are not yet as industry did not fully adopted that suddenly the job of different teams and departments is not any more to wait for somebody to request some action, like deploy this for me, but to provide the service.

Alan: [00:09:25]
That's a big thing where we're really focused on, I would even take it a step further, not just giving the service, but providing hospitality. I want that experience of when I let's say you go into a certain hotel that's known for really good hospitality and they predicted your need. You find a towel in the closet and you're like, why would I ever use this? And then when you have that moment, they're like, Oh, wow. I didn't realize I even needed this thing. That's what we need to be providing for our platforms. I thought about your needs ahead of time. But a lot of us are still stuck in just the service where, okay, you asked me for this thing and I'll give it to you. Well, that's not really good enough.

Viktor: [00:10:01]
Because the person asking you is not really, you don't consider him often as a customer. The hotel is a good example because if you didn't find the towel, maybe you wouldn't come again. Your colleagues don't have that option not to come to you. I mean, not you, but somebody, right?

Alan: [00:10:15]
I would rather choose the cloud many times for a lot of things because it's a lot simpler. I don't have to talk to somebody. There's reasons that people would choose one service or the other. So you can get around having to have a difficult conversation or negotiate all the time for things.

Viktor: [00:10:31]
So how does the development platform, environment, whatever you call it, how does it look like in your case?

Alan: [00:10:36]
In my case, it's a really opinionated experience. We have found and there's a lot of other companies out there like Darkling that are basically trying to cater this experience of those four fundamental concepts where you have a API, you have a worker process, a scheduled job. You have some type of storage, cache or whatever and it's just providing those fundamental things out of the box at the click of a button. For now, it's a GUI. It's a web portal you go to for your self service and then in the future, there could be some CLI component or API, but it's still very early in our journey to provide that. But that's what we're targeting is like push button, get platform versus here are all the ingredients to assemble your platform.

Viktor: [00:11:18]
How much of it is custom development and how much is picking different pieces from different places?

Alan: [00:11:26]
A lot of this is glue code in between different pieces. So in our case, this is a prototype that we're working on that it's not fully production ready yet, but we're getting close. We're gluing GitLab CI together with Rancher, as well as with a lot of other automation that we're doing such as SwaggerHub for code generation projects that we're doing. So for us, it's very heavily customized, but this is the stuff that the platform team is building and it's not what we're enforcing our developers to have to learn all the tooling around Kubernetes.

Viktor: [00:11:58]
How's the adoption going? Are people happy? Are they confused? Are they rejecting?

Alan: [00:12:06]
So far it's going very well. It's very early stages though. I'm releasing this either at the end of this week or next week to our early focus group and our technical architects. I've been publishing blogs internally. I've been doing a lot of marketing efforts and people are very engaged and ready because in an enterprise you have a captive audience for your platform. So that's one thing you have going against you if you're not using product management principles, is that it's just easy to give anything to people and not have to really go that next step of making it really usable for them, but we've been engaging with them. We've been running trial runs with the team. They've given us great feedback. They're very excited with it because what they have today is .NET Windows servers that are basically provisioned based on tickets through ServiceNow. It's not very transparent. You have to wait for work to be done by other teams before you can continue. So it's just really painful versus this new experience that they've touched and felt. They have total control to get to that environment without having to work with other groups that they don't talk to everyday.

Darin: [00:13:13]
You just mentioned something there. Define product management principles.

Alan: [00:13:19]
It sounds kind of silly, but it's really simple stuff of listening to your customers, getting their feedback, prioritizing things. Making sure that when you're releasing stuff, you're validating that people want it, that they see the value in it. It might be a little foreign for people, but this is something that ThoughtWorks puts on their radar last couple of years that it's important to build a product that people like and I think it was silly for me to say it in that way. It's on the radar for a reason. It doesn't come as simple to most people.

Viktor: [00:13:50]
What's the relation between the work developers do on their laptops and the things that they do in clusters? Are they switching towards using Kubernetes as their workstation or how does it look like?

Alan: [00:14:05]
Right now it looks like remote development on the Kubernetes cluster is what we can offer. In the .NET world, what we found is that there's a new project that's in progress right now called Bridge to Kubernetes that makes it much simpler to run, but we've found a lot of challenges with running on that local development environment. It hasn't been as simple and straightforward. For our databases, we use mostly Microsoft SQL, so we need to go through Kerberos authentication. It's requiring a Kerberos sidecar. The space is evolving very rapidly and an axiom I've been sharing with people that I heard on another podcast, The Ambassador Podcast, was if you can wait six months, you should and we're really taking that to heart. The local laptop developer experience is just not there yet and we're really hoping for that to improve.

Darin: [00:14:51]
So it sounds like you're pretty much fully a .NET shop.

Alan: [00:14:54]
Yes, we are fully a .NET C# shop.

Darin: [00:14:58]
Have you been moving to .NET core or are you still staying classic .NET even as you move to Kubernetes?

Alan: [00:15:03]
Oh, we're definitely moving to full .NET Core.

Darin: [00:15:07]
Going back to the problem of the local development processes are painful. I'm assuming that's because you're trying to move towards microservices. Is that the reason why it's painful locally or what's causing that friction?

Alan: [00:15:19]
The friction right now is mostly just the tooling available for the teams and Kubernetes. Also with the legacy .NET framework is just not a fit for anything that's not a Windows VM. The Windows VMs on Kubernetes are not anywhere near close enough to ready for us to play with, but in our business, most of the automation coding that's required is not tied to anything Windows specific other than SQL Server. So we have a lot of flexibility to move forward with .NET Core and to use Kubernetes versus others that might be tied to something very specific.

Darin: [00:15:55]
Are the developers really excited or are they just being compliant?

Alan: [00:16:00]
Our developers are super excited because the legacy Windows VMs ticket based system is not sustainable. There's a lot of things that you need to know to be effective in your job, and it's not always straightforward. There's these other groups that are in charge of things like load balancing that you don't really see or feel or hear about, but when you try to take your application to production, all of a sudden you're having to learn a bunch of terms that you didn't really know, or you're not exposed to, where in this new environment, you don't have to know so much out of the box. So there is a balance there, but we're just trying to lean more towards you don't need to know so much, but also give you that escape hatch that when you do need to debug and find out that information, you can find it easily. We've decided that this paradigm shift is going to require us to all move in sync with one another and the developers I talked to that have seen this, that have played with it, are really excited about this journey.

Darin: [00:17:01]
What about the ones that haven't seen it yet?

Alan: [00:17:05]
Our culture at the company is very like, as long as you are a good fit with us, we'll bring you on and we'll teach you up kind of thing. So I've definitely noticed that there's people that are new to the idea of containers and what that is and even explaining Kubernetes is quite a task. So part of my efforts is like blogging and writing about some of the concepts that they need to know and learn and understand. The ones that I've talked to and seen in our Slack and various conversations, it's a slow journey of educating. We do these monthly sessions called propeller heads and pizza where we might have a speaker talk on a subject. So we'll cover little nuggets of information about this new world of containers versus the virtual machines that people are used to.

Darin: [00:17:51]
Are you shipping pizzas to everybody during these times right now?

Alan: [00:17:54]
We haven't figured that one out yet. Unfortunately, it's just the talks. There's no propeller, no pizza.

Viktor: [00:18:00]
I'm asking this mostly because I'm not really following Windows much. How big is the gap today? I know that can, that like I don't know three years ago, Windows Kubernetes was not really a pleasant thing. How different it is today?

Alan: [00:18:19]
The challenge I've seen when I've had our developers look into it has been you needed the same Windows operating system version on your desktop as the one that was going to be running in Kubernetes. It's kind of challenging to test that. For us, that detail was pretty much a non-starter, but it seems like it's advancing quickly every day. So I imagine it's like two to five years out before it's more tenable, but we just don't have any reason to wait on that need with .NET Core targeting multi operating systems. It's just really no reason to wait.

Viktor: [00:18:54]
okay. This shows my ignorance of not knowing stuff. So .NET Core containers running in Linux nodes just as Windows, right?

Alan: [00:19:02]
Yeah. Yeah. Microsoft is super invested. They had a previous orchestrator called Service Fabric. They try to play that contest where they have multiple horses in the same race. Three years ago when we were talking about our orchestration platform of the future, what we wanted to do, a lot of people thought, Oh, well, Service Fabric makes the most sense. We're a .NET Microsoft shop. Why wouldn't we do that? But the pushback at the time from infrastructure and some of us was like, well, yeah, this runs Azure, but what kind of community is there for it versus Kubernetes. Microsoft has been all in on Kubernetes, so they run a lot of Azure on Kubernetes as well. It's just the direction they're going.

Viktor: [00:19:43]
So your Kubernetes cluster or clusters, are they all Linux nodes then or, and then, okay. Okay. For a moment, I thought that its all Windows. Okay.

Darin: [00:19:55]
With that being said, you're Linux-based Rancher, on-premise, all the things and you said you are risk averse to going to public cloud. Have you looked into and assuming since you're Microsoft, if you were to go cloud more than likely, it'd be Azure, have you guys even considered looking into their on-premise hybrid approach?

Alan: [00:20:19]
We've talked about it but it just hasn't made any sense for us to do that. We're just very risk averse, like you said about lock-in and it didn't seem like that was a big value add for us, especially in the sense that what we're trying to do is build the platform and the cloud platforms that are available are still requiring you to know too much. Azure has an app service. It's very similar to a container, or I'm not really sure, a better example to other public cloud providers. But what we're looking to do is not really get too tied down into a specific cloud model. Like the reason why you would choose Kubernetes, if you're going to go into public cloud even is that you have that standard API that you can shift across the different clouds. Obviously you might have to customize it a little bit, but ideally that could help you drive your costs down and not be super tied into a specific cloud vendors provided platform.

Darin: [00:21:17]
What is management's take been on this, like upper management? Because usually they're the laggards of everybody.

Alan: [00:21:23]
They're very excited. We have a really great company culture here at Veterans United Home Loans. They've been excited about the concept, but obviously what they're looking to get is like governance. Are we making sure that we're managing our resources appropriately? Is this going to last? Is it going to be supported? What we've been doing is educating them and equipping them with the value of the project. How it makes such a big difference comparing our current platform, which sucks, but it's been solid. It's reliable. It definitely does the job. Equipping them with information about how much can change with this paradigm shift and why it's worth it. It took some time and there was a lot of encouraging and information sharing that we had to do. We're definitely crossed that threshold and have started this project. We barely started in like June and it's almost November now and we have a prototype that we're giving over to our focus group and architects, and we're just trying to ramp towards getting to production readiness.

Darin: [00:22:22]
What is your target readiness time frame, if you had to guess, just round numbers.

Alan: [00:22:27]
End of the year. January would be great. I'm expecting surprises. This is the first time I've gone on this journey, so I imagine there's still stuff I just don't know about.

Darin: [00:22:36]
Let me ask you this question. Which version of Rancher are you running right now? And let me rephrase that. Which version of Kubernetes is Rancher based on that you're running right now.

Alan: [00:22:44]
I believe it's the 1.18.

Darin: [00:22:46]
You're on the latest.

Alan: [00:22:48]
I think that there's 1.19 out.

Darin: [00:22:51]
They did just ship that. That's right.

Alan: [00:22:52]
We ran into some weird stuff when we tried to go to 1.19. I wouldn't recommend just going to new versions super fast. We've also had some challenges. We're using Ubuntu cloud init images as those worker node OSes and we've also seen some security upgrades that broke some things for us. There's unknowns that happen in this, and there's a lot to learn.

Viktor: [00:23:12]
Are you keeping it kind of simple or you're going all in? Service meshes and fancy storage like Rook and what so not. How complicated or simple are you?

Alan: [00:23:24]
We've definitely been considering our options with storage. We still use VMware and we've looked at some of their options. They're highly pursuing Kubernetes but it just didn't make sense for us because if you go to vSphere 7, they're assuming that you're going to do software defined networking, and a lot of these really big changes that were just too risky for us. So we've been looking into working with their persistent storage. It seems tenable. I've been working with the architects to really overstate that we're just fully stateless. We're not worried about stateful workloads for now. Maybe in a year or two, we might reconsider that. For service meshes, we're still evaluating our options. Linkerd is probably the simplest one that doesn't require a ton of care and feeding. We're really excited about other kinds of service meshes. At first it seemed, Oh, well this is simple. There's a lot of different options on the market. As we've learned about what's available and what they can do and how much complexity there is, we've been really hesitant to make a choice until we really need to. There's Istio, but at what cost are you going to have to learn it? I don't remember exactly. It's one of those food delivery places said that they were going to spend like six months only on one facet of Istio. I'm not in any rush to even pursue that right now, just because it just seems like it's going to be a lot of headaches for me in the future.

Darin: [00:24:43]
So you're running your cluster on VMs.

Alan: [00:24:45]
Yes

Darin: [00:24:46]
Why?

Alan: [00:24:48]
It's what we have available. We might consider bare metal in the future, but for now it's not been a consideration.

Darin: [00:24:54]
Anything else you want to cover from the post? It seems like you've covered most of the big items from there. Management buy in is a good thing. Developer buy in is a good thing. Well, management is paying the bills. Developers actually have to use it. What about your tangentials from there? Your testers. Your other people that have to interact with it. How are you handling that scenario?

Alan: [00:25:14]
Constant outreach. Not everybody's up-to-date and reading all the new blog posts about Kubernetes. There's been a lot of communication and knowledge sharing and information about how things are going to change. I definitely think about quality assurance engineers, and whether they're going to be impacted. How much do they need to learn? Hopefully, it's just as minimal. The nice thing about so much of Kubernetes and something like Linkerd is you're just getting all your observability provided to you via some Grafana dashboards and whatnot. So compared to what we have today where every single service that we have, we have to talk about, Oh, is this a good time for us to bake in some observability to it? Oh, is this a good time for us to have some Splunk logs for this, like make a dashboard. It's really nice to have that out of the box versus like make everything all the time. So I think a lot of people are really excited about what's possible.

Viktor: [00:26:09]
Curious about one thing that you said earlier. You said something like, don't worry about it. We are fully stateless. What do you mean by that? Stateful parts are out of Kubernetes or, I mean, there is state always somewhere, right?

Alan: [00:26:23]
Yes. Right, right. We're not going to host a database on the Kubernetes cluster. We'll let a regular virtual machine handle that for now.

Viktor: [00:26:31]
Good. Good choice.

Alan: [00:26:33]
Yeah, I don't want more pain.

Darin: [00:26:36]
Beyond a good choice. Again, Alan's contact information, the link off to his blog post. That's also the same title as this episode all down in the show notes. Thanks for joining us today Alan.

Alan: [00:26:48]
Thank you very much. Glad to be here.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 86: Your Internal Developer Platform Sucks

Show Notes

Guests

Alan Barr

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript