DOP 81: Making Email Provider Integration Simple With Nylas

Transcript

Christine: [00:00:00]
Especially in a startup, you don't know what's going to take off, how fast you're going to scale, what kinds of different infrastructure services you might need and using something like AWS or Azure or GCP just drastically makes it so you can be much more flexible and adapt to the needs of the company as they change.

Darin:
This is DevOps Paradox episode number 81. Making Email Provider Integration Simple With Nylas

Darin:
Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darin and Viktor, pretend we know what we're talking about. Most of the time, we mask our ignorance by putting the word DevOps everywhere we can, and mix it with random buzzwords like Kubernetes, serverless, CI/CD, team productivity, islands of happiness, and other fancy expressions that make it sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often, since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. PS: it's Darin reading this text and feeling embarrassed that Viktor made me do it. Here are your hosts, Darin Pope and Viktor Farcic.

Darin: [00:01:23]
You did listen to last week's episode, didn't you? The one where we talked about why you should outsource to a managed solution or when you should. You did listen to that, right?

Viktor: [00:01:32]
you're asking me?

Darin: [00:01:33]
I'm asking everybody.

Viktor: [00:01:35]
oh ok

Darin: [00:01:37]
I expected you to listen to it. That's your job. Okay. Anyway, today is one of those days. We actually have a guest on with us. We have Christine Spang from Nylas. Christine, thanks for joining us today.

Christine: [00:01:54]
Happy to. Excited to be here.

Darin: [00:01:56]
Now, before we started recording, I asked her permission. Christine is a graduate of MIT. That is not a state sponsored school, or is it? Is it technically a state sponsored school or state school?

Christine: [00:02:14]
I mean, it depends on what you mean by a state sponsored. Technically MIT is a private institution. It's a big research facility, so they get tons of grants from the government to do the research. So definitely partially state sponsored.

Darin: [00:02:29]
Yeah, but that's not the same thing as being a state school.

Christine: [00:02:31]
Yeah, I agree.

Darin: [00:02:33]
You're you're more like, so I live in North Carolina, so you automatically hear Duke and you think, Oh, it's no, it's Duke is private.

Christine: [00:02:41]
Yeah, no. MIT is the non Ivy, Ivy. It's not an Ivy league, but we hang out with the same, the other Ivys.

Darin: [00:02:50]
Oh, okay. You're down a few notches, right? Just in, in the snobbery level, I guess.

Christine: [00:02:56]
You know, mortal enemies, always trying to be that cool, but the Ivys just sort of look down their noses on us.

Darin: [00:03:05]
What were you in school for at MIT? Just out of curiosity.

Christine: [00:03:08]
Yeah, so I went to MIT pretty much from the get go to study computer science. I really knew what I wanted to study when I just arrived as a freshmen because I basically got into programming when I was in high school, through the free and open source software communities. So I basically went to MIT because I really saw it as the birthplace of the open source and free software movements.

Darin: [00:03:35]
so was that where Stallman was out of?

Christine: [00:03:40]
Yeah. He worked at the Institute for a while or was sponsored by or something like that.

Darin: [00:03:46]
If you don't know who Richard Stallman is, go back in your history books. So you went to school for computer science and then did you come straight out of school and what'd you do right out of school? What was that like?

Christine: [00:04:00]
When I was at MIT, I spent a lot of time learning about operating systems and I got really into kernel engineering. I had a few friends that I knew from basically the computer club at MIT who started a company that actually was commercializing some kernel technology for Linux that they'd built as a part of their master's work at MIT. So that was really the way that I was introduced to entrepreneurship. I never really grew up being like, I'm going to start a business someday. Really came to it from the technical side of things. Basically, I had the choice of staying MIT and doing a master's degree around operating systems engineering or these folks gave me a job offer to join their startup that was commercializing this kernel technology and I was always kind of very practically minded and I knew that I didn't want to stay in academia. So when given the choice of get a paycheck and learn how to run a little business or stay in grad school, it was a no brainer to take the job.

Darin: [00:05:06]
I think that should be, if you're going to go to school, get out. If you want to get a further degree, do it later.

Christine: [00:05:14]
My mother was a bit disappointed because she wanted me to get a master's degree. But my perspective was that unless you're going to work at a really big company, a master's degree is a bit questionably valuable in the software industry because a lot of the experience that you gain is very practical and can be built up in the workplace. So you can often even kind of get ahead a bit by getting into the industry earlier.

Darin: [00:05:38]
Absolutely and you can always go back and get a degree later. Right. If for fun, not for, because you need it to build up because, Oh, by the way, I've already built five companies, sold them all off and I think I'm going to get my masters next year.

Christine: [00:05:54]
Yeah. The research will always be there, but the right opportunity in terms of jobs might be something that is fleeting and only available once.

Darin: [00:06:03]
and let's talk about that. So how did the opportunity for Nylas come about? What was the trigger point that caused that to be birthed?

Christine: [00:06:15]
So, when I first joined this startup that some friends of mine had started out of college, I was basically with that company for three years and saw the whole life cycle of a startup there. It was a bootstrapped company, so it never raised external funding. So that was an interesting experience building up just based on revenue and we had a government research grants as well. But kind of learned the whole cycle of building products, shipping it, monetizing it, building up a sales team and things like that. I got the front row seats to all of the aspects of building a business and eventually the founders of that company sold it to Oracle. So I also got to go through the exit process where Oracle was taking that company and integrating it into their company. We moved into the office of another company that they had bought in Cambridge, Massachusetts, because Oracle didn't have an office there for a long time. Basically hired a whole new set of folks, trained them up on the technology and made it so that it was going to be a sustainable thing in its new home. Then the founders went off and started a new company and that kind of got me thinking about what I was going to do next. Basically, I was a little bored just working at a big company. I felt like there was a lot of barriers to getting things done quickly there. It was a good job, but I was pretty young and energetic and ambitious and wanted to do more. When it got to the point of being close to the two year mark, when my retention was up there, I really started asking around and talking to folks about what other opportunities were out there. Considered joining another company. But at the end of the day, was not sort of blown away by any of those opportunities. I had this friend who I also knew from MIT who had essentially been trying to build a tool for his undergraduate thesis, that combined data from email, with data from the MIT student directory and allowed you to get contextual information about who you were emailing with as you were doing so and through this experience, you found that the development process of connecting to a mailbox was really difficult. Surprisingly difficult. It took him maybe a few months to just do really basic things like displaying an email in a way that worked all of the time with all the messages you might find in a mailbox. So that's really the idea that actually sparked Nylas. I decided it was a good time in my life to kind of try something that may or may not pan out. My friend had a background in the startup community as well, so I decided to just start a company based around this idea with him and that's what eventually turned into what Nylas is today.

Darin: [00:09:01]
How long has Nylas been around?

Christine: [00:09:03]
I guess it's been a little bit over seven years at this point. That's kind of broken up into two segments. The first was 2013 to 2017, which I would characterize as the incubation phase, like trying new ideas, trying to find product market fit, super small team during that entire time, 15 or fewer people, kind of the garage mode. Between 2017 and today, we really started scaling out the business because we had something that was working, which is these data APIs that abstract away the complexity of building communications features into your applications.

Darin: [00:09:40]
and I think that's the key part here is these are just APIs that any developer could use.

Christine: [00:09:45]
Yeah. So what we ended up building is a modern REST APIs that abstract over the email service providers and calendar providers and make it so that you have one single API that's very simple and easy to work with that can allow you to build workflows and automations using all these different providers in a much faster and easier way than if you went out and built a Google integration and an Exchange integration and there's a whole bunch of different protocols within Exchange. So you might have to build multiple there. Then there's the open source calendars and mailbox providers, which are whole nother can of worms. So it takes all of that complexity, as well as the complexity of just the fact that email's been around, calendar has been around for a long time. For email it's 50 plus years and it's essentially a global distributed infrastructure and there's lots of different implementations which we have specifications, but any software engineer who's spent some time in the wild knows that if you have more than one implementation of something, they're never going to be exactly the same. They'll have different bugs, different edge cases and with email and calendar, you just have to kind of deal with what's out there. You have to deal with all sorts of like weird times zones and things like that and that complexity adds up. It turns into time and it makes it much more difficult to just build a simple feature and get it out to market.

Darin: [00:11:16]
As a developer, I could use the APIs, but let's flip around on the other side. Where are you running those APIs today? Are you running in public cloud? Do you have your own data centers? What does that look like for you?

Christine: [00:11:28]
We have always been based on top of Amazon Web Services. We started that from the get-go. Even in 2013, it seemed like a no brainer to use the public cloud to basically accelerate our ability to experiment. Especially in a startup, you don't know what's going to take off, how fast you're going to scale, what kinds of different infrastructure services you might need and using something like AWS or Azure or GCP just drastically makes it so you can be much more flexible and adapt to the needs of the company as they change. It's been a huge game changer in the industry.

Darin: [00:12:06]
In dealing with these multiple data centers, you're working with people across all different legal districts. GDPR probably burned you. CCPA now is probably a pain. What's it like operating a company that has to deal with those kind of legal restrictions?

Christine: [00:12:28]
We kind of always knew that the business we were building was dealing with potentially sensitive information. Businesses run on top of email. You have to often archive it for long periods of time. There's all sorts of different kinds of data in there that may be everything from legal documents, contracts, conversations that are about sensitive business information, password resets, that kind of thing. We really designed the platform from the ground up to really take that security into mind. For example, we don't mix data between different companies that are using our platform. The assumption is this data belongs to that company and we're an infrastructure provider. We're doing various processing and transformations, but we're not taking those and making them available to some other company or harvesting anonymized, it's hard to anonymize things, but people say they're anonymizing them and using that to power other products. That's not the way our business was built, because of how key this data is to those companies. We've had to make some adaptations for new legislation, like GDPR and CCPA, mostly around implementing processes to make sure that end users can request that their data be deleted and purged from our systems. The other piece of that is around data residency, so we've had to invest in data centers in other jurisdictions than the United States. We have data centers available in the EU as well as Canada with more coming for other regions. I think that's been the biggest investment we've had to make for these privacy legislations, besides from obviously investing in just general good security on our systems.

Darin: [00:14:18]
What is your platform under the hood? Is it you're running everything bare metal, right? You're racking and stacking servers. No. You're on AWS. We know that's not true.

Christine: [00:14:26]
Yeah. Yeah. So it's AWS and then we are a Python shop.

Darin: [00:14:31]
Oh, I'm so sorry. At least you didn't say we're a Ruby shop. I mean,

Christine: [00:14:35]
Yeah. Python isn't so bad.

Darin: [00:14:38]
Python is great. But are you Django or are you just straight up Python?

Christine: [00:14:43]
We are not Django. The key libraries I would say that we're using for the API front end, we use Flask with obviously a lot of extensions and bits and pieces that we've built up over time to power different parts of our system like authentication and things like that. Now we have an OAuth implementation that is used for granting tokens to developers to request data from our API. We also use gevent, which is basically an event loop library for Python. The reason that we chose to use this library, which is maybe a little bit controversial in that it has an event loop, but it also by default, the way to use it is it monkey patches the standard libraries. All IO that happens in the standard library in Python goes through the event loop, which is not typical for other event loop systems for Python. Python is not Node.js. It's not built from the ground up to be doing evented IO, but gevent was one of the first libraries that came about that allowed you to do this in a really scalable, well tested way, on Python and we use that because essentially the core of our system is connecting to all of these different backend providers and doing IO. We're leaving connections open to get real-time notifications about when new messages or events or changes to that data comes in so that we can in as close to real-time as possible sync that data to our API and push it out to whoever, whatever applications need to ingest that data or even internal services that are doing further data processing. So the evented IO is really key because it's just not scalable. One threads are not a great idea in Python and you can't really have multiple threads that use multiple CPUs anyway and two, there's just like a lot of memory overhead from the interpreter. In order to scale up our ability to have thousands, hundreds of thousands of accounts connected to our system that are doing a ton of IO, we needed to have a way to do that. We found that using evented IO was the best way that can make that happen.

Darin: [00:17:07]
Since your application developers are perfect. They've never introduced a bug into the system. That means you've had 0.00% of downtime, and you've never had any issues in operating your systems. Is that correct?

Christine: [00:17:20]
Ah, I think it's a bit of a stretch, but I, I, I take the joke.

Darin: [00:17:27]
Of course nobody has that, right. But what is an operational day look like? When we were talking before around one scenario that's recently happened to where the call chain sort of failed and you got woken up at two or three in the morning.

Christine: [00:17:45]
We obviously have several teams internally that are maintaining different parts of the systems. We have on-call rotations for those and try to instrument the system such that we know about issues before the customer does and so that we're also not burning out our engineering team by having alerts going off all the time. Might be illustrative to kind of talk through this particular issue. I think it's a really interesting from like a learning strategy point of view. So the issue was that one of our larger customers was running into quota issues on their Google integration. We sort of help people manage their applications, but for some things like quota increases, we need involvement from the customer to submit that request to Google because they want to have a direct relationship with the owner of the application. So essentially our customer success team needs to manage the communications with those developers to make sure that they have everything they need to submit the right requests as they're scaling up to show Google what their application does, how they use the calendar APIs, et cetera, that they're following best practices and do need more quota. Our on-call engineering team doesn't actually have the ability to fix this issue, but we have monitoring in place that makes sure all of our accounts are syncing properly and this customer was big enough that it was not just in terms of business hours, like monitoring these graphs, but it was triggering a high urgency alert around the system. We had tried to fix it using communication around our overnight on-call of like, hey, just don't escalate this, but somehow that communication just didn't quite work out and this issue got escalated and through a series of coincidences, the folks that were earlier in the chain didn't manage to pick up, so it did bubble up to me. The really interesting part about this is there's a whole bunch of failure cases. There's one, there's like the human communication piece and then there's this strategy about like, hey, all of these pages should be actionable and in this case it was not an actionable thing. We ended up making a change so that our CS team has separate alerting during business hours that won't wake them up in the middle of the night that allows them to manage the communication around this, but that is not going to page our engineering team at any time about these sorts of quota issues.

Darin: [00:20:26]
Now the one thing that we've sort of been stepping around and I'm just going to ask it bluntly. So you're running on AWS. Are you running in Kubernetes or are you rolling your own on EC2 instances? What does that platform in general look like?

Christine: [00:20:38]
We don't currently use Kubernetes for our base platform, largely because when we were getting started, it was extremely early days on both container tooling, as well as the orchestration fronts. We spent a bit of time looking into it, but it seemed like the tooling was changing too fast, that it wasn't really something that we needed at the time. We use Terraform as well as Ansible to configure and orchestrate our machines just on top of EC2. We have been starting to use containers and Kubernetes for some kind of auxiliary services, but we're not to a point where we have the entire platform on that yet, but it is very interesting, especially as we're kind of looking into how do we, for example, in the future support on-premise installations for large, extremely security sensitive organizations that are averse to using either public or private clouds and Kubernetes is really becoming the de facto way to deploy complex applications into someone's data center environment.

Darin: [00:21:46]
so you're just now getting into containers. You've got AWS. You have your own data centers, quote unquote in how shall we say restricted areas of the world? What's next for Nylas? Are you going to go functions as a service for everything?

Christine: [00:22:10]
We are doing a little bit of Lambdas actually

Darin: [00:22:13]
Oh. No, no. Have you considered multi-cloud?

Christine: [00:22:18]
You mean using multiple of the public clouds? Potentially. We have considered it. I cannot say whether we've gone down that path yet, but it's been something we've discussed for sure.

Darin: [00:22:30]
So you had these issues come up. You resolved them. It was a CS problem, not an engineering problem. I mean, it was an engineering problem. You were able to reclassify it as a CS problem. What are other stories like that to where, whether it was you getting woken up in the middle of the night or somebody else being woken up, it's like, okay, well, how could we have solved this automatically? How could have self healed itself? What's what's something looking back on it now, what would have been a or how did you resolve it? Right? What's what's those types of scenarios?

Christine: [00:23:07]
Gotcha. So other examples around how we're building an operable, reliable, scalable, system? That's a great question. I can give you some examples of how we do, for example, the account loading on our sync fleet and sort of some ideas that we're experimenting with to make it more scalable and operable and self healing in the future. One problem that we've had to deal with is that the protocols that we integrate with vary quite a lot. Especially when we were first getting started, some of the protocols we had to integrate with were stateful protocols that weren't HTTP based that essentially really required long running connections to work properly. So the original design of essentially the fleet of machines that keeps our data store up to date and call it the sync fleet maps an account's sync to a worker in a persistent fashion. So the worker pretty much picks up the account and syncs it for potentially days. So it's very long running, but this has a number of downsides. One is like, it's not so straightforward to scale up and down because you essentially have to be able to, for example, shed load from existing machines. If the workload is spiky, it's hard for the fleet to adapt to that. We essentially have modeled on a system where we can auto-scale that fleet in the up direction so we can add capacity automatically by essentially tracking how much an estimate of account capacity that we have on the existing workers. So we have this concept, we call sync slots where a worker can sync up to so many accounts, which is not a super scientific estimation, but approximate enough to work. So when we were running low on sync slots in a particular AZ, we'll provision automatically more capacity using AWS auto scaling groups to add that, which in the early days we were doing that essentially manually. So that got rid of a bunch of toil on the operation side. An experiment that we're running right now to try to handle spiky workloads as well as allow us to scale things in a more dynamic fashion, as well as potentially use more short-lived workers is splitting out our sync fleets into essentially two components where one is essentially doing the long running connections pieces and figuring out what work needs to be done and then producing tasks and then the other fleet is like a very dynamically sized, scaled based on the queue of work to be done, fleet that consumes those tasks and executes them and updates our data store. This is something that we're really validating out right now and if it ends up working well in production, we think that it could really increase the efficiency of our fleets as well as make this system more resilient. I guess one example is that we, at one point in the past, ran a backfill operation on our database clusters that overloaded the clusters and it was quite difficult to recover from because essentially the fleet couldn't redistribute load on its own. We had to have our on-call team manually going in there and redistributing load because the system was just not designed to be able to do that on its own and prioritize the continuity of syncing these accounts. So that is something that will really go away if we can succeed at this architectural evolution of the system.

Darin: [00:26:54]
It's a hard problem. It has hard answers, right? And you're finding out new answers every day.

Christine: [00:27:02]
Yeah. I think this like load distribution problem. We've had a number of different sort of iterations on it and just the activity on accounts is not something that's completely predictable. So we've kind of ended up in this place where we think that we just need to react to the amount of work essentially. There's a surprising amount of details in there that make it fairly complex.

Darin: [00:27:25]
Did you go back to any of your college textbooks for any of the answers?

Christine: [00:27:30]
Uh, no. We did try a few sort of fancy algorithmic things. But I think one thing that certainly I've learned and I've heard the same from others is keep it simple stupid. Often things that are complex, just don't actually work that well in production. So you want to start with the simplest possible thing that could work and when working with distributed systems, it's almost like the system is a living creature. You have to make iterative changes from one point that works to another point that works and if you try to redo everything at the same time, that stable equilibrium can be hard to reach.

Darin: [00:28:15]
Keep it simple for as long as possible. Only make it hard if you have to. Listen to the old guy about this. Trust me, I've been through this one a few times. All right, Christine. Thanks for hanging out with us today. If people want to follow you on the socials or wherever else, where's the best place people can contact you?

Christine: [00:28:35]
Yeah, for sure. You can find me on Twitter at @spang. That's S P A N G. That's probably the best place to find me.

Darin:
We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and the link to the Slack workspace are at https://www.devopsparadox.com/contact. If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at https://www.devopsparadox.com/ to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 81: Making Email Provider Integration Simple With Nylas

Show Notes

Guests

Christine Spang

Hosts

Darin Pope

Viktor Farcic

Links

Rate, Review, & Subscribe on Apple Podcasts

Signup to receive an email when new content is released

Transcript

33Across

host description

View Cookies

33Across