In his talk at the CIVO Navigate NA 2023 conference, Alex Olivier, product lead at Cerbos, delves into the challenges of ever-evolving authorization requirements in software systems. Drawing from his extensive experience in the tech industry, Alex highlightes the limitations of traditional if-else statement-based authorization logic and introduces a more efficient solution - Cerbos.
Olivier's journey in the tech world began with learning BASIC in MS-DOS and led him through roles at Microsoft, various SaaS businesses, and now Cerbos. His experiences have given him a unique perspective on the challenges of authorization.
Authorization requirements are constantly changing. In Olivier's experience, he had to rebuild the authorization framework multiple times in different companies due to evolving client needs.
As companies scale, their authorization logic becomes more complex. Initially, simple checks based on email domains or user roles suffice. But as the business grows, more granular checks based on feature access, data regions, and custom roles become necessary. This leads to scattered and complicated logic across the codebase.
Olivier proposes decoupling authorization logic from the application and centralizing it using an authorization service like Cerbos. This approach:
Alex Olivier's talk emphasizes the importance of evolving our approach to authorization to meet the dynamic needs of modern software systems. By using tools like Cerbos, businesses can ensure efficient, scalable, and compliant authorization processes.
You can watch the full talk and read the transcript below.
I'm going to be talking to you about solving the never-ending requirements of authorization. This comes from a place of very personal pain. My background - you can date me by the fact that my dad taught me BASIC in MS-DOS when I was five, so you can probably work out how old I am from that. Always been tinkering software, went into a software dev role, actually ended up at Microsoft for a bit doing technical consulting. I saw all the horrible stuff that companies do, lots of banks, did a lot of work with the NHS as well, and that's some legacy stuff—then moved on to the world of startups. I spent 10 years working various SaaS businesses, a lot of Martech, supply chain, connected fitness believe it or not, some ecom, but always very much focused on the data and DevOps and infrastructure side of things, even as I moved to the dark side of product management. Now working on the open source project called Cerbos, but I'll come back to that at the end. But it's free, I'm not here to sell you anything. Just to make sure you're on the same page, I'm going to be talking about authorization rather than authentication. Once you've got something like Teleport or Ocular, one of these things in place that you know who user is, who they say they are, once they're inside your software platform or inside your system, what can they actually do? What should this person, an admin, be able to do? This person is a manager, what should they be able to do? The authorization piece.
So, classic quote in this world, 'nothing certain death but by death and taxes', my extension of this is the ever-changing requirements of authorization. Across these various businesses, I've had to go and rebuild the authorization framework at least three times in each company as the requirements kept changing. Because they signed a new client or they had a new user come online who wanted some custom roles to lock down their access. One of my previous companies, we had Emirates Airlines and Staples, the office supply shop, as two of our clients. And we onboarded them to our system that had like admin, viewer, an editor, a manager, and they're like, 'I can't put 60,000 people into those four roles, you need something more granular.' So I had to solve this problem over and over again.
Scaling a company, you start off nice and early, little day of rolls. Someone logs in, and you might have some code scattered about your codebase where you do some simple check whether the email address of who they're authenticated as from their token or some session or something includes your company's domain name, they're an admin, you should let them do everything. Very simple statement, anyone else is just a regular user. You might be using a scope instead of a token or something like this to denote this. But it's pretty simple logic, not too hard to kind of put in and kind of move on with your life.
Then you start scaling up, maybe you've got a sales team, and things are going pretty well, and they want to start locking down feature access. They can package it up and sell it to different people for lots of money, and usually unreasonable amounts of money. They want to be able to change that conditions on the fly. So now you're going to start to use this concept like, okay, this person's logged in, they're an end-user, what company they belong to. If they have a premium package associated with that company then they should be able to do things. It's like, yeah okay, it's another check, not too bad. But you have to sprinkle this logic out across all your request handlers. You can maybe do this using a feature flag solution based on the client ID, the contract ID or something like that. But the logic kind of gets a bit more complicated.
Here's a fun one I had to deal with. I was the technical product manager for a big data platform, and we were ingesting end consumer type analytic style data for e-commerce sites, Staples being one of them. Things like GDPR, whilst the Safe Harbor agreement fell down between Europe and the US, and GDPR came in. We wanted to, we had to legally had to split our data stores up between different parts of the world because of these newer crimes have come in. And there's only more of these coming through with the California one, there's a Chinese one coming, there's a Russian one as well. But data, it becomes an interesting one. So now you have to have this further authorization check across all your code base to check, which region is this data stored in? Or which regions does user allow to access in the data store? Again, you can maybe use scopes for this. At this point, you may start looking at some sort of framework or library that's specific to your architecture so Castle, CanCanCan, these kinds of things to denote who's allowed to do what. But you know, it kind of makes things more complicated.
And here's the kind of the sales, the enterprise example where you want to support Staples' 60 different departments, I need different access controls etc. You're now going to stop putting even more logic in here where you're starting to check things like groups and roles and maybe sometimes a user defined. Now you want to start having some sort of actual user directory system and this logic is getting a bit more out of hand.
And then another fun one, you've got to the point now where you're selling to these big companies and they all have hard requirements around regulatory needs of audits and access controls and those sort of things. So you go and get your ISO or your SOC 2 compliance. I went through ISO a couple years ago and every year after that I got dragged into a dark conference room in the basement by an auditor and I had to pull up on demand our access logs to show who could do what. Which at the time was me kind of grapping through S3 files, which was not fun. But for starters, you actually start logging and capturing an audit trail of who's done what. So now in your code base, wherever you have that if-else logic going on, you now need to actually put in there a check and actually audit and log out exactly what happened where and who tried to do what and whether the access was allowed or not. You plug it into some sort of logging library, etc.
Now at some point in this journey, as you scale the business, there's likely going to be a discussion around everyone's favorite term, microservices, because you want to now start having specialized components that are doing different things. Maybe you have a node service, you have a route for an end, you've got something running in Python to do recommendations, you've got some legacy components in some .NET framework maybe running in your stack, and you now have to break apart your architecture. What does that now mean for that complicated business logic around authorization? It means whenever those requirements change, every code path across all those different languages and frameworks that you're using now needs to be updated.
So now you have some product manager or some product owner or security team saying there's these new checks or permissions we need to put in. That will then get handed off to some dev, probably, yeah, take a JIRA ticket, see you in three months, who will have to go and comp sort take that business logic and rewrite it into whatever language the relevant services are in. And then it kind of gets a bit messy. And the time to push out change for authorization logic is going to get longer and longer and longer. Just because the amount of code you're gonna have to touch, the amount of translation from business requirements to code you're gonna have to do, and then also the testing and roll out all that.
A new approach to this, we've heard a lot through the talks today and yesterday and generally what's going on in the world right now, this shift to having identity kind of really locked down. You have lots of infrastructure tools out there to help you lock down your stack, everything from database access level to Kubernetes, RBAC, all that sort of thing. But the new approach, particularly that I adopted and we've ended up building a business around, is decoupling that authorization logic out.
So in the world now where we're in this build versus buy, everything's as a service type system, the best parallel there is kind of the authentication space. Go back five maybe ten years, we'll all be spinning up databases that have a username and password in a table somewhere and you might remember to sort and hash it. Nowadays, you don't really do that, you just go and pick something off the shelf like Auth0, Ping Identity, yeah, [Laughter], and one of those tools and plug it in. So you never, you don't have to waste time building and securing and carrying that system.
So, if we were to imagine, what's the authorization equivalent of that? So, once someone's authenticated using Auth0, Okta or some other identity provider, and they're now inside of your system. So they've got a token or something, they're authenticated, we have some user and they're trying to access some resource and they're trying to do some action upon it. So, classic example of, they're trying to do like a CRUD operation on a, say a blog post, should that action be allowed or not? And rather than writing that big if-else case switch style statement that we've been kind of building previously, we now have this kind of authorization service.
So, Authorization Service takes that input of user, the resource they're trying to access and the action they're trying to do to it. And it also takes an input of policy. So, what's the conditions, what's the logic that should be met in order for a particular item should be allowed? And it kind of simply returns back either an allow or denying most cases.
So, what that kind of looks like, we've seen a lot of Kubernetes talk obviously. This hopefully will look familiar to you as a Kubernetes manifest-ish type system. But the idea is to take that big if-else case switch style logic that we were building previously and abstracting that out into policy rather than code. So that our authorization logic can now map to, for example, this is like a contact resource inside of a CRM system. It defines the different actions that are possible and the different roles that a user must have, and for typical sort of RBAC style things.
But most importantly, is this attribute-based logic, so ABAC. So here we have a condition for reading a particular contact. There's a few conditions applied to it. So we're not just checking whether someone's a user or not, we're actually now looking at particular attributes about that user or particular attributes about the particular content that they're trying to access. So we're looking at whether the department is say sales or not or the attribute of whether it's active, it should be true or not.
The best example if you're trying to figure out whether RBAC or ABAC is applicable for your system is, it's impossible to create an RBAC role or a role of owner because owner of a resources contextual based on the resource you're trying to access. So even though you might have nice roles set up in your actual, say request handler for mutating some resource, you're still going to be checking whether someone is an owner or not of that resource by looking at the owner ID attribute of the resource matching the ID of the person making the request. So the same then, they're the owner, they should do something. If not, they should get returned some sort of error. So that's an attribute-driven policy decision.
So, the other trend, as well as kind of picking these specialized components off the shelf, is we're in the world of sidecars, which has gone through a bit of a bumpy ride recently between all the service meshes using sidecars and getting rid of sidecars. But, it's a nice pattern. You have a cluster, and then you have these specialized services that sit inside your pod or around it, and that do particular things. So, authentication, the OAuth2 Proxy is probably the most famous one of these, but also scraping metrics and logs, etc. This is a very common pattern.
And the really important thing here is the performance characteristics. Because sidecars are running alongside your application, to speak to them is like one hop. It's right next to your system, and that's, when it comes to authorization, a really important characteristic. Because authentication, when someone logs in, they issue a token for say 30 minutes or so. That's kind of cashed, you don't have to keep checking that over and over again. Authorization, on the other hand, particularly if you're checking attributes or resource, you actually need to be doing that on every single call to your system and every single API handler, or request or whatever's going on.
So, being able to get an authorization decision back as fast as possible is key because it's in the critical path of every request. It's not something you can cache because the data is going to be dynamic, the resources are dynamic that is going on. So, using a pattern like sidecars allows you to have the authorization service running right alongside your application to get that fastest possible connection. And servers, for example, allow you to do this with Unix sockets if you want.
So, what does this all actually look like in practice? I'm going to explain this one. We have our end users. They are interacting with your application, so say there's some front-end app or mobile app sending API calls over to your application. The application block represents your system. It could be a monolith, it could be a web microservices, could be a load of serverless functions, doesn't really matter.
Request comes in. On that request, you know who the user is because they've got a JWT token, they've got some sort of authenticated session, basically, you know who they are. And from that identity, from your authentication provider, you know who they are. You can go and fetch other attributes about them from say, like a directory service or team they're in, what organization they're in, those sorts of things.
And the other thing your application knows is what resource they're trying to access. So take that CRM example earlier, there's a request coming in, they're trying to do like a PUT or a PATCH to a contact resource. Your application knows from the request they're trying to access contact resources of a particular ID, and they're trying to do a particular action on it.
So, your application is going to be interacting with your underlying database anyway to go and fetch the attributes or the particular resource they're trying to access. And then this is where you would have had that big if-else case switch, the logic to work out whether they, a particular user or not, can do the action or not.
In this kind of authorizing service model, you now package up that information about the principle, so it could be a user but equally, it could be like an API key or a machine token, or whatever is relevant for your system, the resource they're trying to access, what kind of resource it is, the ID of it, the attributes about it that you fetched from your database, and then what action they're trying to do, create, read, update, delete, or more business-specific, approve, deny, mark as spam, you know, whatever's relevant to your system.
That request goes over to that authorizing service, and the authorization service is holding policy. So those policies could be in a database, they could be in an S3 bucket, or our recommended approach is a Git repo. Because it's now configuration rather than code, you can treat it like you would your Kubernetes manifests, using GitHub, Flux, and those fun tools.
So that authorization service, which is running as that sidecar or some other service inside of your system, is going to look at that request for principle, resource, and action, go and evaluate it against the policies that are defined, and return back in most cases either an allow or deny based on the particular request. That kind of comes back, and that complicated logic that was in your application before is now a single 'if' statement. If the authorization service has allowed it, then do the thing. If not, return some sort of error.
This model has two distinct advantages. One, is because the logic is now extracted out into this repository store, when the requirements change, and they will do, I guarantee it, you don't have to go back into your codebase, you go back into your myriad of microservices to go and change the logic. You're just going to change that logic in this policy repository.
So, those YAML manifests, in the case of Cerbos, can be updated in one place. You can check them into that Git repo, you can write unit tests against them, you can make sure they're valid, etc. And then, all the instances of that authorization service will pull down that change and start serving based upon it. So, that now means you can make changes or warrant whatever, interestingly, is the person who owns the requirements, authorization, generally that product person or security team, now has one point where they can update the logic that gets picked up and now, without changing any of your application code, or any of your microservices or however you deploy your app, that change in authorization logic gets pushed out and these systems are serving based on that new logic.
The other big one is now there's a single point where all the authorization checks are being made. There's this advantage, regardless of how the checks have been made, from backends and front ends, some async jobs, some batch queue, it doesn't really matter. Because all the checks are going through one of these authorization systems, you're going to get a clean and consistent audit log. At this time, this principle, this action on this resource, and it was either allowed by a particular policy, which, if you're working in any sort of regulated industry, so there's a lot of service users out there in fintech, insurtech, we've got the largest telecom provider in the UK using Cerbos, Blockchain.com's trading platform uses Cerbos underneath, you're going to get that audit log to help you meet those regulatory requirements that you're going to have if you're in one of those verticals.
So, to kind of simplify it down to a single slide, um, before, you would have had some sort of complicated logic like this. Whenever it changes, you're gonna have to go and touch that code inside of your system. In the new world, you can use something like Cerbos to go and do that check. Now, throughout your codebase, you're not actually replicating that logic over and over. It's all in that service which you then just call out to, which ideally is deployed right alongside your application, preferably inside of the pod, to go and get the response. It simply returns back 'allow' or 'deny' as well as giving you that nice audit log that's going on due to the checks.
So just to wrap it up, um, the advantages of this approach: so your logic is now defined centrally. There's a single point where anyone can go and reference around how a particular permission or how a particular action is controlled and how it's sort of defined in a single place. They can now evolve independently of the codebase. It's agnostic to any particular language or framework or architecture, so if you want to go and spin up some new servers in Rust or you want to go and put something in, you know, new language of the month here and there, um, as long as it can talk to this authorization system, in the case of Cerbos, it's got both the REST and GRPC API, then you're going to be able to check those permissions to get that response back. By having that logic now extracted out into configuration policies rather than code, it kind of fits very nicely into the sort of widely adopted, and I think loved, model of the GitOps workflow. You get that audit trail which is so important if you're in particular industries and verticals.
Now it's not all upside, obviously. You know, there's another service you're going to have to run, and deploy, and scale, and in the case of Cerbos, we've actually put a lot of effort to make sure this thing is as small as possible and can run anywhere. Cerbos itself is fully stateless, so you can even run it as like an ephemeral lambda if you really want to, to check permissions. One of our users actually ships physical ATM machines out to the world with Cerbos running on them which is cool but also kind of scary for updates, at least. There is going to be kind of a new DSL for writing policies in some cases. So writing those conditions that you would have saw where it was checking your particular attribute values, again with Cerbos, we're trying to use an off-the-shelf, open-source solution. So it uses Common Expression Language, which is a Google open-source, um, logic library, basically. There's always going to be a new component in the critical path but, again, if through thoughtful consideration of how you deploy things, making sure it scales with your application, sidecars are a nice model for this, and there's ways to mitigate that. From my team's experience, this is something that's worth actually putting in place compared to having to go and update your code all over the place over time when the requirements change.
So I've mentioned Cerbos a few times. Cerbos.dev is the website. It's completely open-source, Apache 2. You can go off and use it. We've got a few hundred users that we know about, at least, from the telemetry that's in there, but there's definitely more. We just hit over a thousand GitHub stars last week, which makes me happy. I've also got a lot of t-shirts with me if you want some swag. We have some case studies with some of the brands you've seen there as well, even just for the open-source version, and we have plans for a managed patrol plane which you can find out more about at Cerbos.dev/next as well. That is me, any questions, hey?
Cerbos is stateless and there's no persistence with it. All the context about a principle or a resource must be provided in the request. That is an architectural decision we made very early on. So, you know, our team's background is coming from businesses that are doing high throughput, low latency systems. My last company, we were doing about 25 billion requests a day through our pipeline and having to manage and synchronize caches, and persisting resources and principles, just doesn't really work at that scale because of distributed systems, etc. So we made the decision early on to make sure that the application doing the permission check is responsible for getting the state about the resource or principle before doing a check. So it's a very quick response because there's no need to go and fetch a database or anything like this from inside the authorization system.
The alternative model is to actually have the persistence inside of, say, Cerbos. If you're then going to go and fan out and fetch state across different databases and you change the policy, that could lead to some sort of unbounded load across parts of your system that you're not really planned for just because of a policy change. It's a decision we made very early on, which works at the scale of Blockchain.com's trading platform, for example.
The policies themselves can be stored. Yes, they need to be stored somewhere. We recommend using a git repo, and the Cerbos instance will pull that down on whatever basis you want. You can bake them into the image, or you can pull them from an S3 bucket, or even use a database. But in most cases, it's easier to have your policy logic static in something like S3 or a git repo.
Cerbos pulls it down on some schedule and then takes that YAML, converts it into a much more performant format internally, holds it in memory, and evaluates against that. We are building a managed control plane that will handle the distribution of those policies in a synchronized manner rather than having each instance pull it down on their own schedule.
Sure, Cerbos itself exposes a REST and a GRPC interface. We have SDKs that make it nicer to work with for most of the languages now. You know, Java, Go, etc. The service itself is running in Go.
Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team