Kubernetes plays a critical role in driving business agility and resilience as enterprises work to modernize applications and infrastructure. Originally designed for stateless workloads, Kubernetes has evolved into a new enterprise control plane - to orchestrate applications and the infrastructure that they rely on.
Early architectures utilizing Kubernetes kept the data and state outside of the application clusters. Over time, the industry realized that in doing so, they were maintaining multiple control planes - one using Kubernetes for stateless containers and others to manage databases and state. This proved to diminish the power and agility that Kubernetes offers to developers - a self-service, fully automated experience. The industry realized that running databases on Kubernetes unlocks the full potential of the agility of the platform, which is achieved best when it can control all of the infrastructure and application resources. In fact, recent Portworx research cited increasing agility (58%) and increasing resilience (52%) as the biggest drivers behind IT teams' decision to build and deploy stateful applications on Kubernetes.
These benefits aren't going unnoticed, but it is still the early days. Despite success among early adopters, there remain few widely known best practices for running databases on Kubernetes. Let's take a look at the current landscape and why more and more enterprises are looking to run databases on Kubernetes.
The current state of Kubernetes
The majority of Kubernetes deployments today start out stateless, with databases outside. As the project gets more mature and closer to production, only then do the databases start residing inside Kubernetes.
For a smaller project, keeping a lone database outside Kubernetes and treating it as a special resource might make sense. But most modern enterprise projects rely on many different databases, data structures, and data sources. Keeping those databases outside of Kubernetes means that part of your control plane resides in Kubernetes, where your application logic is running, and another control plane is managing the databases. With two different ways of managing the two components, you end up resorting to the lowest common denominator. Whatever you are using to manage those databases becomes your agility fulcrum, and you start losing the power of what Kubernetes can provide.
To unlock the full potential of Kubernetes – the agility, programmatic infrastructure, and self-service, API-driven control plane that it provides – you need to let Kubernetes manage not just the application logic components (the stateless containers) but also the infrastructure resources which that logic depends on networking, security and, obviously, databases.
A cloud-native data processing stack example
For context to what we are talking about, let’s consider a typical data processing pipeline that forms the backbone of many cloud-native application stacks – whether it is for a financial institution doing credit card fraud processing or an IoT application running anomaly detection on sensor data.
A typical pipeline for such a stack involves a messaging system with a publisher and subscriber model, such as Kafka. Here the data is processed and transformed along with some form of metadata tagging. The transformed data, along with the metadata, is then stored in some form of a structured database (SQL or NoSQL). This data is accessed by the nearline (front end facing) part of the application stack, which is actually running some of the business logic (whether it's the actual fraud detection algorithm or anomaly detection in the IoT case). In many cases, residual data ends up in some sort of larger searchable database for either machine learning and/or human operators to perform analytics.
As you can see, such a stack involves multiple stateful components. Having these components managed by Kubernetes allows for the DevOps teams to bring up these services on-demand in a programmatic way without having to resort to other mechanisms to manage and operate the databases. Especially when you have a large number of these pipelines operational at any one point in time, having it managed by Kubernetes eases the operational complexity.
Furthermore, in production environments, you have to plan for failures - hardware and software. Servers can fail, and drives can go bad. Having Kubernetes manage the resiliency and availability of the applications and data ensures an agile and robust application and infrastructure stack - one that does not rely on humans to keep things running.
Storage purpose-built for Kubernetes and application awareness
Trying to force traditional storage technologies directly into Kubernetes is generally seen as a mistake. Doing so usually compromises the agility and performance of the stack since traditional storage is typically designed for machine-centric workloads rather than cloud-native deployments. Imagine trying to manage the example pipeline above by manually provisioning storage volumes directly from a SAN or NAS backend - with humans doing the volume provisioning, data placement, and fixing failures.
Instead, you need a storage backend that understands Kubernetes and what the given application composition looks like. Modern, cloud-native databases are typically not singletons either but are composed of a number of different containers. For example, a modest Cassandra deployment may involve at least six different containers. When Kubernetes is deploying this Cassandra stack, it must communicate with the storage backend about what it is trying to do and what the application composition and requirements are. In working tightly with the storage backend, Kubernetes can identify the correct machines to run the database containers on and select the correct type of storage resources to provide. It has to take into account failure domains and that the six different containers shouldn't contend for the same resources because otherwise, it would degrade the performance of the overall Cassandra database. All of this decision-making is needed for a performant and robust operational environment, for this example, Cassandra deployment, and as you can see, it involves bidirectional communication and decision-making between Kubernetes and the storage provider.
Design for density and velocity
Being architected for density or volume is important and requires a layer that can provision and dynamically manage storage based on the application. Why? In the cloud native ecosystem, there generally are multiple databases running in dev, test, and production environments. Replicating environments means there often will be multiple databases and resources running.
Avoid lock-in. Avoid human provisioning
Lock-in to a deployment is a trap worth mentioning. For example, I’ve seen practitioners take this cloud-agnostic control plane and then tie a project’s state to cloud-hosted resources. This ultimately ends up locking you into a certain deployment, location, or architecture in a way that you're not able to benefit from a seamless cloud experience. It is also important to avoid manually managing how a state is allocated to databases being managed by Kubernetes. When humans are involved in a manual capacity at the same time Kubernetes is trying to do something in an automated fashion, it ends up becoming half programmatic and half manual – leading to a manual process altogether.
What's next for the stateful database stack? A common set of patterns are emerging as practitioners converge around data structures for pipelines like my example above (message queue, followed by structured database, followed by search and analytics). Organizations are starting to look for a simple deployment of an entire database stack that can be deployed and managed in an easy, single-push button manner. And we're just at the beginning of tapping the potential of Kubernetes to provide a seamless and automated experience to developers. Over the next few years, expect to see Kubernetes offer a richer and more native database-as-a-service experience.
Gou Rao is co-founder and CTO of Portworx by Pure Storage.