Kubernetes is rapidly becoming the most important infrastructure platform in the modern IT environment. However, with the power of Kubernetes comes extreme complexity, raising major operational challenges.
DevOps and IT operations teams are scrambling to find ways to operate Kubernetes reliably, detect errors and fix them in a timely manner. The first step in this process is monitoring—teams must establish visibility into containerized environments. Knowing what is running, and gaining access to basic operational metrics, is the first building block for running a robust, enterprise-grade Kubernetes environment.
Why Is Kubernetes Monitoring Important?
There are many benefits to implementing a Kubernetes monitoring strategy, including:
- Troubleshooting and reliability—Kubernetes applications are often complex, especially if they are cloud-based or use a microservices architecture. This makes it difficult to identify the root cause of issues. Monitoring measures can offer you visibility over your Kubernetes deployment, so you can see where issues may arise (or have occurred), allowing you to prevent and remediate problems.
- Performance tuning—understanding the ins and outs of a Kubernetes cluster allows you to make informed decisions about your hardware configurations, to ensure the high performance of your applications.
- Cost management—it is essential to keep track of resources you consume to ensure that you are not over-resourced. If your Kubernetes applications run on a public cloud infrastructure, you need to know how many nodes you are running.
- Chargebacks and showbacks—in certain situations, you may want to know which teams groups have used specific resources. Kubernetes monitoring provides the necessary usage information for cost analysis and chargeback purposes.
- Security—an essential capability in a modern computing environment is to see what jobs are running and where. This allows you to identify unauthorized or unnecessary jobs that may indicate a breach or DOS attack. While Kubernetes monitoring won’t address every security issue, it can provide crucial information for maintaining security.
Having an appropriate level of visibility in Kubernetes is necessary to enable the proper monitoring of applications and clusters.
Top Kubernetes Metrics to Monitor
Kubernetes Cluster Metrics
Monitoring the health of a Kubernetes cluster can help you understand the components that impact the health of your cluster. For example, you can learn how many resources the cluster uses as a whole and how many applications run on each node within the cluster. You can also learn whether your nodes are working well and at what capacity.
Here are several useful metrics to monitor:
- Node resource utilization—metrics such as network bandwidth, memory and CPU utilization, and disk utilization. You can use these metrics to find out if you should decrease or increase the number and size of cluster nodes.
- The number of nodes—this metric can help you learn what resources are being billed by the cloud provider and discover how the cluster is used.
- Running pods—by tracking the number of running pods, you can understand if the available nodes are sufficient to handle current workloads if a node fails.
Kubernetes Pod Metrics
The process of monitoring a Kubernetes pod can be divided into three components:
- Kubernetes metrics—these allow you to monitor how an individual pod is being handled and deployed by the orchestrator. You can monitor information such as the number of instances in a pod at a given moment compared to the expected number of instances (a lower number may indicate the cluster has run out of resources). You can also see in-progress deployment (the number of instances being switched to a newer version), check the health of your pods, and view network data.
- Pod container metrics—these are mostly available via cAdvisor and exposed through Heapster, which queries each node about the containers that are running. Important metrics include network, CPU, and memory usage, which can be compared with the maximum usage permitted.
- Application-specific metrics—these are developed by the actual application itself and relate to specific business rules. A database application, for example, will likely expose metrics on the state of an index, as well as relational statistics, while an eCommerce application might expose the data on the number of customers online and the revenue generated in a given timeframe. The application directly exposes these types of metrics, and you can link the app to a monitoring tool to track them more closely.
kube-state-metrics is a Kubernetes service that provides data on the state of cluster objects, including pods, nodes, namespaces, and DaemonSets. It serves metrics through the standard Kubernetes metrics API.
Here are several aspects you can monitor using state metrics:
- Persistent Volumes (PVs)—a PV is a storage resource specified on the cluster and made available as persistent storage for any pod that requests it. PVs are bound to a certain pod during their lifecycle. When the PV is no longer needed by the pod, it is reclaimed. Monitoring PVs can help you learn when reclamation processes fail, which signifies that something is not working properly with your persistent storage.
- Disk pressure—occurs when a node uses too much disk space or when a node uses disk space too quickly. Disk pressure is defined according to a configurable threshold. Monitoring this metric can help you learn if the application truly requires additional disk space or if it prematurely fills up the disk in an unanticipated manner.
- Crash loop—can happen when a pod starts, crashes, and then gets stuck in a loop of continuously trying to restart without success. When a crash loop occurs, the application cannot run. It may be caused by an application crashing within the pod, a pod misconfiguration, or a deployment issue. Since there are many possibilities, debugging a crash loop can be a tricky effort. However, you do need to learn of the crash immediately in order to quickly mitigate or implement emergency measures that can keep the application available.
- Jobs—components designed to temporarily run pods. A job can run pods for a limited amount of time. Once the pods complete their functions, the job can shut them down. Sometimes, though, jobs do not complete their function successfully. This may happen due to a node being rebooted or crashing. It may also be the result of resource exhaustion. Monitoring job failures can help you learn when your application is not accessible.
You should monitor container metrics to ensure containers are properly utilizing resources. These metrics can help you understand if you are reaching a predefined resource limit and detect pods that are stuck in a CrashLoopBackoff.
Here are several container metrics that you should monitor:
- Container CPU usage—learn how much CPU resources your containers are using in relation to the pod limits you have defined.
- Container memory utilization—discover how much memory your containers are utilizing in relation to the pod limits you have defined.
- Network usage—detect sent and received data packets as well as how much bandwidth is being used.
These metrics can help you measure the availability and performance of the applications running in pods. The business scope of the application determines the type of metrics provided. Here are several important metrics:
- Application availability—can help you measure the uptime and response times of the application. This metric can help you assess optimal user experience and performance.
- Application health and performance—can help you learn about performance issues, latency, responsiveness, and other user experience issues. This metric can surface errors that should be fixed within the application layer.
In this article, I explained the critical importance of Kubernetes monitoring for ongoing reliable operations of this mission-critical system. I discussed five types of metrics teams must gain access to in order to successfully manage Kubernetes environments:
- Cluster Metrics—including the number of nodes, running pods, and node utilization
- Pod metrics—including the number of instances in a pod compared to expected instances, in-progress deployment, and pod health
- State metrics—including persistent volumes, disk pressure, and crash loop metrics
- Container metrics—including CPU and memory utilization and network usage
- Application metrics—including application availability, performance, and business-specific metrics exposed by an application running on Kubernetes
I hope this will be of help as you improve the maturity and reliability of your organization’s Kubernetes deployments.