Overprovisioning Archives

In this article, I will guide you through the process of overprovisioning the EKS cluster, through a detailed step-by-step approach. Furthermore, in the later section, we will explore methods for testing the functionality of overprovisioning.

If you want to understand what is overprovisioning, I recommend referring to my previously published article on overprovisioning.

Let’s get started.

Prechecks

Ensure your setup adheres to the below prerequisites. It should be; unless you are running ancient infrastructure 🙂

Ensure you are running Kubernetes 1.14 or later since pod priority and preemption are first introduced in 1.14 version.
Verify Cluster Autoscaler’s default priority cutoff is set to -10. It is the default since version 1.12.

The manifests provided in this article are with bare minimum specifications. You need to modify them depending on your requirements like the use of non-default namespace, custom labels or annotations, etc. The method of deploying these manifests varies. The simple way is with kubectl apply -f manifest.yaml or the complex way is via Helm charts or ArgoCD apps, etc.

Defining the `PriorityClass`

In Kubernetes, we can set a custom priority for pods using something called PriorityClass. In order to configure overprovisioning, you need to use a PriorityClass lower than zero because the default pod priority is zero. It allows you to set the lower priority for pause pods and ensures that these pods are preempted when the time comes. To deploy this custom PriorityClass on your cluster, use the following simple manifest:

apiVersion: scheduling.k8s.io/v1
description: This priority class is for overprovisioning pods only.
globalDefault: false
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1

Define Autoscaler strategy

A ConfigMap is utilized to define the autoscaler policy for overprovisioning deployment. The process of calculation is explained here. Please refer to the below manifest:

apiVersion: v1
data:
  linear: |-
    {
      "coresPerReplica": 1,
      "nodesPerReplica": 1,
      "min": 1,
      "max": 50,
      "preventSinglePointFailure": true,
      "includeUnschedulableNodes": true
    }
kind: ConfigMap
metadata:
  name: overprovisioning-cm

RBAC Config

Next is RBAC configuration, with the three components: ServiceAccount, ClusterRole, and ClusterRoleBinding. These components give autoscaler deployment the necessary access to adjust the size of the pause pod deployment based on the required scaling. Please refer to the manifest:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: overprovisioning-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: overprovisioning-cr
rules:
  - apiGroups:
      - ''
    resources:
      - nodes
    verbs:
      - list
      - watch
  - apiGroups:
      - ''
    resources:
      - replicationcontrollers/scale
    verbs:
      - get
      - update
  - apiGroups:
      - extensions
      - apps
    resources:
      - deployments/scale
      - replicasets/scale
    verbs:
      - get
      - update
  - apiGroups:
      - ''
    resources:
      - configmaps
    verbs:
      - get
      - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: overprovisioning-rb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: overprovisioning-cr
subjects:
  - kind: ServiceAccount
    name: overprovisioning-sa

Pause pods deployments

Creating pause pods is an easy task. You can use a custom image to set up a healthy pod that acts as a placeholder in the cluster. The size of this pod, CPU, and memory configurations, can be adjusted based on your needs. Make sure to calculate the appropriate size to effectively block cluster resources using pause pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: overprovisioning
  name: overprovisioning
spec:
  selector:
    matchLabels:
      app: overprovisioning
  template:
    metadata:
      labels:
        app: overprovisioning
    spec:
      containers:
          image: nginx (any custom image)
          name: pause
          resources:
            limits:
              cpu: Ym
              memory: YMi
            requests:
              cpu: Xm
              memory: XMi
      priorityClassName: overprovisioning

Autoscaler deployment

Proceed with the deployment of the autoscaler. The objective of these pods is to supervise the replica count of the above pause pod deployment, based on the linear strategy employed by the autoscaler. This mechanism allows for the expansion or reduction of replicas and the efficient allocation of cluster resources through the utilization of pause pods. Execute the deployment by employing the provided manifest below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning-as
spec:
  replicas: 1
  selector:
    matchLabels:
      app: overprovisioning-as
  template:
    metadata:
      labels:
        app: overprovisioning-as
    spec:
      containers:
        - command:
            - /cluster-proportional-autoscaler
            - '--namespace=XYZ'
            - '--configmap=overprovisioning-cm'
            - '--target=deployment/overprovisioning'
            - '--logtostderr=true'
            - '--v=2'
          image: gcr.io/google_containers/cluster-proportional-autoscaler-amd64:{LATEST_RELEASE}
          name: autoscaler
      serviceAccountName: overprovisioning-sa

You now have an overprovisioning mechanism that allows you to allocate more resources than necessary to your cluster. To verify if it’s working correctly, you can perform the below test.

Testing the functionality

To prevent the need for scaling the entire cluster, please execute the tests on a single node within the cluster by employing Pod Affinity. Identify the node with running pause pods and direct the creation of new pods to this specific node through pod affinity specs.
Do not define any PodPriority in this test deployment, Kubernetes will automatically assign a Priority of 0 to this deployment. Meanwhile, our pause pods are configured with a priority of -1, indicating lower priority compared to regular workload pods or these test pods.
Upon creating this deployment, it should trigger the eviction of the pods on the designated node to prioritize the new test pods with a higher priority.

The pause pods should be terminated, and the new test pods should swiftly transition into a running state on this designated node. The terminated pause pods will be subsequently re-initiated as pending by their respective replica set and will search for a place on another node to run.

This article talks about the fundamental concepts of overprovisioning within a Kubernetes Cluster. We will explore the definition of overprovisioning, its necessity, and how to calculate various aspects related to it. So, without further delay, let’s dive right in.

Need of Overprovisioning

It’s a methodology for preparing your cluster for future demands from hosted applications to prevent potential bottlenecks.

Let’s consider a scenario in which the Kubernetes-hosted application needs to increase the number of pods (horizontal scaling) beyond the cluster’s available resources. As a result, additionally spawned pods end up in a pending state because there are not enough resources on the cluster to schedule them. Even if you are using the Elastic Kubernetes Service (EKS) Cluster Autoscaler (referred to as CA), there is a minimum 10-second delay for CA to recognize the need for more capacity and communicate this requirement to the Auto Scaling Group (ASG). Furthermore, there is an additional delay as the ASG scales out, launches a new EC2 instance, goes through the boot-up process, executes necessary bootstrap scripts, and is marked as READY by Kubernetes in the cluster. This entire process typically takes a minute or two, during which time application pods remain in a pending state.

To avoid these delays and ensure immediate capacity availability for unscheduled pods, overprovisioning can be employed. This is accomplished through the use of pause pods.

Concept of pause pods

Pause pods are non-essential, low-priority pods that are created to reserve cluster resources, such as CPU, memory, or IP addresses. When critical pods require this reserved capacity, the scheduler evicts these low-priority pause pods, allowing the critical pods to utilize the freed-up resources. But, what happens to these evicted pause pods?

After being evicted, these pause pods are automatically re-created by their respective replica set and initially start in a pending state. At this point, the Cluster Autoscaler (CA) intervenes, as explained earlier, to provide the additional capacity required. Since pause pods do not serve any specific applications, it is acceptable for them to remain in a pending state for a certain period. Once the new capacity becomes available, these pause pods consume it, effectively reserving it for future requirements.

How does scale-in work with Pause pods?

Now that we’ve grasped how pause pods assist in scenarios requiring cluster scale-out, the next question arises: could these pause pods potentially hold onto resources unnecessarily and block your cluster’s scale-in actions? Here’s the scenario: when the Cluster Autoscaler (CA) identifies nodes with light utilization (perhaps containing only pause pods), it proceeds to evict these low-priority pause pods as part of the node termination process (a scale-in action). Subsequently, these evicted pods are re-created in a pending state. However, during this period, the node count has decreased by one, and the cluster-proportional-autoscaler (HPA) recalculates the new required number of pause pods. This number is typically lower, resulting in the termination of the newly pending pause pods.

Pause pod calculations

Pause pod deployment should be configured with the cluster-proportional-autoscaler i.e. HPA. Set it to use Linear mode by defining the below configuration in the respective ConfigMap as follows:

linear:
  {
    "coresPerReplica": 1,
    "nodesPerReplica": 1,
    "min": 1,
    "max": 50,
    "preventSinglePointFailure": true,
    "includeUnschedulableNodes": true
  }

This configuration means:

coresPerReplica: One pause pod per core, meaning one pause pod for each core.
nodesPerReplica: One pause pod per node, signifying one pause pod for each node.
min: At least one pause pod.
max: A maximum of 50 pause pods.

When both coresPerReplica and nodesPerReplica are used, the system calculates both values and selects the greater of the two. Let’s calculate for a cluster with 4 nodes, each using the m7g.xlarge instance type, which has 4 cores per node:

4 nodes, meaning 4 pause pods (one per node).
16 cores, which equates to 16 pause pods (one per core).

So, in this case, the cluster-proportional-autoscaler will spawn a total of 16 pause pods for the cluster.

Now, let’s explore the process of calculating the CPU request configuration for Pause pods and, as a result, determine the overprovisioned capacity of the cluster.

Let’s consider, each individual pause pod is set to request 200 milliCPU (mCPU); from the cluster’s computing resources point of view, it amounts to 20% of a single CPU core’s capacity. Given that we are using one pause pod per CPU core, this effectively results in overprovisioning 20% of the entire cluster’s computational resources.

Depending on the criticality and frequency of spikes in the applications running on the cluster, you can assess the overprovisioning capacity and compute the corresponding configurations for the pause pods.

Kernel Talks

Unix, Linux, & Cloud!

Tag Archives: Overprovisioning

How to overprovision the EKS cluster?

Prechecks

Defining the `PriorityClass`

Define Autoscaler strategy

RBAC Config

Pause pods deployments

Autoscaler deployment

Testing the functionality

Basics of Overprovisioning in EKS Cluster

Need of Overprovisioning

Concept of pause pods

How does scale-in work with Pause pods?

Pause pod calculations

Prechecks

Defining the PriorityClass

Define Autoscaler strategy

RBAC Config

Pause pods deployments

Autoscaler deployment

Testing the functionality

Need of Overprovisioning

Concept of pause pods

How does scale-in work with Pause pods?

Pause pod calculations

Defining the `PriorityClass`