How to overprovision the EKS cluster?

In this article, I will guide you through the process of overprovisioning the EKS cluster, through a detailed step-by-step approach. Furthermore, in the later section, we will explore methods for testing the functionality of overprovisioning.

If you want to understand what is overprovisioning, I recommend referring to my previously published article on overprovisioning.

Let’s get started.

Prechecks

Ensure your setup adheres to the below prerequisites. It should be; unless you are running ancient infrastructure 🙂

Ensure you are running Kubernetes 1.14 or later since pod priority and preemption are first introduced in 1.14 version.
Verify Cluster Autoscaler’s default priority cutoff is set to -10. It is the default since version 1.12.

The manifests provided in this article are with bare minimum specifications. You need to modify them depending on your requirements like the use of non-default namespace, custom labels or annotations, etc. The method of deploying these manifests varies. The simple way is with kubectl apply -f manifest.yaml or the complex way is via Helm charts or ArgoCD apps, etc.

Defining the `PriorityClass`

In Kubernetes, we can set a custom priority for pods using something called PriorityClass. In order to configure overprovisioning, you need to use a PriorityClass lower than zero because the default pod priority is zero. It allows you to set the lower priority for pause pods and ensures that these pods are preempted when the time comes. To deploy this custom PriorityClass on your cluster, use the following simple manifest:

apiVersion: scheduling.k8s.io/v1
description: This priority class is for overprovisioning pods only.
globalDefault: false
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1

Define Autoscaler strategy

A ConfigMap is utilized to define the autoscaler policy for overprovisioning deployment. The process of calculation is explained here. Please refer to the below manifest:

apiVersion: v1
data:
  linear: |-
    {
      "coresPerReplica": 1,
      "nodesPerReplica": 1,
      "min": 1,
      "max": 50,
      "preventSinglePointFailure": true,
      "includeUnschedulableNodes": true
    }
kind: ConfigMap
metadata:
  name: overprovisioning-cm

RBAC Config

Next is RBAC configuration, with the three components: ServiceAccount, ClusterRole, and ClusterRoleBinding. These components give autoscaler deployment the necessary access to adjust the size of the pause pod deployment based on the required scaling. Please refer to the manifest:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: overprovisioning-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: overprovisioning-cr
rules:
  - apiGroups:
      - ''
    resources:
      - nodes
    verbs:
      - list
      - watch
  - apiGroups:
      - ''
    resources:
      - replicationcontrollers/scale
    verbs:
      - get
      - update
  - apiGroups:
      - extensions
      - apps
    resources:
      - deployments/scale
      - replicasets/scale
    verbs:
      - get
      - update
  - apiGroups:
      - ''
    resources:
      - configmaps
    verbs:
      - get
      - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: overprovisioning-rb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: overprovisioning-cr
subjects:
  - kind: ServiceAccount
    name: overprovisioning-sa

Pause pods deployments

Creating pause pods is an easy task. You can use a custom image to set up a healthy pod that acts as a placeholder in the cluster. The size of this pod, CPU, and memory configurations, can be adjusted based on your needs. Make sure to calculate the appropriate size to effectively block cluster resources using pause pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: overprovisioning
  name: overprovisioning
spec:
  selector:
    matchLabels:
      app: overprovisioning
  template:
    metadata:
      labels:
        app: overprovisioning
    spec:
      containers:
          image: nginx (any custom image)
          name: pause
          resources:
            limits:
              cpu: Ym
              memory: YMi
            requests:
              cpu: Xm
              memory: XMi
      priorityClassName: overprovisioning

Autoscaler deployment

Proceed with the deployment of the autoscaler. The objective of these pods is to supervise the replica count of the above pause pod deployment, based on the linear strategy employed by the autoscaler. This mechanism allows for the expansion or reduction of replicas and the efficient allocation of cluster resources through the utilization of pause pods. Execute the deployment by employing the provided manifest below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning-as
spec:
  replicas: 1
  selector:
    matchLabels:
      app: overprovisioning-as
  template:
    metadata:
      labels:
        app: overprovisioning-as
    spec:
      containers:
        - command:
            - /cluster-proportional-autoscaler
            - '--namespace=XYZ'
            - '--configmap=overprovisioning-cm'
            - '--target=deployment/overprovisioning'
            - '--logtostderr=true'
            - '--v=2'
          image: gcr.io/google_containers/cluster-proportional-autoscaler-amd64:{LATEST_RELEASE}
          name: autoscaler
      serviceAccountName: overprovisioning-sa

You now have an overprovisioning mechanism that allows you to allocate more resources than necessary to your cluster. To verify if it’s working correctly, you can perform the below test.

Testing the functionality

To prevent the need for scaling the entire cluster, please execute the tests on a single node within the cluster by employing Pod Affinity. Identify the node with running pause pods and direct the creation of new pods to this specific node through pod affinity specs.
Do not define any PodPriority in this test deployment, Kubernetes will automatically assign a Priority of 0 to this deployment. Meanwhile, our pause pods are configured with a priority of -1, indicating lower priority compared to regular workload pods or these test pods.
Upon creating this deployment, it should trigger the eviction of the pods on the designated node to prioritize the new test pods with a higher priority.

The pause pods should be terminated, and the new test pods should swiftly transition into a running state on this designated node. The terminated pause pods will be subsequently re-initiated as pending by their respective replica set and will search for a place on another node to run.