EKS cluster autoscaling
One of the amazingly powerful aspects of Kubernetes is its dynamic nature. Deployments watch and replace failed pods, application versions can be rolled forward and backwards with ease, and both workers nodes and pods can be automatically scaled. This blog will discuss worker node scaling, known as Cluster Autoscaling. We will discuss pod autoscaling using the Horizontal Pod Autoscaler in another blog post.
Most customers use the supplied CloudFormation script to create worker nodes for EKS. This CF script automatically places the worker nodes into an AWS auto-scaling group (ASG). Now, metrics generated by this ASG are sent to CloudWatch Metrics by default. These metrics can be used to scale up/down the cluster in typical AWS fashion. Unfortunately, these are the wrong metrics. The most common metric (CPU Utilization) is simply not appropriate for cluster scaling. It might be appropriate for Pod scaling, but memory reservation is the right metric for a cluster. Also, AWS does not collect memory stats on a per instance basis. For that, you would need to install a monitoring agent, known as CloudWatch Agent. So, we have a choice, install the CW agent, and do things in AWS fashion, or turn to a Kubernetes solution.
Since we are running Kubernetes, the right answer is to use a K8 solution to this problem. Also, the CloudWatch agent is looking at course memory status, which do not translate well into pod and cluster-wide memory stats. Interestingly enough, this type of statistic is made available for ECS (Elastic Container Service), the AWS native container solution. Therefore, I would only adjust the kubernetes ASG size manually, and not rely on AWS ASG for its native polices for auto-scaling.
Kubernetes Cluster Autoscaling
The documentation for the Cluster Autoscaler on AWS is pretty good. Here are the steps to take:
- The worker running the cluster autoscaler pod will need access to certain AWS resources and actions. Since we may not take the time to set-up pod affinity or taints, we will simply apply an policy with the appropriate permissions to the ASG instance role. This way, any node running the autoscaler will have the right permissions.
- Edit the cluster template file, adjusting for local settings.
- Update your region information
- Update the certificate authority path
- Update your Auto Scaling Group name, unless you are using ASG discovery
- If you are using kube2iam, you will have to add an annotation
- When creating deployments, one should specify cpu and memory requirements so that the scheduler knows when to fail due to constraint limitations. This will trigger the autoscaler to create/delete a new node as necessary
Cluster autoscaling issues
There are several known issues with cluster autoscaling. For example:
- cluster autoscaling will increase the size of the cluster once a pod fails to be deployed. That is, it is reactive, and it will take several minutes after a placement failure before the new worker node is ready and available.
- cluster autoscaling is NOT zone aware (for now)
- In EKS, one must run the autoscaler on a worker node. It should be run in the kube-system namespace, so it does not terminate the worker node it is running on.
- Cluster Autoscaler decreases the size of the cluster when some nodes are consistently unneeded for a significant amount of time. A node is unneeded when it has low utilization and all of its important pods can be moved elsewhere.
- Pods are given at most 10 minute to gracefully terminate, if the autoscaler decided to kill a node.
IAM Role
{% highlight bash %} { “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: [ “autoscaling:DescribeAutoScalingGroups”, “autoscaling:DescribeAutoScalingInstances”, “autoscaling:DescribeTags”, “autoscaling:SetDesiredCapacity”, “autoscaling:TerminateInstanceInAutoScalingGroup” ], “Resource”: “*” } ] } {% endhighlight %}
Autoscaler template
Other templates can be found in the repo. {% highlight bash %}
apiVersion: v1 kind: ServiceAccount metadata: labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler name: cluster-autoscaler namespace: kube-system
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler rules:
- apiGroups: [""] resources: [“events”,“endpoints”] verbs: [“create”, “patch”]
- apiGroups: [""] resources: [“pods/eviction”] verbs: [“create”]
- apiGroups: [""] resources: [“pods/status”] verbs: [“update”]
- apiGroups: [""] resources: [“endpoints”] resourceNames: [“cluster-autoscaler”] verbs: [“get”,“update”]
- apiGroups: [""] resources: [“nodes”] verbs: [“watch”,“list”,“get”,“update”]
- apiGroups: [""] resources: [“pods”,“services”,“replicationcontrollers”,“persistentvolumeclaims”,“persistentvolumes”] verbs: [“watch”,“list”,“get”]
- apiGroups: [“extensions”] resources: [“replicasets”,“daemonsets”] verbs: [“watch”,“list”,“get”]
- apiGroups: [“policy”] resources: [“poddisruptionbudgets”] verbs: [“watch”,“list”]
- apiGroups: [“apps”] resources: [“statefulsets”] verbs: [“watch”,“list”,“get”]
- apiGroups: [“storage.k8s.io”] resources: [“storageclasses”] verbs: [“watch”,“list”,“get”]
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: Role metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler rules:
- apiGroups: [""] resources: [“configmaps”] verbs: [“create”]
- apiGroups: [""] resources: [“configmaps”] resourceNames: [“cluster-autoscaler-status”] verbs: [“delete”,“get”,“update”]
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-autoscaler subjects:
- kind: ServiceAccount name: cluster-autoscaler namespace: kube-system
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: RoleBinding metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: cluster-autoscaler subjects:
- kind: ServiceAccount name: cluster-autoscaler namespace: kube-system
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system labels: app: cluster-autoscaler spec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler spec: serviceAccountName: cluster-autoscaler containers: - image: k8s.gcr.io/cluster-autoscaler:v1.2.2 name: cluster-autoscaler resources: limits: cpu: 100m memory: 300Mi requests: cpu: 100m memory: 300Mi command: - ./cluster-autoscaler - –v=4 - –stderrthreshold=info - –cloud-provider=aws - –skip-nodes-with-local-storage=false - –nodes=1:10: env: - name: AWS_REGION value: us-west-2 volumeMounts: - name: ssl-certs mountPath: /etc/ssl/certs/ca-certificates.crt readOnly: true imagePullPolicy: “Always” volumes: - name: ssl-certs hostPath: path: “/etc/ssl/certs/ca-bundle.crt”
{% endhighlight %}
Testing
In order to test the autoscaler, I simply created a deployment and scaled it up until I broke my cluster. After examining the logs of the cluster-autoscaler pod, I noticed this:
[scale_up.go:199] Best option to resize: eks-worker-nodes-NodeGroup-17QA3RV58XDBW
[scale_up.go:203] Estimated 1 nodes needed in eks-worker-nodes-NodeGroup-17QA3RV58XDBW
[scale_up.go:292] Final scale-up plan: [{eks-worker-nodes-NodeGroup-17QA3RV58XDBW 3->4 (max: 10)}]
[scale_up.go:344] Scale-up: setting group eks-worker-nodes-NodeGroup-17QA3RV58XDBW size to 4
[aws_manager.go:305] Setting asg eks-worker-nodes-NodeGroup-17QA3RV58XDBW size to 4
Summary
I was able to confirm on my Console that my kubernets cluster increased in size from 3 to 4 nodes. After killing off the deployment and waiting 10 minutes, the cluster shrank back to 3 nodes. Success!