Skip to main content

AKS Known Issues

Known issues encountered during AKS cluster upgrade or rebuild

If a cluster is in a failed state and even after troubleshooting you are unable to complete the cluster upgrade before the auto-shutdown of AKS clusters, please delete the AKS cluster in a failed state manually within the portal and try again the following morning. This stops active jobs and pods on the failed cluster trying to reach inactive endpoints due to the healthy cluster being shutdown.

Upgrade failed in the portal

If the upgrade is showing as failed in the portal, you need to restart it by running a command on the cluster you are trying to upgrade. You can find the full resource ID in the portal too.

az resource update --ids <aks-resource-id>

One or more nodes has failed to upgrade

You can track the progress of the upgrade by running

kubectl get nodes

Eventually all nodes will show as running the intended version, in some cases however, one or more nodes may be stuck on the old AKS version, which will block the upgrade. This happens when a given node is unable to be drained before an upgrade, the most likely cause of which is a broken application.

To diagnose why a node has failed to upgrade, head to the AKS cluster in the Azure portal. Navigate to Node Pools > Linux/System/Other > Nodes > Stuck Node Name > Events. From here you will likely see a failed eviction and a reason why - after which you can troubleshoot further with the relevant team / application to unblock the node draining process. If the cluster is in a failed state, or no longer shows as updating in the portal, you will need to hit the update button again.

Quota limit reached

You may encounter a quota limit during an AKS upgrade due to the provisioning of new nodes during the process. Generally, you can increase these quotas yourself in the portal - by selecting the subscription that the AKS cluster is part of, and increasing the Compute quota to the recommended level.

DaemonSet Failed Scheduling

For a DaemonSet application (e.g. CSI driver, Oneagent or Kured), when applying an update or patch which requires a restart, a pod on a specific node may fail scheduling and blocks the rolling update from proceeding with restarting other pods.

An error similar to the below may be seen in the cluster events log:

oneagent_scheduling_error

  • This happens when there is a capacity issue on specific cluster nodes
  • DaemonSet rolling update is blocked with the pod stuck in Pending state
  • Deleting the pod does not fix issue as pod restarts and goes into same Pending state

  • Run kubectl get pod <pod name> -o yaml | grep nodeName to identify the node where pod scheduling has failed

  • Run kubectl get pods -A | grep <node name> to list all pods running on node

  • Identify and delete one or more non-DaemonSet pods to restart on a different node and free up a capacity on the current node

  • If deleted pods are stuck in Terminating state, use kubectl delete pod <pod name> --grace-period 0 --force to forcefully delete pod

  • Confirm DaemonSet pod now starts successfully

This page was last reviewed on 26 January 2024. It needs to be reviewed again on 26 January 2025 by the page owner platops-build-notices .
This page was set to be reviewed before 26 January 2025 by the page owner platops-build-notices. This might mean the content is out of date.