Skip to main content

Upgrading AKS Clusters

This run book documents some tasks that we have to perform when upgrading any of the AKS Clusters. If you need to rebuild a cluster, please refer to the rebuild guide

Cluster Order of Upgrading

This is the normal order we use when upgrading clusters, it does not have to exactly follow it:

  • Sbox - (cft-sbox-00-aks, cft-sbox-01-aks/ ss-sbox-00-aks, ss-sbox-01-aks)
  • Ptlsbox - (cft-ptlsbox-00-aks/ ss-ptlsbox-00-aks)
  • ITHC - (cft-ithc-00-aks, cft-ithc-01-aks/ ss-ithc-00-aks, ss-ithc-01-aks)
  • Preview - (cft-preview-00-aks/ ss-dev-01-aks)
  • Demo - (cft-demo-00-aks, cft-demo-01-aks/ ss-demo-00-aks, ss-demo-01-aks)
  • Perftest - (cft-perftest-00-aks, cft-perftest-01-aks/ ss-test-00-aks, ss-test-01-aks)
  • AAT - (cft-aat-00-aks, cft-aat-01-aks/ ss-stg-00-aks, ss-stg-01-aks)
  • Production - (prod-00-aks, prod-01-aks/ ss-prod-00-aks, ss-prod-01-aks)
  • Ptl - (cft-ptl-00-aks/ ss-ptl-00-aks)

Planning

Before starting to upgrade any clusters, we need to find out if the version of kubernetes we are upgrading to has any deprecations in it that might affect our applications.

We can do this using the pluto tool.

If you haven’t got pluto install on your macbook, you can install using brew

  brew install pluto

Ensure you’re logged into the correct cluster.

kubectl config current-context

Run pluto against all the clusters to ensure there are no resources using apiVersions that will be deprecated in the kubernetes version you are upgrading to.

pluto detect-helm -owide --target-versions k8s=v1.29 --only-show-removed

If there are deprecations, create a plan to address them before upgrading any cluster.

If there are no apiVersions being deprecated, then you can proceed with the upgrade, going through the environments above in order.

Patch Version

Don’t worry about patch versions. As long as the clusters are on the same minor version, it should be fine. For example, if one cluster is on version 1.30.2 and another is on 1.30.0, this is acceptable as both are on the 1.30 minor version.

For more information, refer to the Azure Kubernetes Service (AKS) supported Kubernetes versions.

Notify teams

Please post a notification in the #cloud-native-announce channel to notify people which clusters will be upgraded (Please refrain from using @here, as this is intended for announcements rather than notification).

Utilize the #sds-cloud-native channel if you are involved in activities concerning SDS clusters.

Pods and HRs Health Checks

Ensure you’re logged into the correct cluster.

kubectl config current-context

Before upgrading any of the clusters run the following query on the cluster you are planning to upgrade to find out the statuses of the pods and hrs:

kubectl get pods -A | wc -l > pods_total_count && kubectl get pods -A > pods_all_status && kubectl get pods -A | awk '!/(Running|Succeeded|Completed)/' > pods_not_running_or_succeeded && kubectl get hr -A > hr_all_status && kubectl get pdb -A > pdb_status && kubectl get hr --all-namespaces | awk '/(Unknown|False)/' > hr_failed_status

Three files will be saved in the directory from which you are running the above command and “ll” in the cli to display the files.

  • Investigate failed helm releases and pods in CrashLoopBackOff or ImagePullBackOff Status.
kubectl get pods --all-namespaces
kubectl get hr --all-namespaces
  • Check if these pods have pod disruption budgets
echo -e "NAMESPACE\t\tDISRUPTIONS_ALLOWED\t\tNAME" && kubectl get pdb -A -o json | jq -r '.items[] | select(.status.disruptionsAllowed==0) | {namespace: .metadata.namespace,disruptionsAllowed: .status.disruptionsAllowed,name: .metadata.name} | [.namespace, .disruptionsAllowed, .name] | @tsv' | sed 's/\t\t/ | /g'

Warning If the pdb associated with the application pod and the Allowed Disruptions is set to 0 do not proceed with the cluster upgrade until this application pod is back up and running or it will block the upgrade. When it set to 0, it essentially prohibiting any voluntary evictions. This means that no pods can be taken down intentionally during maintenance or disruptions.

  • We can temporarily disable pod distribution budget policy for particular application so that it will allow us to do smoother upgrade of the cluster. Please remember to revert the PR after upgrade.

  • For failed helm releases where pods are not deployed, test if rolling back to a previous successful release (which helps narrow down issue being specific to current release).

  • To list all failed helm releases

kubectl get hr --all-namespaces | awk '!/(True)/'
  • To view helm chart history
helm history <chart name> -n <namespace>
  • To roll back helm chart to a prior version or specific version
helm rollback <chart name> -n <namespace> 
helm rollback <chart name> <version number> -n <namespace>
  • For failed pods, investigate root cause and discuss with teams as required (e.g. pods not starting due to missing keyvault secrets)
helm rollback <chart name> -n <namespace> 
helm rollback <chart name>  <version number> -n <namespace>
  • For pods where images beginning with pr- are not found this is often due to the image no longer existing in the ACR. In these, instances you will need to either reach out to the app teams to get it updated or find the latest prod image for that pod in the ACR and put in a PR to fix like this one done previously.

  • To find by an app team, search this file with the name of the namespace the pod is in. Under the “contact_channel”, you will find the app teams slack channel for you to reach out on.

  • For a non production environment (or a non live application in production) if the team cannot fix the issue quickly and it is not a common component then comment out the application in flux for the environment to unblock the upgrade, like the following example.

Common Upgrade Steps

The upgrade process begins by removing the cluster you are going to upgrade from the application gateway. This is step is not required for single clusters.

  • PR example for SDS here
  • PR example for CFT here

  • Please remember when you switching from 00 cluster to 01 on Application Gateway, and 01 to 00, the best practice is to have pipeline build done which has both IP addresses. Eg. All below should be separate PR and then pipeline build

        1. Remove 00 Cluster IP
        2. Put back 00 Cluster IP after successful upgrade
        3. Remove 01 Cluster IP
        4. Put back 01 Cluster IP after successful upgrade
    
    • Unsure which IP belongs to which cluster? Check the front end IP of the kubernetes-internal loadbalancer:

portal.azure.com > Load Balancer > search for ‘kubernetes-internal’ > filter by the Resource Group > click on the load balancer returned by the search > click on Frontend IP Configuration > the IP Address is the one you need

Environment Specific Steps

Some environments do have some additional requirements / checks that need to be confirmed prior to upgrading any of its clusters. We upgrade PTL/PTLSbox and Preview/Dev/ITHC (only because we have one ITHC cluster currently) in-place, meaning we aren’t building another cluster or switching in the Application Gateway unless necessary. If one of these environments must instead be rebuilt and swapped over, please refer to the Rebuilding AKS Clusters guide.

Ptlsbox, Ptl, AAT

This environment is best to be done early in the morning, that Jenkins may be briefly unavailable on the relevant #cloud-native-announce slack channel. Subsequently, please verify the status of [Jenkins instances] for CFT (https://build.hmcts.net/) and SDS (https://sds-build.hmcts.net/) to ensure they are operational following the upgrades.

We upgrade PTL/PTLSbox and Preview/Dev environments by performing an in-place upgrade on the current live cluster. This means we aren’t building another cluster or switching in the Application Gateway unless necessary. If one of these environments must instead be rebuilt and swapped over, please refer to the Rebuilding AKS Clusters guide.

  • Once the in-place upgrade has been completed, update the code to reflect the new version. For example, see the CFT example

  • If the configuration doesn’t update after your pull request has been merged, you can change it manually in the Jenkins Configuration

aat

Before you upgrade a cluster in the AAT environment, please check which is the active cluster being used by Jenkins before removing either from the application gateway. This process is detailed in the rebuild guides.

perftest

If we are upgrading cft-perftest-00-aks and cft-perftest-01-aks, we have to agree on time with on this channel perftest-cluster beforehand.

Production

A Change Request will need to be raised 2-3 days before the upgrade is to take place via Service Now. Ensure to use the following template when raising the (Change Request) - Update to latest GA AKS Version for Production AKS Clusters

Upgrading the Cluster

Find the cluster you want to upgrade into the portal, which should have been removed from app gateway.

Cluster Configuration > Upgrade Version > in the Kubernetes version drop down menu choose the latest GA version > Click Save

This will trigger the upgrade of the cluster, which will take some time. Once this has finished you need to create a pull request to update the code to match.

We do this upgrade manually in the portal because it means we don’t have terraform state issues when the upgrade times out if there was a failure due to failing pods or some other reason.

Warning The kubernetes_version value only takes major version values, i.e 1.25, 1.26 and so on.

Once the changes have been reviewed, approved and merged, run the pipeline for the environment you are upgrading to have terraform update the state:

Once completed compare the hrs and pods statuses with the ones you saved before the upgrade took place.

Warning Issues during an AKS Upgrade

Troubleshooting upgrade failures

After upgrading of a cluster

  • Add the cluster back into the application gateway once you have confirmed the upgrade has been successful.

Known Issues

Code=“ReconcileVMSSAgentPoolFailed” Message=“Code=\"CannotAddAcceleratedNetworkingNicToAnExistingVirtualMachine\”

waiting for update of Node Pool "msnode" (Kubernetes Cluster "ss-sbox-00-aks" / Resource Group "ss-sbox-00-rg"): Code="ReconcileVMSSAgentPoolFailed" Message="Code=\"CannotAddAcceleratedNetworkingNicToAnExistingVirtualMachine\" Message=\"Cannot add network interface '/subscriptions/a8140a9e-f1b0-481f-a4de-09e2ee23f7ab/resourceGroups/ss-sbox-00-aks-node-rg/providers/Microsoft.Network/networkInterfaces/|providers|Microsoft.Compute|virtualMachineScaleSets|aksmsnode|virtualMachines|0|networkInterfaces|aksmsnode' with accelerated networking to an existing virtual machine '/subscriptions/a8140a9e-f1b0-481f-a4de-09e2ee23f7ab/resourceGroups/ss-sbox-00-aks-node-rg/providers/Microsoft.Compute/virtualMachines/|providers|Microsoft.Compute|virtualMachineScaleSets|aksmsnode|virtualMachines|0'

We’ve seen quite a few failures with Windows node pools and attaching NICs to VMs, generally happens during patching. In the first case re-try the pipeline, otherwise re-build the cluster.

Old Chart being used

On the cft-demo-01-aks cluster, we have noticed some CronJobs were using older helm chart despite the latest chart version updated on flux config. This was reporting on pluto that those CronJobs are using deprecated kubernetes apis but the problem was that it has not got updated chart.

The only way we managed to fix and get updated chart for the CronJobs is by following this blog post by running helm uninstall and then removing the helm release.

cft-preview cluster needed swapping because address does not have enough capacity for 2301 IP addresses

On the cft-preview-01-aks cluster, while upgrading on Azure portal, it thrown below error.

Failed to save Kubernetes service 'cft-preview-01-aks'. Error: Code="SubnetIsFull" Message="Subnet aks-01 with address prefix 10.101.160.0/19 does not have enough capacity for 2301 IP addresses." Details=[]

The only reason we can think of this error is because the cluster had 119 nodes at that time but still the address space is quite large with 8,192 IPs so not sure why it did not have enough space.

The only solution we had for this is to rebuild and swap the cft-preview-01-aks cluster with cft-preview-00-aks cluster. Please follow Re-building a cluster guide to find how to rebuild Preview.

This page was last reviewed on 25 April 2024. It needs to be reviewed again on 25 April 2025 by the page owner platops-build-notices .
This page was set to be reviewed before 25 April 2025 by the page owner platops-build-notices. This might mean the content is out of date.