Patching AKS Apps Overview
This run book documents some tasks that we have to perform when patching apps on the AKS Clusters. This runbook will use Traefik as an example throughout. Please see bottom of this page for other app examples.
Note: steps may differ depending on the application being patched.
App Patching Schedule
Patching is shared across the Platform Operations team, review the patching rota to view which app is due and who it is assigned to.
Patching order by Environment
Typically app patching is split into 3 segments, based on environment. This is the normal order we use to apply patches:
SDS is a good place to start, as any issues will have a smaller impact due to the size of the platform.
Sandbox
- Sbox - (cft-sbox-00-aks, cft-sbox-01-aks/ ss-sbox-00-aks, ss-sbox-01-aks)
- Ptlsbox - (cft-ptlsbox-00-aks/ ss-ptlsbox-00-aks)
Non Prod
- ITHC - (cft-ithc-00-aks, cft-ithc-01-aks/ ss-ithc-00-aks, ss-ithc-01-aks)
- Preview - (cft-preview-01-aks/ ss-dev-01-aks)
- Demo - (cft-demo-00-aks, cft-demo-01-aks/ ss-demo-00-aks, ss-demo-01-aks)
- Perftest - (cft-perftest-00-aks, cft-perftest-01-aks/ ss-test-00-aks, ss-test-01-aks)
- AAT - (cft-aat-00-aks, cft-aat-01-aks/ ss-stg-00-aks, ss-stg-01-aks)
Production
- Production - (prod-00-aks, prod-01-aks/ ss-prod-00-aks, ss-prod-01-aks)
- Ptl - (cft-ptl-00-aks/ ss-ptl-00-aks)
Research the App version
It is important to understand what changes are in the version upgrade, especially if there are any breaking changes.
Usually, there will be a renovate pull request in the flux repo that will contain release notes that show you the breaking changes.
Check out some of examples for patching apps such as Traefik to see what this means and how to tackle it.
If you are unsure whether a breaking change will affect us or our application teams, shout out on your standup or send a message on slack to #platform-operations.
Check existing configuration
Check the existing configuration is working by making sure the pods are running on the clusters and the versions match what is in code.
kubectl describe pod {pod-name} -n admin | grep helm.sh/chart=traefik
Updating - start with sandbox
When updating an application, depending on how it is set up and how critical it is, you may have to target a certain environment first, such as sandbox.
In our flux repos, we usually have a base file that contains the application or helm release version that applies to all environments.
To target a certain environment, we need to override this value by creating another helm release file and adding it to the environment specific kustomization.
You can see how this has been done in our examples for Traefik and Flux.
Note: this may not apply to all applications, eg. KEDA, but most can have the versions upgraded per environment.
Checks to see if upgrade worked correctly
When updating an application you can monitor the cluster to see if the upgrade completes.
There are a few things you can check to see the progress of the update end-to-end.
You can start checking the flux config to make sure it hasn’t got any error and it has reconciled correctly.
When a change is pushed to the flux repo, the flux controller pods on the cluster sync with GitHub.
This triggers a reconciliation of the kustomizations in the flux-system namespace.
You can see this by running:
kubectl get kustomizations.kustomize.toolkit.fluxcd.io -n flux-system
You can also use ks
which is shorthand for kustomization
.
There will be a kustomization per namespace and once it becomes reconciled it will say Applied revision
followed by the Github branch and commit SHA e.g. master@abcd1234...
.
Generally, if you find any changes are not deploying to the cluster, checking the kustomizations are reconciling properly is one thing you can do to troubleshoot.
Next, check the pods have come back up. Using traefik as an example, you can run:
kubectl get pods -n admin | grep traefik
The uptime should be fairly recent, i.e., the pods should have been redeployed in the last few minutes.
Check pods have the correct chart version:
kubectl describe pod {pod-name} -n admin | grep helm.sh/chart=traefik
Also, you can check on the Helm release to see if it has got correct version in the log.
kubectl get hr -n admin
Review pods for any new errors:
kubectl logs {pod-name} -n admin -f
You can also check if any UIs are reachable.
Further testing
If the application has a UI, open it in a browser. If it’s working then the upgrade worked.
Some applications, like traefik, don’t have a UI exposed but there are other ways to check.
Have a look at the example on Traefik for details.
Repeat these steps in both the CFT and SDS clusters.
Non-prod environments
Now you’ve completed sandbox, move onto the nonprod environments.
Prod environments
You will want to test the PTL environments on their own before upgrading prod as this cluster hosts Jenkins, Start with SDS then move onto CFT.
PR examples for traefik:
When prod is the only environment left, you can either:
- remove the previous patches you created for the other environments and just update the application yaml file
- merge the renovate PR that will probably have been created and remove the previous patches you created
Either of these methods should have the same outcome, i.e., updating the base version of the application or helm release that applies to all environments.
Other notes
AAD Pod Identity has been deprecated hence there is not going to be further patches.
AAD Pod Identity already using latest version on both sds and CFT.
Application specific examples
- Artifactory
- dns-node-cache
- Dynatrace Operator
- Flux
- KEDA
- kured
- pact-broker
- Prometheus/Grafana
- Traefik