KEDA Patching Example
This document covers a high level patching example for the KEDA app patching. Please see Patching AKS Apps - Traefik for a lower level example with more information.
KEDA is deployed globally, rather than starting in sbox and working up through the environments, due to its limited use.
It is possible to patch only a specific environment by created a patch file within the KEDA namespace in Flux and applying this patch to the specific environment you want to modify.
Example: Flux PR example of creating a patch file and applying to a single environment/cluster.
KEDA is used to auto scale pods on a cluster. A couple of examples of this are:
- Azure DevOps agent pods: we use a KEDA scaled job to monitor for new pipeline builds in Azure DevOps and create a pod on the PTL or PTLSBOX clusters to run the pipeline.
- Jenkins webhook agent: we use a KEDA scaled job to poll Azure Service Bus for new messages when an event is triggered from a repo in GitHub. A pod is then scheduled to process the message and send it to Jenkins to kick off a build.
Patching
Review the KEDA Releases and the Helm chart releases pages to check for breaking changes before updating.
No Production
As shown previously it is possible to patch individual environments or multiple environments without patching KEDA in all environments even though it is deployed globally: example PR.
Using a patch file as shown in this sample PR you can patch one environment at a time until all non-production environments are completed.
Testing after the first environment is highly recommended before moving to any other environments.
ITHC or Demo are the best options to start with as both contain KEDA and also contain plum-recipe-receiver which is discussed in the checks section below.
Production
For production you should merge your patch file changes, the updates to KEDA, into the main keda.yaml file and remove the patch file you created for non-production.
You will also need to remove the patches to each non-production environment as the file will no longer exists: example PR.
SDS
Create a PR in sds-flux-config to patch KEDA in SDS: example PR.
CFT
Create a PR in cnp-flux-config to patch KEDA in CFT: example PR.
Post Patching Checks
All the checks in this section are now automated by keda-patching-checks AI agent skill that is available in the skills repository. Please read the README file of the skill before first usage.
Work through these checks in order: first confirm KEDA itself has upgraded and is healthy, then confirm the
workloads that depend on it (ScaledJobs) are still reconciling.
1. Check the Helm release
KEDA is deployed into the keda namespace (the Helm release is named keda). Confirm the release has reconciled
to the version you applied via Flux:
kubectl get hr keda -n keda
The STATUS column should show the upgrade succeeded with the new version, for example:
Helm upgrade succeeded for release keda/keda.v2 with chart keda@2.20.0.
2. Check the KEDA pods
KEDA runs its own long-lived pods. The chart deploys three of them, so confirm they
are all Running:
kubectl get pods -n keda -l app.kubernetes.io/part-of=keda-operator
You should see one pod for each of the following components:
keda-operator- watchesScaledObjects/ScaledJobsand performs the scalingkeda-operator-metrics-apiserver- serves external metrics to the Kubernetes metrics APIkeda-admission-webhooks- validates KEDA resources
If a pod is not healthy, check its logs, for example:
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -f
3. Check the workloads that use KEDA
Once KEDA itself is healthy, confirm the ScaledJobs that depend on it are still reconciling.
ScaledJob, the READY column is the health signal, not ACTIVE. READY=True means KEDA has validated the ScaledJob and it is ready to scale — that is a pass. The ACTIVE column only tells you whether KEDA is scaling right now: ACTIVE=True means there is queued work at that instant, and ACTIVE=False simply means the trigger is idle (no queued jobs/messages). ACTIVE=False is normal and is NOT a failure. Equally, not seeing brand-new pods or recent KEDAJobsCreated events at the moment you check is expected when the queue is idle. Only treat a ScaledJob as failed when READY is not True (or it is missing on a cluster where it is expected).
There are three workloads we use for this:
- Recipe Receiver - deployed across most (but not all) clusters, so the best general check
- Azure DevOps agents - Linux agents run widely in SDS; on CFT they are only on the integration service (
intsvc) clusters. Windows agents areintsvc-only on both estates - Jenkins webhook relay - only on the integration service (
intsvc) clusters
Recipe Receiver
Recipe Receiver is an app created for Platform Operations and deployed to both CFT and SDS AKS clusters but under different namespaces:
- CFT - plum-recipe-receiver (namespace
cnp) - SDS - toffee-recipe-receiver (namespace
toffee)
These are both deployed the same way and have the same setup, resources and testing steps.
It is not deployed to every cluster, so check the relevant environment before testing. The
authoritative list is the set of environment overlays in the Flux app folders above (each <env>.yaml
is a cluster it is deployed to). At time of writing this is:
| Estate | Namespace | Environments where Recipe Receiver is deployed |
|---|---|---|
| CFT | cnp |
sbox, ithc, perftest, demo, aat, prod
|
| SDS | toffee |
sbox, test, ithc, demo, stg, prod
|
Note: this is the standing deployment managed by Flux. The recipe-receiver app repo separately deploys ephemeral, per-PR resources, and only to CFT Preview and SDS Dev — so don’t rely on those clusters for post-patching checks.
If Recipe Receiver is missing from a cluster where it is expected (per the table above), treat that as a
failure to investigate rather than skipping it — a missing ScaledJob can be a symptom of a broken KEDA
upgrade (for example the KEDA CRDs failing to install/upgrade).
The resources in question that matter for this testing are Azure Service Bus Queues, each environment has a service bus with a queue that the recipe-receiver monitors for messages and will scale out when messages are added to the queue i.e. If KEDA is working correctly, the messages should be processed from the queue by the scaledJob of the recipe-receiver which creates pods to do the processing.
The service bus naming convention is <app>-servicebus-<environment> e.g. plum-servicebus-aat and the queue is called recipes.
Within the Recipe Receiver repository there is a Golang script that can be used to generate messages for a specific service bus/queue and then monitors those messages until they reach zero, this is perfect for testing the recipe-receiver app and KEDA.
There are examples of how to use this script within the repository itself but the general usage is:
go run messageGenerator/main.go -service-bus <app>-servicebus-<environment>.servicebus.windows.net -queue recipes -messages 50 -watch
Please note the use of
goin the command, you will need Golang installed locally: instructions
This script is very reusable across environments simply by changing the app or environment name to suit whichever AKS cluster you are testing.
Steps to test KEDA updates:
- Run the above script first before the update to make sure
Recipe-Receiveris working first. - Once confirmed, make your Flux updates and raise a PR with the patch to a specific environment e.g. ITHC or Demo.
- Have your PR reviewed and merge when approved.
- When the KEDA release has been updated and new pods are running you can re-run the test again.
- To check the KEDA release has updated you can run:
kubectl get hr -n kedawhich will show the Helm Release and if updated correctly should show the new version you applied via Flux.
- To check the KEDA release has updated you can run:
- Run the script again and monitor the script output to see if the queue message count drops as expected.
Once you are happy that KEDA is working across this environment you can deploy the patch to another non-production environment until you completed all non-production environments. Remember to test each environment after the update.
Azure DevOps / Jenkins agents
These are the most heavily used ScaledJobs in the project: they run our self-hosted Azure DevOps agents and send
webhook events to Jenkins to trigger builds. If they are not working they cause widespread problems for application
teams and result in BAU tickets, so always verify them after patching.
These workloads are not deployed to every cluster KEDA runs on, so use the table below to
know exactly where to run the checks. The Jenkins webhook relay only runs on the integration service (intsvc)
clusters; the Azure DevOps agents run more widely in SDS:
| Workload | Namespace |
ScaledJob name(s) |
CFT clusters | SDS clusters |
|---|---|---|---|---|
| Jenkins webhook relay | jenkins |
jenkins-webhook-relay-function |
cft-ptl-00-aks, cft-ptlsbox-00-aks
|
ss-ptl-00-aks, ss-ptlsbox-00-aks
|
| Azure DevOps agents (Linux) | azure-devops |
azure-devops-agent-function |
cft-ptl-00-aks, cft-ptlsbox-00-aks
|
All SDS clusters except demo
|
| Azure DevOps agents (Windows) | azure-devops |
azure-devops-agent-windows-function |
cft-ptl-00-aks, cft-ptlsbox-00-aks
|
ss-ptl-00-aks, ss-ptlsbox-00-aks
|
Run the checks below against relevant cluster. Set the cluster once with:
# pick the cluster you are checking, e.g. cft-ptl-00-aks, cft-ptlsbox-00-aks, ss-ptl-00-aks ...
CTX=cft-ptl-00-aks
First confirm the agents’ Helm releases are healthy (READY should be True):
kubectl --context "$CTX" get hr azure-devops-agent -n azure-devops
kubectl --context "$CTX" get hr jenkins-webhook-relay -n jenkins
List the ScaledJobs and confirm READY is True:
kubectl --context "$CTX" get scaledjob -n azure-devops
kubectl --context "$CTX" get scaledjob -n jenkins
Remember READY=True is the pass criterion here — ACTIVE=False just means the agent pool / relay queue is
currently idle and is not a problem.
Describe a ScaledJob to confirm KEDA is creating jobs for it - look for events with the reason KEDAJobsCreated:
kubectl --context "$CTX" describe scaledjob azure-devops-agent-function -n azure-devops
kubectl --context "$CTX" describe scaledjob jenkins-webhook-relay-function -n jenkins
Recent KEDAJobsCreated events and fresh agent pods are good confirmation that scaling works, but their
absence when the trigger is idle (ACTIVE=False) is expected and does not indicate a failure — a READY=True
ScaledJob with recently completed jobs is healthy.
Finally, confirm new agent pods are being created. Sorting by creation time makes it easy to see fresh pods appearing
after your patch (pods are named after the job, e.g. azure-devops-agent-function-<UNIQUE ID>):
kubectl --context "$CTX" get pods -n azure-devops --sort-by=.metadata.creationTimestamp
kubectl --context "$CTX" get pods -n jenkins --sort-by=.metadata.creationTimestamp
Some application teams have their own ScaledJobs, but if the Azure DevOps agent and Jenkins webhook relay jobs are
working we can expect the others to be working too.
Keep an eye on #platops-help for any tickets that may come in from application teams using scaled jobs.
Other links
You can find info on how the jenkins webhook relay works here.