Crime AKS Cluster Switchover

Source Document: This runbook is derived from the original AKS Failover: steps for kubernetes cluster switchover Confluence page.

This guide provides step-by-step instructions for switching over Crime AKS clusters between environments (e.g., from K8-DEV-CS01-CL01 to K8-DEV-CS01-CL02).

Prerequisites

Approval and Scheduling

Obtain approval for the switchover work for non-live environments
For PRP and PRD: Raise a Request for Change (RFC) in Halo
Schedule the switchover during designated maintenance windows
Create the new AKS cluster one day before go-live date (except PRP/PRD which require RFCs)

Release Branch for Workload Deployments

Before beginning the switchover, identify the correct cpp-pipeline release branch for the target environment. You can determine this by inspecting the currently deployed workloads:

helm ls -n ns-<env>-ccm-01

Example output:

NAME                            NAMESPACE       REVISION    UPDATED                                 STATUS      CHART                   APP VERSION
example-service             ns-dev-ccm-08   2           2026-04-08 11:07:30.183452632 +0000 UTC deployed    wildfly-app-0.25.3      dev/2608

The APP VERSION column indicates the release branch to use (e.g., dev/2608 means the cpp-pipeline release branch is dev/2608).

If you cannot determine the branch this way, ask in the group channel or contact a named person from the How do I get my DevOps / SRE Changes into release (and into Production) document.

Access to DTS Services

Authenticate the new cluster to access DTS services integrated with the Crime platform.

Some Kubernetes workloads (e.g., pods) require access to DTS services. Currently, the PIP service requires OIDC federation. You must update the OIDC issuer’s URL in the Federated Identity Credential in Azure Entra ID.

Get the OIDC issuer URL:

az aks show --name <cluster_name> --resource-group <resource_group> --query "oidcIssuerProfile.issuerUrl" -o tsv

The OIDC issuer URL format: https://uksouth.oic.prod-aks.azure.com/{tenant_id}/{uuid}

Update the Key Vault Secret:

Update the crime-oidc-issuer-config secret in the appropriate Key Vault: - STE, SIT, PRP: pip-bootstrap-stg-kv - PRD: pip-bootstrap-prod-kv

The secret value is a JSON object containing an array of connections. Each connection includes: - name: Environment identifier (e.g., “SIT”, “PRP”, “PRD-NEW”) - issuer: The OIDC issuer URL obtained from the command above - subject: The service account in format system:serviceaccount:<namespace>:<service-account-name>

Example format: json { "connections": [ { "name": "SIT", "issuer": "https://uksouth.oic.prod-aks.azure.com/e2995d11-9947-4e78-9de6-d44e0603518e/059790e1-cb17-43e5-b0b2-8bc4d39736de/", "subject": "system:serviceaccount:ns-sit-ccm-01:stagingpubhub-service-wildfly-app" }, { "name": "PRP", "issuer": "https://uksouth.oic.prod-aks.azure.com/77f54315-6dde-4fe7-9e17-74762c3eb096/e0137a4c-dadb-43a4-b3fb-881df89ddda8/", "subject": "system:serviceaccount:ns-prp-ccm-01:stagingpubhub-service-wildfly-app" } ] }

Reference: The Key Vault configuration is managed in pip-shared-infrastructures/main.tf

Apply the Changes:

After updating the Key Vault secret, re-run the Jenkins master job to apply the configuration:

pip-shared-infrastructures/master

Note: OIDC federation is required in SIT, NFT, PRP, and PRD environments. When upgrading AKS clusters in environments like PRP and PRD, add both the old and new cluster issuer URLs to the connections array. This allows you to keep the old cluster running while testing the new one. Remove the old cluster’s entry after the new one has been verified and tested.

Preparation

Before beginning the switchover process, prepare the cpp-terraform-azurerm-aks and cpp-terraform-azurerm-aks-config repositories.

AKS Repository

Purpose: Contains code to build the cluster and all related resources (Virtual Network, subnets, private endpoints, etc.)

1. Create Release Branch

Recommended: Create a release branch to isolate all switchover changes across environments. This allows gradual rollout (DEV → SIT → NFT → PRP → PRD) and a final merge to main only after all environments are verified.

Clone the repository (if not already cloned) and create a release branch:

git clone https://github.com/hmcts/cpp-terraform-azurerm-aks.git
cd cpp-terraform-azurerm-aks
git switch main && git pull --rebase
git switch -c release/<aks-version>
# Example: release/1.34

Then create your environment-specific branch from this release branch:

git switch -c [jira-ref]/[env]-aks-<new-version>-upgrade
# Example: EI-2230/dev-aks-1.34-upgrade

2. Modify the Relevant .tfvars File

Update the Kubernetes version parameters:

kubernetes_version   = "1.28.3"  # --> "1.30.3"
orchestrator_version = "1.28.3"  # --> "1.30.3"

Change any other references required for your upgrade
Example PR: cpp-terraform-azurerm-aks/pull/121
If creating a new env.tfvars, refer to the Network and Azure Components documentation

AKS Config Repository

Purpose: Contains application code to deploy vital system components (Istio, Prometheus, Gatekeeper, etc.) to the AKS cluster.

1. Understand the Branching Strategy

Module Repository (cpp-module-terraform-azurerm-aks-config): Uses tags to mark specific, immutable versions corresponding to AKS versions (e.g., v1.28.3)
Root Repository (cpp-terraform-azurerm-aks-config): Uses branches to manage environment-specific configurations (e.g., main, aks-1.28.3)

2. Create Release Branch

Recommended: Use the same release branch strategy as the AKS repository for consistency.

Clone the repository (if not already cloned) and create a release branch:

git clone https://github.com/hmcts/cpp-terraform-azurerm-aks-config.git
cd cpp-terraform-azurerm-aks-config
git switch main && git pull --rebase
git switch -c release/<aks-version>
# Example: release/1.34

Then create your environment-specific branch from this release branch:

git switch -c [jira-ref]/[env]-aks-<new-version>-upgrade
# Example: EI-2230/dev-aks-1.34-upgrade

3. Review tfvars File

Ensure your environment’s .tfvars code is correct and up to date:

Double-check all versions are updated
Verify the user_rbac map is updated with groups (this allows users to access the cluster; if empty, no one can access it)
In main.tf, ensure the module source points to main:

module "aks_base_config" {
  source = "git::https://github.com/hmcts/cpp-module-terraform-azurerm-aks-config.git?ref=main"
}

Example PR: cpp-terraform-azurerm-aks-config/pull/217

Build Cluster via Pipelines

Once branches are up to date, raise Pull Requests and run the ADO pipelines for aks and aks-config.

1. Raise Pull Requests

Verify everything is correct
Raise a pull request and get it peer-reviewed

2. Run the ADO Pipelines

AKS Pipeline (run this first):

Pipeline: AKS Pipeline
Branch: Your feature branch (e.g., EI-2230/dev-aks-1.30.3-upgrade)
Environment: Select the cluster you are building

AKS Config Pipeline (run after AKS pipeline succeeds):

Pipeline: AKS Config Pipeline
Branch: Your feature branch (e.g., EI-2230/dev-aks-1.30.3-upgrade)
Environment: Select the cluster you are building

For PRP and PRD, run the pipelines against your feature branch and verify the Terraform plan looks sound and error-free.

Cluster Switchover Steps

Each step specifies which environments it applies to. Environments: STE, DEV, SIT, NFT (non-live) and PRP, PRD, PRX (live).

1. (PRP, PRD, PRX) Request Environment Shuttering

Before proceeding with the cluster switchover in PROD, request SRE resources to shutter the environment. This includes:

Web Application Firewall (WAF)
Other components that are typically shuttered during production releases

Reference: For detailed shuttering steps, see Run PGBASEBACKUP on Sunday in PROD

Note: From the referenced document, perform steps 1 to 7 for shuttering and steps 10 to 15 for unshuttering. Other steps in that document are related to backups and can be ignored for the cluster switchover process.

2. (SIT, NFT, PRP, PRD, PRX) Scale Down Source Cluster Workloads

Set replicas to 0 for all workload namespaces on the source cluster. Scale both deployments and statefulsets for each of the following namespace groups:

ns-<env>-ccm-<stack>
ns-<env>-idam-<stack>
ns-<env>-cdns-<stack>
ns-<env>-web-<stack>
ns-<env>-alfapp-<stack>

kubectl scale deploy -n <namespace> --replicas=0 --all
kubectl scale statefulset -n <namespace> --replicas=0 --all

3. (SIT, NFT, PRP, PRD, PRX) Deploy Workloads on Destination Cluster

Note: DEV and STE are deployed by the dev/release teams themselves — skip this step for those environments.

Use the release branch identified in the Release Branch for Workload Deployments prerequisite step.

Deploy workloads (main stacks):

Pipeline: Deploy Stack #333

This pipeline deploys the following namespaces per stack: - ns-<env>-ccm-<stack> - ns-<env>-idam-<stack> - ns-<env>-cdns-<stack> - ns-<env>-web-<stack>

Deploy Alfapp workloads:

For ns-<env>-alfapp-<stack> namespaces, use a separate pipeline:

Pipeline: Deploy Alfapp #181

For detailed instructions on deploying to AKS, refer to the Deploy to AKS guide.

4. (PRP, PRD) Deploy Reporting Cronjobs

Generally use the main branch for reporting cronjobs. If unsure, check with the SRE team to confirm.

Deploy reporting cronjobs:

Pipeline: Deploy Reporting #225

5. (ALL) Update Global DNS

Update DNS to point to the new cluster.

Repo: cpp-terraform-network
Example PR: Commit
Pipeline: Network Pipeline
- Environment: mdv (or mpd)

6. (ALL) Basic Connectivity Testing

Test connectivity from your workstation (over VPN):

curl https://sitccm01.ingress01.sit.nl.cjscp.org.uk:443/usersgroups-service/internal/metrics/ping

Test from a server in the specific environment (e.g., ENVCCM01ACTAP##.cpp.nonlive):

curl https://sitccm01-api-lb.sit.cpp.nonlive:443/usersgroups-service/internal/metrics/ping

Both should return pong.

7. (ALL) Update Redirect URIs for Kiali/Grafana

Search for mdv-k8s-monitor Application in the Azure portal
Navigate to Authentication
Add Web Redirect URIs:
- Kiali: https://kiali.mgmt.cs01cl02.nft.nl.cjscp.org.uk/kiali/
- Grafana: https://grafana.mgmt.cs01cl02.nft.nl.cjscp.org.uk/login/azuread

8. (NFT, PRP, PRD) Update Dynatrace

Follow the AKS-Dynatrace Integration guide.

9. (SIT, NFT, PRP, PRD, PRX) Update Key Vault Networking

Configure the Key Vault to allow access from the new cluster VNET.

Go to Key Vault → KV-<ENV>-CCP01 in the Azure portal
Click Networking tab and check the firewall access policy

If “Allow public access from all networks”: No change required

If “Allow public access from specific virtual networks and IP addresses”:

Click Add virtual network → Add existing virtual network
Select the new cluster VNET & subnet (select only APP Subnet)
Click Enable
Click Save/Apply
Verify the cluster VNET and subnet appear under Virtual networks section
After updating the new VNET details, delete the old VNET details

10. (PRD) Apply Progression Scaling CronJob (If Applicable)

Note: This step is applicable as of 2 June 2026. If the job is not on the source cluster, don’t apply it.

Important: Delete the cron from the old cluster to prevent it from scaling up services if the cluster isn’t destroyed immediately.

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-progression
  namespace: kube-system
spec:
  schedule: "5 5 * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: cleanup
              image: crmpdrepo01.azurecr.io/hmcts/jenkins-agent-java11:v1.0.4-jdk11
              imagePullPolicy: IfNotPresent
              command:
                - /bin/bash
                - -c
                - kubectl patch hpa progression-service-wildfly-app -p '{"spec":{"minReplicas":30}}' -n ns-prd-ccm-01
              resources:
                requests:
                  memory: "64Mi"
                limits:
                  memory: "128Mi"
          restartPolicy: Never
          serviceAccountName: jenkins-admin
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-progression
  namespace: kube-system
spec:
  schedule: "5 22 * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: cleanup
              image: crmpdrepo01.azurecr.io/hmcts/jenkins-agent-java11:v1.0.4-jdk11
              imagePullPolicy: IfNotPresent
              command:
                - /bin/bash
                - -c
                - kubectl patch hpa progression-service-wildfly-app -p '{"spec":{"minReplicas":8}}' -n ns-prd-ccm-01
              resources:
                requests:
                  memory: "64Mi"
                limits:
                  memory: "128Mi"
          restartPolicy: Never
          serviceAccountName: jenkins-admin

11. (DEV, STE) Update mi-ado-agent Managed Identity

Update the mi-ado-agent Managed Identity with the OIDC URL of the new cluster on the HMCTS.net Subscription. The URL must contain a trailing slash.

Portal location: Log in to the hmcts.net tenant, then navigate to Federated Credentials

Get the new cluster’s OIDC URL:

az aks show --name <new_cluster_name> --resource-group <resource_group> --query "oidcIssuerProfile.issuerUrl" -o tsv

Example: https://uksouth.oic.prod-aks.azure.com/e2995d11-9947-4e78-9de6-d44e0603518e/0dccffda-21f3-4db3-bca7-9f8392cf3587/

Update the Cluster Issuer URL in the federated credentials for your cluster:

Cluster	Federated Credential Names to Update
K8-DEV-CS01-CL01	`k8-dev-cs01-cl01-ado`, `k8-dev-cs01-cl01-ado-agent`
K8-STE-CS01-CL01	`k8-ste-cs01-cl01-ado`, `k8-ste-cs01-cl01-ado-agent`
K8-DEV-CS01-CL02	`ado-agent-aks-dev01`, `ado-agent-aks-dev01-keda-operator`

Note: Only the Cluster Issuer URL needs to change. The namespace and service account values remain the same.

12. (DEV) Update Jenkins Configuration for AKS Agents

Configure Jenkins to spin up agents on the new AKS cluster.

Go to Manage Jenkins → Configure System
Navigate to: https://build.mdv.cpp.nonlive/configure
Find the Kubernetes section → Identify your cluster configuration (figure out from name)

Update the following:

Kubernetes URL:

kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}'

Certificate key:

kubectl get secrets/jenkins-admin-token -n kube-system -o jsonpath="{.data['ca\.crt']}" | base64 -d

Keep a note of the Credentials name
Open another tab: https://build.mdv.cpp.nonlive/credentials/
Identify your credentials → Open it and click Update

Update Secret:

kubectl get secrets/jenkins-admin-token -n kube-system --template='{{.data.token | base64decode}}'

Go back to the other tab where you updated URL and certificate
Click Test Connection (should be successful if URL, certificate, and token are correct)
Click Apply and Save

Note: There is an issue with the Jenkins plugin where even if the test is successful, Jenkins may fail to spin up agents on AKS. Test by running a verify/validation job to see if it can spin up agents in the Jenkins namespace. If it fails, in the Kubernetes section for the cluster, simply copy and paste the same certificate and apply—this seems to fix the issue.

13. (ALL) Enable Cost Analysis

Cost analysis is required until Terraform support is available.

az aks update --resource-group <cluster-rg> --name <cluster> --enable-cost-analysis

Verify the pod is running:

kubectl get deploy cost-analysis-agent -n kube-system

14. (ALL) Lock Inactive Cluster

Post cluster switchover, lock down the old cluster access so it becomes inactive.

Step 1: Update user_rbac in aks-config Repo

Update user_rbac in the var file for the inactive cluster to an empty list for the following groups:

aks_reader_members_ids
aks_contributor_members_ids
aks_cluster_admin_members_ids

Reference: cpp-terraform-azurerm-aks-config/pull/134

Steps:

Create a branch from the release branch: bash cd cpp-terraform-azurerm-aks-config git fetch origin git checkout release/<aks-version> # e.g., release/1.34 git pull origin release/<aks-version> git switch -c [jira-ref]/lock-inactive-[cluster-name]-[env]
Update inactive cluster vars by clearing the list for user_rbac variable
Merge the changes to the release branch (e.g., release/1.34) and apply (ensure the plan shows only changes with respect to user resource)

Step 2: Remove Vault Entries

Remove the vault entry for the following service accounts to restrict deployments via Jenkins. Update aks_cluster_name to the cluster that will be inactive:

secret/terraform/${var.environment}/${var.aks_cluster_name}/jenkins_deploy_clusterrole_kubeconfig
secret/terraform/${var.environment}/${var.aks_cluster_name}/jenkins_admin_clusterrole_kubeconfig

Step 3: Restrict DevOps User Access

Empty the following group in the aks repo: aks_cluster_admins_aad_group_ids

After updating the var files, get them reviewed and run terraform plan/apply on the inactive cluster. After performing all the above steps, verify access is restricted by performing az login.

15. (ALL) Destroy Old Cluster (After a Few Days)

When running the aks-config-destroy pipeline, some namespaces will get stuck in the finalizer stage. Use the following command to delete those namespaces:

for ns in $(kubectl get ns --field-selector status.phase=Terminating -o jsonpath='{.items[*].metadata.name}'); do 
  kubectl get ns $ns -ojson | jq '.spec.finalizers = []' | kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f -
done

16. (ALL) Merge Release Branch to Main (After All Environments Switched)

Once all environments (DEV, SIT, NFT, PRP, PRD) have been successfully switched to the new clusters and the old clusters are destroyed:

1. Merge the release branch to main for both repositories:

For cpp-terraform-azurerm-aks: bash cd cpp-terraform-azurerm-aks git checkout main git pull origin main git merge release/<aks-version> git push origin main

For cpp-terraform-azurerm-aks-config: bash cd cpp-terraform-azurerm-aks-config git checkout main git pull origin main git merge release/<aks-version> git push origin main

2. Tag the release: bash git tag -a v<aks-version> -m "AKS cluster switchover to version <aks-version>" git push origin v<aks-version>

3. Clean up the release branch (optional): bash git push origin --delete release/<aks-version>

Benefits of this approach: - All switchover changes are now in main for future reference - The release is properly tagged for audit trail - Future cluster builds use the updated configuration - Provides clear separation between in-progress switchovers and stable main branch

Troubleshooting

When Building Cluster

Issue: Key Vault Data Access Administrator Error

When running the AKS Config pipeline, you may encounter an error related to the Key Vault Data Access Administrator role (which was in preview state when older clusters were built).

Error Message:

Error: Provider produced inconsistent final plan

When expanding the plan for
module.aks_base_config.kubectl_manifest.store_azure_info1 to include new
values learned so far during apply, provider
"registry.terraform.io/gavinbunney/kubectl" produced an invalid new value
for .yaml_body_parsed: was cty.StringVal("apiVersion: v1\ndata:\n
AvereContributor:
/providers/Microsoft.Authorization/roleDefinitions/4f8fab4f-1852-4a58-a46a-8eaf358af14a\n
...

Solution:

Delete the role-definition ConfigMap from the azure-info namespace and rerun the Terraform pipeline:

kubectl delete cm role-definition -n azure-info

Then rerun the aks-config pipeline.

This page was last reviewed on 10 April 2026. It needs to be reviewed again on 10 April 2027 by the page owner platops-build-notices .

This page was set to be reviewed before 10 April 2027 by the page owner platops-build-notices. This might mean the content is out of date.