Skip to main content

Rebuilding AKS Clusters

Re-building a cluster

Warning Re-building a cluster is only done when absolutely necessary.

When re-building a cluster, remove the given cluster from application gateway if there is a 00 and 01 setup, and then delete the cluster manually in portal. You will have to remove the lock on resource group if applied.

To re-build the cluster you need to run the pipeline in a given environment, and for the given cluster you deleted:

Click on Run pipeline (blue button) in top right of Azure DevOps. Ensure that Action is set to Apply, Cluster is set to whichever one you want to build and that the environment is selected from the drop down menu.

Click on Run to start the pipeline.

Common Steps

Before deployment of a cluster

  • Merge PR to remove a cluster from Application Gateway

After deployment of a cluster

  • Add the cluster back into Application Gateway once you have confirmed deployment has been successful. PR example here.

Environment Specific

Dev/Preview, AAT/Staging

  • Change Jenkins to use the other cluster that is not going to be rebuilt
  • If the configuration doesn’t update after your pull request has been merged, you can change it manually in the Jenkins Configuration

After deployment of a cluster

  • Change Jenkins to use the other cluster, e.g. cnp-flux-config#19886 .
    • After merging above PR, if Jenkins is not updated Check Here, you MUST manually change Jenkins config to use the other cluster.
  • If it is a preview environment, swap external DNS active cluster, e.g. cnp-flux-config#19889
  • If it is an AAT Environment, you will need to update the Private DNS record to point to the new cluster. Prvate DNS PR. The private dns should be set to the kubernetes internal loadbalancer IP (Traefik). See common upgrade steps for how to retrieve the IP address"
  • Create a test PR, normally we just make a README change to rpe-pdf-service . example PR
  • Send across the comms on #cloud-native-announce channel regarding the swap over to the new preview cluster, see below example announcement:-

Hi all, Preview cluster has been swapped cft-preview-00-aks. Login using:

az aks get-credentials --resource-group cft-preview-00-rg --name cft-preview-00-aks --subscription DCD-CFTAPPS-DEV --overwrite
  • Delete all ingress on the old cluster to ensure external-dns deletes its existing records:
kubectl delete ingress --all-namespaces -l '!helm.toolkit.fluxcd.io/name,app.kubernetes.io/managed-by=Helm'
  • Delete any orphan A records that external-dns might have missed:

Replace X.X.X.X with the loadbalancer IP (kubernetes-internal) of the cluster you want to cleanup

Private DNS

az network private-dns record-set a list --zone-name service.core-compute-preview.internal -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --query "[?aRecords[0].ipv4Address=='X.X.X.X'].[name]" -o tsv | xargs -I {} -n 1 -P 8 az network private-dns record-set a delete --zone-name service.core-compute-preview.internal -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --yes --name {}
az network private-dns record-set a list --zone-name preview.platform.hmcts.net -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --query "[?aRecords[0].ipv4Address=='X.X.X.X'].[name]" -o tsv | xargs -I {} -n 1 -P 8 az network private-dns record-set a delete --zone-name preview.platform.hmcts.net -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --yes --name {}

Deletes any txt records pointing to inactive:

# Private DNS
az network private-dns record-set txt list --zone-name preview.platform.hmcts.net -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --query "[?contains(txtRecords[0].value[0], 'inactive')].[name]" -o tsv | xargs -I {} -n 1 -P 8 az network private-dns record-set txt delete --zone-name preview.platform.hmcts.net -g core-infra-intsvc-rg --subscription DTS-CFTPTL-INTSVC --yes --name {}
# Public DNS
az network dns record-set txt list --zone-name preview.platform.hmcts.net -g reformmgmtrg --subscription Reform-CFT-Mgmt --query "[?contains(txtRecords[0].value[0], 'inactive')].[name]" -o tsv | xargs -I {} -n 1 -P 8 az network dns record-set txt delete --zone-name preview.platform.hmcts.net -g reformmgmtrg --subscription Reform-CFT-Mgmt --yes --name {}

Once swap over is fully complete then delete the old cluster:

Perftest

Before deployment of a cluster

  • Confirm that the environment is not being used with Nickin Sitaram before starting.
  • Use slack channel pertest-cluster for communication.
  • Scale the number of active nodes, increase by 5 nodes if removing a cluster. Reason for this CCD and IDAM will auto-scale the number of running pods when a cluster is taken out of service for a upgrade.

After deployment of a cluster

  • Check all pods are deployed and running. Compare with pods status reference taken pre-deployment

Known issues

Neuvector

  1. admission webhook "neuvector-validating-admission-webhook.neuvector.svc" denied the request:, these alerts can be seen on aks-neuvector-<env> slack channels
    • This happens when neuvector is broken.
    • Check events and status of neuvector helm release.
    • Delete Neuvector Helm release to see if it comes back fine.
  2. Neuvector fails to install.
    • Check if all enforcers come up in time, they could fail to come if nodes are full.
    • If they keep failing with race conditions, it could be due to backups being corrupt.
    • Usually policy.backup and admission_control.backup are the ones you need to delete from Azure file share if they are corrupt.

Dynatrace oneagent pods not deployed or failing to start

NOTE: The issue described here need to be validated if still applies when the updated Dynatrace Operator is rolled out.

For a rebuild or newly deployed cluster, Dynatrace oneagent pods are either not deployed by Flux or where deployed, fails with a CrashLoopBackOff status.

Dynatrace Helm Chart requires CRDs to be applied before installing the chart. The CRDs currently need to be manually applied as they are not part of the existing Flux config.

Run the below command on the cluster. An empty result confirms CRDs are not installed.

kubectl get crds | grep oneagent

To fix, run the below command to apply CRDs to the cluster:

kubectl apply -f https://github.com/Dynatrace/dynatrace-oneagent-operator/releases/latest/download/dynatrace.com_oneagents.yaml

kubectl apply -f https://github.com/Dynatrace/dynatrace-oneagent-operator/releases/latest/download/dynatrace.com_oneagentapms.yaml

Note: A Flux config to apply the CRDs was previously tested (PR14321) but rolled back due to issues in prod. Change will need to be revisited with the Production AKS cluster now upgraded to v1.21.7

This page was last reviewed on 22 April 2024. It needs to be reviewed again on 22 April 2025 by the page owner platops-build-notices .
This page was set to be reviewed before 22 April 2025 by the page owner platops-build-notices. This might mean the content is out of date.