Crime AKS System Component Post-Upgrade Verification
This guide documents the verification steps required after upgrading Crime AKS system components. These checks ensure that all components are correctly configured before rolling out to additional environments.
Overview
After upgrading system components via Terraform module updates in cpp-module-terraform-azurerm-aks-config, perform these verification steps to validate the configuration and functionality.
System Components Covered
This guide covers verification for the following system components:
- KEDA - Event-driven autoscaling (includes ADO agent federation)
- cert-manager - Certificate lifecycle management (includes VNet peering)
- SonarQube - Code quality and security
- Prometheus - Metrics and monitoring (includes custom metrics adapter)
- Istio - Service mesh sidecar injection and routing
- Azure Service Operator (ASO) - Azure resource provisioning validation
- Gatekeeper - Policy enforcement and admission control
- Kiali - Service mesh observability and monitoring
- pgAdmin - PostgreSQL database management (DEV switchover and above)
- Dynatrace - Application performance monitoring (NFT and above)
Additional Integration Checks - Deploy, priming, and Validation tests
Prerequisites
Required Before Starting
- Non-active Crime AKS cluster deployed (K8-DEV-CS01-CL01 or K8-DEV-CS01-CL02) with:
- System components installed via Terraform (
cpp-module-terraform-azurerm-aks-config) - Helm charts deployed to the cluster
- All Terraform configurations applied
- For NFT/Production testing: NFT non-active cluster must be created first
- System components installed via Terraform (
Access Requirements
Cluster Access:
- Access to Crime AKS clusters (DEV, NFT)
kubectlconfigured with cluster credentials- Appropriate RBAC permissions for verification tasks
Azure Access:
- Azure CLI installed (
az logincompleted) - Azure Portal access for:
- VNet peering verification/creation
- PostgreSQL Flexible Server cloning (DEV only)
- Managed Identity federated credential updates
- Private DNS zone record updates (
dev.nl.cjscp.org.uk)
- Access to HMCTS tenant (hmcts.net) for managed identity configuration
- Azure CLI installed (
DevOps Access:
- Azure DevOps access with permissions to:
- Variable groups (clone and modify)
- Pipeline execution and monitoring
- GitHub access with write permissions to context repositories
- Azure DevOps access with permissions to:
Secrets and Configuration:
- HashiCorp Vault access (
secret.mnl.nl.cjscp.org.uk:8200)- Paths:
secret/mgmt/*,secret/dev/*
- Paths:
- HashiCorp Vault access (
Documentation and Coordination:
- Confluence access for:
- QA team contact for release identification and test credentials
KEDA Verification
1. ADO Federation Credential for KEDA Agent Scaling
After cluster creation/recreation, update federated credentials to allow KEDA to scale Azure DevOps agents.
1.1 Get OIDC URL from Cluster
The OIDC URL is stored in the azure-info ConfigMap:
# Get OIDC URL for the cluster
kubectl get configmap azure-info -n azure-info -o jsonpath='{.data.oidc_url}'
# Example output:
# https://uksouth.oic.prod-aks.azure.com/e2995d11-9947-4e78-9de6-d44e0603518e/12345678-1234-1234-1234-123456789abc/
1.2 Update Managed Identity Federation Credentials
Navigate to hmcts.net tenant → Managed Identities → mi-ado-agent → Federated credentials.
Update the OIDC URL (do NOT change the subjects) for these two credentials:
k8-dev-cs01-cl01-ado - For KEDA operator
- Subject:
system:serviceaccount:keda:keda-operator - Update: Issuer URL only
- Subject:
k8-dev-cs01-cl01-ado-agent - For ADO agents
- Subject:
system:serviceaccount:ado-agent:ado-agent - Update: Issuer URL only
- Subject:
1.3 Configure Agent with Cluster-Specific Identifier
For testing, configure ONE agent with a cluster-specific identifier to prevent jobs from scheduling on the non-active test cluster.
Example configuration in vars/dev-cs01cl01.tfvars:
ado-agents_config = {
enable = true
namespace = "ado-agent"
sa_name = "ado-agent"
azpurl = "https://dev.azure.com/hmcts-cpp"
poolname = "MDV-ADO-AGENT-AKS-01"
secretname = "azdevops"
secretkey = "AZP_TOKEN"
managed-identity = "52cd0539-fbf7-4e98-9b26-ee6cb4f89688"
tenant-id = "531ff96d-0ae9-462a-8d2d-bec7c0b42082"
subscription-id = "ef8dd153-3fba-47a4-be65-15775bcde240"
agents = [
{
agent_name = "azdevops-agent-centos8-j17"
image_name = "ado-agent-centos8-j17"
image_tag = "v0.0.24-jdk17"
identifier = "centos8-j17-cl01" # Cluster-specific: -cl01 or -cl02
requests_mem = "8Gi"
requests_cpu = "1.5"
limits_mem = "8.5Gi"
limits_cpu = "2"
scaled_min_job = "1"
scaled_max_job = "30"
pollinginterval = "10"
successfuljobshistorylimit = "0"
failedjobshistorylimit = "0"
enable_istio_proxy = true
init_container_config = []
run_as_user = 1000
}
]
}
Note: The identifier field includes the cluster number (e.g., centos8-j17-cl01 for CL01, centos8-j17-cl02 for CL02). This ensures active jobs are not scheduled on the test cluster.
1.4 Verify Configuration
# Check KEDA operator logs for errors
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator --tail=50
# Verify at least one ADO agent pod is running
kubectl get pods -n ado-agent
# Expected: At least 1 pod in Running state
# Check ScaledJob configuration
kubectl get scaledjobs -n ado-agent
cert-manager Verification
2. VNet Peering to Vault Network
Verify that the cluster VNet has peering configured to the vault network (VN-MNL-INT-01) for certificate and secret access.
2.1 Check Peering Status
# List all peerings from cluster VNet
az network vnet peering list \
--resource-group <CLUSTER_VNET_RG> \
--vnet-name <CLUSTER_VNET_NAME> \
--output table
# Check specific peering to vault network
# Example: VP-VN-MNL-INT-01-VN-DEV-CS01-CL01
az network vnet peering show \
--resource-group RG-DEV-CS01-CL01 \
--vnet-name VN-DEV-CS01-CL01 \
--name VP-VN-MNL-INT-01-VN-DEV-CS01-CL01
Expected Output:
Name PeeringState ProvisioningState
-------------------------------------- -------------- -------------------
VP-VN-MNL-INT-01-VN-DEV-CS01-CL01 Connected Succeeded
If PeeringState is “Disconnected”: Manually recreate the peering in Azure Portal or via Azure CLI.
2.2 Verify Vault Connectivity
Test that pods can reach the vault endpoint:
# Test DNS resolution
kubectl exec -n cert-manager deployment/cert-manager -- nslookup secret.mnl.nl.cjscp.org.uk
# Test HTTPS connectivity to vault
kubectl exec -n cert-manager deployment/cert-manager -- \
curl -I https://secret.mnl.nl.cjscp.org.uk:8200/v1/sys/health
# Expected: HTTP 200 response
Vault Path: https://secret.mnl.nl.cjscp.org.uk:8200
3. cert-manager Configuration Verification
Verify that cert-manager has the correct node selectors, replicas, and resource configurations.
3.1 Required cert-manager Customizations
When upgrading cert-manager, apply these custom changes to the upstream manifest.
File Locations:
- Terraform: cpp-module-terraform-azurerm-aks-config/cert-manager.tf
- Manifest: cpp-module-terraform-azurerm-aks-config/manifests/cert-manager/cert-manager.yaml
Steps to Apply Custom Changes:
Download the new upstream manifest from cert-manager releases
Apply template variables for Docker images
Update cert-manager.tf to template the Docker image references:
content = templatefile("${path.module}/manifests/cert-manager/cert-manager.yaml", {
docker_image_certmanager_cainjector = var.docker_image_certmanager_cainjector
docker_tag_certmanager = var.certmanager_version
docker_image_certmanager_controller = var.docker_image_certmanager_controller
docker_image_certmanager_webhook = var.docker_image_certmanager_webhook
})
Then in the manifest file, replace each image reference with template variables:
# For each component, find the image: line and replace with:
image: "${docker_image_certmanager_cainjector}:${docker_tag_certmanager}"
image: "${docker_image_certmanager_controller}:${docker_tag_certmanager}"
image: "${docker_image_certmanager_webhook}:${docker_tag_certmanager}"
- Apply custom changes to the manifest file
manifests/cert-manager/cert-manager.yaml:
a. Change replicas (search for name: cert-manager-cainjector, then find replicas:):
# cert-manager-cainjector Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-cainjector
spec:
replicas: 2 # Change from 1
# cert-manager Deployment (controller)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager
spec:
replicas: 2 # Change from 1
# cert-manager-webhook Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-webhook
spec:
replicas: 3 # Change from 1
b. Change nodeSelector (for all three deployments, search for nodeSelector:):
# FIND:
nodeSelector:
kubernetes.io/os: "linux"
# REPLACE WITH:
nodeSelector:
agentpool: sysagentpool
c. Add tolerations (add BEFORE each nodeSelector: block in all three deployments):
# ADD these lines BEFORE nodeSelector:
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
agentpool: sysagentpool
d. Add resources (cert-manager-webhook deployment only, in the container spec):
Search for the cert-manager-webhook container and add resources after the env: section:
containers:
- name: cert-manager-webhook
image: "${docker_image_certmanager_webhook}:${docker_tag_certmanager}"
imagePullPolicy: IfNotPresent
args:
- --v=2
# ... other args
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources: # ADD this section
requests:
cpu: 1000m
3.2 Verify cert-manager Configuration
After applying the customizations, verify the configuration:
# Check replicas
kubectl get deployment cert-manager-cainjector -n cert-manager
# Expected: 2/2 READY
kubectl get deployment cert-manager -n cert-manager
# Expected: 2/2 READY
kubectl get deployment cert-manager-webhook -n cert-manager
# Expected: 3/3 READY
# Check node selector for cainjector
kubectl get deployment cert-manager-cainjector -n cert-manager \
-o jsonpath='{.spec.template.spec.nodeSelector}' | jq
# Expected: {"agentpool": "sysagentpool"}
# Check tolerations for cainjector
kubectl get deployment cert-manager-cainjector -n cert-manager \
-o jsonpath='{.spec.template.spec.tolerations}' | jq
# Expected: Includes CriticalAddonsOnly toleration
# Check webhook resources
kubectl get deployment cert-manager-webhook -n cert-manager \
-o jsonpath='{.spec.template.spec.containers[0].resources}' | jq
# Expected: {"requests": {"cpu": "1000m"}}
# Verify pods are on correct node pool
kubectl get pods -n cert-manager -o wide
# Expected: All pods on aks-sysagentpool-* nodes
SonarQube Verification
4. SonarQube Validation Configuration
Configure SonarQube testing to validate code quality checks are working on the upgraded cluster.
4.1 Get Cluster-Specific SonarQube URL
Retrieve the VirtualService to get the cluster-specific URL:
# Get VirtualService hosts
kubectl get virtualservice -n sonarqube sonarqube-sonarqube -o jsonpath='{.spec.hosts}'
# Example output:
# ["sonarqube.mgmt01.dev.nl.cjscp.org.uk","sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk"]
Use the cluster-specific URL (e.g., sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk for CL01).
4.2 Clone PostgreSQL Flex Server for Testing (DEV Only)
For upgrade testing without impacting the live CL02 cluster:
Clone the PSF instance in Azure Portal:
- Source:
psf-dev-ccm-sonarqube - New name:
psf-dev-ccm-sonarqube-<TICKET-ID>(e.g.,psf-dev-ccm-sonarqube-dtspo-30530)
- Source:
Update Terraform config in
cpp-terraform-azurerm-aks-config/vars/dev-cs01cl01.tfvars:
sonarqube_config = {
enable = true
# TODO <TICKET-ID>: Revert to psf-dev-ccm-sonarqube at time of switchover
jdbcUrl = "jdbc:postgresql://psf-dev-ccm-sonarqube-<TICKET-ID>.postgres.database.azure.com/sonarqube?sslmode=require&socketTimeout=1500"
sonarVaultPath = "/secret/dev/aks_sonarube_config"
sonarqubeUrl = "sonarqube.mgmt01.dev.nl.cjscp.org.uk"
hosts = "sonarqube.mgmt01.dev.nl.cjscp.org.uk;sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk"
community_build_number = "26.3.0.120487"
}
Important: Remember to revert jdbcUrl back to psf-dev-ccm-sonarqube before production switchover.
4.3 Clone Variable Group
Clone the variable group to run validation pipeline for a context repository pointing at the test SonarQube on the non-active test cluster.
- Navigate to Azure DevOps → Library → Variable Groups
- Clone
cpp-nonlive-sonarqube-akstocpp-nonlive-sonarqube-aks-testing
4.4 Update Testing Variable Group
Update the following variables in cpp-nonlive-sonarqube-aks-testing:
| Variable | Value |
|---|---|
SONARQUBE_URL |
https://sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk (cluster-specific) |
ADMIN_TOKEN |
Get from secret/mgmt/sonaraks_admin_token in vault |
4.5 Verify SonarQube Access
# Check SonarQube pod is running
kubectl get pods -n sonarqube
# Check service and endpoints
kubectl get svc,endpoints -n sonarqube
# Check virtual service for Istio routing
kubectl get virtualservice -n sonarqube -o yaml
# Test access from within cluster
kubectl exec -n istio-ingress-mgmt deployment/istio-ingressgateway-mgmt -- \
curl -I https://sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk
4.6 Test SonarQube Login (Browser)
- Open browser:
https://sonarqube.mgmt.cs01cl01.dev.nl.cjscp.org.uk - Username:
admin - Password: Get from vault path
secret/mgmt/sonaraks_admin_password
Expected: Successful login to SonarQube dashboard
Prometheus Verification
5. Prometheus Adapter - Custom Metrics HPA Validation
Test Prometheus Adapter’s custom metrics capability by patching and validating autoscaling for a test service.
5.1 Patch HPA for usersgroups-service
Apply custom metrics configuration to test HPA with Istio request metrics:
kubectl patch hpa usersgroups-service-wildfly-app \
-n ns-ste-ccm-91 \
--context K8-DEV-CS01-CL01-admin \
--type='json' \
-p='[
{"op": "replace", "path": "/spec/maxReplicas", "value": 10},
{"op": "replace", "path": "/spec/metrics/0/resource/target/averageUtilization", "value": 60},
{"op": "add", "path": "/spec/metrics/-", "value": {
"type": "Object",
"object": {
"describedObject": {
"apiVersion": "v1",
"kind": "Service",
"name": "usersgroups-service-wildfly-app"
},
"metric": {
"name": "istio_requests_per_second"
},
"target": {
"type": "AverageValue",
"averageValue": "2",
"value": "1"
}
}
}}
]'
5.2 Verify HPA Configuration
# Check HPA status
kubectl get hpa usersgroups-service-wildfly-app -n ns-ste-ccm-91
# Describe HPA to see metrics
kubectl describe hpa usersgroups-service-wildfly-app -n ns-ste-ccm-91
5.3 Generate Load to Test Scaling
Generate requests to trigger autoscaling:
# Generate load from another pod in the namespace
seq 1000 | xargs -P 10 -I {} kubectl exec -n ns-ste-ccm-91 \
defence-service-wildfly-app-7d8d8c75cf-29zlf \
-- curl -s http://localhost:8080/usersgroups-service/internal/metrics/ping
5.4 Verify Custom Metrics API
Check that KEDA’s custom metrics API can see the Istio request metrics:
# Query custom metrics API for Istio requests
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ns-ste-ccm-91/services/usersgroups-service-wildfly-app/istio_requests_per_second" | jq '.'
Expected Output:
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {},
"items": [
{
"describedObject": {
"kind": "Service",
"namespace": "ns-ste-ccm-91",
"name": "usersgroups-service-wildfly-app",
"apiVersion": "v1"
},
"metricName": "istio_requests_per_second",
"value": "X" // Should be incrementing, not 0
}
]
}
Note: The value should show an incrementing number (not 0) if metrics are flowing correctly.
5.5 Monitor Pod Scaling
Watch the pods scale based on the metrics:
# Watch pods in the namespace
kubectl get pods -n ns-ste-ccm-91 -l app=usersgroups --watch
# Check HPA events
kubectl get events -n ns-ste-ccm-91 --field-selector involvedObject.name=usersgroups-service-wildfly-app
6. Prometheus Service Name Updates
If the Prometheus Helm chart name is updated with a custom suffix (e.g., -v3), update all Kubernetes service references where Prometheus is consumed.
6.1 Identify Service Name Changes
# Check current Prometheus services
kubectl get svc -n prometheus
# Expected services after upgrade:
# kube-prometheus-stack-v3-prometheus (Prometheus server)
# kube-prometheus-stack-v3-alertmanager (Alertmanager)
# kube-prometheus-stack-v3-operator (Prometheus operator)
# kube-prometheus-stack-v3-kube-state-metrics
6.2 Update Consuming Services in aks-config Module
Search for Prometheus service references in the cpp-module-terraform-azurerm-aks-config repository:
# Search for old service names
grep -r "prometheus-server" .
grep -r "prometheus-kube-prometheus-prometheus" .
grep -r "kube-prometheus-stack-prometheus" .
Common files to update:
- prometheus.tf - Prometheus configurations
- istio.tf - Istio telemetry integration
- alerts.tf - Alert rules and configurations
- Any custom ServiceMonitor or PrometheusRule manifests
6.3 Verify Connectivity
# Test connectivity to Prometheus server
kubectl exec -n cert-manager deployment/cert-manager -- \
curl -I http://kube-prometheus-stack-v3-prometheus.prometheus.svc.cluster.local:9090
# Check Prometheus targets
kubectl port-forward -n prometheus svc/kube-prometheus-stack-v3-prometheus 9090:9090
# Open browser: http://localhost:9090/targets
# Verify metrics are being collected
kubectl port-forward -n prometheus svc/kube-prometheus-stack-v3-prometheus 9090:9090
# Open browser: http://localhost:9090/graph
# Query: up{job="kubernetes-pods"}
6.4 Verify Prometheus Adapter for Custom Metrics
Test that the Prometheus adapter is exposing custom metrics for Istio request rates:
# Check Prometheus adapter pods are running
kubectl get pods -n prometheus -l app.kubernetes.io/name=prometheus-adapter
# Query custom metrics API for Istio requests per second
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ns-ste-ccm-91/services/usersgroups-service-wildfly-app/istio_requests_per_second" | jq '.'
# List all available custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name' | grep istio
# Check adapter logs for errors
kubectl logs -n prometheus -l app.kubernetes.io/name=prometheus-adapter --tail=50
Expected Output (custom metrics API):
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {},
"items": [
{
"describedObject": {
"kind": "Service",
"namespace": "ns-ste-ccm-91",
"name": "usersgroups-service-wildfly-app",
"apiVersion": "v1"
},
"metricName": "istio_requests_per_second",
"value": "X" // Should be incrementing, not 0
}
]
}
Note: The value should show an incrementing number (not 0) if metrics are flowing correctly.
Istio Verification
7. Istio Sidecar Injection and Routing Validation
Verify that Istio sidecar injection is working correctly and validate both internal and external routing through the service mesh.
7.1 Verify Sidecar Injection Status
Check that Istio sidecars are automatically injected into application pods:
# Check namespace label for sidecar injection
kubectl get namespace ns-ste-ccm-91 -o jsonpath='{.metadata.labels.istio-injection}'
# Expected: enabled
# Verify pods have Istio sidecar (should show 2 containers: app + istio-proxy)
kubectl get pods -n ns-ste-ccm-91 -l app=usersgroups
# Expected: READY shows 2/2 (application + sidecar)
# Check sidecar container is present
kubectl get pod <pod-name> -n ns-ste-ccm-91 -o jsonpath='{.spec.containers[*].name}'
# Expected: Should include 'istio-proxy'
7.2 Test Internal Path-Based Routing via Sidecar Proxy
Context: The Istio sidecar binds to port 8080 on localhost within each pod. Applications use this binding to communicate with other services in the namespace via path-based routing through internal VirtualServices.
# Test internal routing from any pod in the namespace
kubectl exec -n ns-ste-ccm-91 <pod-name> -- \
curl -s http://127.0.0.1:8080/usersgroups-service/internal/metrics/ping
# Expected: pong
7.3 Test External Ingress Access
Validate external ingress traffic flow to verify cert-manager certificates, Istio gateway, and end-to-end connectivity:
# Test from your laptop (external access)
curl -s https://steccm91.ingress01.dev.nl.cjscp.org.uk/usersgroups-service/internal/metrics/ping
# Expected: pong
What this validates (end-to-end traffic flow):
| Component | Validation |
|---|---|
| Ingress Controller | Traffic reaches the Istio ingress gateway |
| cert-manager | TLS certificates are correctly provisioned and configured |
| Istio Gateway | External traffic is routed to the internal VirtualService |
| Internal Routing | Path-based routing via sidecar proxy (localhost:8080) |
Azure Service Operator (ASO) Verification
8. ASO Azure Resource Provisioning
Verify that Azure Service Operator can successfully create and manage Azure resources.
8.1 Check Existing ASO Resources
Verify that ASO-managed resources are in a healthy state:
# Check UserAssignedIdentity resources
kubectl get userassignedidentities -A
# Expected: All resources should show STATUS as "Succeeded"
# Check RoleAssignment resources
kubectl get roleassignments -A
# Expected: All resources should show STATUS as "Succeeded"
# Check FederatedIdentityCredential resources
kubectl get federatedidentitycredentials -A
# Expected: All resources should show READY as "True"
8.2 Verify Resource Details
Check detailed status of ASO resources:
# Describe a UserAssignedIdentity to check provisioning state
kubectl describe userassignedidentity <identity-name> -n <namespace>
# Look for:
# - Status: Succeeded
# - Provisioning State: Succeeded
# - No error messages in Events
# Describe a FederatedIdentityCredential
kubectl describe federatedidentitycredential <credential-name> -n <namespace>
# Look for:
# - Ready: True
# - No error conditions
8.3 Test ASO Resource Lifecycle
Validate ASO can create and delete resources by testing with a context release:
# Step 1: Delete a release for one context (this deletes ASO resources)
# In cpp-aks-deploy pipeline, delete the Helm release for a test context
# Example: Delete release for cpp-context-staging-bulkscan in STE environment
# Step 2: Verify ASO resources are deleted
kubectl get userassignedidentities -n <context-namespace>
kubectl get roleassignments -n <context-namespace>
kubectl get federatedidentitycredentials -n <context-namespace>
# Expected: Resources for that context should be removed
# Step 3: Re-run cpp-aks-deploy pipeline for the context
# This will deploy the application and recreate the ASO resources
# Step 4: Verify resources are recreated successfully after deploying the application
kubectl get userassignedidentities -n <context-namespace>
kubectl get roleassignments -n <context-namespace>
kubectl get federatedidentitycredentials -n <context-namespace>
# Expected:
# - UserAssignedIdentities: STATUS = "Succeeded"
# - RoleAssignments: STATUS = "Succeeded"
# - FederatedIdentityCredentials: READY = "True"
8.4 Check ASO Operator Health
Verify the ASO operator itself is running correctly:
# Check ASO controller pod status
kubectl get pods -n azureserviceoperator-system
# Expected: All pods in Running state
# Check ASO operator logs for errors
kubectl logs -n azureserviceoperator-system -l control-plane=controller-manager --tail=50
# Expected: No error messages related to resource provisioning
# Verify ASO can communicate with Azure
kubectl logs -n azureserviceoperator-system -l control-plane=controller-manager --tail=100 | grep -i "error\|failed"
# Expected: No Azure authentication or API errors
Common Issues:
- UserAssignedIdentity stuck in “Provisioning”: Check ASO operator logs for Azure API errors
- RoleAssignment fails: Verify the service principal has sufficient permissions to create role assignments
- FederatedIdentityCredential not Ready: Check OIDC issuer URL is correct and accessible
Gatekeeper Verification
9. Gatekeeper Policy Enforcement
Verify that Gatekeeper admission controller is correctly enforcing policies, specifically image whitelisting and security contexts.
9.1 Check Gatekeeper Installation
Verify Gatekeeper components are running:
# Check Gatekeeper pods
kubectl get pods -n gatekeeper-system
# Expected:
# - gatekeeper-audit pods running
# - gatekeeper-controller-manager pods running
# Check Gatekeeper constraints
kubectl get constraints -A
# Expected output:
# k8srequiredrunasnonroot.constraints.gatekeeper.sh/enforce-runasnonroot-unless-istio deny 0
# k8swhitelistedimages.constraints.gatekeeper.sh/k8senforcewhitelistedimages deny 0
9.2 Test Image Whitelisting Policy
Test that Gatekeeper blocks deployments with non-whitelisted images:
# Create a test deployment with non-whitelisted image (nginx from Docker Hub)
kubectl create deployment nginx-test --image=nginx:1.14.2 -n default
# Expected: Deployment created but pods will fail to create
# Check deployment events
kubectl describe deployment nginx-test -n default
# Expected error in events:
# Error creating: admission webhook "validation.gatekeeper.sh" denied the request:
# [k8senforcewhitelistedimages] pod "nginx-test-xxx" has invalid image "nginx:1.14.2".
# Please, contact your DevOps. Follow the whitelisted images
# {"crmdvrepo01.azurecr.io/", "crmpdrepo01.azurecr.io/", "mcr.microsoft.com/"}
# Check ReplicaSet events for more details
kubectl get events -n default --field-selector involvedObject.kind=ReplicaSet | grep nginx-test
# Clean up test deployment
kubectl delete deployment nginx-test -n default
Example of expected blocking behavior:
93s Warning FailedCreate replicaset/nginx-test-xxx Error creating: admission webhook "validation.gatekeeper.sh" denied the request:
[k8senforcewhitelistedimages] pod "nginx-test-xxx" has invalid image "nginx:1.14.2".
Please, contact your DevOps. Follow the whitelisted images {"crmdvrepo01.azurecr.io/", "crmpdrepo01.azurecr.io/", "mcr.microsoft.com/"}
9.3 Verify Gatekeeper Audit Results
Check if Gatekeeper audit has detected any violations:
# Check constraint status for violations
kubectl get k8swhitelistedimages k8senforcewhitelistedimages -o yaml | grep -A 10 "totalViolations"
# Expected: Should show any existing violations in the cluster
# List all violations
kubectl get k8swhitelistedimages k8senforcewhitelistedimages -o jsonpath='{.status.violations}' | jq
Kiali Verification
10. Kiali Service Mesh Observability
Verify that Kiali is accessible and can communicate with Istio and Prometheus.
10.1 Access Kiali Dashboard
Access the cluster-specific Kiali URL:
# Get the cluster-specific Kiali URL from VirtualService
kubectl get virtualservice -n kiali-operator -o jsonpath='{.items[*].spec.hosts}'
# Example output for CL01 test cluster:
# ["kiali.mgmt01.dev.nl.cjscp.org.uk","kiali.mgmt.cs01cl01.dev.nl.cjscp.org.uk"]
Access Kiali:
- Open browser to cluster-specific URL: https://kiali.mgmt.cs01cl01.dev.nl.cjscp.org.uk (for CL01)
- Login using: hmctsnonlive.onmicrosoft.com account credentials
- After successful login, you will be redirected to the Kiali console
10.2 Verify Kiali Health and Component Connectivity
Check that Kiali can communicate with all Istio components:
Navigate to Mesh Overview: Go to https://kiali.mgmt.cs01cl01.dev.nl.cjscp.org.uk/kiali/console/mesh and check the mesh status page.
Expected healthy state: - Green status indicators for all components: - Kiali can talk to Istio - Kiali can talk to Prometheus - Green tick (✓) next to Kubernetes near the Kiali logo (top-left) - All service mesh components shown in green
Indicators of issues: - Red status on any component indicates connectivity or configuration problems - Check component-specific errors displayed on the mesh page
10.3 Verify Kiali Pods and Services
# Check Kiali operator pods
kubectl get pods -n kiali-operator
# Expected: kiali-operator pod in Running state
# Check Kiali instance (if deployed in separate namespace)
kubectl get kiali -A
# Check Kiali service
kubctl get svc -n kiali-operator
# Verify VirtualService configuration
kubectl get virtualservice -n kiali-operator -o yaml
10.4 Common Issues
- Red Prometheus indicator: Check Prometheus service name and connectivity (see Section 6)
- Red istiod indicator: Verify Istio control plane is running (
kubectl get pods -n istio-system) - Login fails: Verify Azure AD authentication configuration for hmctsnonlive.onmicrosoft.com
- Check app registration mdv-k8s-monitor for any expired secrets
- Green tick missing next to Kubernetes: Check Kiali’s access to Kubernetes API
pgAdmin Verification
11. pgAdmin PostgreSQL Database Management
Verify pgAdmin is accessible and properly configured to manage PostgreSQL databases.
11.1 Access pgAdmin
Access pgAdmin through the management ingress:
- URL:
https://pgadmin.mgmt01.<environment>.nl.cjscp.org.uk(e.g.,pgadmin.mgmt01.dev.nl.cjscp.org.uk) - Authentication: OAuth via hmctsnonlive.onmicrosoft.com
11.2 Verify Server List Population
Test pgAdmin functionality including OAuth login and server list generation:
- Test Login - verify OAuth authentication via hmctsnonlive.onmicrosoft.com works
- Check Server List - verify servers auto-populate from
server_list_sync.pycustom script - Verify server_list_sync.py - confirm the custom server list generation script is working correctly
- Test Connectivity - expand a server to verify database connections
For detailed instructions on authentication, server configuration, and the server_list_sync.py script, see:
Reference: How to access Postgres Databases with PGAdmin
Dynatrace Verification
12. Dynatrace Application Performance Monitoring
Verify that Dynatrace is properly deployed and monitoring the cluster and workloads.
12.1 Prerequisites
Important: You must create the NFT non-active cluster first before performing Dynatrace validation.
Refer to the integration guide: AKS-Dynatrace Integration
12.2 Verify Dynatrace Pods
Check that all Dynatrace components are running:
# Check all Dynatrace pods
kubectl get pods -n dynatrace
# Expected pods:
# - dynatrace-operator (deployment)
# - dynatrace-webhook (deployment)
# - dynatrace-oneagent-csi-driver (DaemonSet)
# - dynatrace-oneagent (DaemonSet - runs on each node)
# All pods should be in Running state
12.3 Verify DynaKube Custom Resource
Check the DynaKube CR status:
# Check DynaKube CR
kubectl get dynakube -n dynatrace
# Expected: Phase should be "Running"
# Get detailed status
kubectl describe dynakube -n dynatrace
12.4 Verify OneAgent Injection
Verify OneAgent init containers are injected into application pods via CSI driver:
# Check for OneAgent CSI driver pods (DaemonSet)
kubectl get pods -n dynatrace -l app.kubernetes.io/name=csi-driver
# Verify init container injection in an application pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Init Containers:"
# Expected: Application pods should have oneagent init container
# injected for Dynatrace metrics collection
12.5 Verify Dynatrace Console
Check that the cluster and workloads appear in the Dynatrace console:
- Login to Dynatrace console for the HMCTS tenant
- Navigate to Infrastructure → Kubernetes
- Verify cluster appears in the list with the correct name (use whatever name was configured when adding the cluster to Dynatrace)
12.6 Verify Metrics and Observability
In Dynatrace console, verify metrics are visible for:
Cluster Level: - Cluster resource utilization (CPU, memory) - Cluster health and status - Kubernetes events
Node Level: - Individual node metrics - Node resource usage - Node health status
Process Level: - Application processes detected - Process resource consumption - Process dependencies
Workload Level: - Pod metrics and status - Container resource usage - Service-to-service communication
Namespace Events: - Kubernetes events captured - Pod lifecycle events - Deployment and scaling events
Note: In NFT, avoid modifying existing dashboards as it may interfere with NFT testing. Dashboard validation should be performed after switchover.
Additional Integration Checks
These checks validate the overall system functionality by testing component integration and cluster readiness.
13. Component Node Pool and Security Context Verification
Verify that all system components are scheduled on the correct node pools with proper security contexts.
13.1 Check Component Node Pool Placement
System components should run on the sysagentpool node pool:
# Check all system component namespaces
for ns in cert-manager gatekeeper-system keda azureserviceoperator-system \
dynatrace istio-system kiali-operator sonarqube pgadmin prometheus; do
echo "=== Namespace: $ns ==="
kubectl get pods -n $ns -o wide 2>/dev/null | grep -E "NAME|Running"
echo ""
done
# Verify nodes have correct labels
kubectl get nodes -L agentpool
# Expected: aks-sysagentpool-* and aks-wrkagentpool-* nodes
13.2 Verify Security Context (runAsNonRoot)
Check that pods have proper security contexts configured:
# Check security context across system namespaces
for ns in cert-manager gatekeeper-system keda azureserviceoperator-system; do
echo "=== Namespace: $ns ==="
kubectl get pods -n $ns --no-headers 2>/dev/null | awk '{print $1}' | while read pod; do
echo "Pod: $pod"
kubectl get pod $pod -n $ns \
-o jsonpath='{range .spec.containers[*]}Container: {.name} | runAsNonRoot: {.securityContext.runAsNonRoot}{"\n"}{end}' 2>/dev/null
kubectl get pod $pod -n $ns \
-o jsonpath='Pod-level runAsNonRoot: {.spec.securityContext.runAsNonRoot}{"\n"}' 2>/dev/null
echo ""
done
echo ""
done
14. Deploy STE Environment
Deploy a test STE (System Test Environment) namespace to validate end-to-end deployment capabilities.
14.1 Get STE Allocation
Request an STE allocation for testing:
Reference: EA Environment Management
Note the allocated stack number (e.g., steccm91).
14.2 Identify Latest Release
Find the latest release branch for the CPP Pipeline:
- Check with QA team for the recommended STE release branch
- Pipeline: CPP-AKS-Deploy Pipeline
- Branch format:
dev/<version>_ste(e.g.,dev/2604_ste)
14.3 Run CPP-AKS-Deploy Pipeline
Execute the deployment pipeline with the following parameters:
Pipeline: CPP-AKS-Deploy
Parameters:
| Parameter | Value | Notes |
|---|---|---|
| Branch | dev/<version>_ste |
Example: dev/2604_ste
|
| Environment | ste |
Fixed value |
| Stack | steccm<number> |
Example: steccm91 (use your allocated number) |
| Event Grid | (default) | Leave as default |
| Cluster |
k8-dev-cs01-cl01 OR k8-dev-cs01-cl02
|
Non-active cluster you are testing |
| deploy-service | ✓ Ticked | Required |
| deploy_idam | ✓ Ticked | Required |
| Create DB | ✓ Ticked | Required |
| create_replica_configmap | ✓ Ticked | Required |
Expected outcome: Pipeline completes successfully and STE namespace is deployed to the non-active cluster.
14.4 Update DNS Records
After deployment, update DNS to route traffic to the non-active cluster.
Private DNS Zone: dev.nl.cjscp.org.uk
Add three CNAME records (replace steccm91 with your stack number and cs01cl01 with your cluster):
# Record 1: CDNS Web
Name: steccm91-cdns.web01
Type: CNAME
Value: web.cs01cl01.dev.nl.cjscp.org.uk.
TTL: 300
# Record 2: Frontend Web
Name: steccm91-frontend.web01
Type: CNAME
Value: web.cs01cl01.dev.nl.cjscp.org.uk.
TTL: 300
# Record 3: Ingress
Name: steccm91.ingress01
Type: CNAME
Value: ingress.cs01cl01.dev.nl.cjscp.org.uk.
TTL: 300
Verification:
- Access
https://steccm91-frontend.web01.dev.nl.cjscp.org.ukusername: erica@test.hmcts.net, password: Check with QA team for latest credentials - Verify all pods are running:
bash kubectl get pods -n ns-ste-ccm-<number>
Note: Remember to revert the DNS records after testing is complete to restore traffic to the active cluster.
15. Priming Pipeline
The priming pipeline sets up databases with stub data, creates users, and runs validation checks to ensure the environment is ready for testing.
Pipeline: CPP-AKS-Priming
What Priming Does: - Populates databases with stub/test data - Creates test users and credentials - Sets up initial configuration - Runs validation checks on cluster components - Verifies database connectivity and schema
Pipeline Parameters:
| Parameter | Value | Notes |
|---|---|---|
| Environment | ste |
Fixed value |
| Stack | steccm<number> |
Example: steccm91 (use your allocated stack number) |
| Cluster |
K8-DEV-CS01-CL01 OR K8-DEV-CS01-CL02
|
Non-active cluster you are testing |
| quick_clear | ✓ Ticked | Required |
| priming_enable | ✓ Ticked | Required |
| priming_image_tag | (release version) | Check with QA team for the latest release version |
| sitdb_restore_dataset_flag | ✓ Ticked | Required |
| restore_dataset_db | postgres-postgresql |
Fixed value |
Verification:
- Check pipeline run completes successfully with green status
- Review pipeline logs for any warnings or errors
Expected Outcome: Priming pipeline completes successfully with all validation checks passing.
16. Validation Pipeline
The validation pipeline performs comprehensive system tests by deploying a validation namespace on the non-active cluster and running integration tests. This validates Istio internal routing and SonarQube connectivity.
Auto-Trigger: The validation pipeline is automatically triggered when commits are pushed to team/* or main branches.
16.1 Create Feature Branch
Create a team branch for testing:
# Example branch name
git checkout -b team/DTSPO-30530
16.2 Update Application Version
Update the version in pom.xml to include the ticket ID:
File: pom.xml (e.g., cpp-context-staging-bulkscan/pom.xml)
<!-- FIND: -->
<version>17.103.43-SNAPSHOT</version>
<!-- REPLACE WITH: -->
<version>17.103.43-DTSPO-30530-SNAPSHOT</version>
This creates a unique version for testing on the non-active cluster.
16.3 Update Pipeline Configuration
Update the context repository’s azure-pipelines.yaml to target the non-active cluster and test SonarQube instance.
File: azure-pipelines.yaml (e.g., cpp-context-staging-bulkscan/azure-pipelines.yaml)
Change 1: Update Agent Pool Identifier
# FIND:
pool:
name: "MDV-ADO-AGENT-AKS-01"
demands:
- identifier -equals centos8-j17
# REPLACE WITH (for CL01 test cluster):
pool:
name: "MDV-ADO-AGENT-AKS-01"
demands:
- identifier -equals centos8-j17-cl01
Note: The identifier must match the unique identifier deployed during ADO agent setup in Section 1 (e.g., centos8-j17-cl01 for CL01, centos8-j17-cl02 for CL02).
Change 2: Update Variable Group for Test SonarQube
# FIND:
variables:
- ${{ if eq(parameters.sonarQubeType, 'sonarQubeAKS') }}:
- group: cpp-nonlive-sonarqube-aks
# REPLACE WITH:
variables:
- ${{ if eq(parameters.sonarQubeType, 'sonarQubeAKS') }}:
- group: cpp-nonlive-sonarqube-aks-testing # Use testing variable group
Optional: Update Template Branch (if testing template changes)
resources:
repositories:
- repository: cppAzureDevOpsTemplates
type: github
name: hmcts/cpp-azure-devops-templates
endpoint: 'hmcts'
ref: '<FEATURE-BRANCH>' # Feature branch with variable group changes (if needed)
16.4 Merge and Trigger Pipeline
# Commit changes
git add pom.xml azure-pipelines.yaml
git commit -m "DTSPO-30530: Update for non-active cluster validation"
# Push to team branch (this auto-triggers the pipeline)
git push origin team/DTSPO-30530
16.5 Monitor Pipeline Execution
Pipeline: cpp-context-staging-bulkscan (definitionId=319)
The pipeline will:
- Deploy validation namespace on the non-active cluster
- Build and deploy the application
- Run integration tests
- Validate Istio internal routing (localhost:8080)
- Test SonarQube connectivity and code quality scanning
Verification:
- Check pipeline completes successfully with green status
- Verify validation namespace is deployed:
bash kubectl get ns | grep validation - Check pods are running:
bash kubectl get pods -n <validation-namespace> - Review test results in Azure DevOps pipeline logs
What This Validates:
- Istio Internal Routing: Tests service-to-service communication via sidecar proxy (localhost:8080)
- SonarQube Integration: Validates code quality scanning on the non-active cluster
- Component Integration: Verifies all system components work together
- Cluster Readiness: Confirms the cluster is ready for dev switchover
Expected Outcome: All validation tests pass, confirming the cluster is ready for switchover.