Prometheus/Grafana Patching Example
This document is a guide to patching Prometheus/Grafana stack on AKS.
Prometheus/Grafana patches should be tested in sbox first to avoid downtime in other environments.
It is important to understand what changes are in the version upgrade, especially if there are any breaking changes. Prometheus provide an Upgrade document for each version they release which should help to understand the changes in major versions
There should be a renovate pull request in the flux repo that will contain release notes that show you the breaking changes however there can be many release notes included depending on how long it has been since the last patch and how many versions of the Prometheus chart have been released in that time.
For example when patching from 79.11.0 → 81.6.3 there were 59 new versions of the chart released.
Be mindful that the PR in this instance didnt include the CRD update so it is not simply a case of merging this PR.
Patching Sandbox
In order to allow patching of sbox only, a new directory can be created for the updated crd URLs:
apps/monitoring/kube-prometheus-stack-crds-upgrade-v81/kustomization.yaml
which is a copy of the existing file with updated version numbers.
apps/monitoring/kube-prometheus-stack-crds-upgrade-v81/kustomize.yaml
which contains:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: prometheus-crd
namespace: flux-system
spec:
path: ./apps/monitoring/kube-prometheus-stack-crds-upgrade-v56
See example PR containing these files
Using this method allows us to target specific environments with the new CRD deployment.
To use these files we can patch the desired cluster with this new directory in the base kustomization file e.g.: clusters/sbox/base/kustomization.yaml
- path: ../../../apps/monitoring/kube-prometheus-stack-crds-upgrade-v81/kustomize.yaml
With the new crds version now available, a patch can be added to the sbox 00 & 01 config files to update the version of the chart for sbox only, example:
chart:
spec:
chart: kube-prometheus-stack
version: 81.6.3
sourceRef:
kind: HelmRepository
name: prometheus
namespace: monitoring
When ready, raise a pull request for review, example PR
CI Checks will be carried out to make sure the changes you’ve made are valid and will apply successfully when merged (these can be found in the tests folder).
Review the pipeline checks for errors. If there are no errors and the PR has been approved, merge the PR.
Verify Flux has applied the changes successfully
Check Pods are online
Using kubectl, check the pods are online and have a short uptime, denoting a newly created pod
kubectl get pods -n monitoring | grep kube-prometheus-stack
output:
kube-prometheus-stack-operator-6944864b75-2d5nb 1/1 Running 0 19m
Check Pod versions
To confirm that the update took place you can check the new pods have the correct chart version:
kubectl get pod {pod-name} -o yaml -n monitoring | grep -E "app.kubernetes.io/version|chart"
which should return the relevant fields
app.kubernetes.io/version: 81.6.3
chart: kube-prometheus-stack-81.6.3
Also, you can check on the Helm release to see if it has got the correct version in the output.
kubectl get hr -n monitoring
which will return all HelmReleases in the monitoring namespace, look for the kube-prometheus-stack which should show the chart version now deployed:
kube-prometheus-stack 14d True Helm upgrade succeeded for release monitoring/kube-prometheus-stack.v2 with chart kube-prometheus-stack@81.6.3
Review pod logs for errors
Check pods for any obvious errors
kubectl logs {pod-name} -n monitoring
UI Check
Prometheus also has a UI which should be checked, i.e. Grafana.
Ensure the CFT and SDS dashboards are accessible (requires VPN access):
Optional - Delete HR to check it comes back online cleanly
The example commands below are for both CFT/SDS using prometheus:
kubectl get hr -n monitoring
kubectl delete hr kube-prometheus-stack -n monitoring
As the pods come online, you will need to go through the checks again to ensure everything is working as expected.
- Pods are up and running
- Logs show no errors
- Versions are all correct
Other non-prod environments
Carry out the same changes as described in this guide for all non-production environments in both SDS and CFT.
This should only be carried out out if sbox was successful in both SDS and CFT clusters
Prod environments
In the Flux repos there will be a renovate PR that can be merged but please check if the changes include the CRD updates as described in this guide:
Once the renovate PR has been merged, remove the previous patches from non-prod environments
