How to switch PTL Clusters Runbook
Previously we moved away from ARM templates for the AKS cluster creation to Terraform from the old release pipeline here to this new one here.
This wiki page documents some tasks that we had to perform when switching over PTL jenkins to Terraform and new infra. We don’t expect this will need to be done again, however just in case it does the following information will likely be useful.
Change Request
So, if you are bringing destroying / rebuilding the PTL cluster then you are going to need a Change Request to be raised as it will prevent teams from using Jenkins as a Path to live while it is unavailable. For example when moving PTL Jenkins to a new cluster recently we had to raise this CR
How to move PTL Jenkins to a new cluster
So there are several steps to follow to move PTL Jenkins to a new cluster. These are:
Go to this page. As per this support page this will put Jenkins into Quiet mode in preparation for a restart and will prevent any new jobs from starting.
In Jenkins and go to the script console and run the script below to cancel any current running jobs
Jenkins.instance.queue.items.findAll { !it.task.name.contains("Extenda") }.each {
println "Cancel ${it.task.name}"
Jenkins.instance.queue.cancel(it.task)
}
Jenkins.instance.items.each {
stopJobs(it)
}
def stopJobs(job) {
if (job in jenkins.branch.OrganizationFolder) {
// Git behaves well so no need to traverse it.
return
} else if (job in com.cloudbees.hudson.plugins.folder.Folder) {
job.items.each { stopJobs(it) }
} else if (job in org.jenkinsci.plugins.workflow.multibranch.WorkflowMultiBranchProject) {
job.items.each { stopJobs(it) }
} else if (job in org.jenkinsci.plugins.workflow.job.WorkflowJob) {
if (job.isBuilding() || job.isInQueue() || job.isBuildBlocked()) {
job.builds.findAll { it.inProgress || it.building }.each { build ->
println "Kill $build"
build.finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborted from Script Console"));
}
}
}
}
return true
Now there should be no jobs running within Jenkins. Now you can need to delete all the agents that are within Jenkins here. You can delete them manually or script it if easier.
Now you can shut Jenkins down via this page. As per this support page this will shutdown Jenkins.
The disk that Jenkins uses is currently in here. If the RG that the Jenkins disk is to be stored in is going to change then you need to take a snapshot of this disk and then create a disk from it in the new RG.
If the Jenkins disk location has changed as per previous step then you will need to make an update in Flux to point to the new location here. However if the location of the disk isn’t changing then this step can be ignored.
Update database subnet whitelisting to new infrastructure, example here.
Update DNS to point to new PTL load balancer IP, example here.
Update dbrule for PTL example here.
Update the palos to add the new vnet, example here.
Now we can destroy the cluster via this pipeline. Just ensure you select the environment to PTL.
Once destroyed use the same pipeline and select the PTL environment and Apply to rebuild the cluster.
If any of the CIDR ranges have changed for the environment then you will need to updatethe F5 VPN routing, for which instructions are located within Confluence here.
If the load balancer IP has changed, you will also need to update local traffic within the F5 VPN too. Within the F5 VPN browse to Local Traffic on the main menu, then select Pools and Pool List. There is currently six pools listed, however you need to make updates to the following named pools pool_build-beta.platform.hmcts.net, pool_response-api.platform.hmcts.net and pool_response.platform.hmcts.net. Within each of these pools you will need to click on Members and add a new member, I would recommend looking at the existing member and copying everything into a new member except the Address which will need to be substituted with the new load balancer IP. Finally, ensure that you Disable the old member and then Enable the newly entered member before clicking on the Update button.
Now log back into Jenkins (it may take a while to come back) and once logged in run the cnp-plum-recipes-service job and confirm it is successful.
Providing the cnp-plum-recipes-service job was successful Jenkins should now be fine.
Gotcha’s
On the last switchover, we noticed the next day that a few Jenkins jobs were failing to connect to several storage accounts. To get around this it was a case of going into the Networking section of the storage account and adding the vnet there for the new vnet that is associated with the PTL environment to cover the subnets named aks-00, aks-01 and iaas.
On storage accounts again, there were three storage accounts which we had to add DNS records of into this private dns zone for the storage accounts named as reformscanaat, reformscanstaging and reformscanprod as those storage accounts were using private endpoints.
There were agents within Jenkins which were randomly going offline and being deleted while jobs where still using them. Thankfully Tim found the cause of this issue here. It was basically caused by the old PTL Cluster not fully stopping before switching over, this meant that the old Jenkins cluster kept deleting agents as they weren’t being used by it. Once the old cluster was fully stopped agents were no longer being deleted while being used.
Troubleshooting
- Once the cluster is up and running and if you see that the Jenkins pods keep rebooting then you can kubectl exec into pod or use Lens app to do this and then a script in place to put Jenkins into Quiet mode to prevent any jobs from starting at startup. Steps to do this are here. When I encountered this issue previously I found it quicker to exec into the pod via the Lens app but if you can do it via the command line that is fine. Once Jenkins is back up with that script in place you should be able to look at logs and troubleshoot further if needed.