TCA 2.1 What's new - Kubernetes Cluster Diagnosis

Share on:

Telco Cloud Automation 2.1 introduced a new feature to be able to perform better health checks for deployed Kubernetes clusters. This post will explore that functionality.

The diagnosis can be run against management clusters, workload clusters or specific nodepools. When going into the details of each of those objects you will find a Diagnosis tab like in the picture below.

Cluster Diagnosis

When you first run this you obviously would not have results listed, but still have the button to "Run Diagnosis". When that button is clicked a new wizard pops up which lets you choose if you only want to run a subset of tests (you can simply multi-select by clicking on each of the options you want) or if you leave the field blank (it does read optional) all of the tests at once as shown in the figure below.

Diagnosis Test Selection

The diagnosis itself will be run through a small helper pod which gets downloaded, the default timeout for this is 60 seconds, depending on the connection to the repository the first run may error out if by that timeout the pod is not fully downloaded or up and running as shown in the picture below. Simply repeat the run and everything should just work fine though.

Diagnosis run timeout

Once the run completes you will see a summary of the run as well as the conducted tests with their respective pass or failure status.

Successfully completed diagnosis run

If a test fails the overall status and the respective test will be marked as fail.

Failed diagnosis results

To get further insight into what made a test fail you can download the diagnosis report, which will download a tarball with a much more detailed html report, for this example one of the hosts in the cluster seems to be missing proper NTP configuration.

Diagnosis test report

When run against a management cluster the following test cases will be run if the whole test suite is selected:

  • Management Cluster VSphereVM Status Diagnosis
  • Management Cluster Machine Status Diagnosis
  • Cluster Control-Plane Nodes Diagnosis
  • Cluster Worker Nodes Diagnosis
  • Node IP address Diagnosis
  • CAPI System Pod Diagnosis
  • CAPV System Pod Diagnosis
  • Cert Manager Pod Diagnosis
  • VMConfig Operator Diagnosis
  • NodeConfig Operator Diagnosis
  • VSphere Controller Manager Diagnosis
  • Etcd Pods Status Diagnosis
  • Kube-apiserver Pods Status Diagnosis
  • Kube-scheduler Pods Status Diagnosis
  • Kube-controller-manager Pods Status Diagnosis
  • Kube-proxy Pods Status Diagnosis
  • CoreDNS Pods Status Diagnosis
  • CNI Antrea Pods Status Diagnosis
  • CSI Status Diagnosis
  • Kapp-controller Pods Status Diagnosis
  • Tanzu-addon-manager Pods Status Diagnosis
  • Tkr Status Diagnosis
  • NodeConfig certificate Diagnosis
  • Host Time Configuration Diagnosis On Prime VC
  • VIP address conflict detection
  • Etcd Status Diagnosis

When run against a worklaod cluster the following test cases will be run if the whole test suite is selected:

  • VSphereVM Status Diagnosis
  • Machine Status Diagnosis
  • Cluster Control-Plane Nodes Diagnosis
  • Cluster Worker Nodes Diagnosis
  • Node IP address Diagnosis
  • VMConfig Diagnosis
  • NodeConfig Operator Diagnosis
  • VSphere Controller Manager Diagnosis
  • Etcd Pods Status Diagnosis
  • Kube-apiserver Pods Status Diagnosis
  • Kube-scheduler Pods Status Diagnosis
  • Kube-controller-manager Pods Status Diagnosis
  • Kube-proxy Pods Status Diagnosis
  • CoreDNS Pods Status Diagnosis
  • CNI Status Diagnosis
  • CSI Status Diagnosis
  • Kapp-controller Pods Status Diagnosis
  • NodeConfig certificate Diagnosis
  • VIP address conflict detection
  • Etcd Status Diagnosis
  • NodePolicy Diagnosis
  • NodePolicyMachineStatus Diagnosis
  • NodeProfileStatus Diagnosis

Overall this for me is one of the best features which Telco Cloud Automation added. It's an extremely common ask to be able to perform a general health check on a cluster, especially prior to performing an upgrade, or validating the functional status after an onboarding or upgrade operation. Also having dealt with a high scale RAN deployment the last couple of weeks, simply being able to check the health of a cluster within a few seconds before diving deeper into troubleshooting something is an amazing time saver, as usually it's very simple things being broken.