Kubernetes Cluster Add-On Bootstrapping: Part 1, TKGI

If you've ever been responsible for building or maintaining Kubernetes clusters at your company (or maybe this is a role you'll soon be asked to fill), you will understand when I say no matter what solution you use to pave those clusters you will have to add or change certain things about them before putting them to use. This could range from deploying new applications to setting up RBAC for certain groups, creating new namespaces, setting up different policies, or a combination of all these things plus more. The reality is that simply clicking the button and getting a shiny new K8s cluster only gets you part of the way. In this article, I'm going to illustrate a method I've developed that'll allow you to very easily and automatically deploy whatever add-ons you want to K8s clusters being built by VMware Tanzu Kubernetes Grid Integrated Edition (TKGI), formerly known as PKS.

The idea is simple: Whenever a cluster gets built we need to customize it to our liking. As I said, this could be lots of things, but it will at least be some things. The challenge is how do we do that without it burdening us with either manual and error-prone work or complicated automation that becomes fragile and cumbersome to change once things are in use.

TKGI, formerly known as PKS, is hands down the most mature Kubernetes solution from VMware. I've been fortunate enough to have deployed it in production for a few companies over the course of almost two years, and one of the functionalities that has been extremely valuable is the ability to deploy add-ons. This is configured at the plan level inside Ops Manager.

The add-ons box inside a TKGI plan shown through the OpsManager UI. Although it says experimental it works perfectly.

In this box, which is only accessible through Ops Manager, one can paste in valid Kubernetes manifests which, as the final stage of building a cluster, will be automatically applied to each cluster being built from this plan. Seems simple, and it really is! And maybe for your needs, this is adequate. But there are problems with this solution as the list of add-ons grows and is regularly altered.

  1. Manifests stored in these add-ons are locked into Ops Manager, not in more established locations like version control.
  2. Updating/deleting anything in the add-ons list requires applying those changes through Ops Manager which temporarily brings down TKGI until the system rebuilds itself.
  3. Add-ons are defined at the plan level, so any changes you make to one plan don't impact others. This is either good or bad "depending".

Although I've used these add-ons for a while, I wanted to find a better way that would allow them to be easier to alter and extend without having to go into Ops Manager all the time, not take down TKGI while the plan gets updated, and allow me to track everything in version control in true GitOps fashion. I'm glad to share this little method with you here.

Overall, the architecture is quite simple: Use the add-ons section of a plan to run a K8s Job which downloads and then applies all the manifests stored in a Git repo. This is simple yet elegant and allows us to

  1. Store and track all our add-ons inside Git as we do with application code. All the normal Git workflows can apply to these add-ons including things like pull requests and reviews.
  2. Ensure TKGI remains available for cluster administrators and managers despite system changes happening in the background.
  3. Keeps OpsManager add-on modifications inside plans to a bare minimum. Only some bootstrap manifests are required that point to a Git repo.

Let's walk through the process of using this method.

Firstly, if you do not already have one, create a Git repo somewhere which stores the Kubernetes manifests you wish to have automatically applied. This can be something like GitHub, which I'll use for this example, or an internal Gitlab system. In this demo, I've created one called biaritz you can use if you'd like. This repo has three files inside: a sample ConfigMap resource, a ClusterRoleBinding to simulate RBAC permissions, and my pks-rancher-reg utility that allows all clusters to be immediately registered under Rancher for management. All of these manifests I wish to get automatically applied whenever a new TKGI cluster is built from a given plan.

Once you've got the Git repo situated, head over to my other repo called k8s-addon-bootstrap and grab the manifest.yaml file. You can either clone this repo or simply copy the contents out.

Go into Ops Manager and click on the plan you wish to use to apply these resources. If you're using the Management Console (formerly EPMC), remember that the add-ons section isn't available there. You'll actually need to go into Ops Manager, and for that you'll need the password listed in the Deployment Metadata section. Once in Ops Manager, find the plan and scroll to the bottom. Paste in the contents of the manifests.yaml file.

Now, grab the URL to your Git repo and base64 encode it. We're storing this as a secret just for good measure, although do keep in mind if using this for production that security through obscurity is not a good practice, so don't rely on the fact that your end users simply not knowing the repo address is sufficient for it to be considered secure.

1$ echo -n 'https://github.com/chipzoller/biaritz' | base64
2aHR0cHM6Ly9naXRodWIuY29tL2NoaXB6b2xsZXIvYmlhcml0eg==

Replace the base64 string in the gitrepo key name in the Secret resource with yours.

This manifest defines a container which does two very simple things: Clones from your Git repo and uses kubectl to apply those manifests. As with other container authoring best practices, I recommend you to grab my Dockerfile and build the image locally ensuring it conforms to your enterprise security policy. However, for home labs, if you'd like to pull my pre-built image from Docker Hub, go right ahead. If you do build and push your own, ensure to replace the image key inside the Job resource with your own.

With everything done, save your plan and apply the changes inside Ops Manager. When you do so, remember TKGI will go down for a couple minutes as BOSH applies the changes.

The add-ons section of your plan should be completed with the sample manifest and replaced values.

Once available again, build a new cluster from your plan and check the result. If your YAML is good and the container was able to reach your Git repo, all the manifests inside should now be applied. Verify this is the case, but if so, you're good! And just as a final precaution, the Job is configured such that the Secret resource containing the base64-encoded URL to your repository will be deleted.

So, to recap, we've now got the minimal scaffolding needed inside our TKGI plan to apply any add-ons we see fit. Because we've decoupled the plan from the manifests, we no longer have to keep altering the plan which means TKGI stays available longer, changes are tracked centrally, and they can easily be extended. If we want to dedicate Git repos to each plan, all we have to do is update a single line once and from there on out, everything is done in Git.

Using this method has made it much easier for me in developing more code quicker and working with CI/CD pipelines more streamlined. I hope you find this useful, and if you have any feedback, please hit me up on Twitter or LinkedIn!