Using Custom Registries with Tanzu Kubernetes Grid

I've had several requests from people who want to use Tanzu Kubernetes Grid (TKG) with their own registries and have had problems doing so. This could be either something in a lab environment or even in a production environment where they have replaced TLS certificates with those signed by an internal, enterprise certificate authority. If you've ever seen the message

Failed to pull image "<registry_name>/<image_name>:": rpc error: code = Unknown desc = failed to pull and unpack image "<registry_name>/<image_name>:": failed to resolve reference "<registry_name>/<image_name>:": failed to do request: Head https://<registry_name>/<image_name>:: x509: certificate signed by unknown authority

then you know what I mean. The problem is, most of the solutions out there basically tell you, "screw security, just turn it off!" in order to get it to work, but I always take issue with these things. While it does demonstrate the functionality in question, I think it teaches bad habits despite coming with warnings like "don't do this in production." Well, but what if you are in production, you're using an existing registry, and you do have custom certificates (whether they be self-signed or otherwise)? You need to be able to pull images but you also want to do it securely. That's what this post is about: integrating TKG with your own registries and doing it in a SECURE manner.

I'm going to assume a deployment to vSphere here, using Photon OS, and on TKG v1.2 which was released a couple of weeks ago. Make sure to have your certificate authority's root certificate available (the one used to sign the client cert assigned to your registry). Although I'm in a lab, I try to model things like real production environments so I can pass off practical info to you. I've got a Harbor registry established (see my article here if you'd like help installing it standalone) which is using a custom TLS certificate signed by my internal CA. Make sure your CA cert is in PEM format.

We need to modify a file which contains the base Cluster API manifest used when deploying to vSphere. Open the file at ~/.tkg/providers/infrastructure-vsphere/v0.7.1/ytt/base-template.yaml and jump down to the KubeadmControlPlane resource. If you read my article on Behind the Scenes with Cluster API Provider vSphere you'll be familiar with this resource type. The KubeadmControlPlane is a resource which describes how Cluser API will provision the control plane nodes using the kubeadm bootstrap provider. There is a similiar resource called KubeadmConfigTemplate which is used to provide the same but for the worker nodes. We will have to modify both of these resources in the base-template.yaml file if we want all of our nodes to trust our internal CA.

Scroll down to where you see the preKubeadmCommands section. It should have this by default.

1preKubeadmCommands:
2- hostname "{{ ds.meta_data.hostname }}"
3- echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
4- echo "127.0.0.1   localhost" >>/etc/hosts
5- echo "127.0.0.1   {{ ds.meta_data.hostname }}" >>/etc/hosts
6- echo "{{ ds.meta_data.hostname }}" >/etc/hostname

This section of the file contains a list of commands that should be run on the deployed machine prior to running kubeadm. As you can see, the default entries are just performing operations on the hostname and /etc/hosts. But this is also the time when we want to inject our CA certificate so that when the CRI (containerd in this case) comes online, there is trust in place. Add the following lines immediately under the last line of what already exists. Ensure you get the indentation just right.

 1- |
 2  cat <<EOF > /etc/ssl/certs/myca.pem
 3  -----BEGIN CERTIFICATE-----
 4  MIIDWzCCAkOgAwIBAgIQeXc+Qv+ngYZDNaSsPUI7kDANBgkqhkiG9w0BAQUFADBA
 5  <snip>
 6  mSddDjR+db3N3XEpThE4AyFnJYErRSnZdQSROmKQGgvXx/qk+TEvzpQa/6oqebE=
 7  -----END CERTIFICATE-----
 8  EOF  
 9- openssl x509 -in /etc/ssl/certs/myca.pem -text >> /etc/pki/tls/certs/ca-bundle.crt
10- c_rehash

The full section should then look similar to this.

 1preKubeadmCommands:
 2- hostname "{{ ds.meta_data.hostname }}"
 3- echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
 4- echo "127.0.0.1   localhost" >>/etc/hosts
 5- echo "127.0.0.1   {{ ds.meta_data.hostname }}" >>/etc/hosts
 6- echo "{{ ds.meta_data.hostname }}" >/etc/hostname
 7- |
 8  cat <<EOF > /etc/ssl/certs/myca.pem
 9  -----BEGIN CERTIFICATE-----
10  MIIDWzCCAkOgAwIBAgIQeXc+Qv+ngYZDNaSsPUI7kDANBgkqhkiG9w0BAQUFADBA
11  <snip>
12  mSddDjR+db3N3XEpThE4AyFnJYErRSnZdQSROmKQGgvXx/qk+TEvzpQa/6oqebE=
13  -----END CERTIFICATE-----
14  EOF  
15- openssl x509 -in /etc/ssl/certs/myca.pem -text >> /etc/pki/tls/certs/ca-bundle.crt
16- c_rehash

What we're doing here is writing our CA root or signing certificate into a file called myca.pem and later adding it into the list of trusted CA certificates for the entire system. We are not telling containerd or anything else to consider this an untrusted registry. Our registry will be fully trusted as any other resource would be.

Now repeat the process for the KubeadmConfigTemplate resource near the bottom of the file. Once again, ensure your indentation is correct. Finally, save the file.

Once complete, let's deploy a test cluster to ensure it works.

1tkg create cluster cz08 --plan=dev --vsphere-controlplane-endpoint-ip=192.168.1.223

If the tkg command errors out before printing the line "Creating workload cluster" then you've foobar'd your base-template.yaml file, probably in the infernal YAML indentation rules. So check that and straighten it out as needed.

Once we've gotten our credentials and are in the cluster, let's test and ensure we can pull an image from our internal registry.

1$ k run util --image harbor2.zoller.com/library/chipzoller/util:latest -- "echo hello"
2pod/util created

We'll check the logs and see if we have our message.

1$ k logs util
2hello

And after describing the pod when completed we can see the message Successfully pulled image "harbor2.zoller.com/library/chipzoller/util:latest" in 242.971197ms so we know it did indeed work.

So, there you go, it worked as intended. We are now able to securely address any container image registry we wish, whether deployed as a TKG shared service or not, and whether signed by a public CA or not.

I hope this helps you to feel more confident (and secure) in being able to operate TKG in your vSphere environment.


EDIT (11/2/20): It was pointed out by a couple of folks that the better way to accomplish this goal is not to modify the base-template.yaml file but rather use ytt, which TKG uses by default as a templating tool, to modify an overlay file. Thank you to those who suggested this improvement. Here are steps below on how to use the overlay method.

Inside of your providers folder (for vSphere this would be ~/.tkg/providers/infrastructure-vsphere/ytt/) there is a file called vsphere-overlay.yaml that exists but is almost empty. Rather than modifying base-template.yaml directly, we will add our preKubeadmCommands into this overlay file which, when rendered by the tkg CLI tool, will add these commands to the finalized manifest which gets used to create the clusters. This power is afforded by the ytt tool which was formerly part of k14s but now Carvel. I haven't yet really dug into ytt, so a more thorough article will have to wait until that time.

Open the vsphere-overlay.yaml file and insert the following contents.

 1#! Please add any overlays specific to vSphere provider under this file.
 2#@ load("@ytt:overlay", "overlay")
 3
 4#! Add and trust your custom CA certificate on all Control Plane nodes.
 5#@overlay/match by=overlay.subset({"kind":"KubeadmControlPlane"})
 6---
 7spec:
 8  kubeadmConfigSpec:
 9    preKubeadmCommands:
10    #@overlay/append
11    - |
12      cat <<EOF > /etc/ssl/certs/myca.pem
13      -----BEGIN CERTIFICATE-----
14      MIIDWzCCAkOgAwIBAgIQeXc+Qv+ngYZDNaSsPUI7kDANBgkqhkiG9w0BAQUFADBA
15      <snip>
16      mSddDjR+db3N3XEpThE4AyFnJYErRSnZdQSROmKQGgvXx/qk+TEvzpQa/6oqebE=
17      -----END CERTIFICATE-----
18      EOF      
19    #@overlay/append
20    - openssl x509 -in /etc/ssl/certs/myca.pem -text >> /etc/pki/tls/certs/ca-bundle.crt
21    #@overlay/append
22    - c_rehash
23
24
25#! Add and trust your custom CA certificate on all worker nodes.
26#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"})
27---
28spec:
29  template:
30    spec:
31      preKubeadmCommands:
32      #@overlay/append
33      - |
34        cat <<EOF > /etc/ssl/certs/myca.pem
35        -----BEGIN CERTIFICATE-----
36        MIIDWzCCAkOgAwIBAgIQeXc+Qv+ngYZDNaSsPUI7kDANBgkqhkiG9w0BAQUFADBA
37        <snip>
38        mSddDjR+db3N3XEpThE4AyFnJYErRSnZdQSROmKQGgvXx/qk+TEvzpQa/6oqebE=
39        -----END CERTIFICATE-----
40        EOF        
41      #@overlay/append
42      - openssl x509 -in /etc/ssl/certs/myca.pem -text >> /etc/pki/tls/certs/ca-bundle.crt
43      #@overlay/append
44      - c_rehash

Once again, ensure that your spacing and indentation is correct. The #@overlay/append statements which appear before each additional command we're adding must be aligned with the dash marker. This tells ytt to add the next command in the array to the existing commands.

Provision a TKG cluster and ensure there are no errors thrown. If everything in the overlay file is correct, your workload clusters should be built just as they were previously, except now this makes for a cleaner separation of default and user-added commands.