Gafaelfawr 12.3 and Kubernetes certificates

Hi,

We are currently unable to upgrade gafaelfawr on the UK IDAC from 12.2 to 12.3. When we try we get errors from the gafaelfawr-operator pod along the lines of the following:

"Request attempt #1/9 failed; will retry: GET https://kubernetes.default.svc/api -> ClientConnectorCertificateError(ConnectionKey(host='kubernetes.default.svc', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None), SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Missing Authority Key Identifier (_ssl.c:1018)'))"

I understand that this is due to upgrading gafaelfawr to use python 3.13.

We have investigated whether there is any easy way to add the missing fields to the Kubernetes certificate, but it appears not. While we can continue to investigate this angle I do not anticipate a quick fix.

Is there any way that we can disable strict certificate checking for this?

So for context: When you deploy a Kubernetes cluster, you set up a Certficate Authority (CA) that signs (and validates that has signed) a cluster certificate that is then used to generate certificates used by k8s components, such as the Kubernetes API server. Information provided by your cluster template(s) is also used in this process.

The bump to python 3.13 (and underlying third party libraries)) means that TLS certificate verification now requires the presence of the Authority Key Identifier field (arguably a bug fix, since openSSL has considered those part of a strict check for some time). You are getting this error reported through gafaelfawr but it is raised by the kubernetes-asyncio third-party library.

Changing gafaelfawr would involve having to figure out whether you can convince kubernetes-asyncio to disable strict checking mode (throwing out all the other advantages of that out with the bathwater) and plumb that through up the calling stack to get gafaelfawr to honor it. It’s a lot of work just to avoid doing the right thing so I am not inclined to schedule it (though if you want to do the work and send us a PR we would consider it). But again, it should be easier to just do the right thing.

The right thing, in this case, is to make sure that whatever cert is misbehaving here (your cluster certificate or your CA certificate) is compliant. It should be straightforward to do this - rotate your CA certificate and/or your cluster certificate and/or double check the template used in your cluster’s certificate generation; this should result in a valid certificate being produced and all will be well.

I understand the UK IDAC is using Magnum to deploy Kubernetes on top of Openstack. I know nothing about this, but going by the documentation here you should be able to rotate your certificate (though I am not sure what “only supported by the Fedora CoreOS driver” means…).

If (and only if) you have tried this and it did not work (), some things to try would be:

  • Check your certificate template
  • Follow the steps in the Magnum documentation to start over with a new CSR maybe?
  • Redeploy your cluster with a newer version of Magnum (if you’re not running the latest version)

Before you go down that road, you might just want to hit a Magnum support forum (and/or your Openstack provider if that is not something you do yourselves) and ask what to do in the situation where your cluster cert has the Authority Key Identifier missing and/or why your certificate rotation doesn’t work as again, this doesn’t really originate with gafaelfawr, it’s really a Magnum/Openstack question about how to ensure that the certificates it uses have that field.