We are currently unable to upgrade gafaelfawr on the UK IDAC from 12.2 to 12.3. When we try we get errors from the gafaelfawr-operator pod along the lines of the following:
I understand that this is due to upgrading gafaelfawr to use python 3.13.
We have investigated whether there is any easy way to add the missing fields to the Kubernetes certificate, but it appears not. While we can continue to investigate this angle I do not anticipate a quick fix.
Is there any way that we can disable strict certificate checking for this?
So for context: When you deploy a Kubernetes cluster, you set up a Certficate Authority (CA) that signs (and validates that has signed) a cluster certificate that is then used to generate certificates used by k8s components, such as the Kubernetes API server. Information provided by your cluster template(s) is also used in this process.
The bump to python 3.13 (and underlying third party libraries)) means that TLS certificate verification now requires the presence of the Authority Key Identifier field (arguably a bug fix, since openSSL has considered those part of a strict check for some time). You are getting this error reported through gafaelfawr but it is raised by the kubernetes-asyncio third-party library.
Changing gafaelfawr would involve having to figure out whether you can convince kubernetes-asyncio to disable strict checking mode (throwing out all the other advantages of that out with the bathwater) and plumb that through up the calling stack to get gafaelfawr to honor it. It’s a lot of work just to avoid doing the right thing so I am not inclined to schedule it (though if you want to do the work and send us a PR we would consider it). But again, it should be easier to just do the right thing.
The right thing, in this case, is to make sure that whatever cert is misbehaving here (your cluster certificate or your CA certificate) is compliant. It should be straightforward to do this - rotate your CA certificate and/or your cluster certificate and/or double check the template used in your cluster’s certificate generation; this should result in a valid certificate being produced and all will be well.
I understand the UK IDAC is using Magnum to deploy Kubernetes on top of Openstack. I know nothing about this, but going by the documentation here you should be able to rotate your certificate (though I am not sure what “only supported by the Fedora CoreOS driver” means…).
If (and only if) you have tried this and it did not work (), some things to try would be:
Check your certificate template
Follow the steps in the Magnum documentation to start over with a new CSR maybe?
Redeploy your cluster with a newer version of Magnum (if you’re not running the latest version)
Before you go down that road, you might just want to hit a Magnum support forum (and/or your Openstack provider if that is not something you do yourselves) and ask what to do in the situation where your cluster cert has the Authority Key Identifier missing and/or why your certificate rotation doesn’t work as again, this doesn’t really originate with gafaelfawr, it’s really a Magnum/Openstack question about how to ensure that the certificates it uses have that field.
I had already tried the command to rotate the CA certificate; however, it appears that it is not supported by the driver in use. I also reached out to the systems team that manages OpenStack but did not get a particularly useful response.
As you suggested, I tried the instructions to regenerate the client key and certificate with a new CSR. This did not result in a working certificate - either with or without the Subject Key Identifier extension in the CSR. I am not sure what the problem is here, but I am not optimistic that this would fix the problem even if we got it working and got the SKID extension added because we would still be missing the AKID extension which I don’t think can be added purely from the client side without changing the server configuration.
I cannot see anything in the cluster template that would relate to certificates, assuming that’s what you mean.
Unfortunately the versions of Magnum/ClusterAPI/Kubernetes in use/available are not directly under our control. I can request that an upgrade be prioritised/newer version made available, but I can’t say when the team that manages the OpenStack environment will be able to get around to it.
It would appear that our only option is to escalate our support request to get the problem fixed at source and hope that it doesn’t take too long. From past experience I am not particularly optimistic though. I had hoped that there might be a workaround so that we could avoid falling too far behind in the meantime.
I’ll raise a support request through our CAPI/ Magnum support contract. Hopefully, we’ll have an idea of way forward and timeline in the near future.
Wow that sounds frustrating, having no reliable ETA for an upgrade is not a nice place to be. I think your best choice for now is to revert Gafaelfawr to the last version that worked for you[1] , and not sync it until you can get your Magnum upgrade.
There is major planned feature on the gafaelfawr roadmap that will result in the rest of phalanx requiring a newer version of gafaelfawr from what you are running [2] but we are at least a couple of months away from releasing that so hopefully that gives you time for your infrastructure provider to do your update.
This problem is only going to get worse as more services uprev to newer versions of python and third party libraries so I don’t think there’s any way to avoid tackling the root cause. These types of strict checks seem annoying, but with the increasing profile of the project we have to stay on top of them.
And on a complete tangent, if anyone else finds the default DIscourse handing of footnotes annoying (popping them up on mouseover) you can change it in your personal settings and they will render at the foot of the post with this setting:
Quick clarification: you want the commit before2a5b9a18fc2b1657cd04b27f333646dcc490e3ab (so, the commit 9e2503bff187e4023ef6c5f99e616e5953ee187b). The commit Frossie listed is the commit that switched to a version of Gafaelfawr built on top of Python 3.13.
I forwarded this thread to the Canadian IDAC infrastructure team and Ryan Taylor from that group suggested the following:
(keel-dev is our k8 development cluster)
I checked on keel-dev (on EL8, was built some time ago) and the API server cert does not have Authority Key Identifier.
However on a different cluster that was built more recently on EL9:
openssl x509 -in apiserver.crt -text -noout this shows
X509v3 Authority Key Identifier:
59:0A:EC:17:D0:7D:15:C1:43:C3:C1:31:18:7A:DF:0C:A6:B5:5F:1A
So it might relate to the openssl version on the OS of the cluster nodes
If their cluster is not already on EL9 it might be worth upgrading/rebuilding to a newer OS before rotating the CA cert.
I provide the above with a complete absence of knowledge on the topic. (i.e. apologies if the above message is noise).
Thank you. It does ineed look to be related to the version of OpenSSL (and Python) that you run, though I am also not an expert. In our case, we use OpenStack Magnum to handle everything Kubernetes and our problem seems to be related to the fact Magnum does not yet support these attributes.
On this occasion, we are optimistic that OpenStack Magnum can be patched to address the issue - https://review.opendev.org/c/openstack/magnum/+/940510. We’re hoping to test this patch in the coming days and will report back on if it resolves the issue.