Unable to refresh certificates with reissue-certificates
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Neutron API OVN Plugin Charm |
Fix Committed
|
High
|
Unassigned | ||
charm-ovn-central |
Fix Committed
|
High
|
Unassigned | ||
charm-ovn-chassis |
Fix Committed
|
High
|
Unassigned | ||
vault-charm |
Fix Released
|
High
|
Unassigned | ||
1.5 |
Triaged
|
Undecided
|
Unassigned | ||
1.6 |
Triaged
|
Undecided
|
Unassigned | ||
1.7 |
Triaged
|
Undecided
|
Unassigned | ||
1.8 |
Triaged
|
Undecided
|
Unassigned |
Bug Description
While attempting to refresh certificates for a k8s installation no units other then the client leaders updated.
Steps to replicate:
Deploy k8s stack and vault with replication count 3 (HA).
Delete vault unit which is leader and add another
execute refresh certificates action
confirm k8s client.crt is actually updated or fail to update
juju ssh kubernetes-worker/0 sudo openssl x509 -in /root/cdk/
Repeat a few times
At issue is that there are multiple instances of relation data from those units being shared with other applications vs one source of truth (the leader).
We have one vault leader which provides the correct data when we re-issue certificates.
However, older vault units that may have been leader at some time still retain stale certificate data shared with all the clients.
That stale data is conflicting with the newly provided certificates and the clients think nothing has changed (the stale data has the original certs)
and thus the clients do not drop the client certificates to disk.
== work-around below ==
We cleared data from the non leaders to solve the issue:
For example here is vault/0 which is a non-leader (vault/1 is the current leader)
juju run -u vault/0 "relation-set -r certificates:61
"
Once the stale data was cleared the clients saw the new certificates and updated correctly.
no longer affects: | charm-kubernetes-master |
no longer affects: | charm-kubernetes-worker |
Changed in vault-charm: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in vault-charm: | |
importance: | Medium → High |
Changed in charm-ovn-chassis: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in charm-ovn-central: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in charm-neutron-api-plugin-ovn: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in vault-charm: | |
assignee: | nobody → Martin Kalcok (martin-kalcok) |
status: | Triaged → In Progress |
no longer affects: | charm-easyrsa |
Changed in vault-charm: | |
status: | Fix Committed → Fix Released |
tags: | added: bseng-344 |
Changed in vault-charm: | |
assignee: | Martin Kalcok (martin-kalcok) → nobody |
tags: | added: bseng-1021 |
Changed in vault-charm: | |
status: | New → Triaged |
Changed in vault-charm: | |
status: | Triaged → Fix Released |
Changed in charm-neutron-api-plugin-ovn: | |
status: | Fix Committed → Fix Released |
Changed in charm-ovn-central: | |
status: | Fix Committed → Fix Released |
status: | Fix Released → Fix Committed |
Changed in charm-neutron-api-plugin-ovn: | |
status: | Fix Released → Fix Committed |
As mentioned, this arises because the interface protocol in question was created before application-level relation data was available, so the leader has no choice but to write the response data in its unit data bucket, potentially leading to conflicting data being presented on the relation. The requesting side has no way to know which unit is the leader and thus which data is authoritative, but it could perhaps parse the cert data and pick the best one based on the effective and expiration dates. However, there are many more clients than providers for this relation and this issue impacts all of them, not just Kubernetes.
Possible solutions:
1) Migrate the interface protocol to app-level rel data. This would be the best solution for a new interface, but migrating to it now would require updating every charm which uses this interface on either side of the relation. It might be possible to do incrementally by writing the data in both buckets and applying one of the other fixes and then gradually updating the client charms to prefer the app-level data.
2) Make provider units clear their relation data whenever they see that they are not the leader. Requires no updates to the clients, and possibly no communication between the leaders and non-leaders of the provider, except that there is a chance for the non-leaders to wipe out relation data before the leader has written the new data, so that may want to be managed using a leader data field.
3) Make provider units all write the latest data as soon as it's available. I think this should be possible for Vault if the non-leader units can read the secrets out, but they'll need some trigger to know when the leader has generated the initial or updated data. For EasyRSA, the cert data will need to be copied to leader data if it isn't already. This is a bit more complex than 2 but ensures that the correct data is always available on the relation.