OpenStack Neutron API OVN Plugin Charm

Unable to refresh certificates with reissue-certificates

Bug #1940549 reported by Steven Parker on 2021-08-19

This bug affects 4 people

	Status	Importance	Assigned to
OpenStack Neutron API OVN Plugin Charm	Fix Committed	High	Unassigned
charm-ovn-central	Fix Committed	High	Unassigned
charm-ovn-chassis	Fix Committed	High	Unassigned
vault-charm	Fix Released	High	Unassigned
1.5	Triaged	Undecided	Unassigned
1.6	Triaged	Undecided	Unassigned
1.7	Triaged	Undecided	Unassigned
1.8	Triaged	Undecided	Unassigned

Bug Description

While attempting to refresh certificates for a k8s installation no units other then the client leaders updated.

Steps to replicate:
  Deploy k8s stack and vault with replication count 3 (HA).
  Delete vault unit which is leader and add another
    execute refresh certificates action
    confirm k8s client.crt is actually updated or fail to update
       juju ssh kubernetes-worker/0 sudo openssl x509 -in /root/cdk/client.crt -text
  Repeat a few times

At issue is that there are multiple instances of relation data from those units being shared with other applications vs one source of truth (the leader).
We have one vault leader which provides the correct data when we re-issue certificates.
However, older vault units that may have been leader at some time still retain stale certificate data shared with all the clients.
That stale data is conflicting with the newly provided certificates and the clients think nothing has changed (the stale data has the original certs)
and thus the clients do not drop the client certificates to disk.

== work-around below ==

We cleared data from the non leaders to solve the issue:
For example here is vault/0 which is a non-leader (vault/1 is the current leader)
  juju run -u vault/0 "relation-set -r certificates:61
       kubernetes-master_0.server.key='' kubernetes-master_0.server.cert='' kubernetes-master_0.processed_client_requests=''
       kubernetes-master_1.server.key='' kubernetes-master_1.server.cert='' kubernetes-master_1.processed_client_requests=''
       kubernetes-master_2.server.key='' kubernetes-master_2.server.cert='' kubernetes-master_2.processed_client_requests=''
       kubernetes-master_5.server.key='' kubernetes-master_5.server.cert='' kubernetes-master_5.processed_client_requests=''
    "

Once the stale data was cleared the clients saw the new certificates and updated correctly.

See original description

Tags:

Revision history for this message

Cory Johns (johnsca) wrote on 2021-08-25:

As mentioned, this arises because the interface protocol in question was created before application-level relation data was available, so the leader has no choice but to write the response data in its unit data bucket, potentially leading to conflicting data being presented on the relation. The requesting side has no way to know which unit is the leader and thus which data is authoritative, but it could perhaps parse the cert data and pick the best one based on the effective and expiration dates. However, there are many more clients than providers for this relation and this issue impacts all of them, not just Kubernetes.

Possible solutions:

1) Migrate the interface protocol to app-level rel data. This would be the best solution for a new interface, but migrating to it now would require updating every charm which uses this interface on either side of the relation. It might be possible to do incrementally by writing the data in both buckets and applying one of the other fixes and then gradually updating the client charms to prefer the app-level data.

2) Make provider units clear their relation data whenever they see that they are not the leader. Requires no updates to the clients, and possibly no communication between the leaders and non-leaders of the provider, except that there is a chance for the non-leaders to wipe out relation data before the leader has written the new data, so that may want to be managed using a leader data field.

3) Make provider units all write the latest data as soon as it's available. I think this should be possible for Vault if the non-leader units can read the secrets out, but they'll need some trigger to know when the leader has generated the initial or updated data. For EasyRSA, the cert data will need to be copied to leader data if it isn't already. This is a bit more complex than 2 but ensures that the correct data is always available on the relation.

Cory Johns (johnsca) on 2021-08-25

no longer affects:	charm-kubernetes-master
no longer affects:	charm-kubernetes-worker

Aurelien Lourot (aurelien-lourot) on 2021-08-27

Changed in vault-charm:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Chris Johnston (cjohnston) wrote on 2021-10-22:

See also: https://bugs.launchpad.net/charm-kubernetes-master/+bug/1899706
I believe these are the same underlying issue

Revision history for this message

James Troup (elmo) wrote on 2021-12-14:

I've subscribed ~field-high. This also affects the OVN charms; workaround in the bug description works great.

Corey Bryant (corey.bryant) on 2021-12-14

Changed in vault-charm:
importance:	Medium → High

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2021-12-14:

Bumping to high impact. There's a work-around as mentioned in the bug description, which typically leads to triaging as medium. However, after spending some time today working with bootstack to work around this, it's clear this may impact more clouds so we should aim to get it fixed soon.

description:	updated
description:	updated
Changed in vault-charm:
importance:	High → Medium
importance:	Medium → High

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2021-12-14:

I think this can affect any client that gets certificates from vault, but I've added charm-ovn-chassis and charm-ovn-central for now, however the fix will likely be on the vault side.

Revision history for this message

Andrea Ieri (aieri) wrote on 2021-12-15:

this bug affected neutron-api-plugin-ovn as well for us today. I'll add the project for completeness but I agree the fix will probably not involve it

Corey Bryant (corey.bryant) on 2022-01-24

Changed in charm-ovn-chassis:
status:	New → Triaged
importance:	Undecided → High
Changed in charm-ovn-central:
status:	New → Triaged
importance:	Undecided → High
Changed in charm-neutron-api-plugin-ovn:
status:	New → Triaged
importance:	Undecided → High

Martin Kalcok (martin-kalcok) on 2022-01-27

Changed in vault-charm:
assignee:	nobody → Martin Kalcok (martin-kalcok)
status:	Triaged → In Progress

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2022-01-27:

Correct me if I'm wrong, but EasyRSA does not really support HA deployments. There's no "peers" relation endpoint defined in metadata.yaml and no code that would handle relations with other EasyRSA units.
So I don't think that this bug is applicable to EasyRSA charm.

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2022-02-07:

In addition to your suggested solutions @johnsca I think there's also an option to implement "leader storage" as a cache. Leader unit can use "leader-set" to store cert/keys when they are issued and when leadership changes, and new leader unit processes the requests from "clients", it first looks into cache if certificate was previously issued for that particular request, and if so, it will reuse it.

This approach is bit of a spin on your suggestion #3 except it does require synchronization between all the vault (provider) units via peer relations when new certificate is issued.

The current approach of re-issuing of certificates on leader change is unnecessary, as the root CA is shared between vault (provider) units and does not change with leadership change. Imo we should aim to remove the re-issuing as part of this fix.

Potential downside of using the leader storage is that all the secrets are stored in the controller database, but I checked with @aluria and he said that even the data from regular relations (provide-require) are stored on the controller, so it should not make much difference security-wise. (As the secrets are currently stored in the provide-require relationship data anyway)

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2022-02-11:

I opened a PR with the implementation of leader cache and synchronization of certificates across the non-leader units. It's marked as a work-in-progress 'cause I didn't look at the functional tests (or if they are needed at all), but overall it's ready for review.

PR: https://review.opendev.org/c/openstack/charm-vault/+/828885

tags:

added: review-needed

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2022-02-28:

#10

At the engineering sprint, it came to my attention that leader-set should be deprecated and "application databag" should be used instead, so this'll need a rework.

Also worth to mention that both Application and Leader databags persist through scaling to zero and then up again. This is an edge case, but we should make sure that the cache gets erased when last unit leaves because new vault instance will probably initiate new CA.

tags:

removed: review-needed

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-03-17:

#11

bundle.yaml Edit (1.4 KiB, text/plain)

Per review comments this is the bundle I'm testing with, in case it helps others iterate on testing faster.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-05-23:

#12

I believe we can mark this as Fix Committed for Vault.

commit f55055b8783ca6f3f569209b4f82285377f5ac64
Author: Martin Kalcok <email address hidden>
Date: Fri Feb 11 15:13:41 2022 +0100

Implement cert cache for vault units (v2)

    This cache is used to store certificates and keys
    issued by the leader unit. Non-leader units read
    these certificates and keep data in their
    "tls-certificates" relations up to date.
    This ensures that charm units that receive certs
    from vault can read from relation data of any
    vault unit and receive correct data.

    This patch is the same as
    1159e547dd755af97d5eab578cdfe90abad93843
    but improved to avoid LP#1970888

    Change-Id: Ic4dd009cc18c52e1667391b00ebba9928acc5937
    Closes-Bug: #1940549
    Closes-Bug: #1970888

Changed in vault-charm:
status:	In Progress → Fix Committed

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-05-23:

#13

I'll mark as Fix Committed for the other affected charms, though they didn't receive and won't be receiving fixes. The fix is in vault.

Changed in charm-ovn-chassis:
status:	Triaged → Fix Committed
Changed in charm-ovn-central:
status:	Triaged → Fix Committed
Changed in charm-neutron-api-plugin-ovn:
status:	Triaged → Fix Committed
Changed in charm-easyrsa:
status:	New → Fix Committed

George Kraft (cynerva) on 2022-05-23

no longer affects:

charm-easyrsa

Revision history for this message

Andrea Ieri (aieri) wrote on 2022-06-28:

#14

We've been hit by this bug multiple times now, and the 22.10 release is still months away. Would it be possible to backport the fix into the current stable ussuri channel (not sure which one that would be)?

Revision history for this message

Erlon R. Cruz (sombrafam) wrote on 2022-07-06:

#15

Hi Andrea, we are working to get this backported to vault channels 1.6 and 1.5. Its taking a bit more of time because of some problems related to the charmhub migration. Backport is here:

https://review.opendev.org/q/topic:bug%252F1970888

Navdeep (navdeep-bjn) on 2022-07-20

Changed in vault-charm:
status:	Fix Committed → Fix Released

Revision history for this message

Navdeep (navdeep-bjn) wrote on 2022-07-26:

#16

Not sure why my name show up with fix released, I just added a comment asking when will the fix backported to vault 1.5 channel.

Eric Chen (eric-chen) on 2022-08-10

tags:

added: bseng-344

Revision history for this message

Paul Goins (vultaire) wrote on 2022-10-27:

#17

Can someone please unmark this as "fix released"? We're continuing to hit this issue, and are using rev 54, channel stable of the vault charm.

Revision history for this message

Felipe Reyes (freyes) wrote on 2022-10-28:

#18

Reopening this bug since the patches that were attempting to fix this issue were reverted

https://review.opendev.org/c/openstack/charm-vault/+/851719
https://review.opendev.org/c/openstack/charm-vault/+/851718

Changed in vault-charm:
status:	Fix Released → Fix Committed
status:	Fix Committed → New

Eric Chen (eric-chen) on 2022-10-30

Changed in vault-charm:
assignee:	Martin Kalcok (martin-kalcok) → nobody

Andrea Ieri (aieri) on 2023-03-22

tags:

added: bseng-1021

Alex Kavanagh (ajkavanagh) on 2023-06-12

Changed in vault-charm:
status:	New → Triaged

Alex Kavanagh (ajkavanagh) on 2023-10-16

Changed in vault-charm:
status:	Triaged → Fix Released
Changed in charm-neutron-api-plugin-ovn:
status:	Fix Committed → Fix Released
Changed in charm-ovn-central:
status:	Fix Committed → Fix Released
status:	Fix Released → Fix Committed
Changed in charm-neutron-api-plugin-ovn:
status:	Fix Released → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

bundle.yaml Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.