Losing k8s-master causes prometheus manual-jobs duplicate

Bug #1899706 reported by Chris Johnston
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Charm Helpers
Fix Released
Undecided
Unassigned
Kubernetes Control Plane Charm
Triaged
Medium
Unassigned
Prometheus2 charm
Triaged
Wishlist
Unassigned

Bug Description

If the kubernetes-master leader unit is lost, the new leader provides an additional set of relation data to prometheus without removing the data from the original leader. This causes duplicate manual-jobs in prometheus, in turn causing queries to produce duplicate results. This appears to be in how prometheus attempts to prevent job names from clashing by using the request_id in the job name [1], but the new leader is using a new request_id, thus not being marked as a duplicate of the original job.

kubernetes-master 1.17.12 active 3 kubernetes-master local 0 ubuntu
kubernetes-worker 1.17.12 active 4 kubernetes-worker local 0 ubuntu exposed
prometheus active 1 prometheus2 local 0 ubuntu

1) Deploy kubernetes-master in HA with 3 units and add relation with prometheus as documented. I used hacluster, I haven't tested with other methods.

2) Restart prometheus to get around bug #1891942 [2]:
juju ssh prometheus/0 "sudo systemctl restart snap.prometheus.prometheus.service"

3) grep 'job_name' from the prometheus config:
juju ssh prometheus/0 "sudo grep 'job_name' /var/snap/prometheus/current/prometheus.yml"

Note that there is one instance of each of:
"job_name": "kube-state-metrics-UUID"
"job_name": "k8s-api-endpoints-UUID"
"job_name": "kubernetes-nodes-UUID"
"job_name": "kube-state-telemetry-UUID"
"job_name": "kubernetes-cadvisor-UUID"

Record the UUID for each.

3) Run this command and notice that only the kubernetes-master leader unit provides data:
for rid in $(juju run -u prometheus/0 'relation-ids manual-jobs'); do echo $rid; for runit in $(juju run -u prometheus/0 "relation-list -r $rid"); do echo $rid - $runit; juju run -u prometheus/0 "relation-get -r $rid - $runit"; done; done

4) Stop the jujud kubernetes-master service on the leader:
juju ssh kubernetes-master/1 "sudo systemctl stop jujud-unit-kubernetes-master-1"

5) Allow things to settle and a new leader to be elected

6) Restart prometheuse to get around bug #1891942 [1]:
juju ssh prometheus/0 "sudo systemctl restart snap.prometheus.prometheus.service"

7) grep 'job_name' from the prometheus config:
juju ssh prometheus/0 "sudo grep 'job_name' /var/snap/prometheus/current/prometheus.yml"

Note that there are now two instances of each of:
"job_name": "kube-state-metrics-UUID"
"job_name": "k8s-api-endpoints-UUID"
"job_name": "kubernetes-nodes-UUID"
"job_name": "kube-state-telemetry-UUID"
"job_name": "kubernetes-cadvisor-UUID"

8) Run this command and notice that there are now two kubernetes-master units which provide data:
for rid in $(juju run -u prometheus/0 'relation-ids manual-jobs'); do echo $rid; for runit in $(juju run -u prometheus/0 "relation-list -r $rid"); do echo $rid - $runit; juju run -u prometheus/0 "relation-get -r $rid - $runit"; done; done

This causes a number of queries in prometheus to return 2x results since there are now 2x scrapes.

[1] https://github.com/juju-solutions/interface-prometheus-manual/blob/master/common.py#L22
[2] https://bugs.launchpad.net/charm-prometheus2/+bug/1891942

Tags: sts
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the detailed report and reproduction steps!

I think this will need to be fixed in the prometheus charm. The code that injects the UUID[1] is part of the to_json function that's invoked by prometheus[2]. This is supposed to "ensure uniqueness" i.e. prevent multiple requests from clashing. A side effect of this is that when different units request the same job, the prometheus charm will generate multiple copies of that job.

The request ID that's used is part of the request-response pattern from charms.reactive. It is stored in relation data, and relation data is unit-scoped. Updating that pattern to allow re-use of a request ID across units would be doable but challenging. I strongly believe that, instead, this is a case where the request ID is being misused by the receiving side. Perhaps the relation ID should be used instead, which seems to be unique per application?

I am adding prometheus to this issue, but also leaving it open against kubernetes-master, since I would like another engineer from our team to look at this.

[1]: https://github.com/juju-solutions/interface-prometheus-manual/blob/3f775242c16d53243c993d7ba0c896169ad1639e/common.py#L36
[2]: https://git.launchpad.net/charm-prometheus2/tree/src/reactive/prometheus.py#n748

Changed in charm-kubernetes-master:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Drew Freiberger (afreiberger) wrote :

When I looked at the code from a charm-prometheus2 support perspective, my take was that the kubernetes-master charm will set the prometheus manual jobs up on leader-elected[1], but, the demoted leader does not stop advertising it's jobs.

Because the charm-prometheus2 is architected to monitor each unit of a related application, this is why the tagging of the request UUID to ensure each unit that is announcing to be monitored is monitored individually. Imagine 100 units of telegraf with an additional manual-job metric for a per-unit counter.

In the instance of kubernetes-master charm providing details for monitoring a single pod (per manual-job) behind the 3 HA k8s-master, there is only need for one monitoring endpoint for kubernetes-master (announced to be monitored through the kubeapi-load-balancer, IIRC), but multiple units advertising to prometheus to be monitored, the data is duplicated.

I believe the solution is to add a function to remove the prometheus manual scrape job details from the relation @when_not('leadership.is_leader'). basically, the opposite of this function when leadership is lost:

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/master/reactive/kubernetes_master.py#L3128

Changed in charm-prometheus2:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I've marked this bug with wishlist against charm-prometheus2, as we could implement a scope variable in the relation that would allow for unit or app based monitors to be defined. This likely won't make it into the charm until the rewrites into the operator framework which is not yet on the roadmap.

Revision history for this message
Chris Johnston (cjohnston) wrote :

Removing and re-adding the kubernetes-master/prometheus relationship cleared the issue for now.

Seyeong Kim (seyeongkim)
tags: added: sts
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

how about this one? Please give me any advice

put timestamp from register_prometheus_jobs in kubernetes-master charm

https://pastebin.ubuntu.com/p/btZxVBHxn8/

and relation provider

https://pastebin.ubuntu.com/p/hD7Jh5x3qf/

then take that timestamp in prometheus charm like below, skip data with old timestamp, take newer one.

then we can take manual_jobs from later leader.

https://pastebin.ubuntu.com/p/Rthx7yWsSz/

I think it is not possible we can detect trigger when leader is down ( maybe im wrong )

Revision history for this message
Cory Johns (johnsca) wrote :

A better solution for this would be to use application-level relation data, but that does require newer version of Juju to support, and currently doesn't have support available in charmhelpers or reactive.

Revision history for this message
Cory Johns (johnsca) wrote :

Having the non-leader K8s unit clear its request from the relation data could still end up with duplicate data in the case of a stuck unit, but it would work in other cases without changes to the interface or promteheus2 charm.

Adding a timestamp would lead to complications when one side or the other is an older version of the charm which doesn't know about the timestamp, but could probably be handled gracefully enough to at least end up with the current behavior.

However, it looks like the interface could be changed to use the relation ID as the request ID instead of relying on the default UUID creation, as George suggested. That way, the existing de-duplication should work as-is and should even handle stuck non-leaders since the rest of the request body should be identical.

Revision history for this message
Cory Johns (johnsca) wrote :

To add to my last comment, in theory changing the request ID to be the relation ID should be as simple as adding a `request_id=relation.relation_id` param to [1] and I *think* it might even handle upgrades fine, too, since it selects the request to update based on the `job_name` field rather than the request ID. However, I'm probably missing something in how that would actually be handled on one side or the other if the request ID for an existing request is changed, so that would definitely need some testing.

[1]: https://github.com/juju-solutions/interface-prometheus-manual/blob/master/provides.py#L37-L41

Revision history for this message
Seyeong Kim (seyeongkim) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

Note that pull/3 has been superseded by pull/6
https://github.com/juju-solutions/interface-prometheus-manual/pull/6

Changed in charm-helpers:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.