Elasticsearch reports elasticsearch.ready and elasticsearch.available before all hosts have finished stabalizing

Bug #2009212 reported by Alexander Balderson
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Elasticsearch Charm
Triaged
Low
Unassigned

Bug Description

While deploying elasticsearch rev 67 and graylog rev64, greylog went into an error with a connecting refused to elasticsearch, resulting in a beats-relation-changed hook error. The graylog unit is waiting till elasticsearch.ready and elasticsearch.available flags are both set, and then tries to connect to all 3 units. [1] Looking through the logs, however, graylog is trying to connect before all 3 units have been fully added:

from the unit-graylog log:

2023-03-03 08:59:57 INFO unit.graylog/1.juju-log server.go:316 beats:21: Error configuring ES: HTTPConnectionPool(host='10.246.64.211', port=9200):

but the elasticsearch.log on units 0 (this unit) and 2 (the master) show that 10.246.64.211 is added to the cluster at the same time (one second later), and unit 1 doesnt even know about .211 yet.

[2023-03-03T08:59:58,250][INFO ][o.e.c.s.ClusterApplierService] [yJR_g8D] added {{0tRUaVL}{0tRUaVLtQZ-Wiq-hHggZXQ}{bEB-5RVBRj-Rl1mODdbUlw}{10.246.64.211}{10.246.64.211:9300}{ml.machine_memory=8343527424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {BztmWJn}{BztmWJn7ROGFwqpiUf8CGw}{0orAbs74TSWrN5fCsKDMOg}{10.246.64.206}{10.246.64.206:9300}{ml.machine_memory=8343531520, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [16]])

Elasticsearch probably didnt have time to fully come online before graylog tried to make the request. The other 2 units, one of which is the leader, are both active/idle, its just the third unit that getting configured at the time graylog tries to connect.

In this case graylog 1 gets the error
elastic 0 is executing
elastic 1 is the leader, but doesnt know about 0
and elastic 2 is the master, and knows about 0

you can view the full testrun at:
https://solutions.qa.canonical.com/v2/testruns/aacbcaef-efba-45e9-9fff-38f5129f8bc8
the crashdump for this run can be found at:
https://oil-jenkins.canonical.com/artifacts/aacbcaef-efba-45e9-9fff-38f5129f8bc8/generated/generated/lma-maas/juju-crashdump-lma-maas-2023-03-03-09.01.14.tar.gz

1) https://git.launchpad.net/charm-graylog/tree/src/reactive/graylog.py#n758

Eric Chen (eric-chen)
tags: added: bseng-967
Changed in charm-elasticsearch:
importance: Undecided → Medium
status: New → Triaged
Eric Chen (eric-chen)
Changed in charm-elasticsearch:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.