kolla-ansible

fluentd not reconnecting to ES on failures

Bug #1830724 reported by Krzysztof Klimonda on 2019-05-28

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	Medium	Doug Szumski	kolla-ansible 10.0.0 "ussuri"
Rocky	New	Medium	Unassigned	kolla-ansible 7.1.1 "rocky"
Stein	Fix Released	Medium	Radosław Piliszek	kolla-ansible 8.0.0 "Stein"
Train	Fix Released	Medium	Radosław Piliszek	kolla-ansible 9.0.1 "Train"
Ussuri	Fix Released	Medium	Doug Szumski	kolla-ansible 10.0.0 "ussuri"

Bug Description

According to the fluentd-plugin-elasticsearch documentation, the plugin, by default, will only reconnect to the ES cluster when it receives "host unreachable" exception. This can be changed by setting `reconnect_on_error` to True. This is even more strongly recommended for connecting to ES clusters running security guard.

What I'm currently experiencing in my deployment seems to be related: Once fluentd-es plugin loses connectivity to the ES cluster, it never recovers and logs are no longer being sent:

```
2019-05-22 21:47:32 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-05-22 21:47:33 +0000 chunk="58980e875da18f46c6c1030714d07a5d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): read timeout reached"
2019-05-23 19:04:44 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-05-23 19:04:45 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
2019-05-23 19:04:45 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2019-05-23 19:04:46 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
2019-05-23 19:04:46 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2019-05-23 19:04:48 +0000 chunk="58992c060e9445fe909cb4dadc1751ab" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"monitor.region1.\", :port=>9200, :scheme=>\"https\", :user=>\"logstash\", :password=>\"obfuscated\"}): end of file reached (EOFError)"
[...]
```

If I wait enough I can see that fluentd gives up on pushing chunks and drops them.

I'll open a review with a proposed configuration change that I've just deployed on one of my controller nodes to see if it helps.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-28: Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661747

Changed in kolla-ansible:
assignee:	nobody → Krzysztof Klimonda (kklimonda)
status:	New → In Progress

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-05-29:

We have seen this on Rocky-based clouds. Jack Heskett and Doug Szumksi spent some time on it so may be able to help. I added them as reviewers.

OpenStack Infra (hudson-openstack) on 2019-06-19

Changed in kolla-ansible:
assignee:	Krzysztof Klimonda (kklimonda) → Doug Szumski (dszumski)

OpenStack Infra (hudson-openstack) on 2019-06-26

Changed in kolla-ansible:
assignee:	Doug Szumski (dszumski) → Krzysztof Klimonda (kklimonda)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-16: Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/671080

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-20: Change abandoned on kolla-ansible (stable/stein)

Change abandoned by Will Szumski (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/671080
Reason: Not merged in master

OpenStack Infra (hudson-openstack) on 2019-11-12

Changed in kolla-ansible:
assignee:	Krzysztof Klimonda (kklimonda) → Michal Nasiadka (mnasiadka)

OpenStack Infra (hudson-openstack) on 2019-11-13

Changed in kolla-ansible:
assignee:	Michal Nasiadka (mnasiadka) → Doug Szumski (dszumski)

OpenStack Infra (hudson-openstack) on 2019-11-15

Changed in kolla-ansible:
assignee:	Doug Szumski (dszumski) → Michal Nasiadka (mnasiadka)

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2019-12-07:

Duplicate: https://bugs.launchpad.net/kolla-ansible/+bug/1855528

It seems to give up after some time and works again for a bit.
I suspect there is also some bug in pooling because there is no other indication that there was an issue with connectivity between fluentd and ES - could be some intermittent load at most.

OpenStack Infra (hudson-openstack) on 2019-12-07

Changed in kolla-ansible:
assignee:	Michal Nasiadka (mnasiadka) → Radosław Piliszek (yoctozepto)

OpenStack Infra (hudson-openstack) on 2019-12-09

Changed in kolla-ansible:
assignee:	Radosław Piliszek (yoctozepto) → Doug Szumski (dszumski)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-10: Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/661747
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=0c573062fc25e208bfa1206146fb31b401c8b7e5
Submitter: Zuul
Branch: master

commit 0c573062fc25e208bfa1206146fb31b401c8b7e5
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

Make fluentd-elasticsearch configuration more robust

    Enable reconnect_on_error option so that ES plugin re-establishes
    a new session to the ES cluster on errors. Also, enable buffering
    to the file, so that the buffer survives container restarts.

    Co-Authored-By: Michal Nasiadka <email address hidden>
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Co-Authored-By: Doug Szumski <email address hidden>
    Closes-Bug: #1830724
    Change-Id: Ia40685b9d4fc02194e03c8791ddeb3d29d7f07f6

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-02: Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/700927

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-07: Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/671080
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=7b3b1def82262dd44ba8e3865b53855a7e3a3143
Submitter: Zuul
Branch: stable/stein

commit 7b3b1def82262dd44ba8e3865b53855a7e3a3143
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

Make fluentd-elasticsearch configuration more robust

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-07: Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/700927
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=51adfd0100e00353daacc032f155919a818c0289
Submitter: Zuul
Branch: stable/train

commit 51adfd0100e00353daacc032f155919a818c0289
Author: Krzysztof Klimonda <email address hidden>
Date: Tue May 28 12:05:48 2019 +0000

Make fluentd-elasticsearch configuration more robust

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-30: Fix included in openstack/kolla-ansible 8.1.0

#10

This issue was fixed in the openstack/kolla-ansible 8.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-30: Fix included in openstack/kolla-ansible 9.0.1

#11

This issue was fixed in the openstack/kolla-ansible 9.0.1 release.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1855528

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.