OpenStack-Gate

File copies fail with "rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]" due to broken node networking (typically these jobs are restarted if initial failure happens in the pre-run playbook.

Bug #1793370 reported by Matt Riedemann on 2018-09-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack-Gate	Confirmed	Undecided	Unassigned

Bug Description

Seen here:

http://logs.openstack.org/17/595317/1/gate/build-openstack-sphinx-docs/b8849f2/ara-report/result/524daf9f-0c90-495c-9542-997cd2176063/

ssh: connect to host 2607:ff68:100:54:f816:3eff:fe02:1637 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]

This is more than just docs builds though, but primarily on limestone nodes.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22rsync%20error%3A%20unexplained%20error%20(code%20255)%20at%5C%22&from=7d

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-09-19:

(2:12:32 PM) mriedem: clarkb: is this a known issue? http://logs.openstack.org/17/595317/1/gate/build-openstack-sphinx-docs/b8849f2/job-output.txt.gz#_2018-09-18_18_33_45_353493
(2:12:42 PM) mriedem: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22rsync%20error%3A%20unexplained%20error%20(code%20255)%20at%5C%22&from=7d
(2:13:40 PM) clarkb: mriedem: I think that is a limestone node (based on ipv6 usage) and I think logan- said he was fixing some stuff there?
(2:13:43 PM) openstackgerrit: Andreas Jaeger proposed openstack-infra/elastic-recheck master: convert docs to PTI https://review.openstack.org/559396
(2:13:49 PM) clarkb: mriedem: it is odd that it timed out then failed to reconnect
(2:13:50 PM) clarkb: logan-: ^

Changed in openstack-gate:
status:	New → Confirmed

Revision history for this message

Clark Boylan (cboylan) wrote on 2018-09-28:

Reading through a lot of these cases they are actually failing very early in the pre steps due to broken networking. We then also fail later in the post steps when trying to copy logs as explained in this bug. I think the impact of this issue is actually lower than e-r may imply because failing in pre playbooks like this will cause the job to be retried. There is a cost associated with the retry but it is lower than the cost of failing entirely.

Clark Boylan (cboylan) on 2018-09-28

summary:

- "Collect sphinx build html" fails with "rsync error: unexplained error
- (code 255) at io.c(226) [Receiver=3.1.1]" on limestone nodes
+ File copies fail with "rsync error: unexplained error (code 255) at
+ io.c(226) [Receiver=3.1.1]" due to broken node networking (typically
+ these jobs are restarted if initial failure happens in the pre-run
+ playbook.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-09-28:

If we're retrying the job run on this then we could just mark the fingerprint / query in elastic-recheck so it doesn't show up in the graph, i.e. if it's just going to be persistent but not really a problem that needs a lot of attention.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.