File copies fail with "rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]" due to broken node networking (typically these jobs are restarted if initial failure happens in the pre-run playbook.

Bug #1793370 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Gate
Confirmed
Undecided
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/17/595317/1/gate/build-openstack-sphinx-docs/b8849f2/ara-report/result/524daf9f-0c90-495c-9542-997cd2176063/

ssh: connect to host 2607:ff68:100:54:f816:3eff:fe02:1637 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]

This is more than just docs builds though, but primarily on limestone nodes.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22rsync%20error%3A%20unexplained%20error%20(code%20255)%20at%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

(2:12:32 PM) mriedem: clarkb: is this a known issue? http://logs.openstack.org/17/595317/1/gate/build-openstack-sphinx-docs/b8849f2/job-output.txt.gz#_2018-09-18_18_33_45_353493
(2:12:42 PM) mriedem: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22rsync%20error%3A%20unexplained%20error%20(code%20255)%20at%5C%22&from=7d
(2:13:40 PM) clarkb: mriedem: I think that is a limestone node (based on ipv6 usage) and I think logan- said he was fixing some stuff there?
(2:13:43 PM) openstackgerrit: Andreas Jaeger proposed openstack-infra/elastic-recheck master: convert docs to PTI https://review.openstack.org/559396
(2:13:49 PM) clarkb: mriedem: it is odd that it timed out then failed to reconnect
(2:13:50 PM) clarkb: logan-: ^

Changed in openstack-gate:
status: New → Confirmed
Revision history for this message
Clark Boylan (cboylan) wrote :

Reading through a lot of these cases they are actually failing very early in the pre steps due to broken networking. We then also fail later in the post steps when trying to copy logs as explained in this bug. I think the impact of this issue is actually lower than e-r may imply because failing in pre playbooks like this will cause the job to be retried. There is a cost associated with the retry but it is lower than the cost of failing entirely.

Clark Boylan (cboylan)
summary: - "Collect sphinx build html" fails with "rsync error: unexplained error
- (code 255) at io.c(226) [Receiver=3.1.1]" on limestone nodes
+ File copies fail with "rsync error: unexplained error (code 255) at
+ io.c(226) [Receiver=3.1.1]" due to broken node networking (typically
+ these jobs are restarted if initial failure happens in the pre-run
+ playbook.
Revision history for this message
Matt Riedemann (mriedem) wrote :

If we're retrying the job run on this then we could just mark the fingerprint / query in elastic-recheck so it doesn't show up in the graph, i.e. if it's just going to be persistent but not really a problem that needs a lot of attention.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.