when an idle ssh transport is interrupted, bzrlib errors; should reconnect instead
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Fix Released
|
High
|
John A Meinel | ||
2.1 |
In Progress
|
High
|
John A Meinel | ||
2.2 |
In Progress
|
High
|
John A Meinel | ||
2.3 |
In Progress
|
High
|
John A Meinel | ||
2.4 |
In Progress
|
High
|
John A Meinel |
Bug Description
Recent haproxy rollout on launchpad has illuminated a bzrlib bug:
20:29 <lifeless> mtaylor: wgrant: so, this is a bzrlib bug.
20:29 <lifeless> not the holding open
20:29 <lifeless> the handling of the disconnects.
20:29 <lifeless> transports are defined stateless.
20:29 <lifeless> so are RPC calls.
20:31 <lifeless> the haproxy stuff is showing this up very visibly, but the bug pre-existed - bzr didn't seen keepalives, and if ssh wasn't configured to do so it could naturally happen anyhow
The following traceback shows the carnage which happens when tarmac holds open its connection around the post-merge hook:
Traceback (most recent call last):
File "/usr/bin/tarmac", line 6, in <module>
main()
File "/usr/lib/
registry.
File "/usr/lib/
self._run(args)
File "/usr/lib/
run_bzr(args)
File "/usr/lib/
ret = run(*run_argv)
File "/usr/lib/
return self.run(
File "/usr/lib/
return self._operation
File "/usr/lib/
self.cleanups, self.func, *args, **kwargs)
File "/usr/lib/
result = func(*args, **kwargs)
File "/usr/lib/
self.
File "/usr/lib/
authors=
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
repo.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
orig_
File "/usr/lib/
for index, key, value in self._iter_
File "/usr/lib/
if not self.key_count():
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
return self._get_
File "/usr/lib/
found.
File "/usr/lib/
for node_pos, node in self._read_
File "/usr/lib/
bytes = self._transport
File "/usr/lib/
resp, response_handler = self._client.
File "/usr/lib/
method, args, expect_
File "/usr/lib/
readv_
File "/usr/lib/
encoder.
File "/usr/lib/
self.
File "/usr/lib/
self.flush()
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
osutils.
File "/usr/lib/
sent = sock.send(
socket.error: [Errno 32] Broken pipe
implementation plan:
* catch the socket.error and change it into a higher-level "connection closed" error
* this might need care to cover both internal/paramiko and external ssh clients; external clients might give different errors on Windows
* in the client per-call method, catch this error if it happens while sending the initial request, log a message, and build a new medium(?) object
To test it automatically:
* make a special medium that raises this error through monkeypatching
* use existing ssh tests that start a real server to check that the right error is raised when the
Interactive tests are probably a good idea considering the interaction with external systems
* start a bzr ssh client from a python shell
* kill the server process
* send a new request and check that it logs and reconnects
Related branches
- bzr-core: Pending requested
-
Diff: 1035 lines (+589/-143)7 files modifiedbzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/osutils.py (+40/-9)
bzrlib/smart/client.py (+170/-86)
bzrlib/smart/medium.py (+43/-24)
bzrlib/smart/protocol.py (+0/-3)
bzrlib/tests/test_osutils.py (+39/-0)
bzrlib/tests/test_smart_transport.py (+295/-21)
- bzr-core: Pending requested
-
Diff: 138 lines (+79/-5)3 files modifiedNEWS (+5/-0)
bzrlib/smart/client.py (+36/-2)
bzrlib/tests/test_smart_transport.py (+38/-3)
- bzr-core: Pending requested
-
Diff: 1397 lines (+810/-219)10 files modifiedNEWS (+10/-0)
bzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/osutils.py (+4/-1)
bzrlib/smart/client.py (+208/-86)
bzrlib/smart/medium.py (+22/-6)
bzrlib/smart/protocol.py (+5/-3)
bzrlib/smart/request.py (+144/-99)
bzrlib/tests/test_smart.py (+5/-2)
bzrlib/tests/test_smart_request.py (+10/-0)
bzrlib/tests/test_smart_transport.py (+400/-22)
- bzr-core: Pending requested
-
Diff: 1387 lines (+810/-217)11 files modifiedbzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/smart/client.py (+208/-86)
bzrlib/smart/medium.py (+22/-6)
bzrlib/smart/protocol.py (+5/-3)
bzrlib/smart/request.py (+144/-99)
bzrlib/tests/test_smart.py (+5/-2)
bzrlib/tests/test_smart_request.py (+10/-0)
bzrlib/tests/test_smart_transport.py (+399/-21)
doc/en/release-notes/bzr-2.1.txt (+5/-0)
doc/en/release-notes/bzr-2.2.txt (+5/-0)
doc/en/release-notes/bzr-2.3.txt (+5/-0)
- Richard Wilbur: Approve
-
Diff: 1576 lines (+891/-230)14 files modifiedbzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/osutils.py (+12/-4)
bzrlib/smart/client.py (+208/-86)
bzrlib/smart/medium.py (+46/-6)
bzrlib/smart/protocol.py (+5/-3)
bzrlib/smart/request.py (+145/-100)
bzrlib/tests/__init__.py (+3/-1)
bzrlib/tests/test_bundle.py (+6/-3)
bzrlib/tests/test_osutils.py (+40/-1)
bzrlib/tests/test_remote.py (+4/-2)
bzrlib/tests/test_smart.py (+5/-2)
bzrlib/tests/test_smart_request.py (+10/-0)
bzrlib/tests/test_smart_transport.py (+399/-21)
doc/en/release-notes/bzr-2.4.txt (+6/-1)
- Jelmer Vernooij (community): Approve
- Martin Pool: Approve
-
Diff: 1324 lines (+799/-199)13 files modifiedbzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/smart/client.py (+208/-86)
bzrlib/smart/medium.py (+20/-5)
bzrlib/smart/protocol.py (+5/-3)
bzrlib/smart/request.py (+145/-100)
bzrlib/tests/test_smart.py (+5/-2)
bzrlib/tests/test_smart_request.py (+10/-0)
bzrlib/tests/test_smart_transport.py (+376/-0)
doc/en/release-notes/bzr-2.1.txt (+5/-0)
doc/en/release-notes/bzr-2.2.txt (+5/-0)
doc/en/release-notes/bzr-2.3.txt (+5/-0)
doc/en/release-notes/bzr-2.4.txt (+8/-3)
doc/en/release-notes/bzr-2.5.txt (+5/-0)
Changed in bzr: | |
status: | New → Triaged |
status: | Triaged → Confirmed |
importance: | Undecided → High |
Changed in bzr: | |
importance: | High → Critical |
assignee: | nobody → canonical-bazaar (canonical-bazaar) |
Changed in bzr: | |
importance: | Critical → High |
assignee: | canonical-bazaar (canonical-bazaar) → Martin Pool (mbp) |
description: | updated |
summary: |
- when an idle ssh transport is interrupted, bzrlib errors + when an idle ssh transport is interrupted, bzrlib errors; should + reconnect instead |
Changed in bzr: | |
status: | Confirmed → Triaged |
status: | Triaged → In Progress |
Changed in bzr: | |
milestone: | none → 2.5b3 |
Changed in bzr: | |
milestone: | 2.5b3 → 2.5b4 |
Changed in bzr: | |
status: | In Progress → Fix Released |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 8/2/2011 12:45 PM, Jelmer Vernooij wrote:
> ** Changed in: bzr Importance: High => Critical
>
> ** Changed in: bzr Assignee: (unassigned) => canonical-bazaar
> (canonical-bazaar)
>
Is this actually critical?
Also, the initial report was "idle ssh transport". Monty's traceback is
certainly not 'idle' given that we are actively fetching content.
And while the transfer-of-content is roughly stateless, I'm not 100%
sure we want to default-reconnect SSH connections.
I suppose we could use a reasonable try-again- but-not- forever, or but-fail- if-last- connect- failed. People with flaky connections
try-once-
would get slow-but-useful connections. Restarting codehosting would
allow things to continue roughly interrupted, but the network going down
wouldn't cause us to spin indefinitely.
John
=:->
-----BEGIN PGP SIGNATURE----- enigmail. mozdev. org/
8AXUACgkQJdeBCY SNAANUhQCgqCGM8 zIhGfqife6E2TIa LwQx GM6t7/uy+ SmXgQlXt
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk4
Nc4AnRq18lIiBiy
=8wVS
-----END PGP SIGNATURE-----