destory-environment errors and hangs forever

Bug #863510 reported by Gustavo Niemeyer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pyjuju
Fix Released
Medium
Jim Baker

Bug Description

As seen in ftests:

+++ juju destroy-environment
2011-09-29 22:46:08,163 INFO Destroying environment 'sample' (type: ec2)...
2011-09-29 22:46:09,966 INFO Waiting on 3 EC2 instances to transition to terminated state, this may take a while
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: Getting https://ec2.us-east-1.amazonaws.com/?AWSAccessKeyId=AKIAJK3WYEWHCTYQBF7A&Action=DeleteSecurityGroup&GroupName=juju-sample-1&Signature=hIEGPhSpPtdir4raEbDPwUjoyAdTtrZaePnYrqu0urU%3D&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2011-09-30T02%3A46%3A45Z&Version=2008-12-01 took longer than 30 seconds.

Related branches

Revision history for this message
Jim Baker (jimbaker) wrote :

The destroy-environment command currently does not attempt any
retries. The question is whether this command itself should, or the
user of the command should, if necessary.

It's not clear that the specific exception reported in this bug,
TimeoutError, is more relevant than other possible errors. The general
expectation of our code using txaws is that it wraps exceptions in
EC2Error (which might be ignored or further wrapped as
ProviderInteractionError). But regardless, in this code path, if there
is an error, environment destruction is stopped. This includes taking
too long. This suggests at least that the command itself must be
retried.

Any such errors will result in the exit code set to 1 (contrary to the
bug report of #697093, as I mention there, this is a problem in our
testing, not in the code itself). So this command can be retried
automatically if that is returned.

The last point to consider is it possible for environment destruction
to be wedged? Trying to do this suggests that it might be possible,
but for only a short period of time. Re-running destroy-environment
eventually succeeds. Machines are always linked to their security
group, so it's always possible (eventually) to iterate over them and
destroy. Although this doesn't guarantee all security groups will be
deleted, so long as the link to the machine is gone, a subsequent
bootstrap process will succeed.

Hence this analysis suggests that the proper way to resolve this bug
report is to better document assumptions, especially when this command
is scripted. We may also want to look at adding to the test suite for
this scenario.

Revision history for this message
Jim Baker (jimbaker) wrote :

The relevant WTF output can be seen here:

http://wtf.labix.org/370/ec2-wordpress.out.FAILED

+++ juju destroy-environment
2011-09-29 22:46:08,163 INFO Destroying environment 'sample' (type: ec2)...
2011-09-29 22:46:09,966 INFO Waiting on 3 EC2 instances to transition to terminated state, this may take a while
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: Getting https://ec2.us-east-1.amazonaws.com/?AWSAccessKeyId=AKIAJK3WYEWHCTYQBF7A&Action=DeleteSecurityGroup&GroupName=juju-sample-1&Signature=hIEGPhSpPtdir4raEbDPwUjoyAdTtrZaePnYrqu0urU%3D&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2011-09-30T02%3A46%3A45Z&Version=2008-12-01 took longer than 30 seconds.
WARNING: this command will destroy the 'sample' environment (type: ec2).
This includes all machines, services, data, and other resources. Continue [y/N]Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object destroy_environment at 0x35430f0> ignored

It looks like this issue is more problematic than the snippet reported with the bug: the Twisted reactor ignored the error here, and thus hangs forever. Making the control Commander more robust should fix this and other issues.

Jim Baker (jimbaker)
Changed in juju:
status: New → In Progress
assignee: nobody → Jim Baker (jimbaker)
milestone: none → eureka
importance: Undecided → Medium
Revision history for this message
Jim Baker (jimbaker) wrote :

Related to bug 846055

The problem is not in the Commander, but in the background security group deletion. So this can be properly waited for, a Deferred ("completed") is constructed for each deletion. However, the errback on completed is only called if the exception returned by the txaws delete_security_group raises an EC2Error. In this case and in #846055, this is not the case, and so no errback/callback is called, and the error is instead handled by the reactor, where it is subsequently ignored after logging the exception.

Arguably #846055 should be fixed by making txaws more robust. However, gather_results should still not wait indefinitely.

To fix, catch the more general Exception. This can be wrapped in ProviderInteractionError, or just passed through the completed Deferred errback, but it should not be ignored.

Jim Baker (jimbaker)
Changed in juju:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.