The destroy-environment command currently does not attempt any
retries. The question is whether this command itself should, or the
user of the command should, if necessary.
It's not clear that the specific exception reported in this bug,
TimeoutError, is more relevant than other possible errors. The general
expectation of our code using txaws is that it wraps exceptions in
EC2Error (which might be ignored or further wrapped as
ProviderInteractionError). But regardless, in this code path, if there
is an error, environment destruction is stopped. This includes taking
too long. This suggests at least that the command itself must be
retried.
Any such errors will result in the exit code set to 1 (contrary to the
bug report of #697093, as I mention there, this is a problem in our
testing, not in the code itself). So this command can be retried
automatically if that is returned.
The last point to consider is it possible for environment destruction
to be wedged? Trying to do this suggests that it might be possible,
but for only a short period of time. Re-running destroy-environment
eventually succeeds. Machines are always linked to their security
group, so it's always possible (eventually) to iterate over them and
destroy. Although this doesn't guarantee all security groups will be
deleted, so long as the link to the machine is gone, a subsequent
bootstrap process will succeed.
Hence this analysis suggests that the proper way to resolve this bug
report is to better document assumptions, especially when this command
is scripted. We may also want to look at adding to the test suite for
this scenario.
The destroy-environment command currently does not attempt any
retries. The question is whether this command itself should, or the
user of the command should, if necessary.
It's not clear that the specific exception reported in this bug, tionError) . But regardless, in this code path, if there
TimeoutError, is more relevant than other possible errors. The general
expectation of our code using txaws is that it wraps exceptions in
EC2Error (which might be ignored or further wrapped as
ProviderInterac
is an error, environment destruction is stopped. This includes taking
too long. This suggests at least that the command itself must be
retried.
Any such errors will result in the exit code set to 1 (contrary to the
bug report of #697093, as I mention there, this is a problem in our
testing, not in the code itself). So this command can be retried
automatically if that is returned.
The last point to consider is it possible for environment destruction
to be wedged? Trying to do this suggests that it might be possible,
but for only a short period of time. Re-running destroy-environment
eventually succeeds. Machines are always linked to their security
group, so it's always possible (eventually) to iterate over them and
destroy. Although this doesn't guarantee all security groups will be
deleted, so long as the link to the machine is gone, a subsequent
bootstrap process will succeed.
Hence this analysis suggests that the proper way to resolve this bug
report is to better document assumptions, especially when this command
is scripted. We may also want to look at adding to the test suite for
this scenario.