Personally, I don't like the policy rule on the fault message field. I mean, it's okay if we want that in general, but that's not a reasonable solution to the problem I think, because it means nobody gets any information about why things failed anymore (like even NoValidHost).
It sucks to have to whack-a-mole the exceptions, but I think addressing it in compute/utils where we stringify unknown exceptions is a good plan. However, I don't think we can conditionally do that based on context.is_admin because some admin doing an operation would record this same information in the fault which another user could then see.
So I think what we should do is change the behavior which stringifies any unknown (i.e. non-NovaException) exception to always just grab the exception.__name__ (which we already do if the exception doesn't stringify to something non-Falsey) instead of the full message and record that. The details will still be there for admin viewers, but we treat any non-NovaException as could-be-sensitive and only record the name.
We'll need to backport it, it won't be fixed until all computes are upgraded, and people may have sensitive things in their databases now that need scrubbing. However, this is the right solution, IMHO. If we want a message policy toggle as well (in general or to mitigate exposure while scrubbing and upgrading) then that's fine I guess, although it does seem like an unfortunate thing for admins to turn off such that failed instance boots just go to ERROR with no explanation.
Personally, I don't like the policy rule on the fault message field. I mean, it's okay if we want that in general, but that's not a reasonable solution to the problem I think, because it means nobody gets any information about why things failed anymore (like even NoValidHost).
It sucks to have to whack-a-mole the exceptions, but I think addressing it in compute/utils where we stringify unknown exceptions is a good plan. However, I don't think we can conditionally do that based on context.is_admin because some admin doing an operation would record this same information in the fault which another user could then see.
So I think what we should do is change the behavior which stringifies any unknown (i.e. non-NovaException) exception to always just grab the exception.__name__ (which we already do if the exception doesn't stringify to something non-Falsey) instead of the full message and record that. The details will still be there for admin viewers, but we treat any non-NovaException as could-be-sensitive and only record the name.
We'll need to backport it, it won't be fixed until all computes are upgraded, and people may have sensitive things in their databases now that need scrubbing. However, this is the right solution, IMHO. If we want a message policy toggle as well (in general or to mitigate exposure while scrubbing and upgrading) then that's fine I guess, although it does seem like an unfortunate thing for admins to turn off such that failed instance boots just go to ERROR with no explanation.