@christian-rohmann The problem essentially boils down to the exception at [1] being raised because prior to that [2] gets called as a result of a timeout exception but the code is not actually catching the exception. This was traced to be the result of a privileged call being used as argument to [3] from [4] (which is in the patch we reverted).
So the *real* problem with privsep code is that if an unexpected exception is raised, it does not get caught thus either killing the reader thread and/or never releasing the lock. There is a separate bug [5] which was raised about the same issue that led to the fix [6] being added to privsep which, crucially, replaces the raised AttributeError with a continue thus stopping it from killing the reader thread. I have not yet tested whether this actually fixes all the agent issues we have seen though and while we should do this, there is still room for improvement in the privsep code namely [7] which should have an except clause that, if nothing else, prints a log message to say that the message timed out.
@christian-rohmann The problem essentially boils down to the exception at [1] being raised because prior to that [2] gets called as a result of a timeout exception but the code is not actually catching the exception. This was traced to be the result of a privileged call being used as argument to [3] from [4] (which is in the patch we reverted).
So the *real* problem with privsep code is that if an unexpected exception is raised, it does not get caught thus either killing the reader thread and/or never releasing the lock. There is a separate bug [5] which was raised about the same issue that led to the fix [6] being added to privsep which, crucially, replaces the raised AttributeError with a continue thus stopping it from killing the reader thread. I have not yet tested whether this actually fixes all the agent issues we have seen though and while we should do this, there is still room for improvement in the privsep code namely [7] which should have an except clause that, if nothing else, prints a log message to say that the message timed out.
[1] https:/ /github. com/openstack/ oslo.privsep/ blob/6d41ef9f91 b297091aa37721b a10456142fc5107 /oslo_privsep/ comm.py# L141 /github. com/openstack/ oslo.privsep/ blob/6d41ef9f91 b297091aa37721b a10456142fc5107 /oslo_privsep/ comm.py# L174 /github. com/openstack/ neutron/ blob/d4b1b4a072 9c187551e1fa2b2 855db136456d496 /neutron/ common/ utils.py# L689 /github. com/openstack/ neutron/ blob/d8f1f1118d 3cde0b526422083 6a250f14687893e /neutron/ agent/linux/ interface. py#L328 /bugs.launchpad .net/neutron/ +bug/1930401 /github. com/openstack/ oslo.privsep/ commit/ f7f3349d6a4def5 2f810ab17288795 21c12fe2d0 /github. com/openstack/ oslo.privsep/ blob/f7f3349d6a 4def52f810ab172 8879521c12fe2d0 /oslo_privsep/ comm.py# L189
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/