In some cases we have a situation where rpc_loop or _sync_routers_task block. From log observations this happens always will executing subprocess.communicate, and the root cause could be this: https://github.com/eventlet/eventlet/pull/24
Which is a bit strange since this popen.communicate is used also in common.processutils and not other block issue has been reported. Perhaps neutron.agent.linux.utils.execute should leverage openstack.common
Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.
In other cases instead the following exception is raised (and probably it shouldn't):
This triggers a full synchronization, which has the following effects:
- blocks rpc loops; so the update for the floating IP is delayed. With many routers (and tenant isolation jobs added many routers) this might mean that the floating IP is applied after the tempest timeout is elapsed.
- doing many execute operations increases the chance of the thread blocking
Current approach is to 'blindly' ignore the router_removed error to avoid it triggering the full router synchronization.
If it works, this should be regarded only as the first step of a more complex fix aimed at getting the gate going again.
In some cases we have a situation where rpc_loop or _sync_routers_task block. From log observations this happens always will executing subprocess. communicate, and the root cause could be this: https:/ /github. com/eventlet/ eventlet/ pull/24
Which is a bit strange since this popen.communicate is used also in common.processutils and not other block issue has been reported. Perhaps neutron. agent.linux. utils.execute should leverage openstack.common
Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.
In other cases instead the following exception is raised (and probably it shouldn't):
2013-10-04 12:28:21.360 1259 ERROR neutron. agent.l3_ agent [-] Failed synchronizing routers agent.l3_ agent Traceback (most recent call last): agent.l3_ agent File "/opt/stack/ new/neutron/ neutron/ agent/l3_ agent.py" , line 730, in _rpc_loop agent.l3_ agent self._process_ router_ delete( ) agent.l3_ agent File "/opt/stack/ new/neutron/ neutron/ agent/l3_ agent.py" , line 739, in _process_ router_ delete agent.l3_ agent self._router_ removed( router_ id) agent.l3_ agent File "/opt/stack/ new/neutron/ neutron/ agent/l3_ agent.py" , line 313, in _router_removed agent.l3_ agent ri = self.router_ info[router_ id] agent.l3_ agent KeyError: u'b17f5fe6- 8354-4af7- b271-a4ab0896dc b7' agent.l3_ agent
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
2013-10-04 12:28:21.360 1259 TRACE neutron.
This triggers a full synchronization, which has the following effects:
- blocks rpc loops; so the update for the floating IP is delayed. With many routers (and tenant isolation jobs added many routers) this might mean that the floating IP is applied after the tempest timeout is elapsed.
- doing many execute operations increases the chance of the thread blocking
Current approach is to 'blindly' ignore the router_removed error to avoid it triggering the full router synchronization.
If it works, this should be regarded only as the first step of a more complex fix aimed at getting the gate going again.