Neutron L3 agent doesn't reschedule routers when MQ is down

Bug #1493755 reported by Dmitry Nikishov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
MOS Neutron
6.0.x
Invalid
High
MOS Maintenance
6.1.x
Invalid
High
MOS Maintenance
7.0.x
Invalid
High
MOS Neutron
8.0.x
Invalid
High
MOS Neutron

Bug Description

HA, Neutron/VLAN, Ubuntu, MOS 6.0 (Juno); Kilo is likely to be affected as well.

Easiest way to reproduce:
- deploy HA/Neutron+VLAN/Ubuntu
- reboot a controller that has "qrouter" network namespace

Expected result:
- router is rescheduled to another controller, along with qrouter namespace

Actual result:
- router is not rescheduled

Additional analysis:
When L3 agent starts, it runs periodic_sync_routers_task with fullsync = True to fetch a list of routers from server and determine if certain routers are not being hosted anywhere [1]. Then it starts hosting these routers (creates network namespaces etc).
Such sync is being performed on any router/agent update. After a successful sync, "fullsync" parameter is being set to False [2]. However, if a controller node which hosts a router has been rebooted, other nodes do not perform a fullsync; as a result, routers are not being rescheduled from a dead L3 agent. This is likely due to failover in MOS 6.0 requiring a shutdown and reconfiguration of RabbitMQ cluster, during which RPC calls are not available. So server tries to auto-reschedule routers from a dead agent (due to "allow_automatic_l3agent_failover=True" in neutron.conf), but RabbitMQ is down, so it fails and gives up. This means that alive L3 agents are not aware that there are routers that need rescheduling and don't try to sync router list because fullsync has been set to False before the failover.

[1] https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py#L1911
[2] https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py#L1939

Revision history for this message
Dmitry Nikishov (nikishov-da) wrote :

Update1:
Server gives up since there are no alive L3 agents left for a duration of HA failover (during which RabbitMQ cluster is down which leads to agents being marked as dead):

2015-09-09 07:40:33.247 15781 WARNING neutron.scheduler.l3_agent_scheduler [-] No active L3 agents
2015-09-09 07:40:33.252 15781 ERROR neutron.db.l3_agentschedulers_db [-] Failed to reschedule router 5924f1d2-a47c-4085-af02-79fa381cfe5d
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db Traceback (most recent call last):
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 136, in reschedule_routers_from_down_agents
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db self.reschedule_router(context, binding.router_id)
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 273, in reschedule_router
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db router_id=router_id)
2015-09-09 07:40:33.252 15781 TRACE neutron.db.l3_agentschedulers_db RouterReschedulingFailed: Failed rescheduling router 5924f1d2-a47c-4085-af02-79fa381cfe5d: no eligible l3 agent found.

Update 2:
There seems to be exactly same or similar bug in Neutron: https://bugs.launchpad.net/neutron/+bug/1403921
However it has been marked as expired.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

It is expected that a router is not rescheduled during Rabbit failover when agent are 'dead', it can be rescheduled only when agents are back to normal state. Moving to as invalid.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.