Bug #1671504 “l3 agent downtime can cause tenant VM outages duri...” : Bugs : tripleo

Steven Hardy (shardy) on 2017-03-09

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → pike-1

OpenStack Infra (hudson-openstack) on 2017-03-15

Changed in tripleo:
assignee:	nobody → Marios Andreou (marios-b)
status:	Triaged → In Progress

Revision history for this message

Marios Andreou (marios-b) wrote on 2017-03-21:

#1

WIP - https://review.openstack.org/445494 for possibly getting this into newton to ocata workflow... still being tested.

Revision history for this message

Marios Andreou (marios-b) wrote on 2017-03-24:

#2

FYI this is also discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1419751 - copy/pasting from a comment I just left there describing latest status:

Hi, update on progress (tl;dr we know what breaks the pingtest, but still blocked on ovs/related issue ) I reached out to jlibosva from the network team for help and he immediately responded (copy/paste my email at [0] for context).

So Jakub quickly confirmed it was openvswitch which is causing the neutron-openvswitch agent to be started (even though it is in a stopped, by us, state). He found an issue in the neutron-openvswitch-agent service file and posted https://review.rdoproject.org/r/#/c/5951/ to fix it. The idea is if we upgrade to this version of openstack-neutron packages (with Jakub fix) then the subsequent openvswitch upgrade should no longer cause the neutron-openvswitch-agent to try and start (prematurely, see [0] for more info on why this is a problem).

Unfortunately in my testing upgrading this way, that is, first upgrade openstack-neutron packages to the ones with jakub fix (he made a repo which has builds with the fix, which i enabled as part of my upgrade-init.yaml environment file) then upgrading openvswitch/all the things. As soon as openvswitch is upgraded to ovs 2.6 i lose all node connectivity/all 3 controllers. I tried doing this both via 'yum update' (for openvswitch i mean) and also including https://review.openstack.org/#/c/434346/ (i.e. the 'special case upgrade with flags' discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1424945#c16 ) and had same result both times.

I think we still need the fix at https://review.rdoproject.org/r/#/c/5951/ (adding to the trackers) but not sure what is going on with openvswitch ... we can pick this up next week and/or someone has more ideas.

thanks, marios

[0] (via email marios->jlibosva):
"
https://review.openstack.org/#/c/445494/ is the review (the 'rolling one node at a time' mechanism is already in place, we are just using it and adding to the l3 agent service here). It does what it's meant to - that code is executed on one node at a time so only one l3 agent is down. Lose like 1/2 ping. Great.

However, upgrade continues and at this point all services are down ( cluster, neutron-* except l3, and all the things). Then _something_ starts the neutron-openvswitch-agent - I am fairly confident it is openvswitch itself (am going from openvswitch-2.5.0-14 to 2.6 so there is an openvswitch restart?). Someone suggested it may even be python-openvswitch but not sure at this point. In other words as these packages are updated as part of the workflow, the neutron-openvswitch-agent is started

Problem is neutron-openvswitch-agent cannot start at this point because rabbit is still down. And the fact that n-ovs-a starts/tries to start kills the ping and it stays down (even though l3 agents are running) until puppet reconfigures and starts all the things again.
"

FYI this is also discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1419751 - copy/pasting from a comment I just left there describing latest status:

Hi, update on progress (tl;dr we know what breaks the pingtest, but still blocked on ovs/related issue )  I reached out to jlibosva from the network team for help and he immediately responded (copy/paste my email at [0] for context).

So Jakub quickly confirmed it was openvswitch which is causing the neutron-openvswitch agent to be started (even though it is in a stopped, by us, state). He found an issue in the neutron-openvswitch-agent service file and posted https://review.rdoproject.org/r/#/c/5951/ to fix it. The idea is if we upgrade to this version of openstack-neutron packages (with Jakub fix) then the subsequent openvswitch upgrade should no longer cause the neutron-openvswitch-agent to try and start (prematurely, see [0] for more info on why this is a problem).

Unfortunately in my testing upgrading this way, that is, first upgrade openstack-neutron packages to the ones with jakub fix (he made a repo which has builds with the fix, which i enabled as part of my upgrade-init.yaml environment file) then upgrading openvswitch/all the things. As soon as openvswitch is upgraded to ovs 2.6 i lose all node connectivity/all 3 controllers. I tried doing this both via 'yum update' (for openvswitch i mean) and also including https://review.openstack.org/#/c/434346/ (i.e. the 'special case upgrade with flags' discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1424945#c16 ) and had same result both times.

I think we still need the fix at https://review.rdoproject.org/r/#/c/5951/ (adding to the trackers) but not sure what is going on with openvswitch ... we can pick this up next week and/or someone has more ideas.

thanks, marios

[0] (via email marios->jlibosva):
"
https://review.openstack.org/#/c/445494/ is the review (the 'rolling one node at a time' mechanism is already in place, we are just using it and adding to the l3 agent service here). It does what it's meant to - that code is executed on one node at a time so only one l3 agent is down. Lose like 1/2 ping. Great.

However, upgrade continues and at this point all services are down ( cluster, neutron-* except l3, and all the things). Then _something_ starts the neutron-openvswitch-agent - I am fairly confident it is openvswitch itself (am going from openvswitch-2.5.0-14 to 2.6 so there is an openvswitch restart?). Someone suggested it may even be python-openvswitch but not sure at this point. In other words as these packages are updated as part of the workflow, the neutron-openvswitch-agent is started

Problem is neutron-openvswitch-agent cannot start at this point because rabbit is still down. And the fact that n-ovs-a starts/tries to start kills the ping and it stays down (even though l3 agents are running) until puppet reconfigures and starts all the things again.   
"

Revision history for this message

Marios Andreou (marios-b) wrote on 2017-04-03:

#3

After a call with ajo today I think the premise behind this bug is wrong. We apparently don't even need the l3 agents for the tenant vm IPs. Sure, if all l3 agents are down you won't be able to _create_ new IPs for example but the existing ones should be reachable OK.

If they aren't then there is a bug - like the one discovered testing https://review.openstack.org/#/c/445494/9 that neutron-openvswitch-agent is being started when openvswitch is being updated during the upgrade package update. Another is an issue with 'ryu' and is likely the one I hit on Friday see review comments@ /#/c/445494/). Apparently there are newer package builds from Friday afternoon for neutron-* that might solve some of these.

So, plan today is to test without this change and see where the ping fails using those latest packages, so we can be clearer about any outstanding bugs.

Emilien Macchi (emilienm) on 2017-04-11

Changed in tripleo:
milestone:	pike-1 → pike-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-26: Change abandoned on tripleo-heat-templates (master)

#4

Change abandoned by Marios Andreou (<email address hidden>) on branch: master
Review: https://review.openstack.org/445494

Revision history for this message

Marios Andreou (marios-b) wrote on 2017-04-26:

#5

see https://bugzilla.redhat.com/show_bug.cgi?id=1419751#c11 for testing info and more context but essentially we no longer need the agents to access the floating IPs (though we won't be able to manage them during the upgrade). Marking the bug as invalid.

Changed in tripleo:
status:	In Progress → Invalid

tripleo

l3 agent downtime can cause tenant VM outages during upgrade

Bug Description

Other bug subscribers

Remote bug watches