Cannot ssh into an instance after reboot

Bug #1240849 reported by Attila Fazekas
88
This bug affects 17 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Yves-Gwenael Bourhis
Havana
Fix Released
High
Ihar Hrachyshka

Bug Description

I was able to ssh the instance before reboot on the floating IP, but it failed after hard reboot.

According to this log it was working 2013-10-13_17_48_45_702. So something added after this date, probably related to the issue.
http://logs.openstack.org/37/50337/4/check/check-tempest-devstack-vm-neutron/9aeca12/logs/tempest.txt.gz#_2013-10-13_17_48_45_702

Revision history for this message
Attila Fazekas (afazekas) wrote :
description: updated
Revision history for this message
yong sheng gong (gongysh) wrote :

hardreboot VM or the compute host?

Revision history for this message
Attila Fazekas (afazekas) wrote :

I rebooted the guest vm as it is done in the attached script.

http://logs.openstack.org/37/50337/8/check/check-tempest-devstack-vm-full/3bb6902/console.html
Looks like the issue can happen with with nova-network, but it is random event. (only command changed between the working and not working version: https://review.openstack.org/#/c/50337/ )

Revision history for this message
Attila Fazekas (afazekas) wrote :

I rebooted the guest vm as it is done in the attached script.

http://logs.openstack.org/37/50337/8/check/check-tempest-devstack-vm-full/3bb6902/console.html
Looks like the issue can happen with with nova-network, but it is random event. (Only comment changed between the two patch set: https://review.openstack.org/#/c/50337/ , once it is working once it is not working)

Revision history for this message
Sean Dague (sdague) wrote :

This race condition with reestablishing networks after a guest reboot is bad enough that we're going to have to skip the test in tempest.

This seems like an important race to get addressed.

Changed in nova:
status: New → Confirmed
Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Changed in nova:
importance: Undecided → High
Changed in neutron:
assignee: nobody → Yves-Gwenael Bourhis (yves-gwenael-bourhis)
Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

Config to reproduce:
=================
neutron master branch with plugin ml2 and agent ovs and interface_driver = neutron.agent.linux.interface.OVSInterfaceDriver

Steps to reproduce:
================
- Create a private virtual network and subnet.
- boot a VM attached to this network.
- from the dhcp netns or router netns ping the vm IP and confirm it works.
- perform a "soft" reboot of the VM.
- confirm the pings never respond anymore
- perform a VM stop followed by a VM start
- confirm the pings come back.

Analysis:
=======

A simple "nova reboot MyVM" (soft reboot) reproduces the bug in 99% of the cases (less with a hard reboot).
"nova stop MyVM && nova start MyVM" does not reproduce the bug (or solves the issue after a reboot).

During soft reboots, either nova or libvirt removes the VM's TAP from the OVS bridge (br-int) and reinserts it, but too fast for neutron to see it and the local VLAN tag is not re-associated with the VM's TAP. (this can be seen with "ovs-vsctl show", after a reboot the tag disappeared).

If we use polling by adding "minimize_polling = True " in /etc/neutron/plugins/ml2/ml2_conf.ini under the [agent], this is a workaround to solve the issue.

TODO:
=====

Neutron:
-------------
Fix agent to put the tag back in the ovs bridge.

Nova:
--------
Check who from Nova or libvirt removes and reintrodues the tap interface.
If it's livirt, we can't do much, if it's nova telling livirt to do so, why do we remove the tap for a reboot?

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

ERRATA :
"minimize_polling = True " actually doesn't solve the issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/66375

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Attila Fazekas (afazekas) wrote :

I just debug the root cause yesterday, with a similar conclusion.

I modified little bit the tempest ssh_floating stress action in-order to just reboot, patch attached.
you may also want to change the etc/tempest.conf: log_config_append=etc/logging.conf.sample

./tempest/stress/run_stress.py -t tempest/stress/etc/ssh_floating.json -n 128

tags: added: havana-backport-potential
Changed in neutron:
milestone: none → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/66375
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=60cb0911712ad11688b4d09e5c01ac39c49f5aea
Submitter: Jenkins
Branch: master

commit 60cb0911712ad11688b4d09e5c01ac39c49f5aea
Author: Yves-Gwenael Bourhis <email address hidden>
Date: Mon Jan 13 18:27:27 2014 +0100

    Fixing lost vlan ids on interfaces

    Sometimes a vm gets its tap interface unset and reset too fast to be caught in
    an agent loop, and its vlan tag was not reset.

    We now detect if an interface loses its vlan tag, and if it happens the
    interface will be reconfigured.

    Since the TAG ID is only available via the "Port" table (in the 'tag' column),
    we couldn't reuse the get_vif_port_set() method's run_vsctl call which queries
    the "Interface" table, and needed a specific run_vsct call to the "Port" table
    in the new get_port_tag_dict() method.

    Change-Id: I7f59e2c1e757c28dae35c44ebfad9d764ae1d3c5
    Closes-Bug: 1240849

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-rc1 → 2014.1
Revision history for this message
Andrew Kinney (andykinney) wrote :

I see the tag "havana-backport-potential" was added a little over two months ago. How do we make the actual backport happen? We have a Havana installation in production that is pretty much hosed because of lost VLAN tags ( https://bugs.launchpad.net/neutron/+bug/1268955 ). If we could get *this* fix in Havana, it would likely fix our issue, too. Alternately, if there's a functional upgrade path between Havana and Icehouse, that might be acceptable.

Revision history for this message
George Shuklin (george-shuklin) wrote :

Andrew Kinney (andykinney), the problem is that this patch can not be 'just applied' to havanna. We've got that jssue to, and I'll try to play with code, but no promises (and I'm not a developer, so attention of someone with os development expirience is welcomed).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/93343

Revision history for this message
George Shuklin (george-shuklin) wrote :

Note about bugfix above: It can be applied to stable/havana but will not work with ubuntu cloud archive version.

Ubuntu version little older than stable/havana head and require addidtional patch over this patch. Yes, I know, it suck.

Ubuntu patch (add after 1st patch):

diff --git a/neutron/agent/linux/ovs_lib.py b/neutron/agent/linux/ovs_lib.py
index e592f67..0860219 100644
--- a/neutron/agent/linux/ovs_lib.py
+++ b/neutron/agent/linux/ovs_lib.py
@@ -374,7 +374,7 @@ class OVSBridge(BaseOVS):
         """
         port_names = self.get_port_name_list()
         args = ['--format=json', '--', '--columns=name,tag', 'list', 'Port']
- result = self.run_vsctl(args, check_error=True)
+ result = self.run_vsctl(args)
         port_tag_dict = {}
         if not result:
             return port_tag_dict

andykinney Please check if it will works for you.

Revision history for this message
Andrew Kinney (andykinney) wrote :

george-shuklin, I'm running havana from the debian wheezy apt-get packages. As soon as I get a chance this week, I'll try applying the patches and observing interface tagging.

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
I finally had an opportunity to examine this more closely. I'm not sure I understand. I thought I did at first, but, upon closer examination, I realized I hadn't fully digested your instructions.

Are you saying to download and install the new files referenced for patch set 20 at https://review.openstack.org/93343 and then apply your second patch on top of that?

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
When attempting to apply the second patch, it gets rejected:
Hunk #1 FAILED at 374.
1 out of 1 hunk FAILED -- saving rejects to file neutron-agent-linux-ovs_lib.py-patchedV2.rej

Revision history for this message
George Shuklin (george-shuklin) wrote :

andykinney, I'll look at the monday to the debian source code to see how to apply patch.

It works for us, but we maintain separate repository with own set of patches. Different distro stuck at different commits of stable/havana, so patch may be needed to be adjusted slightly.

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
I manually made the change referenced in your patch. When starting the openvswitch-agent, it just silently dies with no log output, which is the same behavior it had before manually applying your patch changes.

I'll see if I can stick an strace into the mix somewhere to figure out why it's dying.

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
No luck getting any output, even with strace. I'm not a programmer by trade (trained in networking and systems admin), so I get lost in the bushes with some of these more complex coding problems. Give me bash and perl scripts and I do fine, but python is a slightly different beast and openstack is particularly confounding to debug.

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
Any chance of finishing up the review at https://review.openstack.org/#/c/93343/ so it can be merged and released? It looks like it has stalled.

Upgrading to icehouse won't be an option for this installation until the client puts up funds for a duplicate cluster given the issues upgrading from havana to icehouse in place. That might be quite awhile.

Revision history for this message
George Shuklin (george-shuklin) wrote :

My backport was reviewed, checked and so on, but as far as I understand in IRC chat, havana will recieve only one more update in the autumn. Basically, openstack abandon oldstable release as soon as new stable appears (that means everyone gonna abandon icehouse as soon as juno gonna be released).

Btw, someone pull changes to stable/havan, it can no longer merge changes. I'll rebase them and ask to rereview patch.

But basically, it is. Bug found, fixed, but not ported. Even if you backport it by yourself, no one care to merge it to stable/havan, and even than end-users will not receive it as fix until distro-guys will not pull new 'fix release' to own repository.

Sad story...

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
Yes, that is sad indeed. The upgrade path between releases is not friendly or practical for a production environment. You just end up jumping from one fire to the next with no point at which important bugs are fixed in an environment where new bugs are not introduced by feature creep. I don't know how anyone expects OpenStack to ever be usable in commercial environment (versus just a hobbyist) for as long as this paradigm for new releases is in place.

If this existing expectation that you upgrade between releases to get bug fixes continues, then the upgrade path has to become much better and transparent.

We're already reconsidering our choice to make this the platform of choice after our experiences with our initial deployment. I don't know what the alternatives are, but this situation has started us looking for an alternative.

Revision history for this message
George Shuklin (george-shuklin) wrote :

I've checked out changes in stable/havana - they are rather significant. I will try merge them with fix, but it will take some time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/108453

Revision history for this message
George Shuklin (george-shuklin) wrote :

I've rebackported old fix, it applied without any changes (compare to icehouse).

Revision history for this message
Andrew Kinney (andykinney) wrote :

George,
Thank you for your efforts. If I had the skills, I would be right there with you helping get this backported.

Alan Pevec (apevec)
tags: removed: havana-backport-potential
Sean Dague (sdague)
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/108453
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a199566bcc6b1d159f01f35790e58852256cc77
Submitter: Jenkins
Branch: stable/havana

commit 7a199566bcc6b1d159f01f35790e58852256cc77
Author: Yves-Gwenael Bourhis <email address hidden>
Date: Mon Jan 13 18:27:27 2014 +0100

    Fixing lost vlan ids on interfaces

    Sometimes a vm gets its tap interface unset and reset too fast to be caught in
    an agent loop, and its vlan tag was not reset.

    We now detect if an interface loses its vlan tag, and if it happens the
    interface will be reconfigured.

    Since the TAG ID is only available via the "Port" table (in the 'tag' column),
    we couldn't reuse the get_vif_port_set() method's run_vsctl call which queries
    the "Interface" table, and needed a specific run_vsct call to the "Port" table
    in the new get_port_tag_dict() method.

    Conflicts:
     neutron/tests/unit/openvswitch/test_ovs_lib.py

    Change-Id: I7f59e2c1e757c28dae35c44ebfad9d764ae1d3c5
    Closes-Bug: 1240849
    (cherry picked from commit 60cb0911712ad11688b4d09e5c01ac39c49f5aea)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/havana)

Change abandoned by Alan Pevec (<email address hidden>) on branch: stable/havana
Review: https://review.openstack.org/93343
Reason: Final Havana release 2013.2.4 has been cut and stable/havana is going to be removed in a week.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.