Tempest tests failing(SSH timeout to cirros VM) in packstack scenario001

Bug #1771500 reported by yatin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Packstack
New
Undecided
Unassigned

Bug Description

Some tempest tests are failing with following SSH timed out Error in packstack scenario 001 in RDO Phase1 Job.

Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/tempest/api/compute/servers/test_server_actions.py", line 147, in test_reboot_server_hard
        self._test_reboot_server('HARD')
      File "/usr/lib/python2.7/site-packages/tempest/api/compute/servers/test_server_actions.py", line 121, in _test_reboot_server
        boot_time = linux_client.get_boot_time()
      File "/usr/lib/python2.7/site-packages/tempest/common/utils/linux/remote_client.py", line 78, in get_boot_time
        boot_secs = self.exec_command(cmd)
      File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper
        six.reraise(*original_exception)
      File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
        return function(self, *args, **kwargs)
      File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 108, in exec_command
        return self.ssh_client.exec_command(cmd)
      File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 151, in exec_command
        ssh = self._get_ssh_connection()
      File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
        password=self.password)
    tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.4.9 via SSH timed out.
    User: cirros, Password: E8@kPD~#8i=G1K9

Logs:-
Master:-
https://centos.logs.rdoproject.org/weirdo-master-promote-packstack-scenario001/1242/weirdo-project/logs/latest/tempest.log.txt.gz
https://centos.logs.rdoproject.org/weirdo-master-promote-packstack-scenario001/1243/weirdo-project/logs/latest/tempest.log.txt.gz

Queens:-
https://centos.logs.rdoproject.org/weirdo-queens-promote-packstack-scenario001/48/weirdo-project/logs/latest/tempest.log.txt.gz
https://centos.logs.rdoproject.org/weirdo-queens-promote-packstack-scenario001/46/weirdo-project/logs/latest/tempest.log.txt.gz

In queens seen only 1 test failed and in master 3 tests are failing with same Traceback.

Expected result:-
Tests should pass

Actual Result:-
Tests are failing

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :
Download full text (6.1 KiB)

I have attached a tcpdump to br-ex on the host and then opened a netcat tcp connection to the FIP of the guest VM. When the connection got established, I sent "DDDDDDDDDDDDDDDDDDDD" from the VM. This is what is being received on br-ex in the host:
08:40:08.503759 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11684, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0x6070 (incorrect -> 0xcb31), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20043541 ecr 431651366], length 21
E..I-.@.?...........'..x....L"......`p.....
.1....z&DDDDDDDDDDDDDDDDDDDD

08:40:09.319720 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11685, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0x6070 (incorrect -> 0xca65), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20043745 ecr 431651366], length 21
E..I-.@.?...........'..x....L"......`p.....
.1....z&DDDDDDDDDDDDDDDDDDDD

08:40:10.955818 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11686, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0x6070 (incorrect -> 0xc8cc), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20044154 ecr 431651366], length 21
E..I-.@.?...........'..x....L"......`p.....
.1.z..z&DDDDDDDDDDDDDDDDDDDD

08:40:14.227779 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11687, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0x6070 (incorrect -> 0xc59a), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20044972 ecr 431651366], length 21
E..I-.@.?...........'..x....L"......`p.....
.1....z&DDDDDDDDDDDDDDDDDDDD

08:40:20.771667 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11688, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0x6070 (incorrect -> 0xbf36), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20046608 ecr 431651366], length 21
E..I-.@.?...........'..x....L"......`p.....
.1....z&DDDDDDDDDDDDDDDDDDDD

08:40:33.860176 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:4e, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 11689, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.1.59512: Flags [P.], cksum 0xb26e (correct), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20049880 ecr 431651366], length 21
E..I-.@.?...........'..x....L".......n.....
.1....z&DDDDDDDDDDDDDDDDDDDD

08:40:33.860218 6e:6d:7f:9a:fa:4e > fa:16:3e:1f:24:8a, ethertype IPv4 (0x0800), length 66: (tos 0x10, ttl 64, id 1170, offset 0, flags [DF], proto TCP (6), length 52)
    172.24.4.1.59512 > 172.24.4.3.distinct: Flags [.], cksum 0xf8a2 (correct), seq 1, ack 42, win 229, options [nop,nop,TS val 431682898 ecr 20049880], length 0
E..4..@.@............x'.L".................
...R.1..

Only at this point (when checksum is correct, I got the DDDDD...

Read more...

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Numan found out that adding an OVS port to br-ex and trying to connect to the instance from there he didn't find the issue.
I repeated the tcpdump test doing the same thing, ie. created a network namespace and added an internal port to br-ex. This is the result.

tcpdump on ovs port:

08:59:58.310245 fa:16:3e:1f:24:8a > 6e:6d:7f:9a:fa:37, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 63, id 53038, offset 0, flags [DF], proto TCP (6), length 73)
    172.24.4.3.distinct > 172.24.4.99.52106: Flags [P.], cksum 0x4938 (correct), seq 21:42, ack 1, win 3475, options [nop,nop,TS val 20340995 ecr 432827922], length 21
E..I..@.?..........c'....6=.S.......I8.....
.6a...n.BBBBBBBBBBBBBBBBBBBB

08:59:58.310273 6e:6d:7f:9a:fa:37 > fa:16:3e:1f:24:8a, ethertype IPv4 (0x0800), length 66: (tos 0x10, ttl 64, id 45701, offset 0, flags [DF], proto TCP (6), length 52)
    172.24.4.99.52106 > 172.24.4.3.distinct: Flags [.], cksum 0x60bd (incorrect -> 0xaaa2), seq 1, ack 42, win 229, options [nop,nop,TS val 432847348 ecr 20340995], length 0
E..4..@.@.'....c......'.S....6> ....`......
.....6a.

tap device:

08:59:58.309795 fa:16:3e:e9:09:9c > fa:16:3e:32:f1:55, ethertype IPv4 (0x0800), length 87: (tos 0x0, ttl 64, id 53038, offset 0, flags [DF], proto TCP (6), length 73)
    10.0.0.11.distinct > 172.24.4.99.52106: Flags [P.], cksum 0xbac1 (incorrect -> 0xef48), seq 43:64, ack 1, win 3475, options [nop,nop,TS val 20340995 ecr 432827922], length 21
E..I..@.@...
......c'....6=.S..............
.6a...n.BBBBBBBBBBBBBBBBBBBB

08:59:58.310615 fa:16:3e:32:f1:55 > fa:16:3e:e9:09:9c, ethertype IPv4 (0x0800), length 66: (tos 0x10, ttl 63, id 45701, offset 0, flags [DF], proto TCP (6), length 52)
    172.24.4.99.52106 > 10.0.0.11.distinct: Flags [.], cksum 0x50b3 (correct), seq 1, ack 64, win 229, options [nop,nop,TS val 432847348 ecr 20340995], length 0
E..4..@.?......c
.....'.S....6> ....P......
.....6a.

The checksum is received correctly in the namespace so the ACK is sent right away. The communication works as expected and I think that we can confirm that it's a checksum issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to packstack (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/569012

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (3.9 KiB)

Some additional info, which may help in drill down the issue further.

- The issue happens with new centos 7.5 kernel. With centos 7.4 kernel issue is not seen.
- Ping succeeds, SSH Fails.

With the environment where issue reproduced:-
After flushing the iptables rules(iptables -F), ssh to vm succeeds.

Before flushing the rules output of iptables -L: https://review.rdoproject.org/paste/show/114/, also pasting here:-
[root@packstackovs-2 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT tcp -- packstackovs-2.rdocloud anywhere multiport dports amqps,amqp /* 001 amqp incoming amqp_192.168.100.21 */
ACCEPT tcp -- packstackovs-2.rdocloud anywhere multiport dports iscsi-target /* 001 cinder incoming cinder_192.168.100.21 */
ACCEPT tcp -- anywhere anywhere multiport dports 8776 /* 001 cinder-api incoming cinder_api */
ACCEPT tcp -- anywhere anywhere multiport dports armtechdaemon /* 001 glance incoming glance_api */
ACCEPT tcp -- anywhere anywhere multiport dports https /* 001 horizon 443 incoming */
ACCEPT tcp -- anywhere anywhere multiport dports commplex-main,openstack-id /* 001 keystone incoming keystone */
ACCEPT tcp -- anywhere anywhere multiport dports msgclnt /* 001 manila-api incoming manila_api */
ACCEPT tcp -- packstackovs-2.rdocloud anywhere multiport dports mysql /* 001 mariadb incoming mariadb_192.168.100.21 */
ACCEPT tcp -- anywhere anywhere multiport dports 9696 /* 001 neutron server incoming neutron_server_192.168.100.21 */
ACCEPT udp -- packstackovs-2.rdocloud anywhere multiport dports 6081 /* 001 neutron tunnel port incoming neutron_tunnel_192.168.100.21_192.168.100.21 */
ACCEPT tcp -- anywhere anywhere multiport dports 8773,8774,8775,8778 /* 001 nova api incoming nova_api */
ACCEPT tcp -- packstackovs-2.rdocloud anywhere multiport dports rfb:cvsup /* 001 nova compute incoming nova_compute */
ACCEPT tcp -- packstackovs-2.rdocloud anywhere multiport dports 16509,49152:49215 /* 001 nova qemu migration incoming nova_qemu_migration_192.168.100.21_192.168.100.21 */
ACCEPT tcp -- anywhere anywhere multiport dports 6080 /* 001 novncproxy incoming */
ACCEPT tcp -- anywhere anywhere multiport dports 6641 /* 001 ovn northd incoming ovn_northd_192.168.100.21 */
ACCEPT tcp -- anywhere anywhere multiport dports 6642 /* 001 ovn southd incoming ovn_southd_192.168.100.21 */
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT icmp -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

Chain FORWARD (poli...

Read more...

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Correct, if we delete the following rule from INPUT chain, then everything works.

REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

The checksum thing was misleading as we disabled the hw offloading and then the checksum was always correct.

We have tried with OVS datapath shipped with CentOS 7.4 kernel and it fails. However, upgrading to kernel-alt works even with that iptables rule:

[root@test4 ~]# modinfo openvswitch
filename: /lib/modules/4.14.0-67.el7a.x86_64/kernel/net/openvswitch/openvswitch.ko
alias: net-pf-16-proto-16-family-ovs_packet
alias: net-pf-16-proto-16-family-ovs_flow
alias: net-pf-16-proto-16-family-ovs_vport
alias: net-pf-16-proto-16-family-ovs_datapath
license: GPL
description: Open vSwitch switching datapath
rhelversion: 7.6
srcversion: 9E8F3F0B3A07A375083C2AE
depends: nf_conntrack,nf_nat,libcrc32c,nf_nat_ipv6,nf_nat_ipv4,nf_defrag_ipv6
intree: Y
name: openvswitch
vermagic: 4.14.0-67.el7a.x86_64 SMP mod_unload modversions

With this module loaded we don't have any issues. Now need to find why.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to packstack (master)

Reviewed: https://review.openstack.org/569012
Committed: https://git.openstack.org/cgit/openstack/packstack/commit/?id=e9e6ad87a0237908236c96b49932afa2e96a52c5
Submitter: Zuul
Branch: master

commit e9e6ad87a0237908236c96b49932afa2e96a52c5
Author: yatin <email address hidden>
Date: Thu May 17 10:48:51 2018 +0530

    Temporary switch to ovs in scenario 001, fix gate

    Until https://bugs.launchpad.net/packstack/+bug/1771500
    is fixed, temporary switch to ovs in scenario 001.

    Also, set inject_partition to the default value of -2 in nova.conf.
    We are having problems in the gate in OVH nodes (see [1]), and
    doing this prevents us from using the problematic libguestfs code.

    [1] - https://bugs.launchpad.net/packstack/+bug/1701032

    Change-Id: Ifed85df073104684cd415792785acd12e2f89aa4
    Related-Bug: #1771500
    Co-Authored-By: Javier Peña <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to packstack (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/569463

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to packstack (stable/queens)

Reviewed: https://review.openstack.org/569463
Committed: https://git.openstack.org/cgit/openstack/packstack/commit/?id=1b97ae48de202b49079c579218bd4288ae649161
Submitter: Zuul
Branch: stable/queens

commit 1b97ae48de202b49079c579218bd4288ae649161
Author: yatin <email address hidden>
Date: Thu May 17 10:48:51 2018 +0530

    Temporary switch to ovs in scenario 001, fix gate

    Until https://bugs.launchpad.net/packstack/+bug/1771500
    is fixed, temporary switch to ovs in scenario 001.

    Also, set inject_partition to the default value of -2 in nova.conf.
    We are having problems in the gate in OVH nodes (see [1]), and
    doing this prevents us from using the problematic libguestfs code.

    [1] - https://bugs.launchpad.net/packstack/+bug/1701032

    Change-Id: Ifed85df073104684cd415792785acd12e2f89aa4
    Related-Bug: #1771500
    Co-Authored-By: Javier Peña <email address hidden>
    (cherry picked from commit e9e6ad87a0237908236c96b49932afa2e96a52c5)

tags: added: in-stable-queens
Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Ok, we have the RCA for this. Great job by @tredaelli who joined me in debugging the issue and dug into the kernel changes spotting the possible culprit. He reverted this patch [0] on latest kernel for 7.5 and the problem goes away. The BZ got reassigned and we need to decide whether we want to revert this patch (it's not present in kernel-alt) or fix it on the OVS datapath as the reason is that we're clearing conntrack for the SNAT action [1] and then the packets hit a default REJECT iptables rule installed by iptables.service.

For non 7.5 kernels, ct_clear is a noop but for 7.5, we mark the packets as untracked in OVN and therefore we hit the default REJECT iptables rule.

[0] https://patchwork.ozlabs.org/patch/824054/
[1] http://www.openvswitch.org/support/dist-docs/ovn-northd.8.html (Egress Table 2: Egress Loopback)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to packstack (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/582286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to packstack (stable/pike)

Reviewed: https://review.openstack.org/582286
Committed: https://git.openstack.org/cgit/openstack/packstack/commit/?id=1574a38aa97e9dac9bc2848faccef13fe10a285e
Submitter: Zuul
Branch: stable/pike

commit 1574a38aa97e9dac9bc2848faccef13fe10a285e
Author: yatin <email address hidden>
Date: Thu May 17 10:48:51 2018 +0530

    Temporary switch to ovs in scenario 001, fix gate

    Until https://bugs.launchpad.net/packstack/+bug/1771500
    is fixed, temporary switch to ovs in scenario 001.

    Also, set inject_partition to the default value of -2 in nova.conf.
    We are having problems in the gate in OVH nodes (see [1]), and
    doing this prevents us from using the problematic libguestfs code.

    [1] - https://bugs.launchpad.net/packstack/+bug/1701032

    Change-Id: Ifed85df073104684cd415792785acd12e2f89aa4
    Related-Bug: #1771500
    Co-Authored-By: Javier Peña <email address hidden>
    (cherry picked from commit e9e6ad87a0237908236c96b49932afa2e96a52c5)

tags: added: in-stable-pike
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.