OVB CI - Virtual baremetals boot fail to bring up network - Ordering cycle found, skipping Network Manager

Bug #1930849 reported by Harald Jensås
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

OVB Virtual Baremetal nodes are failing erratically with systemd ordering cycle issues:

[ 6.814178] systemd[1]: network.target: Found ordering cycle on NetworkManager.service/start
[ 6.817148] systemd[1]: network.target: Found dependency on cloud-init-local.service/start
[ 6.820105] systemd[1]: network.target: Found dependency on dbus.socket/start
[ 6.822403] systemd[1]: network.target: Found dependency on sysinit.target/start
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Network Manager
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Network (Pre)
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Init���nit job (metadata service crawler)
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Initial cloud-init job (pre-networking)
[[0;1;31m SKIP [0m] Ordering cycle found, skipping D-Bus System Message Bus Socket
[[0;1;31m SKIP [0m] Ordering cycle found, skipping SSSD���ros Cache Manager responder socket
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Open vSwitch
[[0;1;31m SKIP [0m] Ordering cycle found, skipping Read���inname from /etc/sysconfig/network

https://logserver.rdoproject.org/87/790287/6/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/8521185/logs/baremetal_6_62657_2-console.log

summary: - Ordering cycle found, skipping Network Manager
+ OVB CI - Virtual baremetals boot fail to bring up network - Ordering
+ cycle found, skipping Network Manager
wes hayutin (weshayutin)
Changed in tripleo:
importance: High → Critical
tags: added: promotion-blocker
Revision history for this message
Michele Baldessari (michele) wrote :

I suspect I am also hitting this issue. In my case it manifests itself with the provisioning bits not being able to ssh to one node (out of 9) because on that one node the heat-admin user is not there.

The reason the user is not there is because cloud-init.service did not even start.

Broken node:
[root@ctrl-2-0 mnt]# journalctl |grep -i ordering
Jun 07 08:02:09 localhost.localdomain systemd[1]: network-online.target: Found ordering cycle on network.target/start

Working node:
Jun 07 08:02:03 localhost.localdomain systemd[1]: sysinit.target: Found ordering cycle on nis-domainname.service/start
Jun 07 08:02:03 localhost.localdomain systemd[1]: sysinit.target: Found ordering cycle on nis-domainname.service/start
Jun 07 08:02:03 localhost.localdomain systemd[1]: sysinit.target: Job nis-domainname.service/start deleted to break ordering cycle starting with sysinit.target/start

Revision history for this message
Michele Baldessari (michele) wrote :

Working:
root@ctrl-1-0 ~]# systemd-analyze verify multi-user.target
sysinit.target: Found ordering cycle on nis-domainname.service/start
sysinit.target: Found dependency on network-online.target/start
sysinit.target: Found dependency on network.target/start
sysinit.target: Found dependency on openvswitch.service/start
sysinit.target: Found dependency on ovs-vswitchd.service/start
sysinit.target: Found dependency on sysinit.target/start
sysinit.target: Job nis-domainname.service/start deleted to break ordering cycle starting with sysinit.target/start

Broken:
[root@ctrl-2-0 mnt]# systemd-analyze verify multi-user.target
network-online.target: Found ordering cycle on network.target/start
network-online.target: Found dependency on openvswitch.service/start
network-online.target: Found dependency on ovs-vswitchd.service/start
network-online.target: Found dependency on ovs-delete-transient-ports.service/start
network-online.target: Found dependency on sysinit.target/start
network-online.target: Found dependency on nis-domainname.service/start
network-online.target: Found dependency on network-online.target/start
network-online.target: Job network.target/start deleted to break ordering cycle starting with network-online.target/start
basic.target: Found ordering cycle on sockets.target/start
basic.target: Found dependency on sssd-kcm.socket/start
basic.target: Found dependency on sysinit.target/start
basic.target: Found dependency on nis-domainname.service/start
basic.target: Found dependency on network-online.target/start
basic.target: Found dependency on NetworkManager-wait-online.service/start
basic.target: Found dependency on NetworkManager.service/start
basic.target: Found dependency on basic.target/start
basic.target: Job sockets.target/start deleted to break ordering cycle starting with basic.target/start
NetworkManager-wait-online.service: Found ordering cycle on sysinit.target/start
NetworkManager-wait-online.service: Found dependency on nis-domainname.service/start
NetworkManager-wait-online.service: Found dependency on network-online.target/start
NetworkManager-wait-online.service: Found dependency on NetworkManager-wait-online.service/start
NetworkManager-wait-online.service: Job nis-domainname.service/start deleted to break ordering cycle starting with NetworkManager-wait-online.service/start

[root@ctrl-2-0 mnt]# rpm -q --changelog hostname |head -n2
* Thu May 06 2021 Pavel Zhukov <email address hidden> - 3.20-7
- Nisdomainname service depends on network

I'll try on my envs to downgrade hostnaem and see if that improves things. The bug against hostname is already filed at https://bugzilla.redhat.com/show_bug.cgi?id=1959720 and is potentially the culprit here.

Revision history for this message
Michele Baldessari (michele) wrote :

There is a new build on the stream koji https://koji.mbox.centos.org/koji/buildinfo?buildID=17875
hostname-3.20-7.el8.0.1, which seems to fix it for me. No idea when/if that will reach the stream repos

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

@yatin and me were debugging this, We got a local reproducer, from testing looks like new hostname-3.20-7.el8.x86_64 is our culprit.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1959720

From train: Difference between rpm working(24th) and affected(27th)

http://pastebin.test.redhat.com/969401

hostname were one of the updated packages
~~~
hostname-3.20-6.el8.x86_64 | hostname-3.20-7.el8.x86_64
~~~

Fix is available: https://git.centos.org/rpms/hostname/c/e097d2aac3e76eebbaac3ee4c2b95f575f3798fa?branch=c8s

from local testing looks like .. hostname downgrade solve the issue...

Trying with https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636 to confirm in CI. Awaiting results

Upstream C8 stream mirror have already moved back from hostname-3.20-7.el8.x86_64 to hostname-3.20-6.el8.x86_64, waiting on mirror sync.

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by "Sandeep Yadav <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636
Reason: This is not needed anymore.. infra rolled out previous version of hostname package and removed the new affected version.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.