c9 scen001 and scen004 standalone test are failing - "Non-zero exit code 1 from systemctl start ceph" - selinux enforcing

Bug #1998954 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

scenario001 and scenario004 standalone tests are failing on Zed and Wallaby c9 with the following error:

2022-12-06 11:35:33.518833 | primary | TASK [tripleo.operator.tripleo_ceph_deploy : Run Ceph Deploy] ******************
2022-12-06 11:35:33.518865 | primary | Tuesday 06 December 2022 11:35:33 -0500 (0:00:01.933) 0:09:24.968 ******

"Extracting ceph user uid/gid from container image...\", \"Creating initial keys...\", \"Creating initial monmap...\", \"Creating mon...\", \"Non-zero exit code 1 from systemctl start <email address hidden>\", \"systemctl: stderr Job for <email address hidden> failed because the control process exited with error code.

Related logs:

https://logserver.rdoproject.org/openstack-periodic-integration-zed-centos9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario004-standalone-zed/199e294/logs/_ceph-faa3ddab-0222-5036-88ae-af10554b63d4@mon_standalone_localdomain_service_failed_because_the_cont.log

https://logserver.rdoproject.org/openstack-periodic-integration-zed-centos9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario004-standalone-zed/199e294/job-output.txt

https://logserver.rdoproject.org/b6/b63515f8d458cd4599a36d0f60a050c09b8fbfce/openstack-periodic-integration-stable1/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/b62e851/job-output.txt

https://logserver.rdoproject.org/b6/b63515f8d458cd4599a36d0f60a050c09b8fbfce/openstack-periodic-integration-stable1/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/b62e851/logs/_ceph-b03d550a-899e-57f6-ae90-b3f8ed340e63@mon_standalone_localdomain_service_failed_because_the_cont.log

Failure started 12/06:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario001-standalone-zed&skip=0

Earlier run that day was green:
periodic-tripleo-ci-centos-9-scenario001-standalone-zed openstack/tripleo-ci master openstack-periodic-integration-zed-centos9 1 hr 24 mins 29 secs 2022-12-06 03:54:13 SUCCESS

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → antelope-1
importance: Undecided → Critical
status: New → Triaged
tags: added: promotion-blocker
Revision history for this message
Francesco Pantano (fmount) wrote :

Looking at the log [1] I see that the mon process starts, but at some point it fails because of:

"""
Dec 06 11:37:59 standalone.localdomain bash[68635]: Error: open /etc/containers/networks/netavark.lock: permission denied
"""

Still investigating the issue, but it's not coming from the Ceph version (which has not been changed).

[1] https://logserver.rdoproject.org/b6/b63515f8d458cd4599a36d0f60a050c09b8fbfce/openstack-periodic-integration-stable1/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/b62e851/logs/undercloud/var/log/extra/journal.txt.gz

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Francesco Pantano (fmount) wrote :

The root cause of the issue is related to selinux which deny access to /etc/containers/networks/netavark.lock [1].

This means that w/ "setenforce 0" we're able to get the Ceph cluster deployed.

Investigating to see what changed on that front.

[1] https://logserver.rdoproject.org/b6/b63515f8d458cd4599a36d0f60a050c09b8fbfce/openstack-periodic-integration-stable1/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/b62e851/logs/undercloud/var/log/extra/selinux_denials.txt.gz

Revision history for this message
Ronelle Landy (rlandy) wrote :

So two things here:

1 - Jobs are running enforcing on centos 9. iirc, we don't usually do that.
2 - Probably related to https://github.com/containers/container-selinux/issues/198 and https://bugzilla.redhat.com/show_bug.cgi?id=2150283 as owalsh points out

Revision history for this message
Ronelle Landy (rlandy) wrote (last edit ):
summary: - scenario001 and scenario004 standalone test are failng - "Non-zero exit
- code 1 from systemctl start ceph"
+ c9 scen001 and scen004 standalone test are failing - "Non-zero exit
+ code 1 from systemctl start ceph" - selinux enforcing
Revision history for this message
Ronelle Landy (rlandy) wrote :

The c9 node seems to start:

[zuul@node-0003339614 ~]$ getenforce
Enforcing

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/866806

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Wondering why it's trying to access the netavark thing in the first place. Aren't the container running with "net=host" or something like that? Or is the service running on the host now?

I'd be against passing the CI to permissive. It allows to actually find issues before they bite us back in enforcing environments.

Revision history for this message
Marios Andreou (marios-b) wrote :

@Cedric fine goal to have enforcing but not OK to block the production chain while we do it.

So I think we should go with https://review.opendev.org/c/openstack/tripleo-ci/+/866806 if there are no better suggestions in order to unblock us.

Then if we want to use enforcing then we need to make a plan to work towards it.

Revision history for this message
Marios Andreou (marios-b) wrote :

noting the test result for /tripleo-ci/+/866806 at [1]

we are sending the workaround to gates

[1] https://review.rdoproject.org/r/c/testproject/+/36254/204#message-1cc1e9ebcb562014a45cd2c9456c1e5e4bcdb0fa

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/866806
Committed: https://opendev.org/openstack/tripleo-ci/commit/e5ea2013f7349b212507ea454b0969b49f9cfc1a
Submitter: "Zuul (22348)"
Branch: master

commit e5ea2013f7349b212507ea454b0969b49f9cfc1a
Author: Ronelle Landy <email address hidden>
Date: Tue Dec 6 16:07:28 2022 -0500

    Ensure CentOS nodes are selinux Permissive

    CentOS 9 nodes are startng with selinux in
    Enforcing mode. This is not the expected
    configuration to run RDO/CentOS tests.

    This patch sets selinux to permissive in pre
    when running on CentOS.

    Change-Id: I14dff0c0bb2d793ef4cd52ddcc2ff5ca4f870b97
    Related-Bug: #1998954

Revision history for this message
Alan Pevec (apevec) wrote :

Unclear where/what was setting selinux permissive before, CS9 qcows always had selinux enforcing since ever, and nodepool image history is gone now, only last two are kept in Glance.
In any case, explicit env setup in prep is the best approach.

Changed in tripleo:
status: Triaged → In Progress
status: In Progress → Fix Committed
Alan Pevec (apevec)
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.