Etcd units go into error state and require full restart due to inability to hit freezer CG

Bug #1933128 reported by Boris Lukashev
84
This bug affects 15 people
Affects Status Importance Assigned to Milestone
Etcd Charm
New
Undecided
Unassigned
lxd
New
Undecided
Unassigned
snapd
New
Undecided
Unassigned

Bug Description

Doing burn-in testing for Wallaby deployed via Juju with vault and etcd, i'm seeing all three etcd units (all atop focal canotroller VMs) fail every day or two with the following recorded in their unit logs:
```
2021-06-21 16:40:36 DEBUG juju.worker.uniter.runner runner.go:725 starting jujuc server {unix @/var/lib/juju/agents/unit-etcd-1/agent.socket <nil>}
2021-06-21 16:40:38 DEBUG unit.etcd/1.update-status logger.go:60 lxc
2021-06-21 16:40:39 WARNING unit.etcd/1.update-status logger.go:60 cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
2021-06-21 16:40:39 WARNING unit.etcd/1.update-status logger.go:60 cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
2021-06-21 16:40:39 WARNING unit.etcd/1.update-status logger.go:60 cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 Traceback (most recent call last):
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/charm/hooks/update-status", line 22, in <module>
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 main()
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 bus.dispatch(restricted=restricted_mode)
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 _invoke(other_handlers)
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 handler.invoke()
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 self._action(*args)
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/var/lib/juju/agents/unit-etcd-1/charm/reactive/etcd.py", line 279, in send_cluster_connection_details
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 db.set_connection_string(connection_string, version=etcdctl.version())
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "lib/etcdctl.py", line 193, in version
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 out = check_output(
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 File "/usr/lib/python3.8/subprocess.py", line 512, in run
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 raise CalledProcessError(retcode, process.args,
2021-06-21 16:40:40 WARNING unit.etcd/1.update-status logger.go:60 subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'version']' returned non-zero exit status 1.
2021-06-21 16:40:40 ERROR juju.worker.uniter.operation runhook.go:139 hook "update-status" (via explicit, bespoke hook script) failed: exit status 1
2021-06-21 16:40:40 DEBUG juju.machinelock machinelock.go:186 machine lock released for etcd/1 uniter (run update-status hook)

```
The LXD container has to be stopped and started again for the unit to return to normal. After some time however, it just goes back to this broken state.

Tags: sts
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

Hello,

What versions of etcd, juju, openstack etc.. were you running?

Thanks,
Heather Lemon

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

etcd unit erroring out with cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory

https://bugs.launchpad.net/charm-etcd/+bug/1946010 - duplicated bug

I believe it's coming from snapd here: https://github.com/snapcore/snapd/blob/36a8f0a616978fd1d2e0793c1ce4bae038e5b326/cmd/libsnap-confine-private/cgroup-freezer-support.c#L42

etcd -version latest cs:etcd-554- cs:etcd-634
snapd version 2.51+
series - focal
etcdctl version: 3.4.5
API version: 3.4

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

from cgroup.support.c

void sc_cgroup_create_and_join(const char *parent, const char *name, pid_t pid) {
    int parent_fd SC_CLEANUP(sc_cleanup_close) = -1;
    parent_fd = open(parent, O_PATH | O_DIRECTORY | O_NOFOLLOW | O_CLOEXEC);
    if (parent_fd < 0) {
        die("cannot open cgroup hierarchy %s", parent);
    }

https://github.com/snapcore/snapd/blob/36a8f0a616978fd1d2e0793c1ce4bae038e5b326/cmd/libsnap-confine-private/cgroup-support.c#L40

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

As of right now, I have been unable to reproduce the issue.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Unfortunately we do not have Juju openstack anymore to reproduce or get these logs - snap store failed to download correctly yet again, that broke etcd and vault, which broke the entire stack and it was the final straw. We've since elected to use Kolla for OpenStack (also gives us access to a bunch of pieces not available in the Juju stack, such as Zun) while leveraging Juju for Ceph only. The dependency of private cloud components on a public resource (snap store) sort of defeats the purpose of building private in the first place (independence from vendor/provider shenanigans).

Anecdotally however i have found that cgroup issues on systemd hosts are plentiful and irritating. We use Arch internally so get to "play with" all of the newest changes to systemd and the underlying OS, and its been a roller-coaster of kernel commandline options for hierarchy, container runtime configs, and scripted cgroup mappings to keep older containers running on newer systems (the insanity in LXC/LXD hasn't helped, nor has the general variety of OCI environments). Since you guys are a versioned distro (vs rolling), it shouldn't be quite as bad in-version, but i'm guessing some of these breaking changes get backported once in a while and then fixed once folks realize they can't run previously created containers correctly.

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

From the other LP's juju-crash-dump - first error: /var/log/juju/unit-etcd-2.log

tracer: hooks phase, 1 handlers queued
tracer: ++ queue handler hooks/relations/etcd/provides.py:22:joined_or_changed
2021-10-03 18:35:22 INFO unit.etcd/2.juju-log server.go:325 db:113: Invoking reactive handler: hooks/relations/etcd/provides.py:22:joined_or_changed
2021-10-03 18:35:22 DEBUG unit.etcd/2.juju-log server.go:325 db:113: tracer: set flag db.connected
2021-10-03 18:35:22 DEBUG unit.etcd/2.juju-log server.go:325 db:113: tracer>
tracer: main dispatch loop, 6 handlers queued
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:79:joined:certificates
tracer: ++ queue handler reactive/etcd.py:122:set_app_version
tracer: ++ queue handler reactive/etcd.py:136:prepare_tls_certificates
tracer: ++ queue handler reactive/etcd.py:247:set_db_ingress_address
tracer: ++ queue handler reactive/tls_client.py:18:store_ca
tracer: ++ queue handler reactive/tls_client.py:71:store_client
2021-10-03 18:35:22 INFO unit.etcd/2.juju-log server.go:325 db:113: Invoking reactive handler: reactive/etcd.py:122:set_app_version
2021-10-03 18:35:22 WARNING unit.etcd/2.db-relation-joined logger.go:60 cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
2021-10-03 18:35:22 ERROR unit.etcd/2.juju-log server.go:325 db:113: Failed to get etcd version:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-etcd-2/charm/reactive/etcd.py", line 893, in etcd_version
    raw_output = check_output(
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'version']' returned non-zero exit status 1.

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote (last edit ):

Can you ensure that the freezer controller is enabled and its hierarchy is mounted under /sys/fs/cgroup/freezer?

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

Hi Maciej,

It's enabled, we have logs and tasks under /sys/fs/cgroup/freezer/ sosreport.

running mount | grep freezer results in:

    cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

I mean the logs attached indicate pretty clearly that /sys/fs/cgroup/freezer isn't available in the namespace snap-confine runs in. Perhaps the controller is mounted in on the host, but the bug report also mentions LXD container. Is this the namespace where running snaps is attempted? Is /sys/fs/cgroup/freezer mounted inside the container?

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

I'm trying to run a similar setup to someone else who has the issue,

When I exec into the lxc container and run mount | grep cgroup
I see this:
     cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

This error about it not being able to create cgroups mounts is suspicious

Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328421]: Running constructor lxcfs_init to reload liblxcfs
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328421]: bindings.c: 810: cgfs_setup_controllers: Failed to set up private lxcfs cgroup mounts
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328421]: Failed to setup private cgroup mounts for lxcfs
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328537]: cat: /var/snap/lxd/common/lxcfs.pid: No such file or directory
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328096]: => Starting LXD
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328549]: t=2021-11-25T03:44:00+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, disk priority will be ignored"
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328549]: t=2021-11-25T03:44:00+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328549]: t=2021-11-25T03:44:00+0000 lvl=warn msg="Instance type not operational" driver=qemu err="KVM support is missing" type=virtual-machine
Nov 25 03:44:03 juju-32004e-4-lxd-1 lxd.daemon[2328096]: => LXD is ready
Nov 25 03:44:03 juju-32004e-4-lxd-1 lxd.daemon[2328549]: t=2021-11-25T03:44:03+0000 lvl=warn msg="Failed to update instance types: Get \"https://images.linuxcontainers.org/meta/instance-types/.yaml\": x509: certificate signed by unknown authority"
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: ==> Setting up ZFS (0.8)
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: ==> Escaping the systemd cgroups
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: ====> Detected cgroup V1
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: /snap/lxd/21545/commands/daemon.start: 362: cannot create /sys/fs/cgroup/cpuset/cgroup.clone_children: Permission denied
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: ==> Disabling shiftfs on this kernel (auto)
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328096]: => Starting LXCFS
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328421]: Running constructor lxcfs_init to reload liblxcfs
Nov 25 03:43:59 juju-32004e-4-lxd-1 lxd.daemon[2328421]: bindings.c: 810: cgfs_setup_controllers: Failed to set up private lxcfs cgroup mounts
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328421]: Failed to setup private cgroup mounts for lxcfs
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328537]: cat: /var/snap/lxd/common/lxcfs.pid: No such file or directory
Nov 25 03:44:00 juju-32004e-4-lxd-1 lxd.daemon[2328096]: => Starting LXD

Revision history for this message
Ian Johnson (anonymouse67) wrote :

The most recent logs from @hypothetical-lemon seem to pretty clearly indicate that the problem is in LXD, we should add a lxd task to this bug.

Revision history for this message
Boris Lukashev (rageltman) wrote :

@Ian Johnson: it may not be the case that this is something inherent to LXD as i've seen similar CG issues with LXC itself and Dockers nested in it. These cgroup issues often stem from the host system's systemd setting up its unified CG hierarchy on boot, and locking namespaces out of creating their respective mounts at startup. For example, as kernel versions and systemd versions have updated in our Arch systems, it became impossible to start certain containers after rebooting (like CentOS8 and Ubuntu 20.04 with nesting) without providing a `systemd.unified_cgroup_hierarchy=0` kernel commandline parameter as they would fail with a variety of mount errors on freezer, cpu, and other CGs.

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

I left a note in the internal #lxd channel, but if this can be reproduced on demand then please collect the outputs of:

- SNAPD_DEBUG=1 snap run <snap>.<app>
- snap run --strace='--raw -vf -s 256' <snap>.<app>

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

SNAPD_DEBUG=1 snap run etcd.etcdctl version
https://pastebin.canonical.com/p/y3W3BCSyrZ/

sudo snap run --strace='--raw -s 256 -vf' etcd.etcdctl version
execve("/snap/snapd/14066/usr/lib/snapd/snap-confine", ["/snap/snapd/14066/usr/lib/snapd/snap-confine", "--base", "core18", "snap.etcd.etcdctl", "/usr/lib/snapd/snap-exec", "etcd.etcdctl", "version"], ["HOME=/root/snap/etcd/230", "LANG=C.UTF-8", "LOGNAME=root", "LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*"..., "MAIL=/var/mail/root", "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin", "SHELL=/bin/bash", "SNAP=/snap/etcd/230", "SNAP_ARCH=amd64", "SNAP_COMMON=/var/snap/etcd/common", "SNAP_DATA=/var/snap/etcd/230", "SNAP_INSTANCE_KEY=", "SNAP_INSTANCE_NAME=etcd", "SNAP_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void", "SNAP_NAME=etcd", "SNAP_REAL_HOME=/root", "SNAP_REEXEC=", "SNAP_REVISION=230", "SNAP_USER_COMMON=/root/snap/etcd/common", "SNAP_USER_DATA=/root/snap/etcd/230", "SNAP_VERSION=3.4.5", "SUDO_COMMAND=/usr/bin/strace -u root -f -e !select,pselect6,_newselect,clock_gettime,sigaltstack,gettid,gettimeofday,nanosleep -s 256 -vf /snap/snapd/14066/usr/lib/snapd/snap-confine --base core18 snap.etcd.etcdctl /usr/lib/snapd/snap-exec etcd.etcdctl ver"..., "SUDO_GID=0", "SUDO_UID=0", "SUDO_USER=root", "TERM=xterm-256color", "USER=root", "XDG_RUNTIME_DIR=/run/user/0/snap.etcd"]) = -1 EACCES (Permission denied)
/usr/bin/strace: exec: Permission denied
+++ exited with 1 +++
error: exit status 1

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

I did a bit more digging. When starting snap-confine will prepare and capture the mount namespace of a snap.

In the process it does a recursive bind mount of /sys inside the new mount namespace of a snap (MS_BIND|MS_REC), and then recursively change the propagation to slave (MS_REC|MS_SLAVE). This should /sys/fs/cgroup/freezer appear later in the parent namespace, it should also appear in the snap mount namespace.

On 20.04 VM, findmnt -o +PROPAGATION indicates that /sys/fs/cgroup is a mount point of tmpfs, with propagation set to shared:
/sys/fs/cgroup tmpfs tmpfs ro,nosuid,nodev,noexec,mode=755 shared

Thus in the snap mount namespace it becomes:
/sys/fs/cgroup tmpfs tmpfs ro,nosuid,nodev,noexec,mode=755 private,slave

However, inside the LXD container, it is already private (so mount and unmount events do not propagate):
/sys/fs/cgroup tmpfs tmpfs ro,nosuid,nodev,noexec,mode=755,uid=1000000,gid=1000000 private

Inside the snap mount ns is then:
/sys/fs/cgroup tmpfs tmpfs ro,nosuid,nodev,noexec,mode=755,uid=1000000,gid=1000000 private

As trying to change the propagation of a private mount has no effect. So even if freezer is mounted later it will not become visible.

That state of mount namespace is captured and reused for subsequent attempts to start the snap. Since the freezer is unavailable, s-c cannot setup the cgroup and thus fails.

This code has not been changed since late 2018/early 2019, so some other change must have made this behavior visible just now. It still is unclear why the freezer is unavailable at the time the snap is started, perhaps the system is still booting (as much as containers can boot) but the shell is already acessible and it is possible to try to `snap install`.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :
Revision history for this message
Alex (borsik) wrote :

In my case I was able to solve it yesterday with (inside lxd container):

sudo snap refresh
sudo snap remove etcd
sudo snap install etcd
sudo reboot

Same issue I had with kube-apiserver (in my case we are using k8s 1.21):

snap refresh
sudo snap remove kubectl
sudo snap install kubectl --channel=1.21/stable --classic
snap remove kube-apiserver
sudo snap install kube-apiserver --channel=1.21/stable
sudo reboot

Apparently there were fixed snaps/core.

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

This was resolved by running snap-discard-ns from here

/usr/lib/snapd/snap-discard-ns etcd

SNAPD_DEBUG=1 snap run etcd.etcdctl version

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I just hit this bug at a client installation. It happened after a power outage. Running the snap-discard-ns solved the issue.

Revision history for this message
Илия Анастасов (ianastasov) wrote :

I have several etcd instances running in lxd and this error happens from time to time. Running snap-discard-ns solved the issue for me too.

Chuan Li (lccn)
tags: added: sts
Revision history for this message
Stéphane Graber (stgraber) wrote :

Speaking for LXD, LXD itself doesn't perform any cgroup mounts, it just sets up an empty cgroup namespace.

The cgroup setup you see in the container are the result of what systemd and other userspace tools have setup. So it's technically impossible for it to be a LXD issue as LXD doesn't have any logic related to those mounts ;)

As the host appears to be 20.04, it's unlikely that we're dealing with a mixed cgroup2/cgroup1 situation here. Instead it looks more like the cgroup mounts just disappearing from the child mount namespace.

I don't know exactly what would cause that, but we've observed in the past that systemd will perform a full cgroup reset (delete all cgroups) whenever it believes none of the scope/slice units require specific controllers. Maybe it's that same logic which is triggering a reset on those mounts and subsequent modification of the mount namespaces?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.