Cinder-Backup stuck in creating after RBD export-diff existed and turned into zombie / <defunct>

Bug #2031897 reported by Christian Rohmann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
In Progress
Medium
Unassigned

Bug Description

I just observed a volume backup (RBD driver) which is in creating for DAYS already.
I went to the host running cinder-backup and looked at the RBD processes there:

```
[...]
cinder 3396548 1 3396548 3396548 0 Jul31 ? 03:40:51 /usr/bin/python3 /usr/bin/cinder-backup --config-file=/etc/cinder/cinder.conf --log-file=/var/log/cinder/cinder-backup.log
cinder 3396576 3396548 3396548 3396548 0 Jul31 ? 01:05:29 /usr/bin/python3 /usr/bin/cinder-backup --config-file=/etc/cinder/cinder.conf --log-file=/var/log/cinder/cinder-backup.log
cinder 3098464 3396576 3396548 3396548 0 Aug13 ? 00:00:01 [rbd] <defunct>
cinder 3098465 3396576 3396548 3396548 0 Aug13 ? 00:01:49 rbd import-diff --id cinder-backup --conf /etc/ceph/backup.conf --pool backups - backups/volume-fcc9390e-1b6d-48f1-858f-5ed6557cf7ec.backup.54fcb969-87e7-4594-becc-bd8c855a1cbf
root 3398716 1 3396548 3396548 0 Jul31 ? 00:01:14 /usr/bin/python3 /bin/privsep-helper --config-file /etc/cinder/cinder.conf --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmpssrnnwgk/privsep.sock
[...]
```

Looking at the piece of code doing the piping of rbd export-diff into rbd import-idff at https://opendev.org/openstack/cinder/src/commit/3b9cc13a94668018d8e0aec2dc997262c9c654a8/cinder/backup/drivers/ceph.py#L685, it seems that this case is not handled properly.

Achieving something the likes of "-o pipefail" seems to be a little more challenging.
See e.g.

* https://stackoverflow.com/questions/68228199/make-a-shell-pipeline-started-from-subprocess-popen-fail-if-the-left-hand-side-o

summary: Cinder-Backup stuck in creating after RBD export-diff existed and turned
- into zombie / <defunct> properly
+ into zombie / <defunct>
Eric Harney (eharney)
tags: added: backup ceph drivers rbd
Changed in cinder:
importance: Undecided → Medium
Revision history for this message
Tony Liu (tonyliu0592) wrote :

Could you run ps to check if that rbd is zombie? And parent process may also be stuck.
Typically, rbd will return, success or fail, but in this case rbd is stuck. Need to find out how.
"-o pipefail" only affects the rc of whole command. I don't think it will help in this case,
where rbd doesn't return at all.

Revision history for this message
Tony Liu (tonyliu0592) wrote :

I had the same issue and figured out it's a bug in parent process, which caused infinite loop, which leaves sub-process zombie. Once that bug was fixed, zombie rbd doesn't happen any more.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Tony,

1) If you read my initial bug report, there you'll find the ps output clearly showing that the rbd export-diff (left-hand side cmd) was defunct. This was due to this command exiting without causing the cmd2 (rbd import-diff) to fail. This case is simply not handled properly in the current code.

2) You are referring to a bug in the "parent process", so I suspect cinder-backup? Could you please share a reference to that bug?

Revision history for this message
Simon Hensel (shensel) wrote :

Christian's observation is right, the pipeline gets set up and both processes get started, but the code never looks at the returncode of the first process (p1) and only returns the returncode of the second process (p2): https://opendev.org/openstack/cinder/src/commit/3b9cc13a94668018d8e0aec2dc997262c9c654a8/cinder/backup/drivers/ceph.py#L616

A simple solution would be to check that p1 actually exited (whether successful or because of a timeout) and then stop the pipeline to avoid defunct processes. The subprocess module offers the wait() function for this: https://docs.python.org/3/library/subprocess.html#subprocess.Popen.wait

Revision history for this message
Tony Liu (tonyliu0592) wrote :

I'm not using Cinder for this.
Here is my code.

    p = await asyncio.create_subprocess_shell(cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE)
    stdout, stderr = await p.communicate()
    rc = p.returncode

When cmd is "p1 | p2", and p1 exits error, rc will be that error.
It works fine for me.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/897245

Changed in cinder:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.