instance deletion takes a while and blocks nova-compute
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
nova (Ubuntu) |
New
|
Medium
|
Unassigned |
Bug Description
Hi,
I have a cloud running xenial/mitaka (with 18.02 charms).
Sometimes, an instance will take minutes to delete. I tracked down the time taken to be file deletion :
Apr 23 07:23:00 hostname nova-compute[
Apr 23 07:27:33 hostname nova-compute[
As you can see, 4 minutes and 33 seconds have elapsed between the 2 lines. nova-compute logs absolutely _nothing_ during this time. Periodic tasks are not run, etc... Generally, a deletion takes a few seconds top.
The logs above are generally immediately followed by :
Apr 23 07:27:33 hostname nova-compute[
(which is error: [Errno 104] Connection reset by peer)
because nova-compute doesn't even maintain the rabbitmq connection (on the rabbitmq server I can see errors about "Missed heartbeats from client, timeout: 60s").
So nova-compute appears to be "frozen" during several minutes. This can cause problems because events can be missed, etc...
We have telegraf on this host, and there's little to no CPU, disk, network or memory activity at that time. Nothing relevant in kern.log either. And this is happening on 3 different architectures, so this is all very puzzling.
Is nova-compute supposed to be totally stuck while deleting instance files ? Have you ever seen something similar ?
I'm going to try to repro on queens.
Thanks
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Jun 23 14:23 seq
crw-rw---- 1 root audio 116, 33 Jun 23 14:23 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Package: nova (not installed)
PciMultimedia:
ProcEnviron:
TERM=screen-
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: root=UUID=
ProcLoadAvg: 10.43 10.79 9.76 11/2237 15123
ProcSwaps:
Filename Type Size Used Priority
/swap.img file 8388544 0 -1
ProcVersion: Linux version 4.4.0-128-generic (buildd@
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.157.20
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial uec-images xenial uec-images xenial uec-images
Uname: Linux 4.4.0-128-generic ppc64le
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: False
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
min: 1.201 GHz (cpu 48)
max: 3.710 GHz (cpu 112)
avg: 2.615 GHz
cpu_runmode:
Could not retrieve current diagnostics mode,
No kernel interface to firmware
cpu_smt: SMT is off
Also, nova-scheduler or nova-api-os-compute will log the following lines (a few times per minute) while this is happening :
Apr 23 07:24:47 juju-8c74e6-4-lxd-7 nova-scheduler[ 15786]: 2018-04-23 07:24:47.785 15786 DEBUG nova.servicegro up.drivers. db [req-1573c400- 116c-4825- b108-3291a014b0 e9 bc0ab055427645a ca4ed09266e85b1 db 1cb457a8302543f ea067e5f14b5241 e7 - - -] Seems service nova-compute on host hostname is down. Last heartbeat was 2018-04-23 07:22:56. Elapsed time is 111.785844 is_up /usr/lib/ python2. 7/dist- packages/ nova/servicegro up/drivers/ db.py:82