New instance is always in "spawning" status

Bug #1880828 reported by Taihsiang Ho
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Bundles
Invalid
Undecided
Unassigned
OpenStack Compute (nova)
Triaged
Wishlist
Unassigned
OpenStack Nova Compute Charm
Invalid
Undecided
Unassigned

Bug Description

bundle: openstack-base-bionic-train https://github.com/openstack-charmers/openstack-bundles/blob/master/development/openstack-base-bionic-train/bundle.yaml
hardware: 2 d05 and 2 d06 (the log of the compute node is from one of the d06. Please note they are arm64 arch.)

When trying to create new instances on the deployed openstack, the instance is always in the status of "spawning"

[Steps to Reproduce]
1. Deploy with the above bundle and hardware by following the instruction of https://jaas.ai/openstack-base/bundle/67
2. Wait about 1.5 until the deployment is ready. By ready it means every unit shows its message as "ready" e.g. https://paste.ubuntu.com/p/k48YVnPyVZ/
3. Follow the instruction of https://jaas.ai/openstack-base/bundle/67 until the step of "openstack server create" to create new instance. This step is also summarized in details in this gist code snippet https://gist.github.com/tai271828/b0c00a611e703046dd52da12a66226b0#file-02-basic-test-just-deployed-sh

[Expected Behavior]
An instance is created a few seconds later

[Actual Behavior]
The status of the instance is always (> 20 minutes) "spawning"

[Additional Information]

1. [workaround] Use `ps aux | grep qemu-img` to check if a qemu-img image converting process exists or not. The process should complete within ~20 sec. If the process exists for more than 1 minutes, use `pkill -f qemu-img` to terminate the process and re-create instances again.

The image converting process looks like this one:

```
qemu-img convert -t none -O raw -f qcow2 /var/lib/nova/instance s/_base/9b8156fbecaa194804a637226c8ffded93a57489.part /var/lib/nova/instances/_base/9b8156fbecaa194804a637226c8ffded93a57489.converted
```

2. By investing in more details, this issue is a coupled issue of 1) nova should timeout instance process (comment#21) 2) qemu does not terminate the process to convert the image successfully (comment#20)

Tags: tai-radar
Revision history for this message
Taihsiang Ho (taihsiangho) wrote :
description: updated
summary: - nova-compute Traceback for "qemu-img convert"
+ New instance is always in "spawning" status
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It seems that the image download produced a file '/var/lib/nova/instance s/_base/9b8156fbecaa194804a637226c8ffded93a57489.part' that is not really usable by qemu-img convert. If you still have that file (or can reproduce the problem) then it would be nice to compare hash of the image in glance and the hash of the downloaded .part file. Also if you reproduce could you please set debug=True in the [DEFAULT] section in the nova.conf of your nova-compute service and attach the logs?

Setting this to Invalid until the requested input is provided, please set it back to New when you did that.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I mean Incomplete not Invalid.

Changed in nova:
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

Is it possible that qemu-img is timing out? If so, seems like this could be bug 1805256. We are working on a fix for that - the bionic MP is at:
  https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/383566
And Rafael is staging fixes in:
  https://launchpad.net/~rafaeldtinoco/+archive/ubuntu/lp1805256

Note that the qemu package version 1:2.11+dfsg-1ubuntu7.25 there right now is known to still hang for a different reason. You'll want to wait until he prepares a newer test version, which should be 1:2.11+dfsg-1ubuntu7.27.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

Attached var-log-nova-kreiken-200529-enable-debug.tar.gz for log collection of /var/log/nova when the debug mode is enabled.

The log was collected in a circumstance that this issue is still reproducible even the nova-compute service has restarted several times.

In the meantime, one of the image hash shows in the following when creating instance:

ubuntu@kreiken:/var/lib/nova/instances/_base$ ls -alh
total 848M
drwxr-xr-x 2 nova nova 4.0K May 29 06:39 .
drwxr-xr-x 6 nova nova 4.0K May 29 06:39 ..
-rw-r--r-- 1 nova nova 2.2G May 29 06:40 c5d6cf094c33ec732f70e014b57fcb3c4c953895.converted
-rw-r--r-- 1 nova nova 330M May 29 06:39 c5d6cf094c33ec732f70e014b57fcb3c4c953895.part
ubuntu@kreiken:/var/lib/nova/instances/_base$ md5sum c5d6cf094c33ec732f70e014b57fcb3c4c953895.part ; md5sum c5d6cf094c33ec732f70e014b57fcb3c4c953895.converted
30d459e0dc0562e629d3683970e55efa c5d6cf094c33ec732f70e014b57fcb3c4c953895.part
0d5c8ef961aadef06b51cb37e270c2ad c5d6cf094c33ec732f70e014b57fcb3c4c953895.converted

Changed in nova:
status: Incomplete → New
description: updated
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Taihsiang: The attached nova-compute log does not show any DEBUG logs. Also it does not show the reproduction of the reported problem. I only saw to instances spawned successfully.

Also please compare the hash of the downloaded .part file with the hash of the image stored in glance to rule out the case when the downloaded file is corrupted in transfer.

Changed in nova:
status: New → Incomplete
Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

@Balazs Sorry for the mistake, here is the expected debug enabled mode log files var-log-nova-kreiken-200529-enable-debug.2.tar.gz regarding comment#5

I will check the hash of the image in glance later. Thank you for your help.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

Regarding comment#4, because rafael's ppa for LP: #1805256 does not build the package for bionic successfully, I tried Ike's PPA instead https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256 but no luck. By installing the qemu package from Ike's PPA https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256 I could still reproduce this issue.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

More information for comment#8

 ubuntu@kreiken:~$ apt-cache policy qemu
 qemu:
   Installed: 1:2.11+dfsg-1ubuntu7.25
   Candidate: 1:4.0+dfsg-0ubuntu9.4~cloud0
   Version table:
   1:4.0+dfsg-0ubuntu9.4~cloud0 500
   500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main arm64 Packages
   1:2.11+dfsg-1ubuntu7.26 500
   500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/universe arm64 Packages
   500 http://ports.ubuntu.com/ubuntu-ports bionic-security/universe arm64 Packages
  *** 1:2.11+dfsg-1ubuntu7.25 500
   500 http://ppa.launchpad.net/ikepanhc/lp1805256/ubuntu bionic/main arm64 Packages
   100 /var/lib/dpkg/status
   1:2.11+dfsg-1ubuntu7 500
   500 http://ports.ubuntu.com/ubuntu-ports bionic/universe arm64 Packages
 ubuntu@kreiken:~$ uname -a; lsb_release -a
 Linux kreiken 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:03:46 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description: Ubuntu 18.04.4 LTS
 Release: 18.04
 Codename: bionic
 ubuntu@kreiken:~$

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

I did not manage to find out the image in glance yet in the expected location:

root@juju-0b1ade-2-lxd-1:/var/lib/glance/images# grep filesystem_store_datadir /etc/glance/glance-api.conf
filesystem_store_datadir = /var/lib/glance/images/
root@juju-0b1ade-2-lxd-1:/var/lib/glance/images# ls -alh
total 8.0K
drwxr-xr-x 2 glance glance 4.0K Apr 7 20:44 .
drwxr-xr-x 4 glance glance 4.0K May 28 14:17 ..
root@juju-0b1ade-2-lxd-1:/var/lib/glance/images#

but "glance image-list" shows something:

$ glance image-list
+--------------------------------------+----------------------------+
| ID | Name |
+--------------------------------------+----------------------------+
| 2c4451d3-f607-43d4-bb42-d9e66fa9f98f | bionic-arm64-200528-233637 |
+--------------------------------------+----------------------------+

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

My glance is using mysql as a backend, so I verified the hash of the image by "glance image-download" rather than finding out its real location stored in the db. By comparing the hash of the image I uploaded, the part image, and the downloaded image via "glance image-download", their md5sum shows the same.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

Regarding comment#4, thanks dann for reminding me there is a new patch to try and I need to build the qemu on my arm machine. I am trying the patch.

Changed in nova:
status: Incomplete → New
Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

@Balazs , I have uploaded the nova log with debug mode enabled, you may refer to the attachment var-log-nova-kreiken-200529-enable-debug.2.tar.gz in comment#7 if you are still interested in it. Besides, comment#11 shows the hash of the .part and the image in db are the same. Thank you for your help.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

I tried the qemu in proposed to include the fixes of LP: #1805256 with the openstack built by same machines. I could still reproduce this issue.

===

ubuntu@kreiken:~$ apt-cache policy qemu
qemu:
  Installed: 1:2.11+dfsg-1ubuntu7.27
  Candidate: 1:4.0+dfsg-0ubuntu9.4~cloud0
  Version table:
     1:4.0+dfsg-0ubuntu9.4~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main arm64 Packages
 *** 1:2.11+dfsg-1ubuntu7.27 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-proposed/universe arm64 Packages
        100 /var/lib/dpkg/status
     1:2.11+dfsg-1ubuntu7.26 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/universe arm64 Packages
        500 http://ports.ubuntu.com/ubuntu-ports bionic-security/universe arm64 Packages
     1:2.11+dfsg-1ubuntu7 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic/universe arm64 Packages
ubuntu@kreiken:~$ uname -a
Linux kreiken 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:03:46 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@kreiken:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
ubuntu@kreiken:~$

Revision history for this message
Andrew McLeod (admcleod) wrote :

I have access to these machines and have used the same bundle with very similar config and managed to launch ~100 bionic instances and ~30 xenial instances without any being stuck in spawning state

Can we confirm that the images being used to launch the instances are current?

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

Thank @admcleod for re-testing with the current bionic image.

The image being used to launch the instances is downloaded from http://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-arm64.img

However, the current one bionic-server-cloudimg-arm64.img (10-Jun-2020, f73471e3ea879e4fd77f4811673581f5) differs from what I have downloaded earlier bionic-server-cloudimg-arm64.img (May 24 00:59 UTC+8, 30d459e0dc0562e629d3683970e55efa).

I tried both today, and the current image (10-Jun-2020) won't reproduce this issue when launching ~10 bionic instances. The previous image (May 24 2020) is still able to reproduce this issue.

I am keeping testing by launching more instances to see if this issue is invalid now.

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

After a bit "stress" testing by re-deploying openstack on these ARM machines and retried with the corresponding images[1], the issue is still able to reproduce.

I am aware once the compute instance was created successfully, the following creation will be created successfully as well. Maybe comment#16 and comment#15 were just lucky enough to create the 1st instance successfully and so on the following instances. : (

[1] The script to download the images

#!/bin/bash
CURRENT_DAILY_DATE="200617"

function dl {
    mkdir -p ./$2
    wget $1/$2/$3 -O ./$2/$3
    wget $1/$2/MD5SUMS -O ./$2/MD5SUMS.orig
    echo
    #set -x
    grep $3 ./$2/MD5SUMS.orig > ./$2/MD5SUMS
    pushd ./$2
    md5sum -c ./MD5SUMS
    popd
    #set +x
    echo
}

rm -rf ./daily-* ./release-*
ls -alh
echo
echo

dl http://cloud-images.ubuntu.com/bionic current bionic-server-cloudimg-arm64.img
mv ./current ./daily-${CURRENT_DAILY_DATE}

dl http://cloud-images.ubuntu.com/releases/bionic release-20200518.1 ubuntu-18.04-server-cloudimg-arm64.img
dl http://cloud-images.ubuntu.com/releases/bionic release-20200519.1 ubuntu-18.04-server-cloudimg-arm64.img
dl http://cloud-images.ubuntu.com/releases/bionic release-20200610 ubuntu-18.04-server-cloudimg-arm64.img
dl http://cloud-images.ubuntu.com/releases/bionic release-20200610.1 ubuntu-18.04-server-cloudimg-arm64.img

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

The symptom is similar to this one on top of redhat openstack https://access.redhat.com/solutions/3315611

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

I nailed down this issue more precisely. By "pkill -f qemu-img" we could reproduce and apply the workaround quicker:

1. Deploy the openstack ( with the latest bionic train bundle and 4 arm servers)
2. Upload the ubuntu cloud image. "openstack image create --public --container-format=bare --disk-format=qcow2 --property hw_firmware_type=uefi --file ${IMAGE_TO_UPLOAD} ${IMAGE_UPLOADED_NAME}". Bionic and xenil are both fine. Both of them could reproduce this issue.
3. Reproduce this issue by "openstack server create --image ${IMAGE_UPLOADED_NAME} --flavor ${FLAVOR_NAME} --key-name ${KEY_NAME} --nic net-id=${net_id} ${INSTANCE_NAME}" (You should prepare the flavor, key and network first. Please refer to the README instruction of the corresponding bundle, or refer to the description of this bug.). Please issue several times of this command with different INSTANCE_NAME. (more than 5 is suggested because the reproducing rate is ~80%)
4. "nova list" to check the instance creation status.
5. If the endless spawaning status shows up (e.g. wait for more than 1 minute), issue "pkill -f qemu-img" on the target nova-compute node to terminate the qemu-img converting process. Repeat this step until you get the first successful created instance.

Several highlights:
1. qemu image converting process used by the openstack is "sequential". If one qemu image converting process does not terminate, the following re-creating proceseses won't be started.
2. nova does not timeout such endless qemu image converting process. (see comment#18)
3. By mixing item#1 and item#2, it results in all instance creation processes are always in "spawning" status.

Next:
1. We should go back to have a look of the issue mentioned in comment#4. The root cause looks like "qemu on arm machine". comment#9 shows a quick trial and shows no help. However, I could reproduce similar issue on an arm machine with the qemu package version used by the openstack. It is worthy to have a look why the SRUed deb does not work. (note: bug 1805256 is re-opened recently for regression.)
2. Trial this bug. For example, once we identifed the issue is exactly caused by bug 1805256. We should trace the bug instead. Besides, I would suggest to propose a feature enhancement like comment#18 (nova should timeout endless qemu image converting process)

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

qemu-utils (1:4.0+dfsg-0ubuntu9.8~cloud0) is the latest qemu-utils at this moment provided by ubuntu-cloud archive used by the bundle https://github.com/openstack-charmers/openstack-bundles/blob/master/development/openstack-base-bionic-train/bundle.yaml won't reproduce this issue[1].

By comparing the qemu-utils that we have tested, the bisecting result shows this issue was fixed in version 1:4.0+dfsg-0ubuntu9.4~cloud0 (will reproduce this issue) and 1:4.0+dfsg-0ubuntu9.8~cloud0 (won't reproduce this issue) (that is, 9.4 - 9.8)

[1]

ubuntu@kreiken:~$ apt-cache policy qemu qemu-utils
qemu:
  Installed: (none)
  Candidate: 1:4.0+dfsg-0ubuntu9.8~cloud0
  Version table:
     1:4.0+dfsg-0ubuntu9.8~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main arm64 Packages
     1:2.11+dfsg-1ubuntu7.28 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/universe arm64 Packages
     1:2.11+dfsg-1ubuntu7.26 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-security/universe arm64 Packages
     1:2.11+dfsg-1ubuntu7 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic/universe arm64 Packages
qemu-utils:
  Installed: 1:4.0+dfsg-0ubuntu9.8~cloud0
  Candidate: 1:4.0+dfsg-0ubuntu9.8~cloud0
  Version table:
 *** 1:4.0+dfsg-0ubuntu9.8~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main arm64 Packages
        100 /var/lib/dpkg/status
     1:2.11+dfsg-1ubuntu7.28 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/main arm64 Packages
     1:2.11+dfsg-1ubuntu7.26 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-security/main arm64 Packages
     1:2.11+dfsg-1ubuntu7 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic/main arm64 Packages

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

Acoording comment#20 let's triage this issue by issuing a "Request for Feature Enhancements (RFEs)" with this launchpad blueprint https://blueprints.launchpad.net/nova/+spec/nova-instance-spawning-timeout and https://bugs.launchpad.net/nova/+bug/1887330 (the placeholder of the blueprint).

Revision history for this message
Taihsiang Ho (taihsiangho) wrote :

It seems that I did not have permission to change the status to "triage" in the above bug report summary status. I will leave this action to bug supervisors.

description: updated
tags: added: tai-radar
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Let's handle this in the bp

Changed in nova:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Marking charm tasks as invalid on this particular bug as these aren't related to the charms and were chased down to other components.

Changed in charm-nova-compute:
status: New → Invalid
Changed in openstack-bundles:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.