This is found from debugging the tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap failures in the (multinode) tempest-slow job here:
https://review.openstack.org/#/c/606981/
The failure is ultimately during teardown of the test class, it fails to delete the "old" multiattach volume:
http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz?level=TRACE#_Dec_08_01_45_41_380161
Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.
Dec 08 01:45:49.587081 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [None req-9dddc5b9-7e06-4ab9-83a6-ef211e0d4c48 tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:49.587355 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
After digging through the logs, I can see at least one set of issues in the ComputeManager.swap_volume code here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806
Which is updating the "old" bdm with values from the new volume, including updating the bdm.volume_id from the old to the new, which likely orphans the old record somehow, or messes up the cleanup.
My detailed notes follow:
server1:
{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:11.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-1", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}
server1.hostId = f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19
server1 builds on the subnode (ubuntu-xenial-ovh-gra1-0001066278)
server2:
{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:10.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-2", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}
server2.hostId = 5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df
server2 builds on the primary node (ubuntu-xenial-ovh-gra1-0001066161)
So clearly the two servers are on different hosts. Also, volume1 is on the primary node (same as server2).
attach volume1 to server1:
Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437
attach volume1 to server2:
Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437
swap volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) to volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) on server1 (e0b74695-f9a0-4f4c-8a08-a0de3420cc6c):
2018-12-08 01:43:28.336 8815 INFO tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8 0.618s
2018-12-08 01:43:28.337 8815 DEBUG tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'X-OpenStack-Nova-API-Version': '2.60'}
Body: {"volumeAttachment": {"volumeId": "fa56e173-3bd4-4701-bc71-cb4810c53c51"}}
Response - Headers: {u'x-openstack-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-length': '0', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache/2.4.18 (Ubuntu)', u'x-compute-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-type': 'application/json', 'content-location': 'https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8', u'date': 'Sat, 08 Dec 2018 01:43:27 GMT', 'status': '202', u'openstack-api-version': 'compute 2.60', u'connection': 'close', u'x-openstack-nova-api-version': '2.60'}
now the test waits for the swap to complete and the attachments are all recorded correctly;
once the swap is done, volume1 should be attached to server2 and volume2 should be attached
to server1
This is the connection_info for the old/new volume during the swap:
Dec 08 01:43:31.321263 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Calling driver volume swap with connection infos:
new: {'status': u'reserved', 'multiattach': True, 'detached_at': u'', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'attach_mode': u'null', 'driver_volume_type': u'iscsi', 'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', 'attached_at': u'', 'serial': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'data': {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_portal': u'213.32.74.238:3260', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'oAJ38A88Rom7xrwQ5A4Y', u'auth_method': u'CHAP'}};
old: {u'status': u'reserved', u'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', u'multiattach': True, u'attached_at': u'', u'attach_mode': u'null', u'driver_volume_type': u'iscsi', u'detached_at': u'', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'serial': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'data': {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}} {{(pid=20206) _swap_volume /opt/stack/nova/nova/compute/manager.py:5646}}
When the test cleans up and tries to delete volume1, it fails trying to cleanup iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 from the subnode (which is the old connection to server1).
volume1 connection is supposed to be detached from server1 here:
http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_36_699753
Dec 08 01:45:36.699753 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] ==> disconnect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7fe1aa096d50>, {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}, None), 'kwargs': {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}
It says it logged out:
Dec 08 01:45:37.336156 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: Logout of [sid: 1, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.
This definitely looks wrong (post-swap volume, updating old BDM record), note it's updating the BDM for volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) but with the volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) connection_info - note that for the new volume attach flow the connection_info might not matter since we likely don't use it (we just work with the attachment records), but if
the attachment_id (769b75cc-4283-4b46-a975-243faae6e263) is for the wrong volume, that could be a problem:
Dec 08 01:45:37.699221 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Updating volume 2a48ae08-2833-488c-8771-eea86f41c7b8 BDM record with {'attachment_id': u'769b75cc-4283-4b46-a975-243faae6e263', 'no_device': None, 'connection_info': '{"status": "reserved", "multiattach": true, "detached_at": "", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "attach_mode": "null", "driver_volume_type": "iscsi", "instance": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "attached_at": "", "serial": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "data": {"access_mode": "rw", "target_discovered": false, "encrypted": false, "qos_specs": null, "target_iqn": "iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_portal": "213.32.74.238:3260", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_lun": 1, "device_path": "/dev/sdb", "auth_password": "Ko6DLxWNfrbqWQ7E", "auth_username": "oAJ38A88Rom7xrwQ5A4Y", "auth_method": "CHAP"}}', 'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'source_type': 'volume', 'snapshot_id': None, 'destination_type': 'volume'} {{(pid=20206) swap_volume /opt/stack/nova/nova/compute/manager.py:5819}}
Probably the bigger problem is the BDM for volume1 is updated to have volume_id fa56e173-3bd4-4701-bc71-cb4810c53c51 so we changed the volume_id on the bdm for volume1, so that might have something to do with orphaning volume1's connections when we go to delete - since deleting server2 (which still has volume1 attached) likely doesn't cleanup volume1 properly.
The other things I'm wondering about are:
1. Is nova-compute deleting the old volume1/server1 attachment record here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5691
because I didn't see anything about attachment delete in the logs.
2. Is there a periodic task in the cinder-volume service that is recreating the target connection for volume1 and server1 in between the time we swap the volume and nova-compute deletes the connection but before nova-compute deletes the old attachment record?
The first and most obvious thing to do is fix that bdm update code at the end of the ComputeManager.swap_volume method. I'm not exactly sure yet how the multi-host nature of this comes into play, except possibly in the libvirt driver volume disconnect code here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/virt/libvirt/driver.py#L1221
I'm not sure if multiple hosts is really the main culprit because the test passed on this tempest-slow job:
http:// logs.openstack. org/78/ 606978/ 4/check/ tempest- slow/63b3e0e/
And in that case, the two servers involved in the swap from the same multiattach volume are on different hosts. So maybe there is some timing issue involved when this fails.
In that passing case, these are the two servers:
Body: {"servers": [{"id": "988b38e6- 1556-45b8- 94ba-76d5045ef5 e9", "links": [{"href": "https:/ /10.210. 68.80/compute/ v2.1/servers/ 988b38e6- 1556-45b8- 94ba-76d5045ef5 e9", "rel": "self"}, {"href": "https:/ /10.210. 68.80/compute/ servers/ 988b38e6- 1556-45b8- 94ba-76d5045ef5 e9", "rel": "bookmark"}], "name": "tempest- TestMultiAttach VolumeSwap- server- 1095754502- 2"}, {"id": "2f4078c8- 4911-4154- b24b-ea2ca37fcd d3", "links": [{"href": "https:/ /10.210. 68.80/compute/ v2.1/servers/ 2f4078c8- 4911-4154- b24b-ea2ca37fcd d3", "rel": "self"}, {"href": "https:/ /10.210. 68.80/compute/ servers/ 2f4078c8- 4911-4154- b24b-ea2ca37fcd d3", "rel": "bookmark"}], "name": "tempest- TestMultiAttach VolumeSwap- server- 1095754502- 1"}]} _log_request_full tempest/ lib/common/ rest_client. py:437
988b38e6- 1556-45b8- 94ba-76d5045ef5 e9 (server2) builds on the primary node and 2f4078c8- 4911-4154- b24b-ea2ca37fcd d3 (server1) builds on the subnode.
Maybe it depends on where volume1 builds? In that test, volume1 is 2fe7ad08- b928-4139- 9f9d-a4ee54ac63 0d which builds on the primary node, and if server1 is on the subnode, then that is one difference from the fail scenario where volume1 and server1 are on the same host.