Cinder

Bug #1483434
Comment #3

Comment 3 for bug 1483434

Revision history for this message

Xing Yang (xing-yang) wrote on 2015-08-11:

I looked at the logs. There were 4 restores started in the cinder backup logs. 3 of them finished but 1 didn’t. Of the 4, 2 of them started at almost the same time. One finished and the other one didn’t finish. The unfinished one did finish restoring data and started to restore metadata. At that time, there were connection problems to MySQL so the metadata restore was not finished and eventually timed out.

From the cinder backup log -
http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-bak.txt.gz#_2015-08-10_21_36_37_525

We can see that the restore of backup started:
2015-08-10 21:34:27.541 INFO cinder.backup.manager [req-139ab0ae-d0ed-43d5-a254-19e895c9ff95 tempest-VolumesBackupsV2Test-998958099] Restore backup started, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 volume: bd776a72-66c0-43a6-8d48-39f541634226.

2015-08-10 21:34:27.825 DEBUG cinder.backup.chunkeddriver [req-0a4871ba-550a-4435-8e15-07421d4f9eef tempest-VolumesBackupsV1Test-1030415370] starting restore of backup volume_8b22ff43-58ed-4148-9824-20db1efd21d2/20150810213315/az_nova_backup_bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 container: volumebackups, to volume bd776a72-66c0-43a6-8d48-39f541634226, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1. restore /opt/stack/new/cinder/cinder/backup/chunkeddriver.py:648
2015-08

Now the restore of data finished:
2015-08-10 21:36:08.861 DEBUG cinder.backup.chunkeddriver [req-3629e4d4-a3b0-4f6f-9f30-7e6140e95e78 None] v1 volume backup restore of bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 finished. _restore_v1 /opt/stack/new/cinder/cinder/backup/chunkeddriver.py:634

Now it started to restore metadata:
2015-08-10 21:36:08.863 DEBUG cinder.backup.driver [req-3629e4d4-a3b0-4f6f-9f30-7e6140e95e78 None] Restoring volume base metadata _restore_vol_base_meta /opt/stack/new/cinder/cinder/backup/driver.py:168
2015-08

If this is finished, we should see "Restore backup finished, backup bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 restored to volume: bd776a72-66c0-43a6-8d48-39f541634226." However, this never happened.

Instead, there is a Lost connection to MySQL failure at this time:
2015-08-10 21:36:37.420 ERROR oslo_messaging.rpc.dispatcher [req-1abbfe4f-97f2-4184-b598-522a1e0e2c19 None] Exception during message handling: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2015-08-10 21:36:37.42

Then after a few minutes, tempest timed out the restore request and sent a delete backup request:
2015-08-10 21:39:55.018 INFO cinder.backup.manager [req-c6888349-c6f3-4a1d-842f-a34df2161aa6 tempest-VolumesBackupsV2Test-998958099] Delete backup started, backup: bd79c0a3-5a76-4dc3-a774-d8ef515c46b1.

In the cinder volume log, there is a connection pool is full and discard connection error. This matches the Lost connection to MySQL error in cinder backup log:

http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-vol.txt.gz

2015-08-10 21:36:52.474 WARNING requests.packages.urllib3.connectionpool [req-bd798c22-6983-45e4-b0e7-aa5a8e1dffe4 tempest-VolumesV2ActionsTest-1525165338] Connection pool is full, discarding connection: 127.0.0.1

From the following site, it says Lost connection usually indicates network problems.
https://dev.mysql.com/doc/refman/5.0/en/error-lost-connection.html
Usually it indicates network connectivity trouble and you should check the condition of your network if this error occurs frequently. If the error message includes “during query,” this is probably the case you are experiencing.

So to make the restore tests more stable, we could look at suggestions from the above web site or increment max_pool_size and max_overflow in cinder.conf. By default, max_pool_size is 5 and max_overflow is 10.

# Changing the following options max_pool_size and max_overflow
# to allow more db connections
[database]
# Maximum number of SQL connections to keep open in a pool
# (integer value)
max_pool_size=20
# If set, use this value for max_overflow with sqlalchemy
# (integer value)
max_overflow=30

I looked at the logs.  There were 4 restores started in the cinder backup logs. 3 of them finished but 1 didn’t. Of the 4, 2 of them started at almost the same time. One finished and the other one didn’t finish.  The unfinished one did finish restoring data and started to restore metadata.  At that time, there were connection problems to MySQL so the metadata restore was not finished and eventually timed out.

From the cinder backup log - 
http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-bak.txt.gz#_2015-08-10_21_36_37_525

If this is finished, we should see "Restore backup finished, backup bd79c0a3-5a76-4dc3-a774-d8ef515c46b1 restored to volume: bd776a72-66c0-43a6-8d48-39f541634226."  However, this never happened.

In the cinder volume log, there is a connection pool is full and discard connection error. This matches the Lost connection to MySQL error in cinder backup log:

http://logs.openstack.org/82/202782/15/gate/gate-tempest-dsvm-neutron-full/372a95f/logs/screen-c-vol.txt.gz

From the following site, it says Lost connection usually indicates network problems. 
https://dev.mysql.com/doc/refman/5.0/en/error-lost-connection.html
Usually it indicates network connectivity trouble and you should check the condition of your network if this error occurs frequently. If the error message includes “during query,” this is probably the case you are experiencing.

So to make the restore tests more stable, we could look at suggestions from the above web site or increment  max_pool_size and max_overflow in cinder.conf. By default, max_pool_size is 5 and max_overflow is 10.