Database deadlock in ceilometer-collector when ceilometer-expirer is running

Bug #1431986 reported by Ashwin Agate
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Low
ZhiQiang Fan
Kilo
Fix Released
Undecided
Unassigned

Bug Description

Seeing database deadlock error in ceilometer-collector logs when ceilometer-expirer runs at the same time.

See two errors 1.) and 2.) below

1.) Deadlock error:
    Deadlock found when trying to get lock; try restarting transaction

2015-03-12 04:30:03.875 26730 ERROR ceilometer.dispatcher.database [-] Failed to record metering data: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO sample (meter_id, resource_id, volume, timestamp, recorded_at, message_signature, message_id) VALUES (%s, %s, %s, %s, %s, %s, %s)' (23L, 30218L, 1, Decimal('1426134602.661871'), Decimal('1426134602.737072'), '7693ba4aeb7af6583a90381ad8f155827a4d78dad8fe6cb6fa8b18b6ca7a101a', '72ad5a1d-c870-11e4-90cb-00fc1dc4bd2e')
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database Traceback (most recent call last):
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/ceilometer/dispatcher/database.py", line 92, in record_metering_data
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database self.meter_conn.record_metering_data(meter)
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/ceilometer/storage/impl_sqlalchemy.py", line 345, in record_metering_data

[...]

2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database self.errorhandler(self, exc, value)
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database raise errorclass, errorvalue
2015-03-12 04:30:03.875 26730 TRACE ceilometer.dispatcher.database DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO sample (meter_id, resource_id, volume, timestamp, recorded_at, message_signature, message_id) VALUES (%s, %s, %s, %s, %s, %s, %s)' (23L, 30218L, 1, Decimal('1426134602.661871'), Decimal('1426134602.737072'), '7693ba4aeb7af6583a90381ad8f155827a4d78dad8fe6cb6fa8b18b6ca7a101a', '72ad5a1d-c870-11e4-90cb-00fc1dc4bd2e')

2.) Savepoint Error
Which is followed by savepoint error ('SAVEPOINT sa_savepoint_2 does not exist)

2015-03-12 04:30:04.185 26730 ERROR ceilometer.dispatcher.database [-] Failed to record metering data: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_2 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_2' ()
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database Traceback (most recent call last):
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/ceilometer/dispatcher/database.py", line 92, in record_metering_data
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database self.meter_conn.record_metering_data(meter)
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/ceilometer/storage/impl_sqlalchemy.py", line 338, in record_metering_data
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database data['resource_metadata'])
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/ceilometer/storage/impl_sqlalchemy.py", line 311, in _create_resource
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database meta_map[_model])

[...]

2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database raise errorclass, errorvalue
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database OperationalError: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_2 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_2' ()
2015-03-12 04:30:04.185 26730 TRACE ceilometer.dispatcher.database

Looks like ceilometer-expirer is causing database deadlock to happen.

Possible solution: Changing the expirer to delete from one table at a time and then doing a commit, instead of deleting from all the tables sample, resource and meter in a single transaction

description: updated
summary: - ceilometer expirer causes database deadlock when collector is running
+ Database deadlock in ceilometer-collector when ceilometer-expirer is
+ running
Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

can you try to modify the method of clear_expired_metering_data (https://github.com/openstack/ceilometer/blob/b5cfcb620cf8717e9eeae20c62561b5956ac94a1/ceilometer/storage/impl_sqlalchemy.py#L346 ), change all synchronize_session=False to synchronize_session="fetch" and verify it?
Thanks

Revision history for this message
Ashwin Agate (ashwin-agate) wrote :

Hi ZhiQiang,

Tried setting synchronize_session='fetch' but did not make any difference still see database deadlock

Turned on innodb_print_all_deadlocks in mysql

Here is the deadlock log

150318 1:36:04
*** (1) TRANSACTION:
TRANSACTION 6D40E8, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 8 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 45
MySQL thread id 189654, OS thread handle 0x7fa8c5aac700, query id 24912403 10.22.152.48 ceilometer update
INSERT INTO sample (meter_id, resource_id, volume, timestamp, recorded_at, message_signature, message_id) VALUES (13, 23152, 1, '1426642564.567148', '1426642564.623249', 'b73bd658f660e85614d32906aef8c0ecde48c1f0c9fe3ec0f1a826d617dd4bdf', '245609f2-cd0f-11e4-a21b-78e7d1233465\
')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 119 page no 3 n bits 80 index `PRIMARY` of table `ceilometer`.`meter` trx id 6D40E8 lock mode S locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 6D3FB7, ACTIVE 1 sec fetching rows
mysql tables in use 3, locked 3
1913 lock struct(s), heap size 227768, 229943 row lock(s)
MySQL thread id 190673, OS thread handle 0x7fa8c534f700, query id 24911810 10.xx.xx.xx ceilometer Sending data
DELETE FROM metadata_text WHERE metadata_text.id IN (SELECT resource.internal_id
FROM resource
WHERE NOT (EXISTS (SELECT 1
FROM sample
WHERE resource.internal_id = sample.resource_id)))
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 119 page no 3 n bits 80 index `PRIMARY` of table `ceilometer`.`meter` trx id 6D3FB7 lock_mode X
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 159 page no 1633 n bits 328 index `PRIMARY` of table `ceilometer`.`metadata_text` trx id 6D3FB7 lock_mode X waiting
*** WE ROLL BACK TRANSACTION (1)

1st transaction "insert into sample" is waiting for lock on ceilometer.meter primary index

2nd transaction "delete from metadata_text" has already gotten locks on several ceilometer.meter PRIMARY index rows AND is waiting for lock on ceilometer.metadata_text

1st transaction gets rollbacked

What is not clear to me from the logs why does the 2nd transaction not keep waiting for 1st to complete rather than treating it as a deadlock, unless there is some other locks that the 1st transaction grabs but are not printed in the logs.

Anyway I think that changing expirer to delete in steps each in its own transaction would avoid locking mysql indexes for a long time reducing the possibility of this happening.

Other related problem of this bug that we noticed is that Insert into sample table does not retry insert it encounters a Rollback exception, so we lose data for which transaction was rollbacked.

Changed in ceilometer:
assignee: nobody → Ashwin Agate (ashwin-agate)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165566

Changed in ceilometer:
status: New → In Progress
Revision history for this message
Ashwin Agate (ashwin-agate) wrote :

Tried the proposed fix of changing ceilometer-expirer to delete from all the tables in separate transactions on our test system Did not see any more deadlock errors in collector logs. The load test is running for over 7 hours with about >300k samples inserted

Revision history for this message
gordon chung (chungg) wrote :

marking low priority as you can always run expirer more frequently

Changed in ceilometer:
importance: Undecided → Low
Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

I don't know frequently ceilometer-expirer should be run, each time we run it the DB performance will be affected.

I have tested this issue with devstack on the master branch, it shows that even ceilometer-expirer finish in less than 1 seconds, the deadlock will happen as long as the ceilometer-collector is receiving at a high speed.

It is because the way we insert samples, and the way we clean the tables.

When we clean the table:
1.1) delete expired samples
1.2) delete related meters
1.3) delete related resource's metadata,
1.4) delete related resource

When we insert a sample:
2.1) check if related meter exist, if not, insert a new meter
2.2) check if related resource exist, if not, insert new metadatas, and insert new resource
2.3) insert into sample table

So when process 1, ceilometer-expire is running, some row of the meter table will be locked, then it requires lock on metadata table
then process 2, ceilometer-collector has received a new sample of an existed meter, but it is a new resource, then 2.2 will lock the metadata table, 2.3 will require the lock of meter table, because there is a meter id foreign key

So in such case the deadlock happens, no matter how quickly ceilometer-expirer runs (it actually cannot be so quick in real environment), at the time gap from ceilometer-expirer start to lock the rows of meter table to it actually get the lock of metadata tables, all samples on new resource but with existent meters (no matter the meter is going to be deleted or not) will be deadlock.

To solve it, we can seperate the whole long session in ceilometer-expire, especially the part of clean meter table. The side effect is that, there is some case, when we insert a sample, in 2.1 the meter id exists, but in 2.3 maybe it has already been deleted, then an exception will be raised, but another retry will solve it. This case happens less than db lock IMHO, because it is in the scenario that a meter don't have a new sample for a long time, then the new sample comes and hit the clean meter table window.

Changed in ceilometer:
assignee: Ashwin Agate (ashwin-agate) → ZhiQiang Fan (aji-zqfan)
gordon chung (chungg)
Changed in ceilometer:
milestone: none → liberty-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/165566
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=7bdd0bf2c5ea97f5dd500b0e4fdaaadf6ab00889
Submitter: Jenkins
Branch: master

commit 7bdd0bf2c5ea97f5dd500b0e4fdaaadf6ab00889
Author: Ashwin Agate <email address hidden>
Date: Wed Mar 18 18:55:59 2015 +0000

    Prevent ceilometer expirer from causing deadlocks

    Prevents deadlocks from occuring by deleting
    from each of sample, meter, resource and metadata_text
    metadata_bigint, metadata_bool and metadata_float tables
    in separate transactions

    Closes-Bug: #1431986
    Change-Id: I8670af172796e116e9c17a964063e6d86264efc6

Changed in ceilometer:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/228103

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/kilo)

Reviewed: https://review.openstack.org/228103
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=10305c0eeba67c0605857d426aedde883996d408
Submitter: Jenkins
Branch: stable/kilo

commit 10305c0eeba67c0605857d426aedde883996d408
Author: Ashwin Agate <email address hidden>
Date: Wed Mar 18 18:55:59 2015 +0000

    Prevent ceilometer expirer from causing deadlocks

    Prevents deadlocks from occurring by deleting
    from each of sample, meter, resource and metadata_text
    metadata_bigint, metadata_bool and metadata_float tables
    in separate transactions

    Closes-Bug: #1431986
    Change-Id: I8670af172796e116e9c17a964063e6d86264efc6
    (cherry picked from commit 7bdd0bf2c5ea97f5dd500b0e4fdaaadf6ab00889)

tags: added: in-stable-kilo
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: liberty-rc1 → 5.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.