I don't know frequently ceilometer-expirer should be run, each time we run it the DB performance will be affected.
I have tested this issue with devstack on the master branch, it shows that even ceilometer-expirer finish in less than 1 seconds, the deadlock will happen as long as the ceilometer-collector is receiving at a high speed.
It is because the way we insert samples, and the way we clean the tables.
When we clean the table:
1.1) delete expired samples
1.2) delete related meters
1.3) delete related resource's metadata,
1.4) delete related resource
When we insert a sample:
2.1) check if related meter exist, if not, insert a new meter
2.2) check if related resource exist, if not, insert new metadatas, and insert new resource
2.3) insert into sample table
So when process 1, ceilometer-expire is running, some row of the meter table will be locked, then it requires lock on metadata table
then process 2, ceilometer-collector has received a new sample of an existed meter, but it is a new resource, then 2.2 will lock the metadata table, 2.3 will require the lock of meter table, because there is a meter id foreign key
So in such case the deadlock happens, no matter how quickly ceilometer-expirer runs (it actually cannot be so quick in real environment), at the time gap from ceilometer-expirer start to lock the rows of meter table to it actually get the lock of metadata tables, all samples on new resource but with existent meters (no matter the meter is going to be deleted or not) will be deadlock.
To solve it, we can seperate the whole long session in ceilometer-expire, especially the part of clean meter table. The side effect is that, there is some case, when we insert a sample, in 2.1 the meter id exists, but in 2.3 maybe it has already been deleted, then an exception will be raised, but another retry will solve it. This case happens less than db lock IMHO, because it is in the scenario that a meter don't have a new sample for a long time, then the new sample comes and hit the clean meter table window.
I don't know frequently ceilometer-expirer should be run, each time we run it the DB performance will be affected.
I have tested this issue with devstack on the master branch, it shows that even ceilometer-expirer finish in less than 1 seconds, the deadlock will happen as long as the ceilometer- collector is receiving at a high speed.
It is because the way we insert samples, and the way we clean the tables.
When we clean the table:
1.1) delete expired samples
1.2) delete related meters
1.3) delete related resource's metadata,
1.4) delete related resource
When we insert a sample:
2.1) check if related meter exist, if not, insert a new meter
2.2) check if related resource exist, if not, insert new metadatas, and insert new resource
2.3) insert into sample table
So when process 1, ceilometer-expire is running, some row of the meter table will be locked, then it requires lock on metadata table collector has received a new sample of an existed meter, but it is a new resource, then 2.2 will lock the metadata table, 2.3 will require the lock of meter table, because there is a meter id foreign key
then process 2, ceilometer-
So in such case the deadlock happens, no matter how quickly ceilometer-expirer runs (it actually cannot be so quick in real environment), at the time gap from ceilometer-expirer start to lock the rows of meter table to it actually get the lock of metadata tables, all samples on new resource but with existent meters (no matter the meter is going to be deleted or not) will be deadlock.
To solve it, we can seperate the whole long session in ceilometer-expire, especially the part of clean meter table. The side effect is that, there is some case, when we insert a sample, in 2.1 the meter id exists, but in 2.3 maybe it has already been deleted, then an exception will be raised, but another retry will solve it. This case happens less than db lock IMHO, because it is in the scenario that a meter don't have a new sample for a long time, then the new sample comes and hit the clean meter table window.