Comment 0 for bug 1496409

Revision history for this message
Jorge Niedbalski (niedbalski) wrote : Race condition in mnesia_locker after nodedown

[Description]

On a 3 nodes rabbitMQ cluster after a failure on one of the nodes (kernel crash) the following succession
of events has been registered:

===
This initial crash report

$ ack-grep -i mnesia_locker | wc -l
16

60-=CRASH REPORT==== 7-Sep-2015::14:48:45 ===
61- crasher:
--
66: {mnesia_locker,'rabbit@10-10-34-2',granted}}
67- in function gen_server2:terminate/3 (src/gen_server2.erl, line 1045)
68- ancestors: [worker_pool_sup,rabbit_sup,<0.110.0>]
69- messages: []
70- links: [<0.114.0>]
71- dictionary: [{{xtype_to_module,direct},rabbit_exchange_type_direct},
72- {{xtype_to_module,topic},rabbit_exchange_type_topic},
73- {random_seed,{5643,25632,27953}},
74- {{xtype_to_module,fanout},rabbit_exchange_type_fanout},
75- {worker_pool_worker,true},
76- {fhc_age_tree,{0,nil}},
--
88: Reason: {unexpected_info,{mnesia_locker,'rabbit@10-10-34-2',granted}}
89- Offender: [{pid,<0.142.0>},
90- {name,25},
91- {mfargs,{worker_pool_worker,start_link,[25]}},
92- {restart_type,transient},
93- {shutdown,4294967295},
94- {child_type,worker}]

sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1.log
5166:Mnesia('rabbit@10-10-34-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@10-10-34-2'}

sosreport-svz-op-fdc-os-3.00087142-20150909140820/var/log/rabbitmq/rabbit@10-10-34-3.log.1
545:Mnesia('rabbit@10-10-34-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@10-10-34-1'}

After this I started seeing this traces:

sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log:572:=CRASH REPORT==== 7-Sep-2015::14:48:45 ===
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log:573: crasher:
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-574- initial call: gen:init_it/6
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-575- pid: <0.2346.189>
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-576- registered_name: []
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-577- exception exit: {{unexpected_cast,{next_job_from,<0.4960.181>}},
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-578- {gen_server2,call,
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-579- [<0.22014.255>,
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-580- {submit,#Fun<rabbit_misc.6.25154013>,<0.2862.189>},
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-581- infinity]}}
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-582- in function gen_server2:terminate/3 (src/gen_server2.erl, line 1045)
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-583- ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.110.0>]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-584- messages: []
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-585- links: [<0.259.0>]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-586- dictionary: [{guid,{{3735536962,2437967587,3023977752,60868675},0}}]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-587- trap_exit: true

Related Bugs:

- https://bugs.launchpad.net/mos/+bug/1401948
- https://github.com/erlang/otp/compare/maint...dgud:dgud%3Bmnesia%3Bsticky-race%3BOTP-11375

Possible related upstream commit:

- http://hg.rabbitmq.com/rabbitmq-server/rev/5a63c9e273cc