MMM could not fail-over when server hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
mysql-mmm |
New
|
Undecided
|
Unassigned |
Bug Description
Hi, I'm testing MMM version 2.2.1
I made server hang situation just consume all memory. (or fork a process infinitely)
But, MMM could not failover, moreover MMM monitor froze up, so I could not anything.
@ MMM Monitor
# ./bin/mmm_control show
... no response more than 20 minutes ...
@ MMM Agent (debug mode)
2010/06/07 12:08:46 DEBUG Listener: Connect!
2010/06/07 12:08:46 DEBUG Daemon: Command = 'SET_STATUS|
2010/06/07 12:08:46 DEBUG Received Command SET_STATUS|
2010/06/07 12:08:46 DEBUG Fetching uptime from /proc/uptime
2010/06/07 12:08:46 DEBUG Uptime is 20283763.54
2010/06/07 12:08:46 DEBUG Daemon: Answer = 'OK: Status applied successfully!'
2010/06/07 12:08:46 DEBUG Listener: Disconnect!
2010/06/07 12:08:46 DEBUG Executing /home1/
2010/06/07 12:08:46 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:49 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:52 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:55 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:58 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:01 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:04 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:07 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:10 DEBUG Listener: Waiting for connection...
...
...
This is my test environment.
* tdev01 (10.25.
* tdev04 (10.25.131.54) -> MMM Monitior
<role writer>
hosts tdev01,tdev02
ips 10.25.131.200
mode exclusive
</role>
<role reader>
hosts tdev01,
ips 10.25.131.
mode balanced
</role>
What should I do in this situation?
Let me know if you need more information.
I hit this problem 2 times. I think it's related to the fact that after crash, system can be still responsive.
In source code (Monitor.pm):
if ($state eq 'ONLINE') {
# ONLINE -> HARD_OFFLINE
unless ($ping && $mysql) {
Shouldn't it be OR instead of AND?
And one more thing:
sub move_role()
# Assign role to new host
$roles- >set_role( $role, $ip, $host);
# Notify old host (if is_active_ master_ role($role) this will make the host non writable)
$monitor- >send_agent_ status( $old_owner) ;
# Notify slaves (this will make them switch the master)
$monitor- >notify_ slaves( $host) if ($roles- >is_active_ master_ role($role) );
# Notify new host (if is_active_ master_ role($role) this will make the host writable)
$monitor- >send_agent_ status( $host);
and:
# Finally send command >cmd_set_ status( $master) ;
my $ret = $agent-
unless ($ret) { checks_ status; >ping($ host) && !$agent- >agent_ down()) {
FATAL "Can't reach agent on host '$host'";
$agent- >agent_ down(1) ; >agent_ down) {
FATAL "Agent on host '$host' is reachable again";
$agent- >agent_ down(0) ;
# If ping is down, nothing will be send to agent. So this doesn't indicate that the agent is down.
my $checks = $self->
if ($checks-
}
}
elsif ($agent-
}
return $ret;
}
This is another possible reason (self explained above).
Logs from problematic period of time:
System arch:
- MONITORING SERVER
- 2 machines in master-master pair (db12/db13)
- additional slave (db14)
MONITORING SERVER: 10.17.88. 140)' from host 'db12' 10.17.88. 140)' has been assigned to 'db13'
2010/08/13 13:16:35 FATAL Can't reach agent on host 'db12'
2010/08/13 13:20:27 ERROR Check 'ping' on 'db12' has failed for 242 seconds! Message: ERROR: Could not ping 10.16.214.132
2010/08/13 13:20:32 FATAL State of host 'db12' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
2010/08/13 13:20:32 INFO Removing all roles from host 'db12':
2010/08/13 13:20:32 INFO Removed role 'writer(
2010/08/13 13:20:32 INFO Orphaned role 'writer(
..
(nothing here, waiting for db12 to become online)
2010/08/13 14:16:27 FATAL Agent on host 'db12' is reachable again
db12:
--
(this one crashed, no MMM logs)
db13:
2010/08/13 14:16:24 INFO We have some new roles added or old rules deleted! 10.17.88. 140)
2010/08/13 14:16:24 INFO Added: writer(
(this is exactly when db12 appeared online)
db14:
2010/08/13 13:20:32 INFO Changing active master to 'db13'
I didn't drill into source code but "(ping: not OK, mysql: OK)" and "# If ping is down, nothing will be send to agent. So this doesn't indicate that the agent is down." sounds like possible reason.