MMM could not fail-over when server hang

Bug #590623 reported by SeWoong Jeon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mysql-mmm
New
Undecided
Unassigned

Bug Description

Hi, I'm testing MMM version 2.2.1
I made server hang situation just consume all memory. (or fork a process infinitely)
But, MMM could not failover, moreover MMM monitor froze up, so I could not anything.

@ MMM Monitor
# ./bin/mmm_control show
... no response more than 20 minutes ...

@ MMM Agent (debug mode)
2010/06/07 12:08:46 DEBUG Listener: Connect!
2010/06/07 12:08:46 DEBUG Daemon: Command = 'SET_STATUS|1|tdev03|ONLINE|reader(10.25.131.201)|tdev01'
2010/06/07 12:08:46 DEBUG Received Command SET_STATUS|1|tdev03|ONLINE|reader(10.25.131.201)|tdev01
2010/06/07 12:08:46 DEBUG Fetching uptime from /proc/uptime
2010/06/07 12:08:46 DEBUG Uptime is 20283763.54
2010/06/07 12:08:46 DEBUG Daemon: Answer = 'OK: Status applied successfully!'
2010/06/07 12:08:46 DEBUG Listener: Disconnect!
2010/06/07 12:08:46 DEBUG Executing /home1/irteam/db/mmm/bin/agent/configure_ip eth1 10.25.131.201
2010/06/07 12:08:46 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:49 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:52 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:55 DEBUG Listener: Waiting for connection...
2010/06/07 12:08:58 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:01 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:04 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:07 DEBUG Listener: Waiting for connection...
2010/06/07 12:09:10 DEBUG Listener: Waiting for connection...
...
...

This is my test environment.
* tdev01 (10.25.131.51:3306), tdev02 (10.25.131.52:3306), tdev03 (10.25.131.53:3306) -> MMM Agent & MySQL
* tdev04 (10.25.131.54) -> MMM Monitior

<role writer>
        hosts tdev01,tdev02
        ips 10.25.131.200
        mode exclusive
</role>
<role reader>
        hosts tdev01,tdev02,tdev03
        ips 10.25.131.201,10.25.131.202
        mode balanced
</role>

What should I do in this situation?
Let me know if you need more information.

Revision history for this message
Piotr Biel (piotr-biel) wrote :

I hit this problem 2 times. I think it's related to the fact that after crash, system can be still responsive.

In source code (Monitor.pm):

                if ($state eq 'ONLINE') {
                        # ONLINE -> HARD_OFFLINE
                        unless ($ping && $mysql) {

Shouldn't it be OR instead of AND?

And one more thing:

sub move_role()

        # Assign role to new host
        $roles->set_role($role, $ip, $host);

        # Notify old host (if is_active_master_role($role) this will make the host non writable)
        $monitor->send_agent_status($old_owner);

        # Notify slaves (this will make them switch the master)
        $monitor->notify_slaves($host) if ($roles->is_active_master_role($role));

        # Notify new host (if is_active_master_role($role) this will make the host writable)
        $monitor->send_agent_status($host);

and:

        # Finally send command
        my $ret = $agent->cmd_set_status($master);

        unless ($ret) {
                # If ping is down, nothing will be send to agent. So this doesn't indicate that the agent is down.
                my $checks = $self->checks_status;
                if ($checks->ping($host) && !$agent->agent_down()) {
                        FATAL "Can't reach agent on host '$host'";
                        $agent->agent_down(1);
                }
        }
        elsif ($agent->agent_down) {
                FATAL "Agent on host '$host' is reachable again";
                $agent->agent_down(0);
        }
        return $ret;
}

This is another possible reason (self explained above).

Logs from problematic period of time:

System arch:
- MONITORING SERVER
- 2 machines in master-master pair (db12/db13)
- additional slave (db14)

MONITORING SERVER:
2010/08/13 13:16:35 FATAL Can't reach agent on host 'db12'
2010/08/13 13:20:27 ERROR Check 'ping' on 'db12' has failed for 242 seconds! Message: ERROR: Could not ping 10.16.214.132
2010/08/13 13:20:32 FATAL State of host 'db12' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
2010/08/13 13:20:32 INFO Removing all roles from host 'db12':
2010/08/13 13:20:32 INFO Removed role 'writer(10.17.88.140)' from host 'db12'
2010/08/13 13:20:32 INFO Orphaned role 'writer(10.17.88.140)' has been assigned to 'db13'
..
(nothing here, waiting for db12 to become online)

2010/08/13 14:16:27 FATAL Agent on host 'db12' is reachable again

db12:
--
(this one crashed, no MMM logs)

db13:

2010/08/13 14:16:24 INFO We have some new roles added or old rules deleted!
2010/08/13 14:16:24 INFO Added: writer(10.17.88.140)
(this is exactly when db12 appeared online)

db14:
2010/08/13 13:20:32 INFO Changing active master to 'db13'

I didn't drill into source code but "(ping: not OK, mysql: OK)" and "# If ping is down, nothing will be send to agent. So this doesn't indicate that the agent is down." sounds like possible reason.

Revision history for this message
SeWoong Jeon (seuis398) wrote : RE: [Bug 590623] Re: MMM could not fail-over when server hang
Download full text (5.7 KiB)

Hi, I found a cause of this situation.
You can find below source code in Monitor.pm
...
- # Finally send command
- my $ret = $agent->cmd_set_status($master);
...

Periodically, MMM Monitor send status file to all alive agent, but there is no timeout logic.
Even after a node crash, MMM Monitor try sending status file to crashed node and finally blocked.
So, I added a timeout logic.
I attached patch file, check it out. :-)

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Piotr Biel
Sent: Monday, September 06, 2010 11:14 PM
To: <email address hidden>
Subject: [Bug 590623] Re: MMM could not fail-over when server hang

I hit this problem 2 times. I think it's related to the fact that after crash, system can be still responsive.

In source code (Monitor.pm):

                if ($state eq 'ONLINE') {
                        # ONLINE -> HARD_OFFLINE
                        unless ($ping && $mysql) {

Shouldn't it be OR instead of AND?

And one more thing:

sub move_role()

        # Assign role to new host
        $roles->set_role($role, $ip, $host);

        # Notify old host (if is_active_master_role($role) this will make the host non writable)
        $monitor->send_agent_status($old_owner);

        # Notify slaves (this will make them switch the master)
        $monitor->notify_slaves($host) if ($roles->is_active_master_role($role));

        # Notify new host (if is_active_master_role($role) this will make the host writable)
        $monitor->send_agent_status($host);

and:

        # Finally send command
        my $ret = $agent->cmd_set_status($master);

        unless ($ret) {
                # If ping is down, nothing will be send to agent. So this doesn't indicate that the agent is down.
                my $checks = $self->checks_status;
                if ($checks->ping($host) && !$agent->agent_down()) {
                        FATAL "Can't reach agent on host '$host'";
                        $agent->agent_down(1);
                }
        }
        elsif ($agent->agent_down) {
                FATAL "Agent on host '$host' is reachable again";
                $agent->agent_down(0);
        }
        return $ret;
}

This is another possible reason (self explained above).

Logs from problematic period of time:

System arch:
- MONITORING SERVER
- 2 machines in master-master pair (db12/db13)
- additional slave (db14)

MONITORING SERVER:
2010/08/13 13:16:35 FATAL Can't reach agent on host 'db12'
2010/08/13 13:20:27 ERROR Check 'ping' on 'db12' has failed for 242 seconds! Message: ERROR: Could not ping 10.16.214.132
2010/08/13 13:20:32 FATAL State of host 'db12' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
2010/08/13 13:20:32 INFO Removing all roles from host 'db12':
2010/08/13 13:20:32 INFO Removed role 'writer(10.17.88.140)' from host 'db12'
2010/08/13 13:20:32 INFO Orphaned role 'writer(10.17.88.140)' has been assigned to 'db13'
..
(nothing here, waiting for db12 to become online)

2010/08/13 14:16:27 FATAL Agent on host 'db12' is reachable again

db12:
--
(this one crashed, no MMM logs)

db13:

2010/08/13 14:16:24 INFO We have some new roles added or ol...

Read more...

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

Hi,
   I like the idea of Jeon but I'd prefer the alarm code to be in the _send_command function. I'll write something about this in branch lp:~y-trudeau/mysql-mmm/NonBlocking.

Regards,

Yves

Revision history for this message
Sasha Pachev (sasha-pachev) wrote :

An easy way to simulate the problem. On the database host firewall the outgoing TCP PUSH packets on the agent port like this:

iptables -A OUTPUT -p tcp --sport 9989 --tcp-flags PSH PSH -j DROP

This will allow the connect to succeed, but when the agent tries to reply with data, it will block. Network-wise the behavior is the same as if the the agent/databse host ran out of memory and is not able to proceed beyond acknowledging a TCP handshake.

The monitor does have a timeout on the connect operation, but not on network I/O.

Attached is the proof-of-concept patch that fixes this problem for a likely practical scenario - a host running MySQL and the agent ran out of memory or is otherwise overloaded, and entered a state where the connection to the agent succeeds, but the actual data response is not happening. It enables timeouts on reads. This fix needs to be improved:

- take care of perpetually blocking write operation
- apply the fix to other network I/O in the code

This would require some refactoring of the code to put network I/O in wrappers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.