An easy way to simulate the problem. On the database host firewall the outgoing TCP PUSH packets on the agent port like this:
iptables -A OUTPUT -p tcp --sport 9989 --tcp-flags PSH PSH -j DROP
This will allow the connect to succeed, but when the agent tries to reply with data, it will block. Network-wise the behavior is the same as if the the agent/databse host ran out of memory and is not able to proceed beyond acknowledging a TCP handshake.
The monitor does have a timeout on the connect operation, but not on network I/O.
Attached is the proof-of-concept patch that fixes this problem for a likely practical scenario - a host running MySQL and the agent ran out of memory or is otherwise overloaded, and entered a state where the connection to the agent succeeds, but the actual data response is not happening. It enables timeouts on reads. This fix needs to be improved:
- take care of perpetually blocking write operation
- apply the fix to other network I/O in the code
This would require some refactoring of the code to put network I/O in wrappers.
An easy way to simulate the problem. On the database host firewall the outgoing TCP PUSH packets on the agent port like this:
iptables -A OUTPUT -p tcp --sport 9989 --tcp-flags PSH PSH -j DROP
This will allow the connect to succeed, but when the agent tries to reply with data, it will block. Network-wise the behavior is the same as if the the agent/databse host ran out of memory and is not able to proceed beyond acknowledging a TCP handshake.
The monitor does have a timeout on the connect operation, but not on network I/O.
Attached is the proof-of-concept patch that fixes this problem for a likely practical scenario - a host running MySQL and the agent ran out of memory or is otherwise overloaded, and entered a state where the connection to the agent succeeds, but the actual data response is not happening. It enables timeouts on reads. This fix needs to be improved:
- take care of perpetually blocking write operation
- apply the fix to other network I/O in the code
This would require some refactoring of the code to put network I/O in wrappers.