Comment 28 for bug 1457404

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

The workaround we proposed is obviously not a silver bullet: it only allows to return affected IPs to normal use after $lease seconds, so by decreasing the lease time, we decrease the time IP addresses can't be used for new instances.

I understand your frustration as this affects your CI. For the purpose of testing we could possibly decrease the lease timeout even more to something like 30-60s. For production 10min or default 24h must be ok.

My point is still the same: from what I see, this looks very much a like a nasty race condition in dnsmasq which can only be reproduced once per deployment right after it's complete (we can't reproduce the issue on the same env after that, neither we can do that on any other deployed env). For some reason dnsmasq ignores the first DHCPRELEASE packet it receives (strace'ing dnsmasq daemon shown it actually received the UDP packet but did nothing). So this has little impact on production envs.

I'm now wondering if this has something to do with virtio, as we've already seen cases when dnsmasq ignored packets with bad checksums (this can only be reproduced with virtio, e1000 works just fine).