MAAS becomes unstable after rack controller restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Blake Rouse | ||
2.2 |
Fix Released
|
Critical
|
Unassigned |
Bug Description
Problem
=======
We have an HA setup - 3 region API nodes and 2 rack controllers. When we restart a rack controller, the MAAS API becomes unresponsive/
Here's a 'zones read' call that fails once and then succeeds. This is done immediately after restarting both rack controllers:
http://
The amount of time it stays this way varies - we currently have a 5 minute sleep after restarting maas-rackd before trying to setup networks through the API and that isn't always long enough - we sometimes get API calls disconnected without a response.
Also, the racks sometimes never show up as fully connected again. They show up as 8% connected here:
http://
The logs are full of questionable stuff, "Successfully configured DNS" is repeated over and over:
2017-08-01 16:35:28 maasserver.
2017-08-01 16:35:30 maasserver.
2017-08-01 16:35:32 maasserver.
So are errors like this:
Failed to register rack controller '4shpr4' into the database. Connection will be dropped.
And repeated messages like this:
Aug 1 16:37:05 infra1 maas.rpc.
Aug 1 16:37:12 infra1 maas.rpc.
Aug 1 16:37:22 infra1 maas.service_
Aug 1 16:37:52 infra1 maas.service_
And this:
2017-08-01 16:37:39 provisioningser
Expected Behavior
=================
- Restarting a rack controller should not affect region controller API availability. We should be able to restart rack controllers and immediately use the API.
- Restarted rack controllers should not remain in a 'degraded' 8% connected state.
We're using 2.2.2 (6099-g8751f91-
Related branches
- Blake Rouse (community): Approve
-
Diff: 337 lines (+182/-16)4 files modifiedsrc/maasserver/rpc/regionservice.py (+10/-9)
src/maasserver/rpc/tests/test_regionservice.py (+28/-7)
src/provisioningserver/utils/network.py (+68/-0)
src/provisioningserver/utils/tests/test_network.py (+76/-0)
- Mike Pontillo (community): Approve
-
Diff: 257 lines (+120/-17)4 files modifiedsrc/maasserver/rpc/regionservice.py (+10/-9)
src/maasserver/rpc/tests/test_regionservice.py (+28/-7)
src/provisioningserver/utils/network.py (+33/-0)
src/provisioningserver/utils/tests/test_network.py (+49/-1)
Changed in maas: | |
status: | Incomplete → New |
tags: |
added: foundations-engine removed: foundation-engine |
Changed in maas: | |
milestone: | none → 2.2.3 |
assignee: | nobody → Blake Rouse (blake-rouse) |
importance: | Undecided → High |
status: | New → Triaged |
Changed in maas: | |
status: | Triaged → In Progress |
Changed in maas: | |
importance: | High → Critical |
milestone: | 2.2.3 → 2.3.0 |
Changed in maas: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
milestone: | 2.3.0 → 2.3.0alpha2 |
Changed in maas: | |
status: | Fix Committed → Fix Released |
Logs and config files from the 3 maas nodes.