After being able to take a closer look at a failing system the issue is that the rack controllers are not able to update their interfaces. The underlying issue seems to be that the networking monitoring services has a lock file to ensure that only one process updates the networking information. If the processes gets killed, the lock file stays, pointing to the PID the killed regiond process had.
Now what normally happens is that another process tries to acquire the lock, sees that the lock points to a killed PID , and recreates the lock.
This normally works, but what can happen is that the killed PID gets recycled, so that the lock now points to a PID which the maas user isn't allowed to kill. Now a PermissionError is raised, that the lock file implementation doesn't handle this case, and the networking monitoring service can never start.
After being able to take a closer look at a failing system the issue is that the rack controllers are not able to update their interfaces. The underlying issue seems to be that the networking monitoring services has a lock file to ensure that only one process updates the networking information. If the processes gets killed, the lock file stays, pointing to the PID the killed regiond process had.
Now what normally happens is that another process tries to acquire the lock, sees that the lock points to a killed PID , and recreates the lock.
This normally works, but what can happen is that the killed PID gets recycled, so that the lock now points to a PID which the maas user isn't allowed to kill. Now a PermissionError is raised, that the lock file implementation doesn't handle this case, and the networking monitoring service can never start.
Currently working on a fix for this.