Pod must be on a known host if interfaces are specified

Bug #1847794 reported by Marian Gasparovic
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Björn Tillenius
2.6
Fix Released
High
Unassigned
2.7
Fix Released
Medium
Unassigned

Bug Description

maas_2.6.1~rc1-7830-gee83011f2-ubuntu1~18.04.1
getting this error now

2019-10-11-06:35:41 root DEBUG maas root pod compose 1 hostname=vault-1 cores=2 memory=4096 storage=20.0 zone=1 interfaces=eth0:space=oam-space;eth1:space=internal-space,type=bridge
2019-10-11-06:35:42 root ERROR Command failed: pod compose 1 hostname=vault-1 cores=2 memory=4096 storage=20.0 zone=1 interfaces=''eth0:space=oam-space;eth1:space=internal-space,type=bridge''
2019-10-11-06:35:42 root ERROR b'Pod must be on a known host if interfaces are specified.'
Traceback (most recent call last):
  File "/usr/local/bin/fce", line 11, in <module>
    load_entry_point('foundationcloudengine', 'console_scripts', 'fce')()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 141, in entry_point
    sys.exit(main(sys.argv[1:]))
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 132, in main
    opts.func(opts)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 73, in build_main
    args.steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 47, in build_and_validate_if_needed
    layer.build_outer(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/baselayer.py", line 119, in build_outer
    self.build(only_steps=only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/maaslayer.py", line 2662, in build
    super(MaasLayer, self).run_steps(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/steppedbaselayer.py", line 51, in run_steps
    step.build()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/maaslayer.py", line 1368, in build
    zone['id'],
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_cli.py", line 515, in add_pod_vm
    return cmd(maas_profile, command)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_cli.py", line 117, in cmd
    raise error
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_cli.py", line 112, in cmd
    output = raw_cmd(maas_profile, split_command)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_cli.py", line 106, in raw_cmd
    return subprocess.check_output(maas_cmd)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['maas', 'root', 'pod', 'compose', '1', 'hostname=vault-1', 'cores=2', 'memory=4096', 'storage=20.0', 'zone=1', 'interfaces=eth0:space=oam-space;eth1:space=internal-space,type=bridge']' returned non-zero exit status 2.

Related branches

Revision history for this message
Marian Gasparovic (marosg) wrote :
tags: added: cdo-qa-blocker
tags: added: field-high
tags: removed: cdo-qa-blocker
Revision history for this message
Marian Gasparovic (marosg) wrote :

subsribed also field-high

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Is the Pod created on a machine that was deployed with "Install as KVM host"? If not then you cannot use interface constraints.

Changed in maas:
status: New → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote :

Blake, when I look at a history of our test runs I see we get this working sometimes and failing sometimes, on the same maas revision

2019-10-13-13:09:30 foundationcloudengine.layers.maaslayer INFO Creating vault-1 in leafeon
2019-10-13-13:09:30 root DEBUG maas root pod compose 1 hostname=vault-1 cores=2 memory=4096 storage=20.0 zone=1 interfaces=eth0:space=oam-space;eth1:space=internal-space,type=bridge
2019-10-13-13:09:37 root DEBUG maas root tag read vault
2019-10-13-13:09:38 root DEBUG maas root tags create name=vault
2019-10-13-13:09:39 root DEBUG maas root tag update-nodes vault add=fepak3

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Some more background - we've been running this way on maas 2.6.0 for sometime now and never hit this issue. FCB's design depends on using pods to run on the maas rack/region controllers, and they need to be on multiple interfaces.

We've always done that before and it's worked; we don't expect change behavior like this in a point release.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Nothing has changed in behavior, the behavior has alway been this way. It is either you install the Pod using "Install as KVM host" or if the machine was deployed by MAAS then MAAS should be able to determine that the machine is the host of the Pod.

I assume you are in the later case?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The behavior has certainly changed in 2.6.1, we've run hundreds of test cases prior to that and haven't seen this behavior before.

The machine wasn't deployed by MAAS - it is the machine hosting MAAS, in fact. We add it to MAAS as a pod host.

I'm not sure what this error even means, what is it trying to tell us?

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
milestone: none → 2.7.0alpha1
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Some more background:

We're hosting pods on infra nodes - host maas rack/region controller.
After maas is deployed, we add the three infra nodes as pod hosts
The maas rack controller has interfaces on two networks - OAM for PXE, internal to talk to openstack services.
We need the pod kvm to have interfaces on both networks, so we pass both in during compose. We pass in two nic constraints because we want two interfaces.
Due to bug 1830690, the interface is unconfigured. we go and link it to a subnet via the api.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

As this is a randomly failing and has to due with interface constraints I have reverted commit dbbc280d218fb67cbaf6e1acb30b1549e7c110a5n and have added a 2.6.1 package to ppa:maas-maintainers/experimental3

This will get you on your way while I continue to look into what is causing this regression.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Update: we hit the issue on the new package from the experimental3 ppa. Newell has access to the repro system and is investigating.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Update:
Using the New New package from experimental3 ppa i had 3 successful builds in a row.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

Update:

soultions-qa has not seen any regressions with 2.6.0 but has seen regressions with 2.6.1 as well 2.6.1 with commit dbbc280d218fb67cbaf6e1acb30b1549e7c110a5n reverted. 2.6.0 did not show any issues but the other packaages did. Currently testing 2.6.1-alpha1 (aka 2.6.0-rc3).

Revision history for this message
Adam Collard (adam-collard) wrote :

After many more runs in testing, we cannot reproduce the issue. If it comes back please alert us immediately, and reopen the bug.

Changed in maas:
status: Triaged → Invalid
Changed in maas:
status: Invalid → Incomplete
Revision history for this message
Marian Gasparovic (marosg) wrote :

We just hit it again last night with maas_2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1

2019-10-31-08:28:28 root DEBUG maas root pod compose 1 hostname=vault-1 cores=2 memory=4096 storage=20.0 zone=2 interfaces=eth0:space=oam-space;eth1:space=internal-space,type=bridge
2019-10-31-08:28:29 root ERROR Command failed: pod compose 1 hostname=vault-1 cores=2 memory=4096 storage=20.0 zone=2 interfaces=''eth0:space=oam-space;eth1:space=internal-space,type=bridge''
2019-10-31-08:28:29 root ERROR b'Pod must be on a known host if interfaces are specified.'

Revision history for this message
Marian Gasparovic (marosg) wrote :

and it passed a day before with 2.6.1~rc1-7830-gee83011f2-ubuntu1~18.04.1

Changed in maas:
status: Incomplete → New
Changed in maas:
status: New → In Progress
assignee: nobody → Newell Jensen (newell-jensen)
Revision history for this message
Newell Jensen (newell-jensen) wrote :

After being able to take a closer look at a failing system the issue is that the rack controllers are not able to update their interfaces. The underlying issue seems to be that the networking monitoring services has a lock file to ensure that only one process updates the networking information. If the processes gets killed, the lock file stays, pointing to the PID the killed regiond process had.

Now what normally happens is that another process tries to acquire the lock, sees that the lock points to a killed PID , and recreates the lock.

This normally works, but what can happen is that the killed PID gets recycled, so that the lock now points to a PID which the maas user isn't allowed to kill. Now a PermissionError is raised, that the lock file implementation doesn't handle this case, and the networking monitoring service can never start.

Currently working on a fix for this.

Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Michael Skalka (mskalka) wrote :

We have encountered this error again in testing using maas_2.6.2-7841-ga10625be3-0ubuntu1~18.04.1

MAAS logs attached.

Changed in maas:
status: Fix Committed → New
Revision history for this message
Michael Skalka (mskalka) wrote :

Foundation log from the failed run above.

Changed in maas:
status: New → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Alexander Balderson (asbalderson) wrote :

I'm reopening this; as we're seeing it on maas_2.7.0~rc1-8204-g.d93c8433c-0ubuntu1~18.04.1

Changed in maas:
status: Fix Released → New
Revision history for this message
Michael Skalka (mskalka) wrote :
Alberto Donato (ack)
Changed in maas:
milestone: 2.7.0b1 → 2.7.0rc2
Changed in maas:
status: New → In Progress
Revision history for this message
Björn Tillenius (bjornt) wrote :

I think we've found the real issue now. In Controller.update_interfaces(), we loop over the detected interfaces, first the physical interfaces, and then the vlans, bonds, and bridges. In the loop we get the interface details via mac address.

However, only physical interfaces have extra details. If a bridge, bond or vlan has the same mac address as a physical one, it will get its extra details as well, including the name.

The fix is to only update extra information if it's a physical interface.

The code in question has changed from 2.6.2 and 2.7rc1, but the underlying issue exists in both. So both 2.6 and 2.7 need to be fixed.

I was able to reproduce this with a physical machine (a container won't do) where a single rack and region was installed. Stop maas-rackd and restart maas-regiond, and you should see the issue.

The reason it didn't happen on every test run is that the code works fine the first time it runs, when all the interfaces are created. Then either rackd or regiond will have the responsibility to keep the network information up-to-date, and if the service that doesn't have the responsibility is restarted, the issue won't show, since it won't try to update the interfaces.

Changed in maas:
milestone: 2.7.0rc2 → none
Changed in maas:
milestone: none → next
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
milestone: next → 2.8.0b1
Lee Trager (ltrager)
Changed in maas:
status: Fix Released → New
assignee: Newell Jensen (newell-jensen) → Björn Tillenius (bjornt)
milestone: 2.8.0b1 → 2.8.0
Revision history for this message
John George (jog) wrote :

Solutions QA is continuing to hit this bug. This link has all the test runs that have experienced the failure: https://solutions.qa.canonical.com/#/qa/bug/1847794

Individual test runs are linked from this page. Once on the test run page the artifacts and logs are linked at the bottom.

Here's a direct link to one of the recent failures:
https://solutions.qa.canonical.com/#/qa/testRun/e04fb3be-07f4-4e65-ad3d-eab2a1d34b17

Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
Michael Skalka (mskalka) wrote :

Tagging this as a release blocker for 2.8rc5 as we have not been able to get past pod composition for the full test run due to this issue.

tags: added: cdo-release-blocker
Revision history for this message
Michael Skalka (mskalka) wrote :

Subbing field-critical, we have hit this 11 times in the last ~24 hours. https://solutions.qa.canonical.com/#/qa/bug/1847794

tags: added: field-critical
removed: field-high
Changed in maas:
importance: High → Critical
milestone: 2.8.0 → 2.9.0b1
Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok, we found the issue. The problem happens when the maas deb packages are purged and then reinstalled. That might leave maas lock files in /run/lock/ that are now owned by a numeric user.

If maas is reinstalled directly, then things should work since maas should get the same uid as before and things will work.

But if you install some other deb package that adds a user, and then reinstall maas, the maas uid will now be different than before, and it won't have permissions to remove the stale lock files.

So I would say that the original bug that this was about was actually fixed already.

This is a separate bug that probably started to show up when you changed your systems in some way (i.e. installing an extra package), having the same symptoms as the original bug.

That explains why you started to see it recently only, after not seeing it for a long while.

Revision history for this message
Alberto Donato (ack) wrote :

Filed LP:#1883735 to track the issue mentioned above.

Revision history for this message
Michael Skalka (mskalka) wrote :

Suggested workaround confirmed. Removing tags and subscriptions. We'll hold the fix until the package bug is resolved.

tags: removed: cdo-release-blocker field-critical
Revision history for this message
Adam Collard (adam-collard) wrote :

Bug 1883735 was the root cause of this

Changed in maas:
importance: Critical → Medium
status: New → Fix Released
no longer affects: maas/2.8
Changed in maas:
milestone: 2.9.0b1 → 2.8.0
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Hi, I'm currently hitting this issue on a new install of maas 2.8 (2.8.2-8577-g.a3e674063).

2021-02-26-22:40:41 root ERROR [localhost] Command failed: maas root pod compose 2 hostname=vault-2 cores=2 memory=4096 storage=40.0 zone=1 'interfaces=eth0:space=oam-space,mode=auto;eth1:space=internal-space
,type=bridge,mode=auto'
2021-02-26-22:40:41 root ERROR [localhost] STDOUT follows:
Pod must be on a known host if interfaces are specified

I don't think the bug 1883735 is the root cause in my case since this is a fresh maas install on a fresh ubuntu install.

Revision history for this message
Marian Gasparovic (marosg) wrote :

I hit it today when testing 3.3.0~rc2-13144-g.b2d51cb8c

2023-01-17-08:35:24 root ERROR [localhost] Command failed: maas root vm-host compose 1 hostname=vault-1 cores=2 memory=4096 storage=40.0 zone=2 'interfaces=eth0:space=oam-space;eth1:space=internal-space'
2023-01-17-08:35:24 root ERROR [localhost] STDOUT follows:
Pod must be on a known host if interfaces are specified.

Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Bartłomiej Poniecki-Klotz (barteus) wrote (last edit ):

We hit the same issue on the customer environment with the same VM (vault-1). Retrying the step did not help, the solution was to clean the whole layer in FCE and redeploy again.

OS
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

Maas version: 3.1.1-10918-g.9cbd96fd2

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.