VLAN with the specified VID already exists error when updating the fabric attribute

Bug #1853047 reported by John George
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Björn Tillenius

Bug Description

With MAAS 2.6.2-7841-ga10625be3-0ubuntu1~18.04.1, Solutions QA tests failed when updating VLANs.

Two separate test runs failed, with details and artifacts available at the following URLS:

https://solutions.qa.canonical.com/#/qa/testRun/a1f9f79e-ed99-48f2-a376-e6e999c51ce5
https://solutions.qa.canonical.com/#/qa/testRun/42edc0da-6bd6-462f-a41b-2b68346981a4

The error output can be seen in the fce_build console log:

2019-11-16-08:30:53 foundationcloudengine.maas_config_networks DEBUG Setting vlan: untagged - fabric to default
Traceback (most recent call last):
  File "/usr/local/bin/fce", line 11, in <module>
    load_entry_point('foundationcloudengine', 'console_scripts', 'fce')()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 141, in entry_point
    sys.exit(main(sys.argv[1:]))
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 132, in main
    opts.func(opts)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 77, in build_main
    args.steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 51, in build_and_validate_if_needed
    layer.build_outer(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/baselayer.py", line 120, in build_outer
    self.build(only_steps=only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/maaslayer.py", line 2671, in build
    super(MaasLayer, self).run_steps(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/steppedbaselayer.py", line 113, in run_steps
    step.build()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/maaslayer.py", line 1276, in build
    configure_networks(self._network_config, self._api_url, self._apikey)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_config_networks.py", line 521, in configure_networks
    fabric_vlans=fabric_vlans)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_config_networks.py", line 355, in apply_vlans_to_fabric
    set_vlan_attr(vlan, attr='fabric', value=fabric) # Move the vlan
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/maas_config_networks.py", line 326, in set_vlan_attr
    vlan.save()
  File "/usr/lib/python3/dist-packages/maas/client/utils/async.py", line 49, in wrapper
    result = eventloop.run_until_complete(result)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/lib/python3/dist-packages/maas/client/viscera/vlans.py", line 108, in save
    self._data = await self._handler.update(**update_data)
  File "/usr/lib/python3/dist-packages/maas/client/bones/__init__.py", line 302, in __call__
    response = await self.bind(**params).call(**data)
  File "/usr/lib/python3/dist-packages/maas/client/bones/__init__.py", line 463, in dispatch
    raise CallError(request, response, content, self)
maas.client.bones.CallError: PUT http://10.246.64.33/MAAS/api/2.0/fabrics/1/vlans/2696/ -> HTTP 400 Bad Request ({"__all__": ["A VLAN with the specified VID alrea…)

Related branches

Revision history for this message
Alberto Donato (ack) wrote :

Could you share what the code is trying to do?
It seems the client is trying to the VID of a VLAN to one that is already in use by another VLAN.

Changed in maas:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Changed in maas:
status: Expired → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here is the code:

https://git.launchpad.net/cpe-foundation/tree/foundationcloudengine/foundationcloudengine/maas_config_networks.py

Here is the networks.yaml it uses:
https://git.launchpad.net/cpe-deployments/tree/config/networks.yaml?h=solutionsqa/fcb/project/stable-stein-bionic-production

This code has been stable for a long time; this is a race condition. If I had to guess, maas is doing something like discovering the vlan on the destination fabric between when we read what vlans are on what fabrics and when we try to put vlans on the correct fabrics.

Revision history for this message
John A Meinel (jameinel) wrote :

I can't speak to the underlying issue, but Solutions QA has been doing test runs for Juju releases, and seems to run into this issue about 1 in 4 runs. So it definitely still seems to be an issue.

Revision history for this message
Ian Johnson (ijoh) wrote :

Hi Jason,
Is that the correct code in #3 as I dont see the error message in that code (maas_config_networks.py)?
I ask as I hit this issue last week and was repeatable everytime I ran "fce build --layer maas --steps maas:configure_networks" step and had to perform "fce clean" to move forward.
Thanks
Ian

Revision history for this message
Björn Tillenius (bjornt) wrote :

Let's try to get this fixed for 2.9, since it happens so often in the QA runs.

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.9.0b4
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Lee Trager (ltrager)
Changed in maas:
milestone: 2.9.0b4 → 2.9.0b7
Revision history for this message
Michael Skalka (mskalka) wrote :

Given the history of this bug and the impact on release testing I am escalating this to field-high.

Changed in maas:
assignee: nobody → Alberto Donato (ack)
Revision history for this message
Björn Tillenius (bjornt) wrote :

I took a look at a failed test run:

  https://solutions.qa.canonical.com/testruns/testRun/327f0727-1ab6-4e45-9e19-6b38da4eefa8

I can see that it fails for VLAN 2696. But I also see that all the MAAS hosts have a VLAN interface eno2.2696.

This means that MAAS will detect that interface and create the VLAN, if it doesn't already exist.

So this is indeed a race condition, but I don't think there's anything that can be fixed in MAAS. I think that FCE needs to expect this situation, and either create the VLANs in MAAS before the VLAN interface gets setup, or expect that the PUT might result in 400 and update the existing VLAN instead of creating it.

Does that sound reasonable?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Why is MAAS creating a new vlan for 2696 when we already have a vlan with id 2696, just on another fabric?

We can detect this condition and workaround it, it just seems like MAAS should be saying "hey, we already have this vlan id, we don't need to make a new vlan."

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's our attempt at a workaround:

        try:
            set_vlan_attr(client, vlan, attr="fabric", value=fabric) # Move the vlan
            return
        except CallError as error:
            error_dict = yaml.safe_load(error.content)
            if (
                error_dict.get("__all__")[0]
                == "A VLAN with the specified VID already exists in the destination fabric."
            ):
                [surprise_vlan] = [
                    maybe_vlan
                    for maybe_vlan in fabric.vlans
                    if maybe_vlan.vid == vlan.vid
                ]
                surprise_vlan.delete()
                continue # retries

Is the error string stable and considered part of the API? does this seem reasonable?

Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm not sure I follow. VLAN 2696 exists in multiple fabrics?

The way I understand it, MAAS detects VLAN 2696 and creates it. Then you try to create it, in the same fabric, and get the error above. In which case, instead of creating a new VLAN, you should update that VLAN with the information you have from networks.yaml.

If that's not the case, please provide API output explaining the situation.

As for the error message... ideally we should improve that, since it's very hard to use it programatically. I think what you could do is to catch any error, but then issue a GET request to see whether the VLAN was already there. That way we can improve the error message without breaking FCE, and you can improve the check when we have something better.

Changed in maas:
status: New → Incomplete
Revision history for this message
Alberto Donato (ack) wrote :

@Jason why do you need to create the vlan?

Given that maas detects existing interfaces, could you just find the existing one (possibly looping until it shows up), and update it as needed?

Revision history for this message
Björn Tillenius (bjornt) wrote :

Sorry, I think I misunderstood how FCE did things. I think it would be good to get a detailed explanation about what happens with regards to VLAN 2696 and its related subnet. I.e. whether VLANs/Fabrics/Subnets are created, or gotten from MAAS.

It's important to know that MAAS will most likely have seen the eno2.2696 interfaces and will have created a Fabric, a VLAN, and a Subnet. I see that in the logs, you create your own fabric, and I think that's where things go wrong. I wonder if you shouldn't try to detect whether MAAS already has a fabric for any of the vlans and subnets in network.yaml and try to reuse that.

I think that would cause less problem, since if you take a VLAN that MAAS has connected to a subnet and interface and moves it to another fabric, it's not clear what will happen when MAAS goes over the rack network interfaces again.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

re #13, that's exactly what we do. We find the existing vlan and we update it. While we're doing this, MAAS adds a new vlan with the same VID as the existing one.

re #14, we create a new fabric and move all of the vlans to that fabric, because MAAS discovers each vlan on a separate fabric to start with, which is wrong; see bug 1754484.

Changed in maas:
status: Incomplete → New
Revision history for this message
Canonical Solutions QA Bot (oil-ci-bot) wrote :

This bug is fixed with commit 87b03995 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=87b03995

Revision history for this message
Björn Tillenius (bjornt) wrote :

We reproduced it together with Jason with FCE now.

The issue seems to be that there are two bridges, looking something like this:

  br0:
    eth0.1234 (fabric0, vlan 1234)

  br1:
    eth0 (fabric0, vlan 2345)

Then FCE moves the 2345 vlan to fabric1, which results in this setup:

  br0:
    eth0.1234 (fabric0, vlan 1234)

  br1:
    eth0 (fabric1, vlan 2345)

And in the process, MAAS creates an empty 1234 vlan in fabric1.

So this is a bug in MAAS. The fabric of the physical interfaces and its vlan interfaces can't really be in the same fabric. But at the very least, MAAS shouldn't create an empty 1234 vlan in fabric1.

I'm not sure what the correct fix is, though. We'll have to reproduce this locally and see what MAAS is actually doing.

Alberto Donato (ack)
Changed in maas:
status: New → Triaged
Revision history for this message
Björn Tillenius (bjornt) wrote :

> So this is a bug in MAAS. The fabric of the physical interfaces and its
> vlan interfaces can't really be in the same fabric. But at the very least,

Oops, I meant they can't be in *different* fabrics, of course.

Alberto Donato (ack)
Changed in maas:
status: Triaged → In Progress
Lee Trager (ltrager)
Changed in maas:
milestone: 2.9.0b7 → 2.9.0b8
Changed in maas:
status: In Progress → Fix Committed
Lee Trager (ltrager)
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Michael Skalka (mskalka) wrote :

We have seen this again using MAAS 2.9.2 installed from the snapstore: https://solutions.qa.canonical.com/testruns/testRun/c51dec2f-3e86-48c9-828c-ce072dc4a80d

Changed in maas:
status: Fix Released → New
Revision history for this message
Bill Wear (billwear) wrote :

Can you please explain the test setup and what you see in MAAS that causes you to identify this particular bug as a regression?

Changed in maas:
status: New → Triaged
status: Triaged → Won't Fix
Bill Wear (billwear)
Changed in maas:
status: Won't Fix → Incomplete
Revision history for this message
Michael Skalka (mskalka) wrote :

Bill,

The test setup is the same as the rest of our MAAS-based solution runs: Three MAAS hosts running region & rack controllers on baremetal deployed with snaps using an HA postgresql database also hosted on the same units. The pgsql database floats a VIP using haproxy.

The maas controls six baremetal hosts as well as multiple KVMs, which deploy openstack or kubernetes.

In this instance of failure we were using the MAAS python library to update the fabric attribute of a VLAN in order to move it from one fabric to another. This is an operation we do in CI frequently while configuring networking for OpenStack or Kubernetes deployments, many many times a day.

As always, logs for the test runs we execute can be found at the bottom of the test-plan page I linked: https://oil-jenkins.canonical.com/artifacts/c51dec2f-3e86-48c9-828c-ce072dc4a80d/index.html and more specifically the maas logs are here: https://oil-jenkins.canonical.com/artifacts/c51dec2f-3e86-48c9-828c-ce072dc4a80d/generated/generated/maas/logs-2021-07-03-03.10.34.tar

Changed in maas:
status: Incomplete → New
Revision history for this message
Christian Grabowski (cgrabowski) wrote :

I don't happen to see any corresponding error in the logs regarding this request. However, I am able to update the fabric of a given VLAN with a VID with both 3.0 and 2.9 successfully using the web UI. Would you mind sharing what this client code using the MAAS python library is doing?

Changed in maas:
status: New → Incomplete
Revision history for this message
Alexander Balderson (asbalderson) wrote :

On focal we're using the library from ppa:maas/python-libmaas until LP# 1899187 gets fixed. We rarely run on bionic infra any longer.

Changed in maas:
status: Incomplete → New
Revision history for this message
Christian Grabowski (cgrabowski) wrote :

What is that code that is using that lib doing? Is it just something along the lines of:

fabrics = client.fabrics_list()
default_fabric = fabrics.get_default()
vlan = default_fabric.vlans.create(0)
vlan.vid = 20
vlan.save()

Changed in maas:
status: New → Incomplete
Alberto Donato (ack)
Changed in maas:
milestone: 2.9.0b8 → none
Revision history for this message
Alexander Balderson (asbalderson) wrote :

It took some digging to work out when we get to this path.

1) we create the fabric (fabric = client.fabrics.create())
2) we know what cidr the vlan we want is on, so we look to see if maas knows about the cidr already. if it does we grab the vlan for that cidr (client.subnets.get(cidr).vlan)
3) we found the vlan but its on the wrong fabric, so we go to move it to the right fabric (setattr(vlan, "fabric", fabric)

so something like

fabric = client.fabrics.create()
vlan = client.subnets.get(cidr_for_vlan).vlan
if vlan.fabric.id != fabric.id
    setattr(vlan, "fabric", fabric) # error here
vlan.save()
fabric.save()

There are some extra checks and retries which go on there, but that's the gist of the operation. It seems like the untagged vlan (in this case) we want, is on the wrong fabric, and when we go to move it to the right fabric, its already there.

We have some extra code that looks through the fabric we want to move to, and tries to delete it:
https://git.launchpad.net/cpe-foundation/tree/foundationcloudengine/foundationcloudengine/maas_config_networks.py#n321

Is it possible that since we dont save after the delete, the move operation is failing?

Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

That code should be fine.

What's happening is that you have interfaces like:

  eth0
  eth0.2696

Then move move VLAN 2696 to another fabric, which means that eth0 and eth0.2696 are on different fabrics. When the controllers update their network configuration, they try to fix that situation and create a new 2696 VLAN in eth0's fabric.

I've fixed it, so that it won't do that anymore. I think that should fix it.

Changed in maas:
status: New → In Progress
assignee: Alberto Donato (ack) → Björn Tillenius (bjornt)
milestone: none → next
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: next → 3.2.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.