(3.3) RackD constantly restarting: MulticastJoinError - Address already in use

Bug #2020798 reported by Linus Walther
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

We are having a problem with MAAS 3.3.3 (Using 3 RackDs and 3 RegionDs. The RegionDs are not running stable, as they are flapping in and out of functional service.

Our RackDs have 2 interfaces each. One for the untagged net and one for the OOB-Net to reach the BMCs.

2023-05-22 08:45:38 twisted.internet.defer: [critical] Unhandled error in Deferred:
2023-05-22 08:45:38 twisted.internet.defer: [critical]
 Traceback (most recent call last):
 Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x04\x00\x0c', 98, 'Address already in use')

Something similar or the same has happened before and it seems to not have been a problem anymore. So this might be a potential regression.

Linus Walther (linwalth)
no longer affects: maas (Ubuntu)
summary: - RackD constantly restarting: MulticastJoinError - Address already in use
+ (3.3) RackD constantly restarting: MulticastJoinError - Address already
+ in use
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Could you provide the logs from the system when the rackds are restarting?

Changed in maas:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Linus Walther (linwalth) wrote :
Download full text (16.4 KiB)

Sure. I sought out the logs i thought most pertaining to the issue:

################

Preliminary information:

At 9:29:35 the rackD reports not being able to report test results. This corresponds with the following entry in kern.log:

Jun 8 09:29:35 maas-rackd001 kernel: [749024.447228] audit: type=1400 audit(1686216575.729:26065): apparmor="DENIED" operation="open" profile="snap.maas.supervisor" name="/etc/wgetrc" pid=3025042 comm="wget" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

ens18 is the interface to which RACKD001_ADDRESS is bound.

################

MAAS RackD logs:

2023-06-08 09:28:29 provisioningserver.rpc.clusterservice: [info] Rack controller 'gqhhy7' registered (via maas-regiond-ha003:pid=2832963) with MAAS version 3.3.3-13184-g.3e9972c19.
2023-06-08 09:28:29 provisioningserver.rpc.clusterservice: [info] Rack controller 'gqhhy7' registered (via maas-regiond-ha003:pid=2832963) with MAAS version 3.3.3-13184-g.3e9972c19.
2023-06-08 09:28:29 provisioningserver.rpc.clusterservice: [info] Rack controller 'gqhhy7' registered (via maas-regiond-ha003:pid=2832963) with MAAS version 3.3.3-13184-g.3e9972c19.
2023-06-08 09:28:36 provisioningserver.utils.services: [warn] Couldn't report test results: HTTP error [500]
2023-06-08 09:28:39 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=36090, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:39 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=36100, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:39 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=36112, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:59 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=57688, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:59 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=57702, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:59 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=57708, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:REGIOND002_ADDRESS', port=5252, flowInfo=0, scopeID=0))
2023-06-08 09:28:59 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-regiond-ha002:pid=2854011' authenticated.
2023-06-08 09:28:59 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-regiond-ha002:pid=2854011' authenticated.
2023-06-08 09:28:59 provisioningserver.rpc.clusterservice...

Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

Hi Linus!

Could you please share the contents of `/var/snap/maas/common/maas/maas_id` files for each rack controller? Also, regiond log would also be helpful. Thanks!

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Linus Walther (linwalth) wrote (last edit ):

Sure thing, Igor!

RackDs:

rackd001 = gqhhy7
rackd002 = gbeh6f
rackd003 = w4ffg4

RegionDs:

regiond001 = 8cge7s
regiond002 = 4hnrsk
regiond003 = rfx4sa

Revision history for this message
Alberto Donato (ack) wrote :

can you please also attach regiond logs?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Linus Walther (linwalth) wrote :
Download full text (18.4 KiB)

Surely. Here are the RegionD logs:

################

Preliminary information:

ens18 is the interface to which REGIOND001_ADDRESS is bound.
RegionDs only have ONE interface (ens18)

################

MAAS RegionD logs:

2023-07-05 13:10:37 regiond: [info] 127.0.0.1 POST /MAAS/metadata/2012-03-01/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: Python-urllib/3.10)
2023-07-05 13:10:37 provisioningserver.utils.services: [warn] Couldn't report test results: HTTP error [500]
2023-07-05 13:10:43 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:REGIOND001_ADDRESS', port=5252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=45342, flowInfo=0, scopeID=0))
2023-07-05 13:10:43 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:REGIOND001_ADDRESS', port=5252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:RACKD001_ADDRESS', port=45356, flowInfo=0, scopeID=0))
2023-07-05 13:10:44 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:RACKD001_ADDRESS:45342'.
2023-07-05 13:10:44 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:RACKD001_ADDRESS:45356'.
2023-07-05 13:10:44 maasserver.ipc: [info] Worker pid:119857 registered RPC connection to ('gqhhy7', 'REGIOND001_ADDRESS', 5252).
2023-07-05 13:10:44 maasserver.ipc: [info] Worker pid:119857 registered RPC connection to ('gqhhy7', 'REGIOND001_ADDRESS', 5252).
2023-07-05 13:10:50 maasserver: [error] ################################ Exception: get() returned more than one ScriptResult -- it returned 2! ################################
2023-07-05 13:10:50 maasserver: [error] Traceback (most recent call last):
  File "/snap/maas/28521/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/snap/maas/28521/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/snap/maas/28521/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/snap/maas/28521/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/snap/maas/28521/lib/python3.10/site-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, ...

Revision history for this message
Alberto Donato (ack) wrote :

From the errors in regiond log itseems the cause of the issue might be LP:1988976

Revision history for this message
Linus Walther (linwalth) wrote :

I beg to differ. The scripts not returning is only ONE of the problems. I do not think that this is a duplicate of #1988976.

My main problem is the unstable RackD connection based on MAAS not being able to handle 2 interfaces, crashlooping and breaking off connections with this multicast error - unable to form a stable connection.

2023-06-08 09:29:29 twisted.internet.defer: [critical] Unhandled error in Deferred:
2023-06-08 09:29:29 twisted.internet.defer: [critical]
 Traceback (most recent call last):
 Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x1e\x00\x0c', 98, 'Address already in use')

Of course the RackD, not being able to build a stable connection, will not be able to send test results.

Revision history for this message
Linus Walther (linwalth) wrote :

I ought to correct my assesments as i just realized, looking at the above regionD logs, that the regionD also has this multicast joining error:

2023-07-10 08:48:07 twisted.internet.defer: [critical]
 Traceback (most recent call last):
 Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x04\x00\x0b', 98, 'Address already in use')

Since the RegionD only has one address, I don't know if i can stand by my earlier assessment that this is a problem of Servers with multiple interfaces.

Another piece of information would be that all our MAAS machines are virtualized machines. Might that play into the situation with the multicast join not working correctly?

Revision history for this message
Anton Troyanov (troyanov) wrote :

It happened to me several times last night after restarting MAAS multiple times in a row.

I was running MAAS edge inside container with 2 interfaces

8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:81:67:ef brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.62/24 metric 100 brd 10.0.0.255 scope global dynamic eth0
       valid_lft 2981sec preferred_lft 2981sec
    inet6 fe80::216:3eff:fe81:67ef/64 scope link
       valid_lft forever preferred_lft forever
10: eth1@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:c9:20:fd brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.1/24 brd 10.10.10.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fec9:20fd/64 scope link
       valid_lft forever preferred_lft forever

Jul 21 15:34:50 maas maas.pebble[153069]: 2023-07-21T15:34:50.027Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x00\x00>'>
Jul 21 15:34:50 maas maas.pebble[153069]: 2023-07-21T15:34:50.027Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\n\n\x01',>

Jul 21 15:43:12 maas maas.pebble[154713]: 2023-07-21T15:43:12.017Z [rackd] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x00\x00>', >
Jul 21 15:43:12 maas maas.pebble[154713]: 2023-07-21T15:43:12.017Z [rackd] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\n\n\x01', 9>

Jul 21 15:46:31 maas maas.pebble[155490]: 2023-07-21T15:46:31.718Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x00\x00>'>
Jul 21 15:46:31 maas maas.pebble[155490]: 2023-07-21T15:46:31.718Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\n\n\x01',>

Jul 21 16:03:30 maas maas.pebble[157560]: 2023-07-21T16:03:30.877Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x00\x00>'>
Jul 21 16:03:30 maas maas.pebble[157560]: 2023-07-21T16:03:30.877Z [regiond] Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\n\n\x01',>

Using container with 3 interfaces, errors will come in a group of 3, if there are 4 interfaces in a group of 4. I didn't notice this error when I was running MAAS inside container with 1 interface.

However I didn't notice any issues related to degradation of functionality

According to some code comments such behaviour is expected?
src/provisioningserver/utils/services.py:498

# Use the first IP address on each interface. Since the
# underlying interface is the same, joining using a
# secondary IP address on the same interface will produce
# an "Address already in use" error.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Muhammad Ahmad (ahmadfsbd) wrote (last edit ):
Download full text (7.1 KiB)

We are hitting a similar error on a customer environment after upgrading from Maas 2.9 to 3.2.

2023-09-12 06:55:19 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: brappformix, brext, broam.
2023-09-12 06:55:20 twisted.internet.defer: [critical] Unhandled error in Deferred:
2023-09-12 06:55:20 twisted.internet.defer: [critical]
        Traceback (most recent call last):
        Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'666\t', 98, 'Address already in use')

2023-09-12 06:55:20 twisted.internet.defer: [critical] Unhandled error in Deferred:
2023-09-12 06:55:20 twisted.internet.defer: [critical]
        Traceback (most recent call last):
        Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x0f\xa1\x05', 98, 'Address already in use')

2023-09-12 06:55:20 twisted.internet.defer: [critical] Unhandled error in Deferred:
2023-09-12 06:55:20 twisted.internet.defer: [critical]
        Traceback (most recent call last):
        Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n\x0f\xa8\n', 98, 'Address already in use')

2023-09-12 06:48:26 provisioningserver.utils.services: [critical] Failed to update and/or record network interface configuration: Expecting value: line 1 column 1 (char 0); interfaces: {'bond0': {'type': 'bond', 'mac_address': 'Redacted', 'links': [], 'enabled': True, 'parents': ['eno12409np1', 'eno12399np0'], 'source': 'machine-resources', 'monitored': True}, 'bond0.402': {'type': 'vlan', 'mac_address': 'Redacted', 'links': [], 'enabled': True, 'parents': ['bond0'], 'source': 'machine-resources', 'vid': 402, 'monitored': False}, 'bond0.403': {'type': 'vlan', 'mac_address': 'Redacted', 'links': [], 'enabled': True, 'parents': ['bond0'], 'source': 'machine-resources', 'vid': 403, 'monitored': False}, 'brappformix': {'type': 'bridge', 'mac_address': 'Redacted', 'links': [{'mode': 'static', 'address': 'Redacted'}], 'enabled': True, 'parents': ['bond0.403'], 'source': 'machine-resources', 'monitored': False}, 'brext': {'type': 'bridge', 'mac_address': 'Redacted', 'links': [{'mode': 'static', 'address': 'Redacted'}], 'enabled': True, 'parents': ['bond0.402'], 'source': 'machine-resources', 'monitored': False}, 'broam': {'type': 'bridge', 'mac_address': 'Redacted', 'links': [{'mode': 'static', 'address': 'Redacted', 'gateway': 'Redacted'}, {'mode': 'static', 'address': 'Redacted', 'gateway': 'Redacted'}], 'enabled': True, 'parents': ['bond0'], 'source': 'machine-resources', 'monitored': False}, 'eno12399np0': {'type': 'physical', 'mac_address': 'Redacted', 'links': [], 'enabled': True, 'parents': [], 'source': 'machine-resources', 'monitored': False}, 'eno12409np1': {'type': 'physical', 'mac_address': 'Redacted', 'links': [], 'enabled': True, 'parents': [], 'source': 'machine-resources', 'monitored': False}, 'eno8303': {'type': 'physical', 'mac_address': 'Redacted', 'links': [], 'enabled': False, 'parents': [], 'source': 'machine-resources', 'monitored': False}, 'eno8403': {'type': 'physical', 'mac_address': 'Redacted', 'links': [], 'enabled': False, 'parents': [], 'source': 'machine-resour...

Read more...

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

As comment #10 states, this appears to be expected behaviour that unnecessarily produces a scary-looking log message. MAAS should revise the emitted log level and gracefully handle expected exceptions in this case.

Changed in maas:
milestone: none → 3.5.0
status: Incomplete → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.