DNS Forward failures

Bug #2024625 reported by Jeffrey Chang
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in 3.5
3.3
Fix Committed
High
Christian Grabowski
3.4
Fix Released
High
Christian Grabowski
3.5
Fix Committed
High
Christian Grabowski
charm-magpie
Fix Released
Undecided
Adam Collard

Bug Description

During testing MAAS 3.3.4, SQA bump into DNS forward failures in magpie layer.
https://solutions.qa.canonical.com/testruns/326dcd1e-9572-4253-887d-4881c38bfb19
We see this in 1 out of 5 runs.
MAAS log here, https://oil-jenkins.canonical.com/artifacts/326dcd1e-9572-4253-887d-4881c38bfb19/generated/generated/maas/logs-2023-06-20-19.14.47.tgz

App Version Status Scale Charm Channel Rev Exposed Message
magpie-ceph-access-space blocked 6 magpie latest/edge 13 no ports ok, bonds ok, icmp ok, local hostname ok (meowth), fwd dns failed: ['3'], local mtu ok, required: 9000
magpie-ceph-replica-space blocked 6 magpie latest/edge 13 no ports ok, bonds ok, icmp ok, local hostname ok (meowth), fwd dns failed: ['3'], local mtu ok, required: 9000
magpie-internal-space blocked 6 magpie latest/edge 13 no ports ok, bonds ok, icmp ok, local hostname ok (meowth), fwd dns failed: ['3'], local mtu ok, required: 1500
magpie-oam-space active 6 magpie latest/edge 13 no ports ok, bonds ok, icmp ok, local hostname ok (meowth), dns ok, local mtu ok, required: 1500
magpie-public-space active 6 magpie latest/edge 13 no ports ok, bonds ok, icmp ok, local hostname ok (meowth), dns ok, local mtu ok, required: 1500

In juju crashdump, https://oil-jenkins.canonical.com/artifacts/326dcd1e-9572-4253-887d-4881c38bfb19/generated/generated/magpie/juju-crashdump-magpie-2023-06-20-19.13.29.tar.gz
2023-06-20 19:05:16 DEBUG unit.magpie-public-space/0.juju-log server.go:316 magpie:3: DNS Reverse command: /usr/bin/dig -x 10.244.8.134 +short +tries=1 +time=3
2023-06-20 19:05:16 INFO unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Reverse result for unit_id: 3, hostname: pangoro.silo2.lab0.solutionsqa.
eth1.2678.pangoro.silo2.lab0.solutionsqa., exitcode: 0
2023-06-20 19:05:16 INFO unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Reverse OK for unit_id: 3
2023-06-20 19:05:16 INFO unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Forward lookup for hostname: pangoro.silo2.lab0.solutionsqa.
eth1.2678.pangoro.silo2.lab0.solutionsqa., node: magpie-public-space/3, unit_id: 3
2023-06-20 19:05:16 DEBUG unit.magpie-public-space/0.juju-log server.go:316 magpie:3: DNS Forward command: /usr/bin/dig pangoro.silo2.lab0.solutionsqa. +short +tries=1 +time=3
2023-06-20 19:05:16 INFO unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Forward result for unit_id: 3, ip: 10.244.8.134
10.246.64.161
10.246.64.210
192.168.33.165
192.168.36.71
192.168.35.71, exitcode: 0
2023-06-20 19:05:16 INFO unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Forward OK for unit_id: 3
2023-06-20 19:05:16 ERROR unit.magpie-public-space/0.juju-log server.go:316 magpie:3: Original IP and Forward MATCH FAILED for unit_id: 3, Original: 10.244.8.134, Forward: Can not resolve hostname to IP '10.244.8.134\n10.246.64.161\n10.246.64.210\n192.168.33.165\n192.168.36.71\n192.168.35.71'

Related branches

Revision history for this message
Alberto Donato (ack) wrote :

It looks like this could be a duplicate of LP:2012801

Alberto Donato (ack)
Changed in maas:
milestone: none → 3.4.0
importance: Undecided → High
Alberto Donato (ack)
Changed in maas:
assignee: nobody → Christian Grabowski (cgrabowski)
Revision history for this message
Alberto Donato (ack) wrote :

From what I see the error comes from the magpie charm because the check for forward resolution expects a single IP is returned in forward_dns() here: https://opendev.org/openstack/charm-magpie/src/branch/master/src/lib/charms/layer/magpie_tools.py

Revision history for this message
Adam Collard (adam-collard) wrote :
Changed in charm-magpie:
status: New → Incomplete
status: Incomplete → Confirmed
Changed in maas:
status: New → Incomplete
status: Incomplete → In Progress
Changed in charm-magpie:
status: Confirmed → In Progress
assignee: nobody → Adam Collard (adam-collard)
Revision history for this message
Nobuto Murata (nobuto) wrote :

In what circumstances MAAS returns multiple IPs to one DNS forward resolution? Why now do we have to update a logic in Magpie if it hasn't been changed for a while?

Revision history for this message
Adam Collard (adam-collard) wrote :

Magpie and MAAS both need to conform to DNS RFCs and norms, among which it is possible for there to be more than one IP for a given hostname lookup.

We are investigating why SolQA runs seem to only be hitting this 20% of the time, but in the meantime have landed fixes to Magpie for correctness.

Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-rc1
Changed in charm-magpie:
status: In Progress → Fix Committed
Revision history for this message
Alberto Donato (ack) wrote :

Looking at the logs from one of the recently failed runs after the magpie charm fix, it seems sometimes the DNS is not responding:

unit-magpie-ceph-replica-space-0: 2023-07-26 08:28:48 DEBUG unit.magpie-ceph-replica-space/0.juju-log magpie:4: DNS Forward command: /usr/bin/dig 69.64-26.35.168.192.in-addr.arpa. +short +tries=1 +time=3
...
unit-magpie-ceph-replica-space-0: 2023-07-26 08:28:48 INFO unit.magpie-ceph-replica-space/0.juju-log magpie:4: Forward result for unit_id: 2, ip: No forward response, exitcode: 1

Revision history for this message
Adam Collard (adam-collard) wrote :

Extracted logs from magpie-ceph-access-space-0.log in the juju crashdump

https://pastebin.canonical.com/p/6q2hPxPfJv/

Note that magpie does a reverse lookup first - and gets a odd looking .in-addr.arpa

Revision history for this message
Alberto Donato (ack) wrote :

When prefix length for the subnet is less than a /24, MAAS generates glue records (see https://github.com/maas/maas/blob/master/src/maasserver/dns/zonegenerator.py#L482-L509), where it returns a PTR record pointing to another PTR for the corresponding /24.

Clients usually automatically resolve PTRs until they get a meaningful record (A, AAAA, CNAME).

The magpie charm should probably do the same, checking what kind of result is returned by dig, and resolving again if it's a PTR.

Changed in charm-magpie:
status: Fix Committed → New
Revision history for this message
Alberto Donato (ack) wrote :

Actually, the reverse lookup returns a CNAME pointing to the subnet section, but that record has no address associated to it.

As an example:

$ dig -x 192.168.36.71 +nocomments

; <<>> DiG 9.18.12-0ubuntu0.22.04.2-Ubuntu <<>> -x 192.168.36.71 +nocomments
;; global options: +cmd
;71.36.168.192.in-addr.arpa. IN PTR
71.36.168.192.in-addr.arpa. 17 IN CNAME 71.64-26.36.168.192.in-addr.arpa.
64-26.36.168.192.in-addr.arpa. 17 IN SOA silo2.lab0.solutionsqa. nobody.example.com. 180 600 1800 604800 30
64-26.36.168.192.in-addr.arpa. 30 IN SOA silo2.lab0.solutionsqa. nobody.example.com. 180 600 1800 604800 30
;; Query time: 3 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Tue Aug 01 07:02:31 UTC 2023
;; MSG SIZE rcvd: 190

$ dig 71.64-26.36.168.192.in-addr.arpa. +nocomments

; <<>> DiG 9.18.12-0ubuntu0.22.04.2-Ubuntu <<>> 71.64-26.36.168.192.in-addr.arpa. +nocomments
;; global options: +cmd
;71.64-26.36.168.192.in-addr.arpa. IN A
64-26.36.168.192.in-addr.arpa. 30 IN SOA silo2.lab0.solutionsqa. nobody.example.com. 180 600 1800 604800 30
;; Query time: 3 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Tue Aug 01 07:02:49 UTC 2023
;; MSG SIZE rcvd: 137

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So this does seem to be a MAAS issue. When a forward answer is added to BIND from MAAS dynamically (i.e a host deploys) a reverse record is generated from the same update. Currently, MAAS will generate the reverse update using the forward update's zone, this makes BIND figure out which reverse zone it should be under. In cases where we don't have the directive to generate CNAMEs for glue records, this works. However, when we have the glue record directive for a given reverse zone, the reverse update is ignored, assuming it should use answers from the smaller subnet's reverse zone.

We will need to add logic to lookup the reverse zone rather than just generating the reverse update, and add it to the reverse update instead.

no longer affects: maas/3.2
Alberto Donato (ack)
Changed in charm-magpie:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.