Neutron fails to respawn radvd due to corrupt pid file

Bug #2033980 reported by Nate Bill
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Invalid
Undecided
Unassigned
neutron
In Progress
Medium
Brian Haley

Bug Description

**Bug Report**

What happened:

I have had issues periodically where radvd seems to die and neutron is not able to respawn it. I'm not sure why it dies.

In my neutron-l3-agent.log, the following error occurs once per minute:

```
2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.utils [-] Unable to convert value in /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.external_process [-] radvd for router with uuid ea759c71-0f4d-4be9-a761-83843ce04d9a not found. The process should not have died
2023-09-03 14:37:07.514 16 WARNING neutron.agent.linux.external_process [-] Respawning radvd for uuid ea759c71-0f4d-4be9-a761-83843ce04d9a
2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.utils [-] Unable to convert value in /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
2023-09-03 14:37:07.762 16 ERROR neutron.agent.linux.utils [-] Exit code: 255; Cmd: ['ip', 'netns', 'exec', 'qrouter-ea759c71-0f4d-4be9-a761-83843ce04d9a', 'env', 'PROCESS_TAG=radvd-ea759c71-0f4d-4be9-a761-83843ce04d9a', 'radvd', '-C', '/var/lib/neutron/ra/ea759c71-0f4d-4be9-a761-83843ce04d9a.radvd.conf', '-p', '/var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd', '-m', 'syslog', '-u', 'neutron']; Stdin: ; Stdout: ; Stderr:
```

Inspecting the pid file, it appears to have 2 pids, one on each line:

```
$ docker exec -it neutron_l3_agent cat /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
853
1161
```

Deleting the file then properly respawns radvd:

```
2023-09-03 14:38:07.515 16 ERROR neutron.agent.linux.external_process [-] radvd for router with uuid ea759c71-0f4d-4be9-a761-83843ce04d9a not found. The process should not have died
2023-09-03 14:38:07.516 16 WARNING neutron.agent.linux.external_process [-] Respawning radvd for uuid ea759c71-0f4d-4be9-a761-83843ce04d9a
```

What you expected to happen:

Radvd is respawned without needing manual intervention. Likely what is meant to happen is neutron should write the pid to the file, whereas instead it appends it. I'm not sure if this is a kolla issue or a neutron issue.

How to reproduce it (minimal and precise): Unsure, I'm not sure how radvd ends up dying in the first place. You could likely reproduce this by deploying kolla-ansible and then manually killing radvd.

**Environment**:
* OS (e.g. from /etc/os-release):
NAME="Rocky Linux"
VERSION="9.2 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"

* Kernel (e.g. `uname -a`):
Linux lon1 5.14.0-284.25.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

* Docker version if applicable (e.g. `docker version`):
Client: Docker Engine - Community
 Version: 24.0.5
 API version: 1.43
 Go version: go1.20.6
 Git commit: ced0996
 Built: Fri Jul 21 20:36:54 2023
 OS/Arch: linux/amd64
 Context: default

Server: Docker Engine - Community
 Engine:
  Version: 24.0.5
  API version: 1.43 (minimum version 1.12)
  Go version: go1.20.6
  Git commit: a61e2b4
  Built: Fri Jul 21 20:35:17 2023
  OS/Arch: linux/amd64
  Experimental: false
 containerd:
  Version: 1.6.22
  GitCommit: 8165feabfdfe38c65b599c4993d227328c231fca
 runc:
  Version: 1.1.8
  GitCommit: v1.1.8-0-g82f18fe
 docker-init:
  Version: 0.19.0
  GitCommit: de40ad0

* Kolla-Ansible version (e.g. `git head or tag or stable branch` or pip package version if using release):
16.1.0 (stable/2023.1)

* Docker image Install type (source/binary): Default installed by kolla-ansible
* Docker image distribution: rocky
* Are you using official images from Docker Hub or self built? official
* If self built - Kolla version and environment used to build: not applicable
* Share your inventory file, globals.yml and other configuration files if relevant: Likely not relevant.

Revision history for this message
Sven Kieske (s-kieske) wrote :

an operator just reported this to us. It seems to happen if you provide 5 DNS Servers to a neutron IPv6 network.

Figuring out if this is really a kolla-ansible bug or should be reported to neutron instead.

Revision history for this message
Nate Bill (nbill) wrote :

In our case, we have a network with several subnets, each one with a DNS set. So I believe this would line up with what your operator has reported.

I've also noticed some other odd behaviour from the L3 agent where it seems to have a high loss and latency when handling udp traffic periodically. This started happening around the same time as the radvd issues, but I haven't fully narrowed down the cause yet.

Revision history for this message
Sven Kieske (s-kieske) wrote :
Revision history for this message
Sven Kieske (s-kieske) wrote :

so, this error message appears at exactly two locations in the neutron codebase:

"Unable to convert value in"

https://github.com/search?q=repo%3Aopenstack%2Fneutron+%22Unable+to+convert+value+in%22&type=code

1. in the "get_value_from_file" function in utils.py:

https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/agent/linux/utils.py#L250

2. in the "_get_value_from_conf_file" function in dhcp.py:

https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/agent/linux/dhcp.py#L401

the function in 1. is being called in:

https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/agent/linux/external_process.py#L154

in the function "pid" from class "ProcessManager".

This Logic was implemented in:

https://review.opendev.org/c/openstack/neutron/+/115935

Blueprint: https://blueprints.launchpad.net/openstack/?searchtext=agent-child-processes-status

the commit message reads:

"Implements ProcessMonitor in the dhcp_agent

The ProcessMonitor class will watch over spawned external processes,
taking the administrator configured action in the case of any
of the external processes die unexpectedly.

It covers both the neutron-ns-metadata-proxy for isolated metadata
and dnsmasq in the dnsmasq driver.

ProcessMonitor has been extended to allow specific pid files
for backwards-compatible dnsmasq pid file location.
"

sadly there are no other docs added in this changeset which explain the process of how to configure this
administrative action.

Revision history for this message
Sven Kieske (s-kieske) wrote :
Revision history for this message
Sven Kieske (s-kieske) wrote :
Revision history for this message
Sven Kieske (s-kieske) wrote :

I _think_ the bug is, that **only** on receiving signal 9 the pid file get's cleaned up:

That is, if radvd crashes for example, the pid file does not get cleaned up, at least when I search neutrons codebase for the function I can't find any other code that calls it in the context of radvd.

So, if radvd does not clean up after itself properly when crashing - which just might not be possible - I think this is a bug in neutron, but it's possible I have some misunderstanding, because this involves multiple layers of process management (python, Neutron ProcessManager, eventlet..).

Revision history for this message
Anton Dollmaier (antondollmaier) wrote :

Hi folks,

Thanks for the efforts you've put into this issue!

I can share another insight on why radvd might have crashed:

Before cleaning the PID file, I did take a look at the config of radvd:

```
$ cat /var/lib/neutron/ra/aee91f41-1945-40b4-b72f-8be2eb369b44.radvd.conf
interface qr-caa16d7e-26
{
   AdvSendAdvert on;
   MinRtrAdvInterval 30;
   MaxRtrAdvInterval 100;
   AdvLinkMTU 1450;

   RDNSS 2a02:74a0:x:0::53 10.40.3.53 2a02:74a0:x:0::54 {};

   prefix 2a02:74a0:x:y::/64
   {
        AdvOnLink on;
        AdvAutonomous on;
   };

   route fe80::a9fe:a9fe/128 {
   };
};
```

We've been configuring the router with terraform, assigning the ipv4 resolvers to the IPv4 subnet and the IPv6 resolvers to the IPv6 subnet.

After deleting the router, adjusting the subnets (no resolvers on v4, only ipv6 resolvers on ipv6), and recreating the router, radvd is now active and everything's fine.

It seems that due to misconfiguration (and incomplete template parsing), IPv4 nameservers ended up in the config of radvd, which failed to start. Neutron was then unable to clean up the pidfile, thus failing to start radvd again.

Aside from the missing cleanup, I've been asking myself where I could have found logs from radvd. It does log to syslog (I even found an old discussion introducing the change away from logfile/stderr) - but where do those logs end up?

Please let me/us know if additional information or testing is desired (not "required", obviously).

Best,
Anton

Changed in neutron:
importance: Undecided → Medium
tags: added: ipv6
tags: added: low-hanging-fruit
Revision history for this message
Brian Haley (brian-haley) wrote :

So if radvd is stopped by the neutron l3-agent, it doesn't specify a signal and defaults to '9', so should always cleanup the pid file. But if it dies unexpectedly, or I believe if the l3-agent is restarted and restarts radvd for a network, there could be a stale pid file left behind.

Seems changing this code to be more like the keepalived code would solve the problem - when spawning the first thing it does is try and remove the pid file. I'll send a patch that does that which is pretty small, not sure if anyone has the means to test it to see if it helps in their environment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/895832

Changed in neutron:
status: New → In Progress
Revision history for this message
Sven Kieske (s-kieske) wrote :

thanks for all the input and work.

should a separate bug be raised against radvd regarding feeding ipv4 dns servers to ipv6 subnets?

one could argue that radvd should at least not crash on that, if that's what it did. it's not entirely clear to me tbh.

Revision history for this message
Sven Kieske (s-kieske) wrote :

This is not a bug in kolla-ansible

Changed in kolla-ansible:
status: New → Invalid
Revision history for this message
Brian Haley (brian-haley) wrote :

Hi Sven,

Yes, please open another bug for the DNS server issue, neutron should probably filter by address family if it's crashing radvd. I ran into a similar bug in another project recently and that was the fix we adopted.

Revision history for this message
Sven Kieske (s-kieske) wrote :
Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.