Services not running that should be: apache2

Bug #1952414 reported by Aurelien Lourot
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
New
Undecided
Unassigned
Charm Helpers
New
Undecided
Unassigned
Gnocchi Charm
New
Undecided
Unassigned
OpenStack Cinder Charm
New
Undecided
Unassigned
OpenStack Dashboard Charm
New
Undecided
Unassigned
OpenStack Glance Charm
New
Undecided
Unassigned
OpenStack Heat Charm
New
Undecided
Unassigned
OpenStack Neutron API Charm
New
Undecided
Unassigned
OpenStack Nova Cloud Controller Charm
New
Undecided
Unassigned
OpenStack Swift Proxy Charm
New
Undecided
Unassigned

Bug Description

We have been seeing this in various charm gates since the beginning of November 2021.

For example on Focal-Victoria in a 21.10 backport on the neutron-api charm in the octavia gate:
https://review.opendev.org/c/openstack/charm-octavia/+/819246
https://openstack-ci-reports.ubuntu.com/artifacts/d39/819246/2/check/focal-victoria-no-amphora/d39cb11/

For example on Bionic-Queens on `master` on the neutron-api charm in a trilio gate:
https://review.opendev.org/c/openstack/charm-trilio-horizon-plugin/+/816064
https://openstack-ci-reports.ubuntu.com/artifacts/524/816064/11/check/bionic-queens-41_trilio-horizon/524a9a9/

For example on Bionic-Queens on `master` on the nova-cloud-controller charm in a trilio gate:
https://review.opendev.org/c/openstack/charm-trilio-wlm/+/816065
https://openstack-ci-reports.ubuntu.com/artifacts/5c0/816065/9/check/bionic-queens-41_trilio-wlm/5c0535f/

For example on Bionic-Train on `master` on the swift-proxy charm in the gnocchi gate: (juju crashdump attached)
https://review.opendev.org/c/openstack/charm-gnocchi/+/817785
https://openstack-ci-reports.ubuntu.com/artifacts/0aa/817785/3/check/test-s3-bionic-train-s3/0aa151a/

Maybe caused by lp:1795918

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote :
Download full text (3.2 KiB)

this is not looking right, I'm troubleshooting a ci job failure https://openstack-ci-reports.ubuntu.com/artifacts/5d4/819438/2/check/bionic-ussuri-ha-ovn/5d44828/

```
2021-11-26 17:04:54.656762 | focal-medium | 2021-11-26 17:04:54 [INFO] Timed out waiting for 'neutron-api/0'. The workload status message is 'Services not running that should be: apache2' which is not one of '['ready', 'Ready', 'Unit is ready']'
```

```
neutron-api/0* blocked idle 4 172.16.0.254 9696/tcp Services not running that should be: apache2
```

Looking at it closer I found this:

In syslog:
```
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 systemd[1]: Starting The Apache HTTP Server...
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 apachectl[5947]: AH00526: Syntax error on line 13 of /etc/apache2/sites-enabled/openstack_https_frontend.conf:
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 apachectl[5947]: SSLCertificateFile: file '/etc/apache2/ssl/neutron/cert_172.16.0.254' does not exist or is empty
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 apachectl[5947]: Action 'start' failed.
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 apachectl[5947]: The Apache error log may have more information.
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 systemd[1]: apache2.service: Control process exited, code=exited status=1
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 systemd[1]: apache2.service: Failed with result 'exit-code'.
Nov 26 16:38:34 juju-98a11d-zaza-4d85c92c4d57-4 systemd[1]: Failed to start The Apache HTTP Server.
```

Then in the apache config `/etc/apache2/sites-enabled/openstack_https_frontend.conf`:
```
    SSLCertificateFile /etc/apache2/ssl/neutron/cert_172.16.0.254
    # See LP 1484489 - this is to support <= 2.4.7 and >= 2.4.8
    SSLCertificateChainFile /etc/apache2/ssl/neutron/cert_172.16.0.254
    SSLCertificateKeyFile /etc/apache2/ssl/neutron/key_172.16.0.254
```

Then in `/var/log/juju/unit-neutron-api-0.log`:

```
2021-11-26 15:56:29 DEBUG juju-log certificates:41: Writing file /usr/local/share/ca-certificates/vault_juju_ca_cert.crt root:root 444
2021-11-26 15:56:29 DEBUG certificates-relation-changed Clearing symlinks in /etc/ssl/certs...
2021-11-26 15:56:29 DEBUG certificates-relation-changed done.
2021-11-26 15:56:29 DEBUG certificates-relation-changed Updating certificates in /etc/ssl/certs...
2021-11-26 15:56:30 DEBUG certificates-relation-changed 129 added, 0 removed; done.
2021-11-26 15:56:30 DEBUG certificates-relation-changed Running hooks in /etc/ca-certificates/update.d...
2021-11-26 15:56:30 DEBUG certificates-relation-changed done.
2021-11-26 15:56:30 DEBUG jujuc server.go:211 running hook tool "juju-log" for neutron-api/0-certificates-relation-changed-4815572091935494528
2021-11-26 15:56:30 DEBUG juju-log certificates:41: Writing file /etc/apache2/ssl/neutron/cert_host-172-16-0-254.project.serverstack root:root 640
2021-11-26 15:56:31 DEBUG jujuc server.go:211 running hook tool "juju-log" for neutron-api/0-certificates-relation-changed-4815572091935494528
2021-11-26 15:56:31 DEBUG juju-log certificates:41: Writing file /etc/apache2/ssl/neutron/key_host-172-16-0-254.project.serverstack roo...

Read more...

tags: added: unstable-test
Revision history for this message
Felipe Reyes (freyes) wrote :

I think everyhing boils down to get_network_addresses()[0] , according to the docstring it may return any of these lists:

[(address_in_net_a, hostname_in_net_a),...]
[(address_in_net_a, vip_in_net_a),...]
[(address_in_net_a, address_in_net_a),...]

For the job I am troubleshooting neutron-api has a single unit (non-HA), so we have security that there is no vip set which rules out the possibility of return a list like the second one. So what if one run of get_network_addressess() returned a list like the first one while other runs returned a list like the third one?, I have no evidence, but it would explain the behavior.

[0] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/context.py#L1071

Revision history for this message
Felipe Reyes (freyes) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote :

https://github.com/juju/charm-helpers/pull/661 - the get_network_addresses() is now logging what the returned set contains, so when this bug pops up again we need to grep for "get_network_addresses" in /var/log/juju/unit-*.log to confirm or discard my hypothesis in comment #3 .

Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (6.7 KiB)

> we need to grep for "get_network_addresses" in /var/log/juju/unit-*.log
> to confirm or discard my hypothesis in comment #3 .

Hope this helps:

"get_network_addresses: [('192.168.151.123', '192.168.151.123')]"

$ ll /etc/apache2/ssl/horizon/
total 16
dr-xr-xr-x 2 root root 4096 Jun 22 03:45 ./
drwxr-xr-x 3 root root 4096 Jun 22 03:29 ../
-rw-r----- 1 root root 1492 Jun 22 03:45 cert_juju-31eb60-1-lxd-5.maas
-rw-r----- 1 root root 1678 Jun 22 03:45 key_juju-31eb60-1-lxd-5.maas

$ journalctl -u apache2 -n 30 --no-pager
-- Logs begin at Wed 2022-06-22 03:08:02 UTC, end at Wed 2022-06-22 04:24:33 UTC. --
Jun 22 03:37:32 juju-31eb60-1-lxd-5 systemd[1]: Stopped The Apache HTTP Server.
Jun 22 03:37:38 juju-31eb60-1-lxd-5 systemd[1]: Starting The Apache HTTP Server...
Jun 22 03:37:39 juju-31eb60-1-lxd-5 apachectl[30701]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 192.168.151.123. Set the 'ServerName' directive globally to suppress this message
Jun 22 03:37:39 juju-31eb60-1-lxd-5 systemd[1]: Started The Apache HTTP Server.
Jun 22 03:46:02 juju-31eb60-1-lxd-5 systemd[1]: Reloading The Apache HTTP Server.
Jun 22 03:46:02 juju-31eb60-1-lxd-5 apachectl[34513]: AH00548: NameVirtualHost has no effect and will be removed in the next release /etc/apache2/sites-enabled/default-ssl.conf:3
Jun 22 03:46:02 juju-31eb60-1-lxd-5 apachectl[34513]: AH00526: Syntax error on line 43 of /etc/apache2/sites-enabled/default-ssl.conf:
Jun 22 03:46:02 juju-31eb60-1-lxd-5 apachectl[34513]: SSLCertificateFile: file '/etc/apache2/ssl/horizon/cert_192.168.151.123' does not exist or is empty
Jun 22 03:46:02 juju-31eb60-1-lxd-5 apachectl[34510]: Action 'graceful' failed.
Jun 22 03:46:02 juju-31eb60-1-lxd-5 apachectl[34510]: The Apache error log may have more information.
Jun 22 03:46:02 juju-31eb60-1-lxd-5 systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
Jun 22 03:46:02 juju-31eb60-1-lxd-5 systemd[1]: Reload failed for The Apache HTTP Server.
Jun 22 03:50:30 juju-31eb60-1-lxd-5 systemd[1]: Stopping The Apache HTTP Server...
Jun 22 03:50:30 juju-31eb60-1-lxd-5 apachectl[36531]: AH00548: NameVirtualHost has no effect and will be removed in the next release /etc/apache2/sites-enabled/default-ssl.conf:3
Jun 22 03:50:30 juju-31eb60-1-lxd-5 apachectl[36531]: AH00526: Syntax error on line 43 of /etc/apache2/sites-enabled/default-ssl.conf:
Jun 22 03:50:30 juju-31eb60-1-lxd-5 apachectl[36531]: SSLCertificateFile: file '/etc/apache2/ssl/horizon/cert_192.168.151.123' does not exist or is empty
Jun 22 03:50:30 juju-31eb60-1-lxd-5 apachectl[36529]: Action 'stop' failed.
Jun 22 03:50:30 juju-31eb60-1-lxd-5 apachectl[36529]: The Apache error log may have more information.
Jun 22 03:50:30 juju-31eb60-1-lxd-5 systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
Jun 22 03:50:30 juju-31eb60-1-lxd-5 systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 22 03:50:30 juju-31eb60-1-lxd-5 systemd[1]: Stopped The Apache HTTP Server.
Jun 22 03:50:37 juju-31eb60-1-lxd-5 systemd[1]: Starting The Apache HTTP Server...
Jun 22 03:50:37 juju-31eb60-1-lxd-5 apachectl[36545]: AH00548...

Read more...

Revision history for this message
Nobuto Murata (nobuto) wrote :

Another unit with "Services not running that should be: apache2" in the same deployment.

tags: added: cdo-qa
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote (last edit ):

Subscribing ~field-high.

"Services not running that should be: apache2"
with
"SSLCertificateFile: file '/etc/apache2/ssl/*/cert_*' does not exist or is empty"
is happening frequently and unnecessarily blocking an unattended deployment.

[previous effort]
- https://launchpad.net/bugs/1822952
- https://launchpad.net/bugs/1889665
- https://launchpad.net/bugs/1893847

[similar open bugs]
- https://launchpad.net/bugs/1966489
- https://launchpad.net/bugs/1935990

Revision history for this message
Nobuto Murata (nobuto) wrote :

This is easy to trigger with the attached minimized bundle with the following steps.

$ juju deploy ./keystone-ha.yaml && time juju-wait -w --exclude vault

$ juju ssh vault/leader '
    sudo apt install -y jq
    export VAULT_ADDR="http://localhost:8200"
    output="$(vault operator init -key-shares=1 -key-threshold=1 -format json)"

    vault operator unseal "$(echo "$output" | jq -r .unseal_keys_b64[])"

    export VAULT_TOKEN="$(echo "$output" | jq -r .root_token)"
    vault token create -ttl=10m
'

$ juju run-action --wait vault/leader authorize-charm token="<TOKEN>"

$ juju run-action vault/leader --wait generate-root-ca

Revision history for this message
Nobuto Murata (nobuto) wrote :

crashdump with the step above. I will try vault 1.8/stable when I have time.

Revision history for this message
Nobuto Murata (nobuto) wrote :

vault 1.8/stable didn't change the situation. It still fails.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record, it's reproducible with vault latest/edge charm.

Revision history for this message
Nobuto Murata (nobuto) wrote :

One thing I noticed was juju-f74f34-0-lxd-0.maas (FQDN) vs juju-f74f34-1-lxd-0 (hostname) in "cert_requests". I will run more testing to see if there is a correlation between cert_requests data and the failed units.

$ juju show-unit vault/0 --endpoint certificates
vault/0:
  workload-version: 1.8.8
  machine: 0/lxd/2
  opened-ports:
  - 8200/tcp
  public-address: 192.168.151.190
  charm: ch:amd64/jammy/vault-93
  leader: true
  life: alive
  relation-info:
  - relation-id: 10
    endpoint: certificates
    related-endpoint: certificates
    application-data: {}
    related-units:
      keystone/0:
        in-scope: true
        data:
          cert_requests: '{"juju-f74f34-0-lxd-0.maas": {"sans": ["192.168.151.193",
            "192.168.151.99"]}}'
          egress-subnets: 192.168.151.193/32
          ingress-address: 192.168.151.193
          private-address: 192.168.151.193
          unit_name: keystone_0
      keystone/1:
        in-scope: true
        data:
          cert_requests: '{"juju-f74f34-1-lxd-0": {"sans": ["192.168.151.187", "192.168.151.99"]}}'
          egress-subnets: 192.168.151.187/32
          ingress-address: 192.168.151.187
          private-address: 192.168.151.187
          unit_name: keystone_1
      keystone/2:
        in-scope: true
        data:
          cert_requests: '{"juju-f74f34-2-lxd-0": {"sans": ["192.168.151.188", "192.168.151.99"]}}'
          egress-subnets: 192.168.151.188/32
          ingress-address: 192.168.151.188
          private-address: 192.168.151.188
          unit_name: keystone_2

Revision history for this message
Nobuto Murata (nobuto) wrote :

There *is* a correlation. At the end of the day, it might be boiled down to the known issue to other charms than API services...
https://bugs.launchpad.net/charm-layer-ovn/+bug/1896630

In another run, there are two units that failed to put /etc/apache2/ssl/keystone/cert_<vip> and those two had FQDN compared to the succeeded one with hostname.

$ for i in {0..2}; do echo keystone/$i; juju ssh keystone/$i 'ls -l /etc/apache2/ssl/keystone/cert_192.168.151.99' 2>/dev/null ; done
keystone/0
ls: cannot access '/etc/apache2/ssl/keystone/cert_192.168.151.99': No such file or directory
keystone/1
lrwxrwxrwx 1 root root 50 Mar 23 07:13 /etc/apache2/ssl/keystone/cert_192.168.151.99 -> /etc/apache2/ssl/keystone/cert_juju-3f0619-1-lxd-0
keystone/2
ls: cannot access '/etc/apache2/ssl/keystone/cert_192.168.151.99': No such file or directory

$ juju show-unit vault/0 --endpoint certificates
vault/0:
  workload-version: 1.8.8
  machine: 0/lxd/2
  opened-ports:
  - 8200/tcp
  public-address: 192.168.151.200
  charm: ch:amd64/jammy/vault-93
  leader: true
  life: alive
  relation-info:
  - relation-id: 10
    endpoint: certificates
    related-endpoint: certificates
    application-data: {}
    related-units:
      keystone/0:
        in-scope: true
        data:
          cert_requests: '{"juju-3f0619-0-lxd-0.maas": {"sans": ["192.168.151.122",
            "192.168.151.99"]}}'
          egress-subnets: 192.168.151.122/32
          ingress-address: 192.168.151.122
          private-address: 192.168.151.122
          unit_name: keystone_0
      keystone/1:
        in-scope: true
        data:
          cert_requests: '{"juju-3f0619-1-lxd-0": {"sans": ["192.168.151.125", "192.168.151.99"]}}'
          egress-subnets: 192.168.151.125/32
          ingress-address: 192.168.151.125
          private-address: 192.168.151.125
          unit_name: keystone_1
      keystone/2:
        in-scope: true
        data:
          cert_requests: '{"juju-3f0619-2-lxd-0.maas": {"sans": ["192.168.151.197",
            "192.168.151.99"]}}'
          egress-subnets: 192.168.151.197/32
          ingress-address: 192.168.151.197
          private-address: 192.168.151.197
          unit_name: keystone_2

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

This FQDN vs hostname may not be the root source of the issue. I did see the tendency multiple times in a row but there were some cases where one of some services failed even without having the difference on FQDN vs hostname.

In the following case, cinder failed to create a symlink of a cert while glance worked. But I don't see much difference in the relation data.

Anyway, hope I offered enough data by now to reproduce the issue so someone familiar with the logic of this HTTPS setup code can take a look.

  - relation-id: 73
    endpoint: certificates
    related-endpoint: certificates
    application-data: {}
    related-units:
      cinder/0:
        in-scope: true
        data:
          cert_requests: '{"juju-fe03b8-1-lxd-2.maas": {"sans": ["192.168.151.123"]}}'
          egress-subnets: 192.168.151.123/32
          ingress-address: 192.168.151.123
          private-address: 192.168.151.123
          unit_name: cinder_0
  - relation-id: 74
    endpoint: certificates
    related-endpoint: certificates
    application-data: {}
    related-units:
      glance/0:
        in-scope: true
        data:
          cert_requests: '{"juju-fe03b8-2-lxd-2.maas": {"sans": ["192.168.151.108"]}}'
          egress-subnets: 192.168.151.108/32
          ingress-address: 192.168.151.108
          private-address: 192.168.151.108
          unit_name: glance_0

ubuntu@quick-maas:~$ juju ssh cinder/0 -- ls -alF /etc/apache2/ssl/cinder/
total 16
dr-xr-xr-x 2 root root 4096 Mar 24 00:51 ./
drwxr-xr-x 3 root root 4096 Mar 24 00:32 ../
-rw-r----- 1 root root 1492 Mar 24 00:51 cert_juju-fe03b8-1-lxd-2.maas
-rw-r----- 1 root root 1678 Mar 24 00:51 key_juju-fe03b8-1-lxd-2.maas
Connection to 192.168.151.123 closed.

ubuntu@quick-maas:~$ juju ssh glance/0 -- ls -alF /etc/apache2/ssl/glance/
total 16
dr-xr-xr-x 2 root root 4096 Mar 24 00:51 ./
drwxr-xr-x 3 root root 4096 Mar 24 00:32 ../
lrwxrwxrwx 1 root root 53 Mar 24 00:51 cert_192.168.151.108 -> /etc/apache2/ssl/glance/cert_juju-fe03b8-2-lxd-2.maas
-rw-r----- 1 root root 1492 Mar 24 00:51 cert_juju-fe03b8-2-lxd-2.maas
lrwxrwxrwx 1 root root 52 Mar 24 00:51 key_192.168.151.108 -> /etc/apache2/ssl/glance/key_juju-fe03b8-2-lxd-2.maas
-rw-r----- 1 root root 1674 Mar 24 00:51 key_juju-fe03b8-2-lxd-2.maas
Connection to 192.168.151.108 closed.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Not to mix up many issues, I filed a separate bug focusing on LXD + MAAS rDNS and will unsubscribe ~field-high from this one.
https://bugs.launchpad.net/charm-helpers/+bug/2012801

Revision history for this message
Peter Matulis (petermatulis) wrote :

I got bit by this today on Jammy Antelope with the glance charm.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.