OSDs are not starting after upgrade from Mimic to Nautilus
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
New
|
Undecided
|
Unassigned |
Bug Description
ubuntu@
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 10:56:56Z
App Version Status Scale Charm Store Rev OS Notes
ceph-mon 13.2.7 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 13.2.7 active 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active idle 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* active idle 0 172.27.85.186 Unit is ready (1 OSD)
ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 active idle 1 172.27.85.187 Unit is ready (1 OSD)
ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 active idle 2 172.27.85.188 Unit is ready (1 OSD)
ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 active idle 3 172.27.85.189 Unit is ready (1 OSD)
ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready
Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started
ubuntu@
cloud:bionic-stein
ubuntu@
# wait for an upgrade finish, everything looks ok
ubuntu@
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 11:03:41Z
App Version Status Scale Charm Store Rev OS Notes
ceph-mon 14.2.4 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 13.2.7 active 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active idle 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* active idle 0 172.27.85.186 Unit is ready (1 OSD)
ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 active idle 1 172.27.85.187 Unit is ready (1 OSD)
ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 active idle 2 172.27.85.188 Unit is ready (1 OSD)
ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 active idle 3 172.27.85.189 Unit is ready (1 OSD)
ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready
Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started
# check package version
ubuntu@
ceph-mon:
Installed: 14.2.4-
Candidate: 14.2.4-
Version table:
*** 14.2.4-
500 http://
100 /var/lib/
12.
500 http://
500 http://
12.
500 http://
Connection to 172.27.85.191 closed.
# check cluster health, everything looks fine except of warning, https:/
ubuntu@
cluster:
id: ecb243d2-
health: HEALTH_WARN
3 monitors have not enabled msgr2
services:
mon: 3 daemons, quorum juju-5a5ca4-
mgr: juju-5a5ca4-
osd: 4 osds: 4 up, 4 in
rgw: 1 daemon active (juju-5a5ca4-
data:
pools: 15 pools, 62 pgs
objects: 187 objects, 1.1 KiB
usage: 4.0 GiB used, 104 GiB / 108 GiB avail
pgs: 62 active+clean
Connection to 172.27.85.191 closed.
# proceed to the OSD upgrade
ubuntu@
# wait until agents become idle - OSDs are broken
ubuntu@
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 11:11:08Z
App Version Status Scale Charm Store Rev OS Notes
ceph-mon 14.2.4 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 14.2.4 blocked 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active executing 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active executing 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* blocked idle 0 172.27.85.186 No block devices detected using current configuration
ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 blocked idle 1 172.27.85.187 No block devices detected using current configuration
ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 blocked idle 2 172.27.85.188 No block devices detected using current configuration
ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 blocked idle 3 172.27.85.189 No block devices detected using current configuration
ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready
Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started
# SSH to the OSD node
ubuntu@
ubuntu@node01:~$ sudo apt-cache policy ceph-osd
ceph-osd:
Installed: 14.2.4-
Candidate: 14.2.4-
Version table:
*** 14.2.4-
500 http://
100 /var/lib/
12.
500 http://
500 http://
12.
500 http://
ubuntu@node01:~$
ubuntu@node01:~$ sudo systemctl status ceph-osd@3.service
● ceph-osd@3.service - Ceph object storage daemon osd.3
Loaded: loaded (/lib/systemd/
Active: failed (Result: core-dump) since Tue 2020-04-07 11:09:12 UTC; 2min 32s ago
Main PID: 58443 (code=dumped, signal=ABRT)
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Main process exited, code=dumped, status=6/ABRT
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Failed with result 'core-dump'.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Service hold-off time over, scheduling restart.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Scheduled restart job, restart counter is at 3.
Apr 07 11:09:12 node01 systemd[1]: Stopped Ceph object storage daemon osd.3.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Start request repeated too quickly.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Failed with result 'core-dump'.
Apr 07 11:09:12 node01 systemd[1]: Failed to start Ceph object storage daemon osd.3.
# OSD log
ubuntu@node01:~$ sudo cat /var/log/
http://
# attempt to start OSD manually
ubuntu@node01:~$ sudo /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph 2>&1 | pastebinit
http://
Sounds like this is an upstream bug: https:/ /tracker. ceph.com/ issues/ 42223 with backport available https:/ /github. com/ceph/ ceph/pull/ 31644
This was released in Nautilus 14.2.5 (https:/ /docs.ceph. com/docs/ master/ releases/ nautilus/)