"crash" module is always on but not properly configured
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Fix Committed
|
Undecided
|
Samuel Walladge | ||
Quincy.2 |
Fix Committed
|
Undecided
|
Unassigned | ||
Ceph OSD Charm |
Fix Committed
|
Undecided
|
Samuel Walladge | ||
Quincy.2 |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
cloud:focal-yoga (quincy)
$ juju ssh ceph-mon/leader -- sudo ceph version
ceph version 17.2.0 (43e2e60a7559d3
How to reproduce:
1. make sure "crash" module is on (it's a part of "always on" modules)
https:/
$ juju ssh ceph-mon/leader -- sudo ceph mgr module ls | grep crash
crash on (always on)
2. intentionally crash ceph-osd process (in this example I used SIGSEGV)
$ juju ssh ceph-osd/leader -- sudo pkill --signal SIGSEGV ceph-osd
3. make sure a normal crash file is generated for apport *and* a set of files for ceph crash module.
# ll -h /var/crash/
total 121M
drwxrwxrwt 2 root root 4.0K Dec 28 10:42 ./
drwxr-xr-x 13 root root 4.0K Dec 12 21:41 ../
-rw-r----- 1 ceph ceph 121M Dec 28 10:42 _usr_bin_
# ll -h /var/lib/
'/var/
total 1.6M
drwx------ 2 ceph ceph 4.0K Dec 28 10:42 ./
drwxr-xr-x 4 ceph ceph 4.0K Dec 28 10:42 ../
-r--r--r-- 1 ceph ceph 0 Dec 28 10:42 done
-rw-r--r-- 1 ceph ceph 1.6M Dec 28 10:42 log
-rw------- 1 ceph ceph 926 Dec 28 10:42 meta
/var/
total 8.0K
drwxr-xr-x 2 root root 4.0K Sep 13 17:47 ./
drwxr-xr-x 4 ceph ceph 4.0K Dec 28 10:42 ../
4. check syslog for post failures to MON units.
Dec 28 10:51:18 famous-skunk ceph-crash[10667]: WARNING:
Changed in charm-ceph-mon: | |
assignee: | nobody → Samuel Walladge (swalladge) |
Changed in charm-ceph-mon: | |
status: | New → In Progress |
Changed in charm-ceph-osd: | |
status: | New → In Progress |
assignee: | nobody → Samuel Walladge (swalladge) |
By configuring authentication for the crash module, OSD nodes posted a recent crash to MON. Real outages can happen after a few crashes of MONs or OSDs so this should be helpful to give a heads-up to operators to diagnose a recent crash.
https:/ /docs.ceph. com/en/ quincy/ mgr/crash/
$ juju ssh ceph-mon/leader -- sudo ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash' AU20bKTeL3k2pIl PNazeVfQ= =
[client.crash]
key = AQCRI6xje9HrHxA
$ juju run -a ceph-osd ' ceph.client. crash.keyring AU20bKTeL3k2pIl PNazeVfQ= =
cat <<EOF | sudo tee /etc/ceph/
[client.crash]
key = AQCRI6xje9HrHxA
EOF
'
$ sudo ceph health detail 28T10:42: 04.661282Z
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
osd.2 crashed on host famous-skunk at 2022-12-