Failed to start Ceph metadata server daemon

Bug #1961904 reported by kashif nawaz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Ceph-FS Charm
Invalid
Undecided
Unassigned

Bug Description

I have deployed a ceph cluster via chamrs bundle file

series: focal
variables:
  oam-space: &oam-space oam-space
  customize-failure-domain: &customize-failure-domain True
machines:
  "0":
    constraints: tags=ceph-node-1
    series: focal
  "1":
    constraints: tags=ceph-node-2
    series: focal
  "2":
    constraints: tags=ceph-node-3
    series: focal
  "3":
    constraints: tags=master
    series: focal
  "4":
    constraints: tags=worker1
    series: focal
  "5":
    constraints: tags=worker2
    series: focal
  "6":
    constraints: tags=ceph-fs-1
    series: focal

applications:
  ceph-fs:
    charm: ceph-fs
    channel: stable
    revision: 36
    num_units: 1
    to:
    - "6"
    bindings:
      "": *oam-space
      ceph-mds: *oam-space
      certificates: *oam-space
      public: *oam-space
  ceph-mon:
    charm: cs:ceph-mon
    num_units: 3
    bindings:
      "": *oam-space
      public: *oam-space
      osd: *oam-space
    options:
      monitor-count: 3
      expected-osd-count: 3
      customize-failure-domain: *customize-failure-domain
      source: cloud:focal-wallaby
    to:
    - lxd:3
    - lxd:4
    - lxd:5
  ceph-osd:
    charm: cs:ceph-osd
    num_units: 3
    bindings:
      "": *oam-space
      public: *oam-space
      cluster: *oam-space
    options:
      osd-devices: /dev/vdb
      source: cloud:focal-wallaby
      aa-profile-mode: complain
      customize-failure-domain: *customize-failure-domain
      autotune: false
      bluestore: true
      osd-encrypt: True
    to:
    - '0'
    - '1'
    - '2'
  ntp:
    charm: "cs:focal/ntp"
    annotations:
      gui-x: '678.6017761230469'
      gui-y: '415.27124759750086'
relations:
  - [ "ceph-osd:mon", "ceph-mon:osd" ]
  - [ "ceph-osd:juju-info", "ntp:juju-info" ]
  - [ "ceph-fs:ceph-mds", "ceph-mon:mds" ]

inside ceph-mon lxd container when I issue ceph -s command it returns following

root@juju-0026d2-3-lxd-0:~# ceph -s
  cluster:
    id: 2efa1500-9435-11ec-8f93-6b9f09615464
    health: HEALTH_ERR
            mons are allowing insecure global_id reclaim
            1 filesystem is offline
            1 filesystem is online with fewer MDS than max_mds
            Reduced data availability: 104 pgs inactive
            Degraded data redundancy: 104 pgs undersized

  services:
    mon: 3 daemons, quorum juju-0026d2-4-lxd-0,juju-0026d2-5-lxd-0,juju-0026d2-3-lxd-0 (age 8h)
    mgr: juju-0026d2-3-lxd-0(active, since 8h), standbys: juju-0026d2-5-lxd-0, juju-0026d2-4-lxd-0
    mds: 0/0 daemons up, 1 standby
    osd: 3 osds: 3 up (since 8h), 3 in (since 8h)

  data:
    volumes: 1/1 healthy
    pools: 3 pools, 104 pgs
    objects: 0 objects, 0 B
    usage: 16 MiB used, 900 GiB / 900 GiB avail
    pgs: 100.000% pgs not active
             104 undersized+peered

  progress:
    Global Recovery Event (8h)
      [............................]

and I if

id=0
mkdir /var/lib/ceph/mds/ceph-${id}
sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
sudo systemctl start ceph-mds@${id}

root@juju-0026d2-3-lxd-0:~# sudo systemctl status ceph-mds@0
● ceph-mds@0.service - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; disabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-23 07:35:05 UTC; 10min ago
    Process: 47432 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 47432 (code=exited, status=1/FAILURE)

Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Scheduled restart job, restart counter is at 3.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: Stopped Ceph metadata server daemon.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Start request repeated too quickly.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Failed with result 'exit-code'.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Start request repeated too quickly.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Failed with result 'exit-code'.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: Failed to start Ceph metadata server daemon.

Revision history for this message
kashif nawaz (knawaz) wrote :

root@juju-0026d2-3-lxd-0:~# ceph fs ls
name: ceph-fs, metadata pool: ceph-fs_metadata, data pools: [ceph-fs_data ]
root@juju-0026d2-3-lxd-0:~# ceph fs dump

e4
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'ceph-fs' (1)
fs_name ceph-fs
epoch 4
flags 12
created 2022-02-22T23:16:19.325823+0000
modified 2022-02-23T08:20:31.622066+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 0
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in
up {}
failed
damaged
stopped
data_pools [2]
metadata_pool 3
inline_data disabled
balancer
standby_count_wanted 0

Standby daemons:

[mds.ceph-fs-1{-1:4931} state up:standby seq 1 addr [v2:192.168.24.52:6800/1182135415,v1:192.168.24.52:6801/1182135415] compat {c=[1],r=[1],i=[1]}]
dumped fsmap epoch 4
root@juju-0026d2-3-lxd-0:~#
root@juju-0026d2-3-lxd-0:~# ceph fs status
ceph-fs - 0 clients
=======
      POOL TYPE USED AVAIL
ceph-fs_metadata metadata 0 284G
  ceph-fs_data data 0 284G
STANDBY MDS
 ceph-fs-1
MDS version: ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)

Revision history for this message
kashif nawaz (knawaz) wrote :

ceph mds is up and running on ceph-fs machine but once I am issuing command ceph -s from ceph-mon nodes than it's giving me error "1 filesystem is offline and 1 filesystem is online with fewer MDS than max_mds"

ubuntu@ceph-fs-1:~$ sudo su -
root@ceph-fs-1:~# systemctl status <email address hidden>
● <email address hidden> - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-02-22 23:16:33 UTC; 11h ago
   Main PID: 24397 (ceph-mds)
      Tasks: 15
     Memory: 22.6M
     CGroup: /system.slice/system-ceph\<email address hidden>
             └─24397 /usr/bin/ceph-mds -f --cluster ceph --id ceph-fs-1 --setuser ceph --setgroup ceph

Feb 22 23:16:33 ceph-fs-1 systemd[1]: Started Ceph metadata server daemon.
Feb 22 23:16:33 ceph-fs-1 ceph-mds[24397]: starting mds.ceph-fs-1 at

Revision history for this message
James Page (james-page) wrote :

The issue with your deployment is that none of the placement groups for the underlying pools are active - I can see that you have enabled the feature to customise the failure domain to the physical zone information provided via Juju - do you have at least three zones defined in your underlying MAAS and are there servers in each zone?

Changed in charm-ceph-fs:
status: New → Incomplete
Revision history for this message
kashif nawaz (knawaz) wrote :

hi James; thanks for looking into the issue and sharing your analysis. I did not have zone defined in MaaS. So in this case I should remove "customize-failure-domain" from the bundle file. Any other parameter do you suggest to add or remove ? thanks

Revision history for this message
James Page (james-page) wrote :

In this case yes set that to false rather than true - ceph will just use host based resilience for pg replica placement.

Changed in charm-ceph-fs:
status: Incomplete → Invalid
Revision history for this message
kashif nawaz (knawaz) wrote :

thanks a lot James; it's working now... working bundle file is appended below ..

applications:
  ceph-fs:
    charm: ceph-fs
    channel: stable
    revision: 36
    num_units: 1
    to:
    - "4"
  ceph-mon:
    charm: ceph-mon
    channel: stable
    revision: 73
    num_units: 3
    to:
    - lxd:0
    - lxd:1
    - lxd:2
  ceph-osd:
    charm: cs:ceph-osd
    channel: stable
    revision: 316
    num_units: 3
    to:
    - "0"
    - "1"
    - "2"
    options:
      osd-devices: /dev/vdb
machines:
  "0": {}
  "1": {}
  "2": {}
relations:
- - ceph-mon:osd
  - ceph-osd:mon
- - ceph-fs:ceph-mds
  - ceph-mon:mds

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.