OpenStack Ceph-FS Charm

Failed to start Ceph metadata server daemon

Bug #1961904 reported by kashif nawaz on 2022-02-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Ceph-FS Charm	Invalid	Undecided	Unassigned

Bug Description

I have deployed a ceph cluster via chamrs bundle file

series: focal
variables:
  oam-space: &oam-space oam-space
  customize-failure-domain: &customize-failure-domain True
machines:
  "0":
    constraints: tags=ceph-node-1
    series: focal
  "1":
    constraints: tags=ceph-node-2
    series: focal
  "2":
    constraints: tags=ceph-node-3
    series: focal
  "3":
    constraints: tags=master
    series: focal
  "4":
    constraints: tags=worker1
    series: focal
  "5":
    constraints: tags=worker2
    series: focal
  "6":
    constraints: tags=ceph-fs-1
    series: focal

applications:
  ceph-fs:
    charm: ceph-fs
    channel: stable
    revision: 36
    num_units: 1
    to:
    - "6"
    bindings:
      "": *oam-space
      ceph-mds: *oam-space
      certificates: *oam-space
      public: *oam-space
  ceph-mon:
    charm: cs:ceph-mon
    num_units: 3
    bindings:
      "": *oam-space
      public: *oam-space
      osd: *oam-space
    options:
      monitor-count: 3
      expected-osd-count: 3
      customize-failure-domain: *customize-failure-domain
      source: cloud:focal-wallaby
    to:
    - lxd:3
    - lxd:4
    - lxd:5
  ceph-osd:
    charm: cs:ceph-osd
    num_units: 3
    bindings:
      "": *oam-space
      public: *oam-space
      cluster: *oam-space
    options:
      osd-devices: /dev/vdb
      source: cloud:focal-wallaby
      aa-profile-mode: complain
      customize-failure-domain: *customize-failure-domain
      autotune: false
      bluestore: true
      osd-encrypt: True
    to:
    - '0'
    - '1'
    - '2'
  ntp:
    charm: "cs:focal/ntp"
    annotations:
      gui-x: '678.6017761230469'
      gui-y: '415.27124759750086'
relations:
  - [ "ceph-osd:mon", "ceph-mon:osd" ]
  - [ "ceph-osd:juju-info", "ntp:juju-info" ]
  - [ "ceph-fs:ceph-mds", "ceph-mon:mds" ]

inside ceph-mon lxd container when I issue ceph -s command it returns following

root@juju-0026d2-3-lxd-0:~# ceph -s
  cluster:
    id: 2efa1500-9435-11ec-8f93-6b9f09615464
    health: HEALTH_ERR
            mons are allowing insecure global_id reclaim
            1 filesystem is offline
            1 filesystem is online with fewer MDS than max_mds
            Reduced data availability: 104 pgs inactive
            Degraded data redundancy: 104 pgs undersized

  services:
    mon: 3 daemons, quorum juju-0026d2-4-lxd-0,juju-0026d2-5-lxd-0,juju-0026d2-3-lxd-0 (age 8h)
    mgr: juju-0026d2-3-lxd-0(active, since 8h), standbys: juju-0026d2-5-lxd-0, juju-0026d2-4-lxd-0
    mds: 0/0 daemons up, 1 standby
    osd: 3 osds: 3 up (since 8h), 3 in (since 8h)

  data:
    volumes: 1/1 healthy
    pools: 3 pools, 104 pgs
    objects: 0 objects, 0 B
    usage: 16 MiB used, 900 GiB / 900 GiB avail
    pgs: 100.000% pgs not active
             104 undersized+peered

  progress:
    Global Recovery Event (8h)
      [............................]

and I if

id=0
mkdir /var/lib/ceph/mds/ceph-${id}
sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
sudo systemctl start ceph-mds@${id}

root@juju-0026d2-3-lxd-0:~# sudo systemctl status ceph-mds@0
● ceph-mds@0.service - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; disabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-23 07:35:05 UTC; 10min ago
    Process: 47432 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 47432 (code=exited, status=1/FAILURE)

Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Scheduled restart job, restart counter is at 3.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: Stopped Ceph metadata server daemon.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Start request repeated too quickly.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Failed with result 'exit-code'.
Feb 23 07:35:05 juju-0026d2-3-lxd-0 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Start request repeated too quickly.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: ceph-mds@0.service: Failed with result 'exit-code'.
Feb 23 07:39:22 juju-0026d2-3-lxd-0 systemd[1]: Failed to start Ceph metadata server daemon.

Revision history for this message

kashif nawaz (knawaz) wrote on 2022-02-23:

root@juju-0026d2-3-lxd-0:~# ceph fs ls
name: ceph-fs, metadata pool: ceph-fs_metadata, data pools: [ceph-fs_data ]
root@juju-0026d2-3-lxd-0:~# ceph fs dump

e4
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'ceph-fs' (1)
fs_name ceph-fs
epoch 4
flags 12
created 2022-02-22T23:16:19.325823+0000
modified 2022-02-23T08:20:31.622066+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 0
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in
up {}
failed
damaged
stopped
data_pools [2]
metadata_pool 3
inline_data disabled
balancer
standby_count_wanted 0

Standby daemons:

[mds.ceph-fs-1{-1:4931} state up:standby seq 1 addr [v2:192.168.24.52:6800/1182135415,v1:192.168.24.52:6801/1182135415] compat {c=[1],r=[1],i=[1]}]
dumped fsmap epoch 4
root@juju-0026d2-3-lxd-0:~#
root@juju-0026d2-3-lxd-0:~# ceph fs status
ceph-fs - 0 clients
=======
POOL TYPE USED AVAIL
ceph-fs_metadata metadata 0 284G
ceph-fs_data data 0 284G
STANDBY MDS
ceph-fs-1
MDS version: ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)

Revision history for this message

kashif nawaz (knawaz) wrote on 2022-02-23:

ceph mds is up and running on ceph-fs machine but once I am issuing command ceph -s from ceph-mon nodes than it's giving me error "1 filesystem is offline and 1 filesystem is online with fewer MDS than max_mds"

ubuntu@ceph-fs-1:~$ sudo su -
root@ceph-fs-1:~# systemctl status <email address hidden>
● <email address hidden> - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-02-22 23:16:33 UTC; 11h ago
   Main PID: 24397 (ceph-mds)
      Tasks: 15
     Memory: 22.6M
     CGroup: /system.slice/system-ceph\<email address hidden>
             └─24397 /usr/bin/ceph-mds -f --cluster ceph --id ceph-fs-1 --setuser ceph --setgroup ceph

Feb 22 23:16:33 ceph-fs-1 systemd[1]: Started Ceph metadata server daemon.
Feb 22 23:16:33 ceph-fs-1 ceph-mds[24397]: starting mds.ceph-fs-1 at

Revision history for this message

James Page (james-page) wrote on 2022-02-23:

The issue with your deployment is that none of the placement groups for the underlying pools are active - I can see that you have enabled the feature to customise the failure domain to the physical zone information provided via Juju - do you have at least three zones defined in your underlying MAAS and are there servers in each zone?

Changed in charm-ceph-fs:
status:	New → Incomplete

Revision history for this message

kashif nawaz (knawaz) wrote on 2022-02-23:

hi James; thanks for looking into the issue and sharing your analysis. I did not have zone defined in MaaS. So in this case I should remove "customize-failure-domain" from the bundle file. Any other parameter do you suggest to add or remove ? thanks

Revision history for this message

James Page (james-page) wrote on 2022-02-23:

In this case yes set that to false rather than true - ceph will just use host based resilience for pg replica placement.

Changed in charm-ceph-fs:
status:	Incomplete → Invalid

Revision history for this message

kashif nawaz (knawaz) wrote on 2022-02-23:

thanks a lot James; it's working now... working bundle file is appended below ..

applications:
  ceph-fs:
    charm: ceph-fs
    channel: stable
    revision: 36
    num_units: 1
    to:
    - "4"
  ceph-mon:
    charm: ceph-mon
    channel: stable
    revision: 73
    num_units: 3
    to:
    - lxd:0
    - lxd:1
    - lxd:2
  ceph-osd:
    charm: cs:ceph-osd
    channel: stable
    revision: 316
    num_units: 3
    to:
    - "0"
    - "1"
    - "2"
    options:
      osd-devices: /dev/vdb
machines:
  "0": {}
  "1": {}
  "2": {}
relations:
- - ceph-mon:osd
  - ceph-osd:mon
- - ceph-fs:ceph-mds
  - ceph-mon:mds

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.