We are using the following:
- CephFSNative driver
- Ussuri Manila with Nautilus client libraries
- Ceph Octopus 15.2.8 server
What I see is that the manila-share service is getting blacklisted in the Ceph MDS on startup, which renders it inoperable eg for share creation.
Here's an example timeline:
Manila share driver starts and evicts previous clients as part of startup:
2021-02-03 15:15:17.186 20 DEBUG ceph_volume_client [-] mds_command: 7138235, ['session', 'evict', 'auth_name=...'] _evict /usr/lib/python3.6/site-packages/ceph_volume_client.py:166
2021-02-03 15:15:18.243 20 DEBUG ceph_volume_client [-] mds_command: complete 0 _evict /usr/lib/python3.6/site-packages/ceph_volume_client.py:174
2021-02-03 15:15:18.244 20 INFO ceph_volume_client [req-2481ee83-1105-4cef-8e55-a9cc340219b3 - - - - -] evict: joined all
2021-02-03 15:15:18.244 20 DEBUG ceph_volume_client [req-2481ee83-1105-4cef-8e55-a9cc340219b3 - - - - -] Premount eviction of manila completes _connect /usr/lib/python3.6/site-packages/ceph_volume_client.py:491
Ceph MDS sees this and does the eviction, but also blacklists:
2021-02-03T15:15:52.636+0000 7f9203f49700 1 mds.xxx asok_command: session evict {filters=[auth_name=...],prefix=session evict} (starting...)
2021-02-03T15:15:52.636+0000 7f9203f49700 1 mds.0.287 Evicting (and blacklisting) client session 7207716 (p.q.r.s:0/2920427123)
2021-02-03T15:15:52.636+0000 7f9203f49700 0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 7207716 (p.q.r.s:0/2920427123)
2021-02-03T15:15:53.476+0000 7f9204f4b700 0 --2- [v2:a.b.c.d:6800/1138805783,v1:a.b.c.d:6801/1138805783] >> p.q.r.s:0/2920427123 conn(0x555791941800 0x5557917a6800 crc :-1 s=SESSION_ACCEPTING pgs=6 cs=0 l=0 rev1=1 rx=0 tx=0).handle_reconnect no existing connection exists, reseting client
The manila-share ceph client logs also record this:
2021-02-03 15:15:53.475 7f641cff9700 0 client.7207716 ms_handle_remote_reset on v2:a.b.c.d:6800/1138805783
2021-02-03 15:15:53.476 7f641cff9700 -1 client.7207716 I was blacklisted at osd epoch 12942
Subsequent attempt at share creation fails in manila-share:
2021-02-03 15:17:56.314 20 DEBUG manila.share.drivers.cephfs.driver [req-495c5115-5a0a-4465-bc6d-0fdb1caaac92 b2d76137b21489d3fbe0125f36cd8a92ddca0018ef8f265c8f3f9fdc6efcb191 25f96bca327c4136ab28f251203d71a3 - - -] create_share xxxx name=081a69e2-e80b-454c-880b-789ca6f70851 size=10 share_group_id=None create_share /usr/lib/python3.6/site-packages/manila/share/drivers/cephfs/driver.py:262
2021-02-03 15:17:56.324 20 INFO ceph_volume_client [req-495c5115-5a0a-4465-bc6d-0fdb1caaac92 b2d76137b21489d3fbe0125f36cd8a92ddca0018ef8f265c8f3f9fdc6efcb191 25f96bca327c4136ab28f251203d71a3 - - -] create_volume: /volumes/_nogroup/081a69e2-e80b-454c-880b-789ca6f70851
2021-02-03 15:17:56.324 20 ERROR manila.share.manager [req-495c5115-5a0a-4465-bc6d-0fdb1caaac92 b2d76137b21489d3fbe0125f36cd8a92ddca0018ef8f265c8f3f9fdc6efcb191 25f96bca327c4136ab28f251203d71a3 - - -] Share instance 081a69e2-e80b-454c-880b-789ca6f70851 failed on creation.: cephfs.OSError: error in stat: /volumes/_nogroup/081a69e2-e80b-454c-880b-789ca6f70851: Cannot send after transport endpoint shutdown [Errno 108]
2021-02-03 15:17:56.325 20 WARNING manila.share.manager [req-495c5115-5a0a-4465-bc6d-0fdb1caaac92 b2d76137b21489d3fbe0125f36cd8a92ddca0018ef8f265c8f3f9fdc6efcb191 25f96bca327c4136ab28f251203d71a3 - - -] Share instance information in exception can not be written to db because it contains {} and it is not a dictionary.: cephfs.OSError: error in stat: /volumes/_nogroup/081a69e2-e80b-454c-880b-789ca6f70851: Cannot send after transport endpoint shutdown [Errno 108]
The manila-share servers are blacklisted for cephfs:
# ceph osd blacklist ls
p.q.r.s:0/2920427123 2021-02-03T16:15:52.637798+0000
[...]
To work around this issue, we need to set non-default config in our Ceph cluster:
# ceph config set global mds_session_blacklist_on_evict false
(as described in https://docs.ceph.com/en/octopus/cephfs/eviction/#advanced-configuring-blacklisting)
Now things appear to be working again. Is there a way to do this without requiring configuration changes to the Ceph cluster, or is it something that will need adding as a documentation addition for the CephFSNative driver?