[SRU] the leak in bluestore_cache_other mempool

Bug #1996010 reported by dongdong tao
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
Ussuri
New
Undecided
Unassigned
Wallaby
Fix Released
Undecided
Unassigned
Xena
Fix Released
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
ceph (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Committed
Undecided
Unassigned
Jammy
Fix Released
Undecided
Unassigned
Kinetic
Fix Released
Undecided
Unassigned
Lunar
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

This issue has been observed from ceph octopus 15.2.16.
Bluestore's onode cache might be completely disabled because of the entry leak happened in bluestore_cache_other mempool.

Below log shows the cache's maximum size had become 0:
------
2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608
-------

The dump_mempools bluestore_cache_other had consumed most majority of the cache due to the leak while only 3 onodes (2 of them are pinned) are in the cache:
---------------
"bluestore_cache_onode": {
"items": 3,
"bytes": 1848
},
"bluestore_cache_meta": {
"items": 13973,
"bytes": 111338
},
"bluestore_cache_other": {
"items": 5601156,
"bytes": 224152996
},
"bluestore_Buffer": {
"items": 1,
"bytes": 96
},
"bluestore_Extent": {
"items": 20,
"bytes": 960
},
"bluestore_Blob": {
"items": 8,
"bytes": 832
},
"bluestore_SharedBlob": {
"items": 8,
"bytes": 896
},
--------------

This could cause the io experiencing high latency as the 0 sized cache will significantly increasing the need to fetch the meta data from rocksdb or even from disk.
Another impact is that this can significantly increase the possibility of hitting the race condition in Onode::put [2], which will crash the osds, especially in large cluster.

[Test Case]

1. Deploy a 15.2.16 ceph cluster

2. Create enough rbd images to spread all over the OSDs

3. Stressingthem with fio 4k randwrite workload in parallel until the OSDs got enough onodes in its cache (more than 60k onodes and you'll see the bluestore_cache_other is over 1 GB):

   fio --name=randwrite --rw=randwrite --ioengine=rbd --bs=4k --direct=1 --numjobs=1 --size=100G --iodepth=16 --clientname=admin --pool=bench --rbdname=test

4. Shrink the pg_num to a very low number so that pgs per osd is around 1.
Once the shrink finished

5. Enable debug_bluestore=20/20, we can observe a 0 sized onode cache by grep max_shard_onodes. Also can observe the leaked bluestore_cache_other mempool via "ceph daemon osd.id dump_mempools"

[Potential Regression]
The patch correct the apparent wrong AU calculation of the bluestore_cache_other pool, it wouldn't increase any regression.

[Other Info]
The patch[1] had been backported to upstream Pacific and Quincy, but not Octopus.
Pacific is going to have it on 16.2.11 which is still pending.
Quincy already had it in 17.2.4

We'll need to backport this fix to Octopus.

[1]https://github.com/ceph/ceph/pull/46911

[2]https://tracker.ceph.com/issues/56382

dongdong tao (taodd)
description: updated
summary: - the leak in bluestore_cache_other mempool
+ [SRU] the leak in bluestore_cache_other mempool
tags: added: sts-sru-needed
tags: added: seg
Revision history for this message
dongdong tao (taodd) wrote :

This is the debdiff based on focal proposed 15.2.17

Revision history for this message
dongdong tao (taodd) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "focal-15.2.17-debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
dongdong tao (taodd) wrote :

pacific debdiff uploaded based on focal-xena

affects: cloud-archive → xena
dongdong tao (taodd)
affects: xena → cloud-archive
no longer affects: cloud-archive/victoria
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu Focal):
status: New → Confirmed
Changed in ceph (Ubuntu):
status: New → Confirmed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

It looks like only Quincy has a point release in-flight [1] that will pick up this fix; we should discuss if it's better to pick up a Pacific point release for this rather than a specific patch, but we should prioritize this fix after the references point release.

1: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1998958

Revision history for this message
Robie Basak (racb) wrote :

This patch is in the general sponsorship queue, but it was added automatically and it's not clear to me if this is ready/wanted by the openstack team. Please could you clarify?

Changed in ceph (Ubuntu Lunar):
status: Confirmed → Fix Released
Changed in ceph (Ubuntu Kinetic):
status: New → Fix Released
Changed in ceph (Ubuntu Jammy):
status: New → Fix Released
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Newer point releases (both Ubuntu and Ubuntu Cloud Archive) have got the fixes for Pacific & Quincy:
Wallaby: 16.2.11-0ubuntu0.21.04.1~cloud0
xena: 16.2.11-0ubuntu0.21.10.1~cloud0
yoga: 17.2.5-0ubuntu0.22.04.3~cloud0
jammy: 17.2.5-0ubuntu0.22.04.3
kinetic: 17.2.5-0ubuntu0.22.10.3
lunar: 17.2.5-0ubuntu2
mantic: 17.2.6-0ubuntu1

SRU needed for Ussuri and Focal.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

As Robie mentioned in comment #8, it is not clear to me if this SRU to Focal will be handled by the OpenStack team or if you want help to get this landed. Could you please clarify that? In case the OpenStack team is going to handle this, please unsubscribe ~ubuntu-sponsors.

I just took a quick look and your debdiff in comment #1 is outdated, you need to rebase your changes against the latest version in focal-updates which is 15.2.17-0ubuntu0.20.04.4.

Revision history for this message
dongdong tao (taodd) wrote :

New debdiff file attached

dongdong tao (taodd)
tags: removed: patch
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Still waiting on the information request on comments #8 and #10.

Revision history for this message
dongdong tao (taodd) wrote :

I've removed the "patch" tag, i believe it should be handled by the openstack team like usual.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Unsubscribing sponsors too.

Revision history for this message
James Page (james-page) wrote :

Upload made to UNAPPROVED queue for SRU team review in focal.

Changed in ceph (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello dongdong, or anyone else affected,

Accepted ceph into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/15.2.17-0ubuntu0.20.04.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Focal):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-focal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.