sysctl values applied by autotune are bad
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
In Progress
|
High
|
Brett Milford | ||
OpenStack Charm Guide |
Fix Released
|
Medium
|
Peter Matulis |
Bug Description
This is in some sense an extension to bug 1770171 which dealt with removing non-sane systcl settings from the "sysctl" config option.
The ceph-osd charm has an autotune config option which if set to True will:
"attempt to tune your network card sysctls and hard drive settings.
This changes hard drive read ahead settings and max_sectors_kb.
For the network card this will detect the link speed and make
appropriate sysctl changes. Enabling this option should generally
be safe."
This essentially translates (for networking) to the following sysctls being set [1]:
'net.core.
'net.core.
'net.core.
'net.core.
'net.core.
'net.core.
'net.ipv4.
'net.ipv4.
'net.ipv4.tcp_mem': '10000000 10000000 10000000'
[1] https:/
While on the surface these settings seem like they might have the effect of ensuring e.g. that packets never get dropped, setting netdev_max_backlog tp 300000 is high enough that things can time out while packets are in the enormous backlog queues this value will permit to grow. If the system can't keep up with the incoming packet rate, this setting just adds a lot of latency to the mix without making things better. This sysctl is really a "surge" capacity, and it is doubtful that surges of 300,000 packets and commonplace in ceph deployments. Also setting 'net.ipv4.tcp_*mem to have the same high value for min/default/max is likely to lead to problems in heavily loaded systems:
comment from @jvosburgh:
"[this] is a limit for TCP as a whole (not a per-socket limit), and is measured in units of pages, not bytes as with tcp_rmem/wmem. Setting all of these to the same value will cause TCP to go from 'all fine here' immediately to 'out of memory' without any attempt to moderate its memory use before simply failing memory requests for socket buffer data. That's likely not what was intended. The value chosen, 10 million pages, which, at 4K per page is 38GB, is probably absurdly too high."
So either we need to completelty remove the autotune feature or we need to find more sane values for the "optimisations" but it feels like it is unlikley that we are going to find a magic one-size-fits-all and maybe we should just let users manipulate their sysctls via the sysctl config.
NOTE: removing these settings from the charm will ensure that they are never set for future deployments but existing deployments will need to manually change these settings to prior values since the charm will no longer have any knowledge of their existence.
Changed in charm-ceph-osd: | |
milestone: | 18.11 → 19.04 |
Changed in charm-ceph-osd: | |
status: | New → Triaged |
Changed in charm-ceph-osd: | |
milestone: | 19.04 → 19.07 |
Changed in charm-ceph-osd: | |
milestone: | 19.07 → 19.10 |
Changed in charm-ceph-osd: | |
importance: | Medium → Low |
importance: | Low → High |
Changed in charm-ceph-osd: | |
milestone: | 19.10 → 20.01 |
Changed in charm-ceph-osd: | |
milestone: | 20.01 → 20.05 |
Changed in charm-ceph-osd: | |
milestone: | 20.05 → 20.08 |
Changed in charm-ceph-osd: | |
milestone: | 20.08 → none |
Changed in charm-ceph-osd: | |
assignee: | nobody → Brett Milford (brettmilford) |
Changed in charm-guide: | |
assignee: | nobody → Peter Matulis (petermatulis) |
status: | New → In Progress |
importance: | Undecided → Medium |
Given the marginal usefulness in operations, the potential for highly undesirable impact, and the idea that it is difficult, if not impossible, to autotune with a one-sized approach, I would be in favor of deprecating this charm config option.
In the meantime, we should adjust (and backport) the config.yaml descriptions to reference this bug and to insight elevated caution in users who might consider turning it on.