Comment 1 for bug 1728742

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Copied from a private bug:

Currently we have no stability in /dev/bcache<n> device names:

  * minor numbers for bcache devices are not guaranteed to stay the same across reboots because there is no guaranteed enumeration;
  * uevent details for bcache devices do not propagate an underlying disk's serial number
  * serial numbers of disks are driver-specific device attributes - there is no guarantee that this is exposed

  ====

  /dev/disk/by-dname/<device-name> symlinks provided by curtin are not
  reliable as they merely depend on kernel-provided name which is
  unstable:

  cat /etc/udev/rules.d/bcache0.rules.rules
  SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVNAME}=="/dev/bcache0", SYMLINK+="disk/by-dname/bcache0"

  dname symlink rules for block devices depend on a partition uuid - if a device doesn't have any partition pre-created a symlink will not be created:

  cat /etc/udev/rules.d/sda.rules.rules
  SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="disk", ENV{ID_PART_TABLE_UUID}=="5a492040", SYMLINK+="disk/by-dname/sda"

  There is no way in MAAS to pre-create a GUID Partition Table without a
  partition and a file system for a bcache device (no isolated API call
  for partition table creation - only for file systems).

  ====

  Why is this important for bcache usage?

  Raw block devices need to be used by ceph-disk in cases where it needs
  a device without a file system or partition table, namely, ceph
  journal (used without a file system normally), ceph bluestore (for
  both data and metadata journal. Bluestore is important especially
  because it was designated to work with a raw block device. Using
  bluestore on top of a pre-created file system is an improper usage
  scenario.

  ====

  Ways to mitigate:

  1. Introduce a new udev rule which sets up /dev/by-backing/<backing-
  device-name> symlinks to bcache devices:

  cat /etc/udev/rules.d/bcache-by-backing.rules.rules
  SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVNAME}=="/dev/bcache*", PROGRAM="/lib/udev/bcache-name-helper.sh $kernel", SYMLINK+="disk/by-backing/$result"

  cat /lib/udev/bcache-name-helper.sh
  #!/bin/sh -e
  logger Getting a backing device for a bcache device $1 by sysfs file creation timestamp
  ls -c -1t /sys/block/$1/slaves/ | tail -n1

  tree /dev/disk/by-backing/
  /dev/disk/by-backing/
  ├── sdc -> ../../bcache2
  ├── sdd -> ../../bcache1
  ├── sde -> ../../bcache0
  └── sdf -> ../../bcache3

  lsblk
  NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
  sdf 8:80 0 64G 0 disk
  └─bcache3 252:48 0 64G 0 disk
  sdd 8:48 0 64G 0 disk
  └─bcache1 252:16 0 64G 0 disk
  sdb 8:16 0 64G 0 disk
  ├─bcache0 252:0 0 64G 0 disk
  ├─bcache3 252:48 0 64G 0 disk
  ├─bcache1 252:16 0 64G 0 disk
  └─bcache2 252:32 0 64G 0 disk
  sde 8:64 0 64G 0 disk
  └─bcache0 252:0 0 64G 0 disk
  sdc 8:32 0 64G 0 disk
  └─bcache2 252:32 0 64G 0 disk
  sda 8:0 0 64G 0 disk
  └─sda1 8:1 0 64G 0 part /

  2. Modify the Linux kernel source code to include a way to identify a particular bcache device (bdev UUID) and pass this in a uevent environment so that a udev rule in userspace can handle that or pass a an underlying device's serial number to a udev rule

  ====

  Problems with the above respectively:

  1. Doesn't work well with Juju storage because tags are assigned to
  bcache device names visible in MAAS;

  2. Upstream kernel modifications take time and resource allocation.

  ====

  Other:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705493

  The lack of partitioning support for bcache devices in xenial kernels
  (4.4) leaves us no ability to use ceph-disk to partition a block
  device.

  This will not be a problem in 18.04 or in 4.10 HWE kernels.

  ====

  Bottom line

  Right now the only way to use Ceph OSDs with bcache devices (filestore
  or bluestore) on xenial GA kernel 4.4 is to use the following
  approach:

  * pre-create a file system in MAAS on a bcache device (bucketsconfig.yaml portion example https://paste.ubuntu.com/25787262/)
  * use ceph-disk in the directory mode by passing a mount point of that file system to ceph-osd charm