Currently we have no stability in /dev/bcache<n> device names:
* minor numbers for bcache devices are not guaranteed to stay the same across reboots because there is no guaranteed enumeration;
* uevent details for bcache devices do not propagate an underlying disk's serial number
* serial numbers of disks are driver-specific device attributes - there is no guarantee that this is exposed
====
/dev/disk/by-dname/<device-name> symlinks provided by curtin are not
reliable as they merely depend on kernel-provided name which is
unstable:
There is no way in MAAS to pre-create a GUID Partition Table without a
partition and a file system for a bcache device (no isolated API call
for partition table creation - only for file systems).
====
Why is this important for bcache usage?
Raw block devices need to be used by ceph-disk in cases where it needs
a device without a file system or partition table, namely, ceph
journal (used without a file system normally), ceph bluestore (for
both data and metadata journal. Bluestore is important especially
because it was designated to work with a raw block device. Using
bluestore on top of a pre-created file system is an improper usage
scenario.
====
Ways to mitigate:
1. Introduce a new udev rule which sets up /dev/by-backing/<backing-
device-name> symlinks to bcache devices:
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdf 8:80 0 64G 0 disk
└─bcache3 252:48 0 64G 0 disk
sdd 8:48 0 64G 0 disk
└─bcache1 252:16 0 64G 0 disk
sdb 8:16 0 64G 0 disk
├─bcache0 252:0 0 64G 0 disk
├─bcache3 252:48 0 64G 0 disk
├─bcache1 252:16 0 64G 0 disk
└─bcache2 252:32 0 64G 0 disk
sde 8:64 0 64G 0 disk
└─bcache0 252:0 0 64G 0 disk
sdc 8:32 0 64G 0 disk
└─bcache2 252:32 0 64G 0 disk
sda 8:0 0 64G 0 disk
└─sda1 8:1 0 64G 0 part /
2. Modify the Linux kernel source code to include a way to identify a particular bcache device (bdev UUID) and pass this in a uevent environment so that a udev rule in userspace can handle that or pass a an underlying device's serial number to a udev rule
====
Problems with the above respectively:
1. Doesn't work well with Juju storage because tags are assigned to
bcache device names visible in MAAS;
2. Upstream kernel modifications take time and resource allocation.
The lack of partitioning support for bcache devices in xenial kernels
(4.4) leaves us no ability to use ceph-disk to partition a block
device.
This will not be a problem in 18.04 or in 4.10 HWE kernels.
====
Bottom line
Right now the only way to use Ceph OSDs with bcache devices (filestore
or bluestore) on xenial GA kernel 4.4 is to use the following
approach:
* pre-create a file system in MAAS on a bcache device (bucketsconfig.yaml portion example https://paste.ubuntu.com/25787262/)
* use ceph-disk in the directory mode by passing a mount point of that file system to ceph-osd charm
Copied from a private bug:
Currently we have no stability in /dev/bcache<n> device names:
* minor numbers for bcache devices are not guaranteed to stay the same across reboots because there is no guaranteed enumeration;
* uevent details for bcache devices do not propagate an underlying disk's serial number
* serial numbers of disks are driver-specific device attributes - there is no guarantee that this is exposed
====
/dev/ disk/by- dname/< device- name> symlinks provided by curtin are not
reliable as they merely depend on kernel-provided name which is
unstable:
cat /etc/udev/ rules.d/ bcache0. rules.rules ="block" , ACTION= ="add|change" , ENV{DEVNAME} =="/dev/ bcache0" , SYMLINK+ ="disk/ by-dname/ bcache0"
SUBSYSTEM=
dname symlink rules for block devices depend on a partition uuid - if a device doesn't have any partition pre-created a symlink will not be created:
cat /etc/udev/ rules.d/ sda.rules. rules ="block" , ACTION= ="add|change" , ENV{DEVTYPE} =="disk" , ENV{ID_ PART_TABLE_ UUID}== "5a492040" , SYMLINK+ ="disk/ by-dname/ sda"
SUBSYSTEM=
There is no way in MAAS to pre-create a GUID Partition Table without a
partition and a file system for a bcache device (no isolated API call
for partition table creation - only for file systems).
====
Why is this important for bcache usage?
Raw block devices need to be used by ceph-disk in cases where it needs
a device without a file system or partition table, namely, ceph
journal (used without a file system normally), ceph bluestore (for
both data and metadata journal. Bluestore is important especially
because it was designated to work with a raw block device. Using
bluestore on top of a pre-created file system is an improper usage
scenario.
====
Ways to mitigate:
1. Introduce a new udev rule which sets up /dev/by- backing/ <backing-
device-name> symlinks to bcache devices:
cat /etc/udev/ rules.d/ bcache- by-backing. rules.rules ="block" , ACTION= ="add|change" , ENV{DEVNAME} =="/dev/ bcache* ", PROGRAM= "/lib/udev/ bcache- name-helper. sh $kernel", SYMLINK+ ="disk/ by-backing/ $result"
SUBSYSTEM=
cat /lib/udev/ bcache- name-helper. sh $1/slaves/ | tail -n1
#!/bin/sh -e
logger Getting a backing device for a bcache device $1 by sysfs file creation timestamp
ls -c -1t /sys/block/
tree /dev/disk/ by-backing/ disk/by- backing/
/dev/
├── sdc -> ../../bcache2
├── sdd -> ../../bcache1
├── sde -> ../../bcache0
└── sdf -> ../../bcache3
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdf 8:80 0 64G 0 disk
└─bcache3 252:48 0 64G 0 disk
sdd 8:48 0 64G 0 disk
└─bcache1 252:16 0 64G 0 disk
sdb 8:16 0 64G 0 disk
├─bcache0 252:0 0 64G 0 disk
├─bcache3 252:48 0 64G 0 disk
├─bcache1 252:16 0 64G 0 disk
└─bcache2 252:32 0 64G 0 disk
sde 8:64 0 64G 0 disk
└─bcache0 252:0 0 64G 0 disk
sdc 8:32 0 64G 0 disk
└─bcache2 252:32 0 64G 0 disk
sda 8:0 0 64G 0 disk
└─sda1 8:1 0 64G 0 part /
2. Modify the Linux kernel source code to include a way to identify a particular bcache device (bdev UUID) and pass this in a uevent environment so that a udev rule in userspace can handle that or pass a an underlying device's serial number to a udev rule
====
Problems with the above respectively:
1. Doesn't work well with Juju storage because tags are assigned to
bcache device names visible in MAAS;
2. Upstream kernel modifications take time and resource allocation.
====
Other:
https:/ /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/1705493
The lack of partitioning support for bcache devices in xenial kernels
(4.4) leaves us no ability to use ceph-disk to partition a block
device.
This will not be a problem in 18.04 or in 4.10 HWE kernels.
====
Bottom line
Right now the only way to use Ceph OSDs with bcache devices (filestore
or bluestore) on xenial GA kernel 4.4 is to use the following
approach:
* pre-create a file system in MAAS on a bcache device (bucketsconfig.yaml portion example https:/ /paste. ubuntu. com/25787262/)
* use ceph-disk in the directory mode by passing a mount point of that file system to ceph-osd charm