Description:
While ubuntu installation "late_command" directive is used for a lot of stuff i.e. disk partitioning, grub installation, etc.
But actually all this stuff just single shell string. Pretty big string...
In our case we have ceph-osd nodes with 23 disks (1 OS, 4 journals, 18 osds). While Ubuntu installation "late_command" directive fails and node starts loop reboots. (nopxe flag was not set).After some investigation I've found that late_command immediately fails with 'Argu ment list too long' error. After reducing number of disks or removing some parts of the late_command it starts working. My assumption that string too long and not fits some kernel limits (i.e. MAX_ARG_PAGES http://www.linuxjournal.com/article/6060). How "late_command" executed http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/preseed/precise/view/head:/preseed_command#L16
Environment:
Fuel 5.0, Ubuntu, HA, Neutron+VLAN, Ceph on dedicated nodes
Steps to reproduce:
- Create environment(HA or not) with dedicated ceph nodes
- Assign large number of disks for ceph-osds (at least 20)
- Deploy
Expected result:
- Ubuntu installed successfully
Actual result:
- Ubuntu installation stuck on 100% in Fuel (loop reinstallation)
Possible solution:
- Rebuild debian-installer kernel with increased limit (need more research)
- Move out ceph partitioning stuff from late_command (to puppet?)
Details:
In /var/log/remote/node-X.domain.tld/finish-install.log:
2014-07-10T12:42:38.789606+00:00 notice: info: Running /usr/lib/finish-install.d/07preseed
2014-07-10T12:42:38.845174+00:00 notice: /bin/preseed_command: line 23: logger: Argument list too long
2014-07-10T12:42:38.846209+00:00 notice: warning: /usr/lib/finish-install.d/07preseed returned error code 2
To fix this bug we can put this long string into a script somewhere on master node and then download this script via http during late preseed stage. That is QUITE UGLY solution and my suggestion is to fix this bug by using image based provisioning scheme which will be available since 6.0. As far as nodes with such a large amount of disks are quite rare this bug is rather medium, not high.