1) we don't describe how to put nodes into "maintenance mode": e.g. for compute nodes on triggering node reinstallation we leave VMs hanging in ACTIVE state, although they will be actually shut off during re-provisioning / re-deployment. Neither we disable nova-compute service in nova-scheduler (nova service-disable), thus, in busy clouds we may end up with half-provisioned VMs interrupted by node reinstallation
2) the way our deployment works we run into multiple different problems on re-deployment with preserving of partitions on the compute node:
- we lose all OVS bridges and don't control over the order of OVS vs nova-compute initialization (order of start in upstart is not enough - we need to make sure OVS / neutron agent *completed* initialisation before we attempt to boot VMs, not that they started)
- the way our deployment works leads to a situation when nova-compute is configured/restarted multiple times. The interesting fact is that Ceph tasks are executed at late stage of deployment, thus, it's possible that nova-compute will start with Ceph not configured for ephemeral after re-deployment, although it *did* use Ceph ephemeral before attempt to reinstall the node. In this case we'll mistakenly rewrite VMs XMLs to look for local disks which never existed
and so on. These are just two problems we've run so far, but looks like there are more to come.
I checked the docs (https:/ /docs.mirantis. com/openstack/ fuel/fuel- 7.0/user- guide.html# partition- preservation and https:/ /docs.mirantis. com/openstack/ fuel/fuel- 7.0/user- guide.html# node-reinstalla tion) and looks like our approach to this was way too naive and only worked by chance:
1) we don't describe how to put nodes into "maintenance mode": e.g. for compute nodes on triggering node reinstallation we leave VMs hanging in ACTIVE state, although they will be actually shut off during re-provisioning / re-deployment. Neither we disable nova-compute service in nova-scheduler (nova service-disable), thus, in busy clouds we may end up with half-provisioned VMs interrupted by node reinstallation
2) the way our deployment works we run into multiple different problems on re-deployment with preserving of partitions on the compute node:
- we lose all OVS bridges and don't control over the order of OVS vs nova-compute initialization (order of start in upstart is not enough - we need to make sure OVS / neutron agent *completed* initialisation before we attempt to boot VMs, not that they started)
- the way our deployment works leads to a situation when nova-compute is configured/ restarted multiple times. The interesting fact is that Ceph tasks are executed at late stage of deployment, thus, it's possible that nova-compute will start with Ceph not configured for ephemeral after re-deployment, although it *did* use Ceph ephemeral before attempt to reinstall the node. In this case we'll mistakenly rewrite VMs XMLs to look for local disks which never existed
and so on. These are just two problems we've run so far, but looks like there are more to come.