Update:
- validated comparable OOMs (though lower frequency) on juju2.1beta4.
- Logging verbosity is turned up on the juju controller in all failure cases, so the logs collection growth is much faster than default configuration.
-- "logging-config: <root>=DEBUG;juju.apiserver=TRACE"
When using Juju2.1beta3 to deploy a newton OpenStack HA cloud, I've run into out of memory errors where the kernel kills off mongod every 8 - 15 minutes.
Mongo quickly climbs to > 4Gb in memory on my 16G the bootstrap node. The node becomes completely unresponsive to ssh. I can see the mongo service timing out and causing errors with other juju units as update-status hooks are run.
This cloud deployment is a 6 node cluster, each node has 16G of memory. Each node has around 7 lxcs configured runnning various OpenStack applications. Lots of hooks firing for each highly-available application.
The bootstrap node also shares resources with 7 other lxcs running on that machine running various openstack services in each lxc.
Because of the OOM issues, juju doesn't respond to status or ssh commands, many service endpoints timeout and service status updates are lost causing update-status hooks to fail.
Note: Initially, the region deployment succeeded and I was able to bootstrap on the deployed cloud. I left the cloud untouched for 3 days and came back to see it completely unresponsive.
Juju status failed OpenStack HA deployment.