juju 1.24 poor performance
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Triaged
|
High
|
Unassigned | ||
1.24 |
Triaged
|
High
|
Unassigned |
Bug Description
Using juju 1.24.3 I'm unable to finish deployment that works fine with 1.22. I've retired deployment at least 3 times in last 24h and had no success. Observed problems:
- juju debug-log stalls for long periods; end result is debug-log printing messages up to 10 minutes old
- juju status takes up to 3-5 minutes to return output
- ran juju add-unit 10 minutes ago; juju has still not requested new node
- 3 units from 190 node deployment are marked as pending, but never requested from MAAS
- forcefully terminating nodes does nothing
In general, whole juju environment seems unusable. I'm not sure how to proceed. Deployment was started at 8:22AM, at 12:03 it's still not done. With 1.22 it took around 2h.
Environment has 192 nodes, out of which 174 are used. This seems like a continuation of bug 1474195. While memory leak is not there any more, the load on juju state server is still rather high (11-14), logs and juju db has consumed ~8GB of disk in those 4h:
root@verifiable
3.0G /var/lib/juju/
root@verifiable
2.3G /var/log/juju/
root@verifiable
2.7G /var/log/syslog
Deployment is done with juju-deployer (version 0.4.3-0ubuntu1~
description: | updated |
Changed in juju-core: | |
importance: | Undecided → High |
assignee: | nobody → Menno Smits (menno.smits) |
status: | New → Triaged |
Changed in juju-core: | |
assignee: | Menno Smits (menno.smits) → nobody |
tags: | added: sts |
Changed in juju-core: | |
milestone: | none → 1.25.0 |
I've been doing controlled performance comparisons between 1.22.6 and 1.24.3 using a MAAS environment today and I haven' t been able to find a significant difference. My test deployed 10 machines each with 10 containers on them (with units being added with some in parallellism), and both Juju versions completed in almost exactly the same time. This at least gives me some idea about where the problem /doesn't/ lie.
It likely that there's some aspect of the charms or the additional scale of what you're deploying that's triggering the issue.
Can you please give me more detailed instructions of how to reproduce?
The state server logs would also be very helpful.