do_refresh_security_group_rules in nova.virt.firewall is very slow
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Michael Still | ||
Folsom |
Fix Released
|
High
|
Pádraig Brady | ||
nova (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Quantal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This is a bug against stable essex. I have made no attempt to determine if this is still a problem in Folsom at this stage.
During a sprint this week we took a nova region which was previously relatively idle and started turning up large numbers of instances using juju. We started to experience very slow instance starts, which I dug into. I should note that juju seems to trigger this behaviour by refreshing security groups when ports are exposed, but other openstack users will probably experience problems if they are trying to do non-trivial things with security groups.
It appears that do_refresh_
(logs from scheduler node)
2012-10-05 08:06:28 run_instance
2012-10-05 08:06:29 cast to <<compute node>>
(logs from compute node)
2012-10-05 08:07:21 Starting instance...
2012-10-05 08:07:34 Starting toXML method
2012-10-05 08:07:43 Finished toXML method
2012-10-05 08:07:43 Called setup_basic_
2012-10-05 08:07:43 Ensuring static filters
2012-10-05 08:08:48 Attempting to grab semaphore "iptables" for method "_do_refresh_
2012-10-05 08:24:00 Got semaphore "iptables" for method "_do_refresh_
2012-10-05 08:24:01 Creating image
2012-10-05 08:24:06 Instance is running
2012-10-05 08:25:28 Checking state
2012-10-05 08:25:30 Instance spawned successfully.
I instrumented utils.synchronized to include lock wait and help times like this (patch against essex):
diff --git a/nova/utils.py b/nova/utils.py
index 6535b06..2e01a15 100644
--- a/nova/utils.py
+++ b/nova/utils.py
@@ -926,10 +926,16 @@ def synchronized(name, external=False):
+ started_waiting = time.time()
+
with sem:
- '"%(method)s"...') % {'lock': name,
- 'method': f.__name__})
+ '"%(method)s" after %(wait)f second wait...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_waiting})
+ started_working = time.time()
+
if external and not FLAGS.disable_
@@ -945,6 +951,12 @@ def synchronized(name, external=False):
+ LOG.debug(
+ '"%(method)s" after %(wait)f seconds of use...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_working})
+
# If no-one else is waiting for it, delete it.
# See note about possible raciness above.
if not sem.balance < 1:
Taking a look at the five longest lock holds in my logs after this patch is applied, I get:
# grep "Released semaphore" /var/log/
192.134270 "do_refresh_
194.140478 "do_refresh_
194.153729 "do_refresh_
201.135854 "do_refresh_
297.725837 "do_refresh_
So I then instrumented do_refresh_
diff --git a/nova/
index f0f1594..99f580a 100644
--- a/nova/
+++ b/nova/
@@ -17,6 +17,8 @@
# License for the specific language governing permissions and limitations
# under the License.
+import time
+
from nova import context
from nova import db
from nova import flags
@@ -167,16 +169,35 @@ class IptablesFirewal
def add_filters_
+ start_time = time.time()
+ LOG.debug(_('Get network info took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
chain_name = self._instance_
+ LOG.debug(_('Get chain name took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
if FLAGS.use_ipv6:
+ LOG.debug(_('Add chain took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
+ LOG.debug(
+ time.time() - start_time)
+
+ start_time = time.time()
+ LOG.debug(_('Add filters took %f seconds'),
+ time.time() - start_time)
def remove_
chain_name = self._instance_
@@ -362,9 +383,17 @@ class IptablesFirewal
@utils.
def do_refresh_
+ # TODO(mikal): why is security group passed in if its not used?
for instance in self.instances.
+ remove_start = time.time()
+ add_start = time.time()
+ LOG.debug(
+ 'took %(remove)f seconds, add took %(add)f seconds)'),
+ {'remove': add_start - remove_start,
+ 'add': time.time() - add_start},
+ instance=instance)
def refresh_
"""See :class:
This gives me log entries like this (not all are this slow):
2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-
2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-
I'm still looking at this, but it looks like nova.virt.
Related branches
CVE References
Changed in nova: | |
assignee: | nobody → Michael Still (mikalstill) |
Changed in nova (Ubuntu): | |
status: | Confirmed → In Progress |
Changed in nova: | |
importance: | Undecided → High |
Changed in nova: | |
milestone: | none → grizzly-1 |
status: | Fix Committed → Fix Released |
Changed in nova (Ubuntu Quantal): | |
status: | New → Confirmed |
Changed in nova: | |
milestone: | grizzly-1 → 2013.1 |
tags: |
added: verification-done removed: verification-needed |
no longer affects: | nova/essex |
Status changed to 'Confirmed' because the bug affects multiple users.