ComputeManager._run_image_cache_manager_pass times out when running on NFS
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
In Progress
|
Medium
|
Lee Yarwood | ||
Ocata |
Triaged
|
Medium
|
Unassigned | ||
Pike |
Triaged
|
Medium
|
Unassigned | ||
Queens |
Triaged
|
Medium
|
Unassigned | ||
Rocky |
Triaged
|
Medium
|
Unassigned |
Bug Description
Description
===========
Under Pike we are operating a /var/lib/
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:
host:/share /var/lib/
But recently we noticed an increase of Error during ComputeManager.
which we mitigated by increasing the rpc_response_
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.
Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?
Actual result
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
Expected result
===============
rpc_response_
Environment
===========
Ubuntu 16.04.4 LTS (amd64)
pips:
nova==16.1.5.dev57
nova-lxd=
nova-powervm=
python-
debs:
libvirt-bin 3.6.0-1ubuntu6.
libvirt-clients 3.6.0-1ubuntu6.
libvirt-daemon 3.6.0-1ubuntu6.
libvirt-
libvirt0 3.6.0-1ubuntu6.
python-libvirt 3.5.0-1build1~
description: | updated |
tags: | added: compute rpc |
Changed in nova: | |
assignee: | Matthew Booth (mbooth-9) → Lee Yarwood (lyarwood) |
Part of the problems seems to be that the image cache manager seems to run on all nodes and checking all instances:
2018-11-20 18:02:13.810 34617 INFO nova.compute. manager [req-b2f865a4- 1d07-49d9- aefa-d1e6d0331f f3 - - - - -] Running image cache manager for 705 instances
That's from a debug line I added inside the manager.py and _run_image_ cache_manager_ pass method: