We have a new xen CI run with the more detailed debug logs now and it's basically confirming what I think the problem was, but apparently this is a race because this time the xen CI job passed.
Then when the update_available_resource periodic task runs, we see that _copy_resources updates the in-memory ComputeNode.*_allocation_ratio values to 0.0 from the config:
Aug 30 08:03:09.458344 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] ComputeNode.cpu_allocation_ratio changed from 16.0 to 0.0 in _copy_resources.
And the _resource_change method, called from RT._update, confirms the change:
Back to the update_available_resource periodic thread, because RT._resource_changed returned True, we called ComputeNode.save() which will call ComputeNode._from_db_object which will reset the ComputeNode.cpu_allocation_ratio from 0.0 to 16.0, and then we pass that ComputeNode object to the RT._normalize_inventory_from_cn_obj method:
Aug 30 08:03:10.089986 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Begin _normalize_inventory_from_cn_obj for node 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090256 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Using cpu_allocation_ratio 16.0 for node: 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090512 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] RT: Sending compute node inventory changes back to placement for node: 9c58942c-d183-455a-a760-991e4430e816
And in there we are setting cpu_allocation_ratio in the VCPU inventory dict properly to 16.0.
And because of that, we don't update the inventory and change the cpu_allocation_ratio in placement from 16.0 to 0.0:
Aug 30 08:03:10.170157 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.scheduler.client.report [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] (Local) inventory has not changed for provider 9c58942c-d183-455a-a760-991e4430e816 based on inventory data: {'VCPU': {'allocation_ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}}
But there is definitely something going on here with shared ProviderTree cache, I'm just not sure where.
We have a new xen CI run with the more detailed debug logs now and it's basically confirming what I think the problem was, but apparently this is a race because this time the xen CI job passed.
Starting with the logs here:
http:// dd6b71949550285 df7dc-dda4e480e 005aaa13ec30355 1d2d8155. r49.cf1. rackcdn. com/13/ 597613/ 2/check/ dsvm-tempest- neutron- network/ 621833d/ logs/screen- n-cpu.txt. gz
This is the initial inventory update for the newly created compute node and resource provider on start of nova-compute:
Aug 30 08:02:12.144029 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: DEBUG nova.scheduler. client. report [None req-f4b08178- 4b7a-4fba- b57a-91612721f9 70 None None] Updated inventory for 9c58942c- d183-455a- a760-991e4430e8 16 at generation 1: {'VCPU': {'allocation_ ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}} {{(pid=24292) _update_ inventory_ attempt /opt/stack/ new/nova/ nova/scheduler/ client/ report. py:967} }
Then when the update_ available_ resource periodic task runs, we see that _copy_resources updates the in-memory ComputeNode. *_allocation_ ratio values to 0.0 from the config:
Aug 30 08:03:09.458344 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. resource_ tracker [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] ComputeNode. cpu_allocation_ ratio changed from 16.0 to 0.0 in _copy_resources.
And the _resource_change method, called from RT._update, confirms the change:
Aug 30 08:03:09.549234 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. resource_ tracker [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] Compute node resources have changed. citrix- lon-nodepool- 1379396 nova-compute[ 24292]: Old: ComputeNode( cpu_allocation_ ratio=16. 0,cpu_info= '{"model" : "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_ at=2018- 08-30T15: 02:11Z, current_ workload= 0,deleted= False,deleted_ at=None, disk_allocation _ratio= 1.0,disk_ available_ least=27, free_disk_ gb=47,free_ ram_mb= 12283,host= 'localhost' ,host_ip= 192.168. 33.1,hypervisor _hostname= 'localhost' ,hypervisor_ type='XenServer ',hypervisor_ version= 7001000, id=1,local_ gb=47,local_ gb_used= 0,mapped= 0,memory_ mb=12795, memory_ mb_used= 512,metrics= '[]',numa_ topology= None,pci_ device_ pools=PciDevice PoolList, ram_allocation_ ratio=1. 5,running_ vms=0,service_ id=None, stats={ failed_ builds= '0'},supported_ hv_specs= [HVSpec, HVSpec] ,updated_ at=None, uuid=9c58942c- d183-455a- a760-991e4430e8 16,vcpus= 4,vcpus_ used=0)
Aug 30 08:03:09.549407 dsvm-devstack-
Aug 30 08:03:09.549912 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: New: ComputeNode( cpu_allocation_ ratio=0. 0,cpu_info= '{"model" : "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_ at=2018- 08-30T15: 02:11Z, current_ workload= 0,deleted= False,deleted_ at=None, disk_allocation _ratio= 0.0,disk_ available_ least=27, free_disk_ gb=47,free_ ram_mb= 12283,host= 'localhost' ,host_ip= 192.168. 33.1,hypervisor _hostname= 'localhost' ,hypervisor_ type='XenServer ',hypervisor_ version= 7001000, id=1,local_ gb=47,local_ gb_used= 0,mapped= 0,memory_ mb=12795, memory_ mb_used= 512,metrics= '[]',numa_ topology= None,pci_ device_ pools=PciDevice PoolList, ram_allocation_ ratio=0. 0,running_ vms=0,service_ id=None, stats={ failed_ builds= '0'},supported_ hv_specs= [HVSpec, HVSpec] ,updated_ at=2018- 08-30T15: 02:11Z, uuid=9c58942c- d183-455a- a760-991e4430e8 16,vcpus= 4,vcpus_ used=0)
Now while that happens, I see a ProviderTree. update_ inventory call, and I'm not sure from where:
Aug 30 08:03:09.614402 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. provider_ tree [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] Updating inventory in ProviderTree for provider 9c58942c- d183-455a- a760-991e4430e8 16 with inventory: {u'VCPU': {u'allocation_ ratio': 16.0, u'total': 4, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 4}, u'MEMORY_MB': {u'allocation_ ratio': 1.5, u'total': 12795, u'reserved': 512, u'step_size': 1, u'min_unit': 1, u'max_unit': 12795}, u'DISK_GB': {u'allocation_ ratio': 1.0, u'total': 47, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 47}}
Back to the update_ available_ resource periodic thread, because RT._resource_ changed returned True, we called ComputeNode.save() which will call ComputeNode. _from_db_ object which will reset the ComputeNode. cpu_allocation_ ratio from 0.0 to 16.0, and then we pass that ComputeNode object to the RT._normalize_ inventory_ from_cn_ obj method:
Aug 30 08:03:10.089986 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. resource_ tracker [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] Begin _normalize_ inventory_ from_cn_ obj for node 9c58942c- d183-455a- a760-991e4430e8 16 citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. resource_ tracker [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] Using cpu_allocation_ ratio 16.0 for node: 9c58942c- d183-455a- a760-991e4430e8 16 citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.compute. resource_ tracker [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] RT: Sending compute node inventory changes back to placement for node: 9c58942c- d183-455a- a760-991e4430e8 16
Aug 30 08:03:10.090256 dsvm-devstack-
Aug 30 08:03:10.090512 dsvm-devstack-
And in there we are setting cpu_allocation_ ratio in the VCPU inventory dict properly to 16.0.
And because of that, we don't update the inventory and change the cpu_allocation_ ratio in placement from 16.0 to 0.0:
Aug 30 08:03:10.170157 dsvm-devstack- citrix- lon-nodepool- 1379396 nova-compute[ 24292]: INFO nova.scheduler. client. report [None req-9b1b9924- b89e-4a03- 9a69-c9fff17594 e3 None None] (Local) inventory has not changed for provider 9c58942c- d183-455a- a760-991e4430e8 16 based on inventory data: {'VCPU': {'allocation_ ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}}
But there is definitely something going on here with shared ProviderTree cache, I'm just not sure where.