stx-kubernetes sriov resources suddenly disappears

Bug #1917857 reported by Venkata Veldanda
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Steven Webster

Bug Description

Brief Description
We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u

We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources

[root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info
###
### StarlingX
### Release 20.06
###

OS="centos"
SW_VERSION="20.06"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.4.0"

JOB="STX_4.0_build_layer_flock"
<email address hidden>"
BUILD_NUMBER="22"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2020-08-05 12:25:52 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_4.0_build_layer_flock"
<email address hidden>"
FLOCK_BUILD_NUMBER="22"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000"

Capacity:
  cpu: 96
  ephemeral-storage: 10190100Ki
  hugepages-1Gi: 46Gi
  hugepages-2Mi: 0
  intel.com/intel_fpga_fec: 0
  intel.com/pci_sriov_net_datanet_c: 0
  intel.com/pci_sriov_net_datanet_u: 0
  intel.com/pci_sriov_net_datanetbh1: 8
  intel.com/pci_sriov_net_datanetdn1: 8
  intel.com/pci_sriov_net_datanetmh1: 8
  memory: 97436728Ki
  pods: 110
Allocatable:
  cpu: 92
  ephemeral-storage: 9391196145
  hugepages-1Gi: 46Gi
  hugepages-2Mi: 0
  intel.com/intel_fpga_fec: 0
  intel.com/pci_sriov_net_datanet_c: 0
  intel.com/pci_sriov_net_datanet_u: 0
  intel.com/pci_sriov_net_datanetbh1: 8
  intel.com/pci_sriov_net_datanetdn1: 8
  intel.com/pci_sriov_net_datanetmh1: 8

It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes

Here are some of the CLI outputs:

[root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0
+-----------------------+---------------------------------------------------------------------------------------------------------+
| Property | Value |
+-----------------------+---------------------------------------------------------------------------------------------------------+
| name | pci_0000_1d_00_0 |
| address | 0000:1d:00.0 |
| class id | 120000 |
| vendor id | 8086 |
| device id | 0d8f |
| class name | Processing accelerators |
| vendor name | Intel Corporation |
| device name | Device 0d8f |
| numa_node | 0 |
| enabled | True |
| sriov_totalvfs | 8 |
| sriov_numvfs | 8 |
| sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 |
| sriov_vf_pdevice_id | 0d90 |
| extra_info | |
| created_at | 2021-03-03T13:46:26.363470+00:00 |
| updated_at | 2021-03-03T13:47:12.684827+00:00 |
| root_key | None |
| revoked_key_ids | None |
| boot_page | None |
| bitstream_id | None |
| bmc_build_version | None |
| bmc_fw_version | None |
| driver | igb_uio |
| sriov_vf_driver | igb_uio |
+-----------------------+---------------------------------------------------------------------------------------------------------+
[root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1
+-----------------+--------------------------------------+
| Property | Value |
+-----------------+--------------------------------------+
| ifname | sriovfh1 |
| iftype | ethernet |
| ports | [u'enp177s0f3'] |
| imac | 40:a6:b7:34:e4:a3 |
| imtu | 9216 |
| ifclass | pci-sriov |
| ptp_role | none |
| aemode | None |
| schedpolicy | None |
| txhashpolicy | None |
| uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 |
| ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e |
| vlan_id | None |
| uses | [] |
| used_by | [] |
| created_at | |
| updated_at | |
| sriov_numvfs | 16 |
| sriov_vf_driver | vfio |
| accelerated | [True] |
+-----------------+--------------------------------------+

[root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0
+--------------+--------------------------------------+----------+------------------+
| hostname | uuid | ifname | datanetwork_name |
+--------------+--------------------------------------+----------+------------------+
| controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c |
| controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 |
| controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 |
| controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u |
| controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 |
+--------------+--------------------------------------+----------+------------------+
[root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0
+--------------+--------------------------------------+----------+------------------+
| hostname | uuid | ifname | datanetwork_name |
+--------------+--------------------------------------+----------+------------------+
| controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c |
| controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 |
| controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 |
| controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u |
| controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 |
+--------------+--------------------------------------+----------+------------------+
[root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86
+------------------+--------------------------------------+
| Property | Value |
+------------------+--------------------------------------+
| hostname | controller-0 |
| uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 |
| ifname | sriovfh1 |
| datanetwork_name | datanet-c |
+------------------+--------------------------------------+
[root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95
+------------------+--------------------------------------+
| Property | Value |
+------------------+--------------------------------------+
| hostname | controller-0 |
| uuid | e155e1d0-8dec-47e6-ac60-076832698a95 |
| ifname | sriovfh1 |
| datanetwork_name | datanet-u |
+------------------+--------------------------------------+
[root@controller-0 sysadmin(keystone_admin)]#

sriov device plugin logs:
=====================================================================================

controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec
      "resourceName": "intel_fpga_fec",
I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}]
I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec"
I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec
I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool
I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock
I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving
I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet
I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked
I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
controller-0:/home/sysadmin#
controller-0:/home/sysadmin#
controller-0:/home/sysadmin#
controller-0:/home/sysadmin#
controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c
      "resourceName": "pci_sriov_net_datanet_c",
I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}]
I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c"
I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c
I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool
I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock
I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving
I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked
I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet
I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu
      "resourceName": "pci_sriov_net_datanet_u",
I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}]
I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u"
I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u
I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool
I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock
I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving
I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked
I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet

Severity:
Critical - This is a show stopper and blocks the deployment of Flexran

Steps to Reproduce:
1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver
2> system lock and unlock the host
3> checked the resource of FEC device from k8s
Allocatable:
intel.com/intel_fpga_fec: 8
intel.com/pci_sriov_net_datanet_c: 16
intel.com/pci_sriov_net_datanet_u: 16

4> system lock and unlock the host multiple times during regular usage
5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock

 Allocatable:
intel.com/intel_fpga_fec: 0
intel.com/pci_sriov_net_datanet_c: 0
intel.com/pci_sriov_net_datanet_u: 0

6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet

Expected Behavior:
All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0"

Actual Behavior:
Resources are not seen as expected

Reproducibility:
Intermittent

System Configuration
Simplex (AIO)

Branch/Pull Time/Commit
StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/
Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/

Timestamp/Logs
--------------
Logs are attached. Issue occurred on 03-03-2021
The collect log has been captured after we tried several work arounds of removing the VF association & recreating them (with a lock/unlock). Hence the config.json that is present in the collect log may not reflect the same resources at that point of time.

Test Activity
Evaluation

Workaround
None

Revision history for this message
Venkata Veldanda (vveldanda) wrote :
description: updated
description: updated
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.networking
Revision history for this message
Steven Webster (swebster-wr) wrote :
Download full text (3.7 KiB)

In terms of SR-IOV I agree that things look ok (apart from the resources not being seen by kublet). It's a bit strange to have the sriovfh1 on two datanetworks with the same driver, but there is nothing stopping this...

In any case,

The final SR-IOV binds in the puppet worker run:

2021-03-03T15:32:58.739 Debug: 2021-03-03 15:32:58 +0000 Exec[sriov-bind-device: 0000:1d:01.0](provider=posix): Executing '/usr/share/starlingx/scripts/dpdk-devbind.py --bind=igb_uio 0000:1d:01.0
2021-03-03T15:32:58.744 Debug: 2021-03-03 15:32:58 +0000 Executing: '/usr/share/starlingx/scripts/dpdk-devbind.py --bind=igb_uio 0000:1d:01.0

And we can see the sriov device plugin started after the final bind, which is a good thing:

./var/extra/containerization.info:2021-03-03T15:33:03Z kube-sriov-device-plugin-amd64-888tg Pod Stopping container kube-sriovdp Killing Normal
./var/extra/containerization.info:%!s(<nil>) kube-sriov-device-plugin-amd64-cnqqq Pod Successfully assigned kube-system/kube-sriov-device-plugin-amd64-cnqqq to controller-0 Scheduled Normal
./var/extra/containerization.info:2021-03-03T15:33:09Z kube-sriov-device-plugin-amd64 DaemonSet Created pod: kube-sriov-device-plugin-amd64-cnqqq SuccessfulCreate Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Started container kube-sriovdp Started Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Container image "registry.local:9001/docker.io/starlingx/k8s-plugins-sriov-network-device:stx.4.0-v3.2-16-g4e0302ae" already present on machine Pulled Normal

Examining daemon.log for the kubelet logs:

2021-03-03T15:33:10.815 controller-0 kubelet[133031]: info E0303 15:33:10.815867 133031 kubelet_node_status.go:92] Unable to register node "controller-0" with API server: Node "controller-0" is invalid: [status.capacity.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetbh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetmh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.memory: Invalid value: resource.Quantity{i:resource.int64Amount{value:99775209472, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, ...

Read more...

Changed in starlingx:
status: New → Triaged
Revision history for this message
Venkata Veldanda (vveldanda) wrote :

1. We only created 1G hugepages using following commands:
    system host-memory-modify controller-0 0 -1G 38
    system host-memory-modify controller-0 1 -1G 10
   We did not deliberately create 2M pages

2. Other than with the SRIOV VF resources we did not see the count issue with other resources like CPU, memory, or hugepages

Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low due to lack of activity

Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.