Comment 2 for bug 1917857

Revision history for this message
Steven Webster (swebster-wr) wrote :

In terms of SR-IOV I agree that things look ok (apart from the resources not being seen by kublet). It's a bit strange to have the sriovfh1 on two datanetworks with the same driver, but there is nothing stopping this...

In any case,

The final SR-IOV binds in the puppet worker run:

2021-03-03T15:32:58.739 Debug: 2021-03-03 15:32:58 +0000 Exec[sriov-bind-device: 0000:1d:01.0](provider=posix): Executing '/usr/share/starlingx/scripts/dpdk-devbind.py --bind=igb_uio 0000:1d:01.0
2021-03-03T15:32:58.744 Debug: 2021-03-03 15:32:58 +0000 Executing: '/usr/share/starlingx/scripts/dpdk-devbind.py --bind=igb_uio 0000:1d:01.0

And we can see the sriov device plugin started after the final bind, which is a good thing:

./var/extra/containerization.info:2021-03-03T15:33:03Z kube-sriov-device-plugin-amd64-888tg Pod Stopping container kube-sriovdp Killing Normal
./var/extra/containerization.info:%!s(<nil>) kube-sriov-device-plugin-amd64-cnqqq Pod Successfully assigned kube-system/kube-sriov-device-plugin-amd64-cnqqq to controller-0 Scheduled Normal
./var/extra/containerization.info:2021-03-03T15:33:09Z kube-sriov-device-plugin-amd64 DaemonSet Created pod: kube-sriov-device-plugin-amd64-cnqqq SuccessfulCreate Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Started container kube-sriovdp Started Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Container image "registry.local:9001/docker.io/starlingx/k8s-plugins-sriov-network-device:stx.4.0-v3.2-16-g4e0302ae" already present on machine Pulled Normal

Examining daemon.log for the kubelet logs:

2021-03-03T15:33:10.815 controller-0 kubelet[133031]: info E0303 15:33:10.815867 133031 kubelet_node_status.go:92] Unable to register node "controller-0" with API server: Node "controller-0" is invalid: [status.capacity.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetbh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetmh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.memory: Invalid value: resource.Quantity{i:resource.int64Amount{value:99775209472, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.intel.com/pci_sriov_net_datanetbh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:0, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes]

It looks from the sysinv database that you have both 2M and 1G pages configured, which is likely causing the issue above. I was under the impression that we did not allow this via semantic check: https://opendev.org/starlingx/config/commit/b180df6a986a6a58a298953b89e7d5d2979adcbf .

Are other resources besides the SR-IOV ones reported as allocatable?