In terms of SR-IOV I agree that things look ok (apart from the resources not being seen by kublet). It's a bit strange to have the sriovfh1 on two datanetworks with the same driver, but there is nothing stopping this...
And we can see the sriov device plugin started after the final bind, which is a good thing:
./var/extra/containerization.info:2021-03-03T15:33:03Z kube-sriov-device-plugin-amd64-888tg Pod Stopping container kube-sriovdp Killing Normal
./var/extra/containerization.info:%!s(<nil>) kube-sriov-device-plugin-amd64-cnqqq Pod Successfully assigned kube-system/kube-sriov-device-plugin-amd64-cnqqq to controller-0 Scheduled Normal
./var/extra/containerization.info:2021-03-03T15:33:09Z kube-sriov-device-plugin-amd64 DaemonSet Created pod: kube-sriov-device-plugin-amd64-cnqqq SuccessfulCreate Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Started container kube-sriovdp Started Normal
./var/extra/containerization.info:2021-03-03T15:33:10Z kube-sriov-device-plugin-amd64-cnqqq Pod Container image "registry.local:9001/docker.io/starlingx/k8s-plugins-sriov-network-device:stx.4.0-v3.2-16-g4e0302ae" already present on machine Pulled Normal
Examining daemon.log for the kubelet logs:
2021-03-03T15:33:10.815 controller-0 kubelet[133031]: info E0303 15:33:10.815867 133031 kubelet_node_status.go:92] Unable to register node "controller-0" with API server: Node "controller-0" is invalid: [status.capacity.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetbh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.intel.com/pci_sriov_net_datanetmh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:8, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"8", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.capacity.memory: Invalid value: resource.Quantity{i:resource.int64Amount{value:99775209472, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:536870912, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.intel.com/pci_sriov_net_datanetbh1: Invalid value: resource.Quantity{i:resource.int64Amount{value:0, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes]
In terms of SR-IOV I agree that things look ok (apart from the resources not being seen by kublet). It's a bit strange to have the sriovfh1 on two datanetworks with the same driver, but there is nothing stopping this...
In any case,
The final SR-IOV binds in the puppet worker run:
2021-03- 03T15:32: 58.739 Debug: 2021-03-03 15:32:58 +0000 Exec[sriov- bind-device: 0000:1d: 01.0](provider= posix): Executing '/usr/share/ starlingx/ scripts/ dpdk-devbind. py --bind=igb_uio 0000:1d:01.0 03T15:32: 58.744 Debug: 2021-03-03 15:32:58 +0000 Executing: '/usr/share/ starlingx/ scripts/ dpdk-devbind. py --bind=igb_uio 0000:1d:01.0
2021-03-
And we can see the sriov device plugin started after the final bind, which is a good thing:
./var/extra/ containerizatio n.info: 2021-03- 03T15:33: 03Z kube-sriov- device- plugin- amd64-888tg Pod Stopping container kube-sriovdp Killing Normal containerizatio n.info: %!s(<nil> ) kube-sriov- device- plugin- amd64-cnqqq Pod Successfully assigned kube-system/ kube-sriov- device- plugin- amd64-cnqqq to controller-0 Scheduled Normal containerizatio n.info: 2021-03- 03T15:33: 09Z kube-sriov- device- plugin- amd64 DaemonSet Created pod: kube-sriov- device- plugin- amd64-cnqqq SuccessfulCreate Normal containerizatio n.info: 2021-03- 03T15:33: 10Z kube-sriov- device- plugin- amd64-cnqqq Pod Started container kube-sriovdp Started Normal containerizatio n.info: 2021-03- 03T15:33: 10Z kube-sriov- device- plugin- amd64-cnqqq Pod Container image "registry. local:9001/ docker. io/starlingx/ k8s-plugins- sriov-network- device: stx.4.0- v3.2-16- g4e0302ae" already present on machine Pulled Normal
./var/extra/
./var/extra/
./var/extra/
./var/extra/
Examining daemon.log for the kubelet logs:
2021-03- 03T15:33: 10.815 controller-0 kubelet[133031]: info E0303 15:33:10.815867 133031 kubelet_ node_status. go:92] Unable to register node "controller-0" with API server: Node "controller-0" is invalid: [status. capacity. hugepages- 2Mi: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:536870912 , scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status. capacity. intel.com/ pci_sriov_ net_datanetbh1: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:8, scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"8", Format: "DecimalSI" }: may not have pre-allocated hugepages for multiple page sizes, status. capacity. intel.com/ pci_sriov_ net_datanetmh1: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:8, scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"8", Format: "DecimalSI" }: may not have pre-allocated hugepages for multiple page sizes, status. capacity. memory: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:997752094 72, scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status. allocatable. hugepages- 2Mi: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:536870912 , scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status. allocatable. intel.com/ pci_sriov_ net_datanetbh1: Invalid value: resource. Quantity{ i:resource. int64Amount{ value:0, scale:0}, d:resource. infDecAmount{ Dec:(*inf. Dec)(nil) }, s:"0", Format: "DecimalSI" }: may not have pre-allocated hugepages for multiple page sizes]
It looks from the sysinv database that you have both 2M and 1G pages configured, which is likely causing the issue above. I was under the impression that we did not allow this via semantic check: https:/ /opendev. org/starlingx/ config/ commit/ b180df6a986a6a5 8a298953b89e7d5 d2979adcbf .
Are other resources besides the SR-IOV ones reported as allocatable?