Worker fails reboot recovery due to SRIOV timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Douglas Henrique Koerich |
Bug Description
Brief Description
-----------------
When testing an AIO-SX configuration with modified CPU allocation, with SRIOV enabled and running a large number of pods, it was observed that after unlocking host the system went into a reboot loop due to a timeout failure when applying the worker manifest.
Severity
--------
Major.
Steps to Reproduce
------------------
- Lock host.
- Configure at least 16 CPUs for Platform function;
- Enable and configure an SRIOV interface;
- With increased pod limit, start 400 pods;
- Unlock host.
Expected Behavior
------------------
Reboot after unlock should complete successfully and all pods should be running.
Actual Behavior
----------------
The system went into a reboot loop (2+).
Reproducibility
---------------
Reproducible.
System Configuration
-------
One node system (AIO-SX).
Branch/Pull Time/Commit
-------
###
### StarlingX
### Release 20.12
###
### Wind River Systems, Inc.
###
SW_VERSION="20.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
SRC_BUILD_ID="883"
JOB="StarlingX_
BUILD_BY="jenkins"
BUILD_NUMBER="883"
BUILD_HOST=
BUILD_DATE=
Last Pass
---------
N/A.
Timestamp/Logs
--------------
From worker's puppet log:
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
Test Activity
-------------
Developer testing.
Workaround
----------
Increase timeout value for SRIOV device plugin deletion (introduced in bug 1900736) at /usr/share/
CVE References
Changed in starlingx: | |
assignee: | nobody → Douglas Henrique Koerich (dkoerich-wr) |
status: | New → In Progress |
tags: | added: stx.5.0 stx.networking |
Changed in starlingx: | |
importance: | Undecided → Critical |
importance: | Critical → High |
importance: | High → Medium |
Changed in starlingx: | |
status: | In Progress → Fix Released |
I recalled past issues that relate to this problem, and I am listing them below for background reference:
Bug 1850438;
Bug 1885229;
Bug 1896631;
(One relevant comment in the last one above is: "There is a race between the kubernetes processes coming up after the controller manifest is applied and the application of the worker manifest. (...) The fix for this would be quite extensive, requiring the creation of a new AIO, or separate kubernetes manifest to coordinate the bring-up of k8s services and the worker configuration.")
Bug 1900736.
While the final solution of avoiding the race condition is not ready yet, the timeout value will be increased to consider different loading from pods. For better evaluation of that, some measurements will be taken considering:
- Different number and types of pods;
- Different values of timeout.