StarlingX

Worker fails reboot recovery due to SRIOV timeout

Bug #1916620 reported by Douglas Henrique Koerich on 2021-02-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Douglas Henrique Koerich

Bug Description

Brief Description
-----------------
When testing an AIO-SX configuration with modified CPU allocation, with SRIOV enabled and running a large number of pods, it was observed that after unlocking host the system went into a reboot loop due to a timeout failure when applying the worker manifest.

Severity
--------
Major.

Steps to Reproduce
------------------
- Lock host.
- Configure at least 16 CPUs for Platform function;
- Enable and configure an SRIOV interface;
- With increased pod limit, start 400 pods;
- Unlock host.

Expected Behavior
------------------
Reboot after unlock should complete successfully and all pods should be running.

Actual Behavior
----------------
The system went into a reboot loop (2+).

Reproducibility
---------------
Reproducible.

System Configuration
--------------------
One node system (AIO-SX).

Branch/Pull Time/Commit
-----------------------
###
### StarlingX
### Release 20.12
###
### Wind River Systems, Inc.
###

SW_VERSION="20.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2021-02-18_06-00-00"
SRC_BUILD_ID="883"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="883"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2021-02-18 06:04:39 -0500"

Last Pass
---------
N/A.

Timestamp/Logs
--------------
From worker's puppet log:

2021-02-23T14:47:53.280 ^[[0;36mDebug: 2021-02-23 14:47:53 +0000 Exec[Delete sriov device plugin pod if present](provider=posix): Executing check 'kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n kube-system --selector=app=sriovdp --field-selector spec.nodeName=$(hostname) | grep kube-sriov-device-plugin'^[[0m
2021-02-23T14:47:53.282 ^[[0;36mDebug: 2021-02-23 14:47:53 +0000 Executing: 'kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n kube-system --selector=app=sriovdp --field-selector spec.nodeName=$(hostname) | grep kube-sriov-device-plugin'^[[0m
2021-02-23T14:47:53.349 ^[[0;36mDebug: 2021-02-23 14:47:53 +0000 /Stage[main]/Platform::Kubernetes::Worker::Sriovdp/Exec[Delete sriov device plugin pod if present]/onlyif: kube-sriov-device-plugin-amd64-ws67f 0/1 Pending 0 51s^[[0m
2021-02-23T14:47:53.351 ^[[0;36mDebug: 2021-02-23 14:47:53 +0000 Exec[Delete sriov device plugin pod if present](provider=posix): Executing 'kubectl --kubeconfig=/etc/kubernetes/admin.conf delete pod -n kube-system --selector=app=sriovdp --field-selector spec.nodeName=$(hostname) --timeout=60s'^[[0m
2021-02-23T14:47:53.353 ^[[0;36mDebug: 2021-02-23 14:47:53 +0000 Executing: 'kubectl --kubeconfig=/etc/kubernetes/admin.conf delete pod -n kube-system --selector=app=sriovdp --field-selector spec.nodeName=$(hostname) --timeout=60s'^[[0m2021-02-23T14:48:53.433 ^[[mNotice: 2021-02-23 14:48:53 +0000 /Stage[main]/Platform::Kubernetes::Worker::Sriovdp/Exec[Delete sriov device plugin pod if present]/returns: pod "kube-sriov-device-plugin-amd64-ws67f" deleted^[[0m
2021-02-23T14:48:53.435 ^[[mNotice: 2021-02-23 14:48:53 +0000 /Stage[main]/Platform::Kubernetes::Worker::Sriovdp/Exec[Delete sriov device plugin pod if present]/returns: error: timed out waiting for the condition on pods/kube-sriov-device-plugin-amd64-ws67f^[[0m
2021-02-23T14:48:53.437 ^[[1;31mError: 2021-02-23 14:48:53 +0000 kubectl --kubeconfig=/etc/kubernetes/admin.conf delete pod -n kube-system --selector=app=sriovdp --field-selector spec.nodeName=$(hostname) --timeout=60s returned 1 instead of one of [0]

Test Activity
-------------
Developer testing.

Workaround
----------
Increase timeout value for SRIOV device plugin deletion (introduced in bug 1900736) at /usr/share/puppet/modules/platform/manifests/kubernetes.pp

Tags:

CVE References

2020-15705

Douglas Henrique Koerich (dkoerich-wr) on 2021-02-23

Changed in starlingx:
assignee:	nobody → Douglas Henrique Koerich (dkoerich-wr)
status:	New → In Progress

Revision history for this message

Douglas Henrique Koerich (dkoerich-wr) wrote on 2021-02-23:

I recalled past issues that relate to this problem, and I am listing them below for background reference:

Bug 1850438;
Bug 1885229;
Bug 1896631;

(One relevant comment in the last one above is: "There is a race between the kubernetes processes coming up after the controller manifest is applied and the application of the worker manifest. (...) The fix for this would be quite extensive, requiring the creation of a new AIO, or separate kubernetes manifest to coordinate the bring-up of k8s services and the worker configuration.")

Bug 1900736.

While the final solution of avoiding the race condition is not ready yet, the timeout value will be increased to consider different loading from pods. For better evaluation of that, some measurements will be taken considering:

- Different number and types of pods;
- Different values of timeout.

Ghada Khalil (gkhalil) on 2021-02-23

tags:	added: stx.5.0 stx.networking
Changed in starlingx:
importance:	Undecided → Critical
importance:	Critical → High
importance:	High → Medium

Revision history for this message

Douglas Henrique Koerich (dkoerich-wr) wrote on 2021-02-25:

The issue is indeed caused by the concurrency in high load between kubelet (launching the pods) and puppet (applying worker manifest), as depicted in the table below, obtained from tests on AIO-SX lab with a small, generic pod:

Table 1: Elapsed time between relevant events vs. number of pods
+-----------------------------------+----------+----------+-----------+
| Event | 100 pods | 200 pods | 300 pods* |
+-----------------------------------+-------- -+----------+-----------+
| Finished with controller manifest | 0 sec | 0 sec | 0 sec |
| Pods gets launched | 26 sec | 23 sec | 27 sec |
| Started with worker manifest | 1 sec | 3 sec | <1 sec |
| Triggered delete of sriovdp | 23 sec | 46 sec | 62 sec |
| sriovdp deleted | 18 sec | 107 sec | 245 sec |
| Finished with worker manifest | 1 sec | 10 sec | 1 sec |
+-----------------------------------+-------- -+----------+-----------+
(*) System got unstable due to heavy load from pods

Revision history for this message

Douglas Henrique Koerich (dkoerich-wr) wrote on 2021-02-25:

Proposed workaround fix available for review at: https://review.opendev.org/c/starlingx/stx-puppet/+/777587

Douglas Henrique Koerich (dkoerich-wr) on 2021-03-01

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix proposed to stx-puppet (f/centos8)

#10

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-03: Fix merged to stx-puppet (f/centos8)

#11

Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

Enable kubelet support for pod pid limit

Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

We fix this by adding dependencies between the above classes
in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

This dependency does not exist for configuring the VF rate-limits
however. There is a cha...

Reviewed:  https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu May 13 18:03:45 2021 +0300

Enable kubelet support for pod pid limit
    
    Enable limiting the number of pids inside of pods.
    
    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.
    
    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Tue May 11 10:21:57 2021 -0400

Safe restart of the etcd SM service in etcd upgrade runtime class
    
    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.
    
    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.
    
    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Mon May 10 13:09:52 2021 +0300

Serialize updates to global_filter in the AIO manifest
    
    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.
    
    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.
    
    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <steven.webster@windriver.com>
Date:   Fri May 7 15:33:43 2021 -0400

Add SR-IOV rate-limit dependency
    
    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking.  This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.
    
    This dependency does not exist for configuring the VF rate-limits
    however.  There is a chance that the VF rate-limiting configuration
    happens before the VFs are actually created.
    
    This commit fixes the issue by creating a dependency on
    platform::networking from the sriov::config class, which ensures
    the VFs are created before both driver binding and rate
    limiting configuration occurs.
    
    Closes-Bug: #1927758
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: Ic452247eb8c980e1b18bdc54832eb635d7a9fc54

commit 0b429c7cb0c16e34755c1b1e146ebb8b006d44dc
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Thu May 6 12:33:24 2021 -0400

Configure etcd service critical process nice and ionice
    
    The etcd server is a critical "interactive" process that requires
    low-latency. This process has many etcd threads, each worker does
    minimal work and wakes up frequently. The threads do small amount of
    writes to commit.
    
    The etcd server will start exceeding heartbeat interval of 100ms and
    the election timeout of 1000ms under load and independent disk stress,
    if not properly tuned as a critical process. This cascades into many
    failures.
    
    This requires io-scheduler 'cfq' to take advantage of io-nice policy
    and priority. This bumps up to best-effort/0 from best-effort/4.
    
    This sets nice -19 from nice 0. This helps tremendously with
    interactive processes for linux CFS (completely-fair-scheduler).
    
    With tuned settings, under application load and additional disk stress,
    we see a dramatic reduction of 'blocked_max' and no more kern.log
    etcdserver related errors for exceeding the timeouts.
    We see dramatic improvement to system responsiveness for kubectl,
    kube-apiserver. This prevents pods from failing when clients they
    cannot renew lease.
    
    Note that 'blocked_max' scheduler stats for this process represents
    involuntary wait for disk related delay, scheduling delay, etc.
    
    Testing coverage:
    - various root disk HW: RAID, NVMe, SSD, VBox
    - sanity on multiple labs: R730_1 with RAID, WFP13_14
    
    Configuration change used in testing:
    - baseline: deadline, best-effort/4,
    - system under test: cfq, best-effort/0, nice -19
    - dd stress was single writer to root disk:
      while true; do
        dd if=/dev/zero of=./test.dd bs=200K count=20000 conv=fsync
      done
    
    Compared results and observe system behaviour:
    - watch kern.log for etcserver 'took too long', and 'wal: sync'
    - watch fm alarms
    - watch kubectl pod status
    - observe performance with: iotop, schedtop, iostat
    
    Tests performed:
    - DRBD resync with and without dd writer stress
    - swact with and without dd stress
    - large application apply + dd writer stress
    - launch large number of pods (eg, scale nginx with 80 pods),
      watch systemctl status commands using strace to check for hang
    - copy very large files, create big tarballs, write mkisofs iso
    - host install
    
    Closes-Bug: 1927515
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/790098
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>
    Change-Id: Ieeeba5c1375d8d99401f839c7409a9de356fda87

commit 9782bb104c07b4aed0876d88d1743d4816a34515
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 08:51:19 2021 -0400

Update dnsmasq.conf for UEFI pxeboot
    
    Due to recent grub2 update for CVE-2020-15705, pxeboot must use the
    shim.efi file for secure boot, rather than grubx64.efi directly.
    
    Change-Id: I864ff46f449e92dfd5f1667379bc56aaaf6dfe2c
    Closes-Bug: 1927730
    Depends-On: https://review.opendev.org/c/starlingx/metal/+/790253
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/790254
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit c120fb798091db9fb756e51b895dccfa8d80a947
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed May 5 17:30:19 2021 -0400

AIO-SX reboots after change OAM ip address
    
    On HW tests, it was detected that openstack-endpoints restart was
    happening at the same as the service-manager restart, this creating
    a conflict that preventing SM services to reach enabled-active.
    This was provoking the reboot.
    
    The correction creates openstack::keystone::endpoint::runtime::post
    class to be executed the post stage and not on the main stage, to
    avoid conflict with service-manager
    
    Also marking platform::network::runtime to be run at the pre stage
    to avoid some encountered apply errors related to the delay of
    haproxy bringup due to the lack of the IP address on the interface
    as it was only configured later. This way the other restarted
    services will have the address on the interface as restart happens
    
    Tested on AIO-SX, by monitoring manifest apply and validating that
    no reboot happens
    
    Closes-Bug: 1927275
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ia70a3395753e43b3c1e2c037818c8c23e4ec0fd6

commit cb7858c65982c250f07a5022719d4f2b6d547d64
Author: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
Date:   Wed May 5 11:11:27 2021 -0300

Fix for failure during AIO-SX to AIO-DX migration on standalone system
    
    Fix drbd-cephmon mount error by manually remounting monitor DRBD after
    DRBD::Resource creation. Removed patching of Kubernetes Persistent
    Volumes from puppet manifest since Kubelet and kube-api are no longer
    available during puppet run.
    
    Partial-Bug: 1927224
    Signed-off-by: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
    Change-Id: Id5565ac734499b617b470499cfc2aa1ae2972da3

commit 5695a29e6a5ed8ee5d211e937496384027d7fd4e
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Apr 29 13:35:38 2021 -0400

Fix missing kubelet service enable for worker nodes
    
    Previous commit:
      https://review.opendev.org/c/starlingx/stx-puppet/+/780600/
    kubelet enable is skipped for the worker nodes.
    
    Change-Id: I7769aebb4a9e38404af0c883640e1a27bb1e9e84
    Closes-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 94ec35ff2d5363d3816f6d267a77a4efba6c6aa8
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Wed Apr 14 23:28:03 2021 -0400

Increase min_free_kbytes to 256M for storage to avoid OOM issue
    
    Help to prevent the OOM issue that it failed to allocate memory
    with error message 'page allocation failure: order:2, mode:0x104020'
    
    As the min_free_kbytes in the linux documentation shows:
    This is used to force the Linux VM to keep a minimum number
    of kilobytes free.  The VM uses this number to compute a
    watermark[WMARK_MIN] value for each lowmem zone in the system.
    Each lowmem zone gets a number of reserved free pages based
    proportionally on its size.
    
    Keeping more memory free in those zones means that the os itself is
    less likely to run out of memory during high memory pressure and high
    allocation events.
    
    Based on the issue occurs on the storage node so far, we only update
    the value for the storage node.
    
    Closes-Bug: #1924209
    
    Change-Id: Iae2e5a0787f69c62ba5da53663371fd2be148e15
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>

commit 736199af4106378b86b4cdca784105fe2cd8ed05
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed Apr 28 14:50:21 2021 -0400

On runtime, kube-sriov-device-plugin needs to be restarted
    
    The previous correction for bug 1918139 removed the sriov plugin
    restart necessary during runtime, done during the interface sriov
    assign to a datanetwork (allowed on an unlocked AIO-SX). Without
    it, the pod creation will not be able to use a datanetwork created
    on runtime.
    
    The correction bring back the platform::kubernetes::worker::sriovdp
    class to be used only on runtime
    
    Closes-Bug: 1918139
    
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: Ied19bf3138b58b279b350d067ae0c1080e220f31

commit 69b9809465b5e7a837917cce7d0a731ddf257f0d
Author: Steven Webster <steven.webster@windriver.com>
Date:   Tue Apr 27 17:54:24 2021 -0400

Fix interface (re)configuration for single-nic system
    
    Currently, the apply-network-config manifest step launches a script
    that detects differences between puppet's view of what the
    ifcfg-* network scripts should be and what the value
    of the ifcfg files actually are in the /etc/sysconfig/network-scripts/
    directory.
    
    If there are differences, the puppet representation of the interface
    configuration is copied to the system network-scripts directory and
    the interface is brought down and up to apply the config.
    If there are no changes between the puppet view and the system view,
    the interface is left alone.
    
    An issue can occur in a single-nic system comprising a physical
    lower ethernet interface configured for SR-IOV with upper vlan
    interfaces (oam, mgmt, etc).  If the lower interface is
    re-configured, it is subsequently brought down/up to apply
    the changes.  This causes the upper vlan interfaces to also
    be brought down by the kernel.  In the case of an IPv6 system,
    the interfaces will lose their addresses as well as any configured
    default route.  In the case of an IPv4 system, the default route
    will be wiped out, which could cause issues in a distributed cloud
    environment.
    
    This commit addresses the issue by detecting whether any lower
    interface associated with a vlan interface has been marked for
    re-configuration.  If this is the case, the vlan interface is
    also added to the up/down list to cause it to re-apply the
    existing static configuration (if it is not already in the list).
    
    Closes-Bug: 1926366
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: I40177900ef58a9619fecb34ceffc412f31d1a965

commit 139ba4aa6c143e495b8b7136b359254ceb3ba296
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Apr 26 14:59:51 2021 -0400

Reset N3000 fpgas only when it exists
    
    Remove calling reset n3000 fpga before detecting h/w exists.
    
    Closes-Bug: 1918139
    Change-Id: I81b7fbc9500fac7e86424537551c1e9aac7492ec
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit e6b1ae7d222f83625110d80a576b95f88f5ed04a
Author: Charles Short <charles.short@windriver.com>
Date:   Mon Apr 26 11:16:00 2021 -0400

Fix zuul errors due to changes in dependencies
    
    Pin hacking to < 4.0.1 to fix zuul gate issues.
    
    Test:
    Ran tox -e pep8 command to validate the pep8 job and result.
    
    Related-Bug: 1926172
    
    Signed-off-by: Charles Short <charles.short@windriver.com>
    Change-Id: Ia85b584d7ff4e5e7cb19a820d6f6323aa672f52e

commit 16f0b0cc66b23a9e74005a9cd9379de6a2d78234
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Fri Apr 23 10:02:35 2021 -0400

Rename the dnsmasq runtime class
    
    As the platform::dns:runtime class only referencing the resource of
    dnsmasq, this commit renames it as platform::dns::dnsmasq::runtime in
    order to indicate its function clearly.
    
    Story: 2008774
    Task: 42365
    
    Change-Id: I79dd23bf64abfd63906daa59ec59c4496dedda31
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 70971df9f35886f5ece04c82bfccee105d3d0861
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Mar 30 15:58:15 2021 -0400

AIO manifest to start kubernetes once
    
    This change is to avoid restarting kubernetes.
    Also calling sysinv-reset-n3000-fpgas to reset N3000 FPGAS
    on host start up.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/785683
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/780600
    Change-Id: I4a27840820fd45ad86cef4dfce6ea0389e583f68
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit f4694f8a30f1e5cbe0f7d354f95949a1601eb1e1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Feb 8 13:00:38 2021 -0500

Single puppet for AIO controllers
    
    This change includes:
    1. create aio.pp for AIO controller nodes
    2. execute aio.pp for nodes with subfunctions of 'controller,worker'
    3. remove sriov device plugin restart code as now kubelet starts
       after related config are applied.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/784761
    Change-Id: I54b90a76454c6c545bf2891b81225bbf2ba15b03
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit accc39cefe9f54efa656b99bb3fad949ba030367
Author: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>
Date:   Sun Mar 7 19:38:23 2021 -0300

DRBD replication, rebuilding monitor and PVCs during migration to AIO-DX
    
    Given the system capability "simplex_to_duplex_migration" exists on the
    system to indicate that it is going through a migration from AIO-SX
    to AIO-DX, this commit will during the unlock process, create a DRBD
    replicated filesystem for the floating monitor, rebuild the monitor
    store.db from the existing Ceph OSDs on the system, recover the
    previously existing cephfs filesystems, updates the ceph crushmap
    and updates the Ceph monitor IP on existing PersistentVolume resources.
    
    Story: 2008587
    Task: 42078
    
    Signed-off-by: Pedro Linhares <PedroHenriqueLinhares.Silva@windriver.com>
    Change-Id: Iba6ec8bf812c9623724c357455a370d79ffd7b60
    Signed-off-by: Pedro Henrique Linhares <PedroHenriqueLinhares.Silva@windriver.com>

commit 569b457592d3f3c95aba72f5f52108316842b6fe
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 14 14:54:40 2021 -0400

Generate admin ep cert on subcloud controllers in puppet
    
    Enabled admin endpoint cert to be generated in manifest directly
    from k8s secret data (via secure hieradata). This operation is
    consistant to the system controller as well as admin endpoint cert
    renewal.
    
    Partial-Bug: 1923510
    
    Change-Id: I442f3c2c97cf83588aefa8b4fe808834a31fdcc5
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit ffddc103ca66f87fb96ae02e9cfbb656d39f38ab
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Thu Apr 15 09:59:55 2021 -0400

OAM IP change needs double lock/unlock controllers for IPV6 system
    
    Added IPv6 address fields on the list used to detect if the interface
    have changed on apply_network_config.sh. Without it was only copying
    the interface config file from /var/run/network-scripts.puppet/ to
    /etc/sysconfig/network-scripts/ which explains why it was working
    on the second reboot.
    
    Tested on:
    AIO-DX
    AIO-SX
    
    Closes-Bug: 1895555
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
    Change-Id: I25e60a04b4aec38c254ff3e3a7b2f0d80ce5daaf

commit f46c154188b5d90bdd19ba2a5952b4f8c565d5d3
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Wed Apr 14 17:13:59 2021 -0400

kdump config remove intel eth drivers from ramdisk
    
    Problem:
    On a kernel crash, such as the watchdog timer firing, kexec
    tries booting the crash recovery kernel in order to capture
    a vmcore so that the issue can be debugged. This normally
    succeeds unless the platform has ice network hardware. Why?
    Because the crash recovery kernel has only a small amount of
    memory set aside for it, and the ice driver allocates enough
    memory to cause memory exhaustion.  This causes the crash
    recovery kernel's startup to fail, leading to complete platform
    hang.  In order to break out of the hang, one needs to manually
    do a hardware reset or power cycle.
    
    Solution:
    Change kdump.conf to leave the ice driver module out of the
    initramfs that is used by the crash recovery kernel.  In
    fact, leave all of the intel ethernet drivers out since they
    are not needed and increase the risk of memory exhaustion.
    Upon changing kdump.conf, the kdump service is restarted to
    regenerate the initramfs.
    
    Verification:
    Install, check the kdump.conf file and unpack the initramfs file
    making sure that those modules are gone.  Check controller,
    worker, and storage node types.  Reboot node, make sure things
    behave as expected ie. no extra kdump.conf mangling and no
    unexpected kdump service restarts.
    Also crash a node with intel ethernet hardware on it and make
    sure it comes back up with a vmcore left in /var/log/crash.
    
    Change-Id: I9112f722cee8e199d94393bca887d3bb9bb89b39
    Closes-Bug: 1923879
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit f21842a2c46c656086234b9b006c224a41485acb
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Apr 12 17:06:33 2021 -0400

Creates a LDAP client runtime class
    
    This commit creates a wrapper class platform::ldap::client::runtime to
    update the LDAP client in runtime.
    
    Tested with apply this class in runtime to update the LDAP server URI.
    
    Change-Id: Ia3e40617c9e628deeca839734bd3a3b41431f336
    Story: 2008774
    Task: 42248
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 2a80652598f399995edfc434f1aa0154f1b8299c
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Tue Mar 30 13:46:11 2021 -0400

Applying SRIOV VF configuration at runtime
    
    During runtime, if a user converts a non pci-sriov classed interface
    to a pci-sriov classed interface with type 'ethernet', or creates an
    SR-IOV interface of type 'VF', logic is implemented to enable and
    configure the interface'
    
    Story: 2008531
    Task: 42203
    Change-Id: I0edb4abf2cea6dc29b9485fa09d1fecab4b76c65
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit 28ef813cda9fd0191d8cee9c1f2bd80d64175f6f
Author: Don Penney <don.penney@windriver.com>
Date:   Tue Mar 30 18:10:02 2021 -0400

Add aggregate to DX service group reprovisioning
    
    The following prior update added service group reprovisioning on DX
    nodes, but was missing the aggregate option necessary to ensure
    certain groups were active on the same controller:
    https://review.opendev.org/c/starlingx/stx-puppet/+/773277
    
    As a result, failures during swact could lead to these groups being
    assigned to different nodes, causing other failures in the system.
    
    This update adds the missing aggregate option.
    
    Partial-Bug: 1893669
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I063d1549aa456bd4bb68c4c69c50dbc078ae7be0

commit 6a4907694c386aa6e85b0c51ac1963903f9092c8
Author: Robert Church <robert.church@windriver.com>
Date:   Sat Oct 24 03:30:47 2020 -0400

Add support for setting optional k8s cpu configuration flag
    
    If the host-label 'kube-ignore-isol-cpus=enabled' is added to a host,
    then file '/etc/kubernetes/ignore_isolcpus' will be created for kubelet
    to consume so that it can determine how to handle application-isolated
    CPUs.
    
    Story: 2008760
    Task: 42166
    Signed-off-by: Robert Church <robert.church@windriver.com>
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: Ifbcc245d0e2716b7abb7726d38d3662e7b53d770

commit 5927be3eed92dee4192bf76af04b171c9758bfd5
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Thu Mar 11 05:59:50 2021 -0500

Added classes to restart service manager and vim-webserver
    
    For service manager, created a class to stop, modify the OAM IP and
    restart it. In the case of vim-webserver, a new class to only restart the
    service during runtime
    
    Story: 2008531
    Task: 42061
    Change-Id: I7846c5ab3f1f8d0adb741356164a20932f9ed25f
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit e1552be5bcd4f32ae5d9c30a4158ca98368005a6
Author: Robert Church <robert.church@windriver.com>
Date:   Thu Mar 11 01:30:24 2021 -0500

Enabling Ceph MDS as part of adding Ceph at runtime
    
    A metadata server is assigned to every node that has a monitor.
    
    Restructure the metadataserver class to ensure that the metadata server:
     - is started after the Ceph monitor and the Ceph manager on controllers
     - is started after the Ceph monitor on a worker assigned a monitor
    
    If the metadata server is started prior to the monitor, it will not
    start properly.
    
    Future optimization may be to create a MDS SM service on the
    controllers, but based on current testing, it seems unnecessary.
    
    Tested:
     - Adding Ceph pre-controller-0 unlock
       - AIO-SX, AIO-DX, Standard 2+2, Storage 2+2+2
     - Adding Ceph at runtime after installed nodes are fully provisioned.
       - AIO-SX, AIO-DX, Standard 2+2
     - For all the above configs also added storage tiers and confirmed
       proper functionality
     - NOTE: No Ceph runtime option for labs with storage node
       configuration.
    
    Change-Id: I27b53b55738d0aec70db6a9e4004c920029869fa
    Closes-Bug: #1919276
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 1aef5b8968d8ce28c4fbc42a5160f77a8ebff642
Author: Robert Church <robert.church@windriver.com>
Date:   Thu Mar 11 01:29:04 2021 -0500

Re-enable adding bare-metal Ceph storage backend at runtime
    
    Adding the bare-metal Ceph storage backend at runtime fails as
    $::platform::rook::params::service_enabled can not be resolved.
    
    This update explicitly includes the class to allow resolution and enable
    the Ceph storage backend to be added.
    
    Change-Id: I1bd12910784387c2a2d37a29d2f299e3cebb8cd2
    Closes-Bug: #1919274
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit a79bc74c31350fc05cf16d89b1c2cdf35af5ef5f
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Fri Mar 12 13:47:19 2021 -0500

Cleanup config.toml.erb of MARK_* comments
    
    The comments that contain strings "MARK_BEGIN" and "MARK_END"
    are not used anymore in ansible bookstrap and they need
    to be cleaned up from config.toml.erb template.
    
    Change-Id: Id53bc58d2624581b6e50ead1c77a5cd424631ae5
    Closes-Bug: 1892768
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/779047
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit 13ba2d4a7e6f9337eda22d98df872cb40ec983ac
Author: John Kung <john.kung@windriver.com>
Date:   Sun Feb 28 10:38:54 2021 -0600

puppet manifest apply check hieradata rsync
    
    Update the puppet manifest apply to check whether the hieradata
    has been rsync successfully.  Check return value and, in certain
    cases reattempt, before continuing.  This is needed because
    the hieradata is actually generated by the controller node,
    and this script may be running on another host.
    
    It has been observed that there are instances on worker host
    whereby some of the hieradata is missing (e.g. missing
    system.yaml openstack_host in puppet.log).
    
    Verified:
        install, deployment and sanity on multinode and AIO
        backup and restore with Ceph
        platform upgrade
    
    Change-Id: I9e7a0a02dd28c06d914fafe8234f4fee5e05247c
    Closes-Bug: 1917229
    Signed-off-by: John Kung <john.kung@windriver.com>

commit fab4ea75c03d96ece44f441725f7f385202e737c
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Feb 26 16:23:35 2021 -0500

Increase haproxy timeout for patching
    
    Some patching operations can take a significant amount of time.
    Thus, in this commit the haproxy timeouts for
    patching-restapi-admin-internal and patching-restapi-internal
    are updated to be 600s.
    
    Change-Id: I1b73793c2963be2d1e40634ed6f85d747c6d6985
    Story: 2007267
    Task: 41944
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 8300408337a5051aba0c7106add4f0068ba7d461
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Feb 17 15:56:40 2021 +0000

Add container runtime interface (CRI) placeholder to config.toml
    
    This commit extends containerd config.toml template file  to include
    placeholder for custom CRI entries. The custom CRI entries can be
    specified via service-parameter method (Change-Id: Icc5fd16 stx/config).
    
    Story: 2008434
    Task: 41389
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776220
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ib1dd5bd2fbb5e386cf06ab4161226c3bf6f107ac

commit a8cf39d9d37d51869503cfa4d239faa4ced7e67f
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:23:23 2021 -0500

Device image repository
    
    The device images are stored in the drbd filesystem
    (/opt/platform/device_images) in the active controller.
    In order to allow the other worker hosts to retrieve the device images
    from the active controller over lighttpd, the directory
    /www/pages/device_images is created as a bind mount of the drbd
    directory. This mount resource is managed by SM.
    The 'device_images' is added to the lighttpd static content list.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41877
    Depends-On: https://review.opendev.org/c/starlingx/ha/+/776489
    
    Change-Id: I4e7686ece49546d7ef84f5724370167afaf21375
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 2e92d0ec7e3e39b69ac4838d54f5ec8e0ad752bc
Author: Litao Gao <litao.gao@windriver.com>
Date:   Thu Feb 18 07:59:17 2021 -0500

Add retry to tolerate 'ip link set' failure
    
    If 'ip link set' is executed too fast for X710, it is possible
    that some of them fails with 'Resource temporarily unavailable'.
    Add retry in puppet Exec resource to tolerate this failure case.
    
    Story: 2008470
    Task: 41936
    
    Signed-off-by: Litao Gao <litao.gao@windriver.com>
    Change-Id: Ib80ea77d36a0b0f63d3db2015dadb3911c56d1e9

commit 598427b294ee25fa817fb2de5e56bb18825c984e
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Thu Feb 25 07:49:26 2021 -0500

Increase timeout for sriovdp deletion
    
    In an AIO-SX, pods get launched by kubelet soon after puppet is done
    with controller's manifest but is still working with worker's manifest.
    When pods are several the concurrency may lead the SRIOV device plugin
    to not be deleted (then restarted) in the expected time frame.
    The final solution for this problem is on the way, by refactoring the
    current AIO to orchestrate between pods bring-up and worker setup. In
    the meantime, the workaround solution in this change is the increase
    of the original timeout for deletion of the SRIOV device plugin.
    
    Closes-Bug: 1916620
    Implements: increased timeout value to manage sriovdp in kubernetes.pp
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
    Change-Id: I0f6fb20a0ed5086fc80794b35715eea8d3d74cb8

commit 98601d637cb8f421a0fdccb2acb63339309d0dbe
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 12 11:33:36 2021 -0600

Use kubelet.conf instead of admin.conf on worker nodes during upgrade
    
    Specifying a config file that does not exist causes kubelet upgrades
    to fail on worker nodes when some of the commands return errors.
    
    admin.conf does not exist on worker nodes, but exists on controllers.
    The code has been updated to use kubelet.conf during worker kubelet
    upgrade actions.
    
    The worker init code has also been changed when pulling the pause
    image so that it does not try to contact k8s.gcr.io.
    The kubernetes-version needed to be passed in when querying for the
    pause image.
    
    Story: 2008137
    Task: 41828
    Change-Id: I6565132bd587927bd26c845c2ea56a995ac6da1c
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit 20f211cbef89be56ad7dd26e93cb720d81a93172
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 19 12:14:38 2021 -0600

Add bindep target to tox
    
    bindep is a helpful tox target to assist in determining
    what components a test environment needs to have installed.
    
    For stx-puppet, puppet-lint needs ruby headers otherwise
    the tox linters target will fail.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: Iaccd8d8f3af292ef29028cde59f0d344b94f1d72

commit c67bd455f8b8bd06ab9611d1b5d6ce7f2f948337
Author: albailey <Al.Bailey@windriver.com>
Date:   Fri Feb 19 10:01:19 2021 -0600

Fix running tox linters in a python2 env
    
    The bandit target is python3, and the package
    fails to be installed in a python2 env.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: I9d683c99274dc3120995e0376ace53644dc2a050

commit 54be537f9edea23df45dc3221d9be41d83f13778
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Feb 12 17:45:58 2021 -0600

Add support for dcmanager-audit-worker service
    
    We're moving the bulk of the dcmanager subcloud audits to separate
    worker processes, so we need to add a service for the main worker
    processes (which will then spawn additional workers).
    
    Story: 2007267
    Task: 41869
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Depends-On: https://review.opendev.org/c/starlingx/ha/+/775457
    Change-Id: I119d24ae67ec4a40c360ac721582b45388231cbf

commit 3c2f1530c9ee8ccd2c27cb757655a9c851b926ae
Author: Babak Sarashki <zbsarashki@gmail.com>
Date:   Thu Feb 4 23:52:37 2021 +0000

platform puppet: Config ACC100 bbdev with QMGR val
    
    The ACC100 PF and VF configuration takes the same puppet
    config code path as the N3000 except that the ACC100 does
    not require a reset, but requires bbdev config.
    
    This patch adds platform::devices::acc100::fec class to
    exec pf-bb-config to configure QMGR on the Intel ACC100
    (Mt. Bryce) with number of 5G UL/DL qgroups and configures
    the device with the number of VF's.
    
    Story: 2008440
    Task: 41530
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/775252
    
    Signed-off-by: Babak Sarashki <zbsarashki@gmail.com>
    Change-Id: I7d42852009fedba5136d9d726092f273ef41c7fd

commit 5a555ad98eb4fb978c7b553d463dfedf4d9b3a25
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 3 12:11:16 2021 -0500

Change collectd plugin search path
    
    This update changes the collectd's plugin
    search path from /etc/collectd.d to
    /etc/collectd.d/starlingx to avoid loading
    the collectd default plugins.
    
    Partial-Fix: 1905581
    Depends-On: https://review.opendev.org/c/starlingx/monitoring/+/772516
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I1999f25244465430d9c1385a2fc3c002d0e108c9

commit fbb4cdef07c52acb66cfcaf91bc2d029ffb00ff1
Author: Don Penney <don.penney@windriver.com>
Date:   Sun Jan 31 22:02:43 2021 -0500

Reprovision SM services on duplex
    
    Update SM provisioning for duplex to reprovision services
    if needed. The default configuration in SM is duplex services,
    and a simplex node will reprovision these to be simplex. In
    order to support SX to DX migration, these services will also
    be reprovisioned on duplex to ensure the configuration
    is correct.
    
    Story: 2008587
    Task: 41743
    
    Change-Id: Ifb61a6046c680d0dee7c76660397c6fe8c2cbe73
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit b2e37caaeb90e5931cb1522d8d23e6258d506fdb
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Wed Feb 3 09:26:47 2021 -0500

Etcd parameters lost when changing kube-apiserver parameters
    
    Etcd parameters are getting lost when changing kube-apiserver
    parameters. This is due to no default values being present. The
    missing etcd parameters causes kube-apiserver to fail to start up.
    This commit makes the script for changing kube-apiserver parameters
    keep any existing etcd parameters in the previous config.
    
    Change-Id: I83eb5426ba72a36a5eed3ecbddcbbacdf38803c5
    Closes-bug: 1914291
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit a4decc6fbc0f796e03f25119b635fad51962fbdd
Author: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>
Date:   Wed Jan 27 14:29:59 2021 -0500

Added class to handle pci runtime config on kubernetes
    
    The new class adds a handler on puppet to trigger
    configuration on runtime when an SR-IOV interface is
    assigned to a data network on an unlocked host
    
    Story: 2008531
    Task: 41707
    Depends-On: https://review.opendev.org/c/starlingx/config/+/772759
    Change-Id: Iddbc272eb6b3321c987c2700e63734ee57244cf9
    Signed-off-by: Andre Fernando Zanella Kantek <AndreFernandoZanella.Kantek@windriver.com>

commit a34cd954e92f37994d40a31ccc2777249598622d
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 10:01:29 2021 -0500

Modify collectd manifest to not start collectd
    
    The collectd puppet manifest auto starts the
    collectd process before a node's configuration
    is complete. This has been see to lead to a
    collectd process core dump in the collectd's
    network plugin due to being started before
    networking is setup or fully operational.
    
    Collectd has a service file that has been
    modified by the Depends-On update to start
    collectd after config is complete.
    
    Partial-Bug: 1872979
    Depends-On: https://review.opendev.org/772349
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I70ded9b745b7dadb7c50b1d5f9ba8bdcb5ffa2da

commit 389f37582a1568bc34089956dc52b6fe5c274b83
Author: John Kung <john.kung@windriver.com>
Date:   Thu Jan 21 09:41:41 2021 -0600

Adjust dcorch database pool size for dcorch scaling
    
    The dcorch database pool sizes are updated based upon
    the delivery of feature with multiple dcorch engine workers.
    
    As the workload is allocated amongst 5 multiple workers,
    the values for the dcorch database pools can be lowered
    from the defaults previously set for the single process case.
    
    The max theoretical database connections allowable per worker
    is based on 100 audit and 100 sync threads.
    
    Furthermore, the dcorch scaling feature audits based on
    audit timestamp so the peak loads are also likely more balanced.
    
    Testcases:
    In multiple subclouds environment, monitor each
    dcorch engine each workers database connections usage:
    subcloud add
    subcloud initial manage
    subcloud resource sync
    subcloud manage and unmanage
    
    Change-Id: Id3386df6289d42080a90b9d97cc0834054160805
    Story: 2007267
    Task: 41650
    Signed-off-by: John Kung <john.kung@windriver.com>

commit b112fce71e8ce69b7065fa3bf8f4da896cd637a3
Author: Litao Gao <litao.gao@windriver.com>
Date:   Tue Jan 12 10:18:08 2021 -0500

VF rate limiting support
    
    This commit implements puppet part logic to perform
    VF max_tx_rate setting according to the configuration.
    
    Story: 2008470
    Task:  41508
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/770135
    Signed-off-by: Litao Gao <litao.gao@windriver.com>
    Change-Id: Ic599f9ac70430529f31d57a74d0809f7077b98e5

commit ce66ffd30cb33f2b770587d7731732e34593e8f1
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Sun Jul 5 20:42:55 2020 +0800

Add puppet class for rook
    
    Create a new drbd device for rook. For duplex system, device mount to
    folder /var/lib/ceph/mon-a for mon data sync on two controllers.
    
    Story: 2005527
    Task: 40281
    
    Change-Id: Ic5edca16e2dce905aeb582b0359446bd222e5ad3
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 2680e463198cd75b01a8f04140e2d4f72e4844c9
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Tue Jan 19 10:47:17 2021 -0600

Migrate etcd after both controllers are upgraded
    
    The flag to trigger the data migration is now set by the conductor on
    controller-1 and the migration will be performed on controller-0. The
    flag is now set in a drbd synced filesystem so it is accessible to both
    controllers.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/771668
    Story: 2008055
    Task: 41631
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I761740f4de24f33f2d314ec1bc8fbc5941607900

commit 978dea28f21592ad4aa79e99821b70a1b07ab438
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Wed Jan 20 09:47:34 2021 -0300

Remove trap destination from fm.conf
    
    With the host-based SNMP removal,
    remove trap_destination entry from fm.conf
    
    Story: 2008132
    Task: 41350
    Change-Id: I3f0298233beedc3370fa8c4c2dbc65fe678b14a6
    Depends-On: https://review.opendev.org/765381
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>

commit 0f7418e761fa49b0f5a5edc9593ff9f6c0921206
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon Sep 28 11:39:17 2020 -0400

Configure SQL as helm storage backend
    
    Configmap is the default helmv2 storage backend to store
    release information but its 1MB resource limit prevents
    scaling up stx openstack worker nodes, so we want to use
    SQL as helm storage backend.
    
    Add class in helm puppet manifest to setup helm database
    during ansible bootstrap.
    
    This commit also fixes the IP address in postgres pg_hba.conf.
    
    Currently, we have the following rules for both IPv4 and
    IPv6 systems:
    Rule Name: allow access to all users with encrypted password
    from all IPv4 addresses.
    host  all  all         0.0.0.0/0   md5
    Rule Name: deny access to postgresql user.
    host  all  postgres    0.0.0.0/32 reject
    
    For the IPv6 system, the address of pods is IPv6. The CIDR
    address in the rule should be changed to corresponding
    IPv6 address (::0/0) to allow tiller running in container
    to access helm database.
    
    Depends-On: https://review.opendev.org/#/c/761645/
    Change-Id: Ifd072000e0680a59d5be0f2f1ef2ce1cbabc1e4f
    Partial-Bug: 1887677
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 4b97414655f5126ce65acf9b15be635483955c74
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Thu Jan 7 11:12:11 2021 -0300

Support trap_server_port configurable
    
    Add parameter for trap_server_port to make user can
    configure snmp trap server port number through
    user helm override.
    
    Story: 2008132
    Task: 41548
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
    Change-Id: Iac44d813447881591efd7b4a088185f2d59986be

commit 777d5d0de78c97fdc223e56662f7d3db6def2768
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Sat Oct 31 01:15:33 2020 +0800

Enable etcd with security setting.
    
    Update etcd puppet to support security settings.
    
    Partial-Bug: 1894870
    
    Change-Id: Ifb5bb2506a260186bf4e8caa487bbeaae04df80b
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 6182d3f94990ea282004245e2d821eae5ac573ea
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:21:50 2020 -0500

Add auto-version for remaining stx/stx-puppet packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I110ef3a10c3164f8edb706b9257f33178b4a2517
    Story: 2008455
    Task: 41456
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit f8397fe71bae28a4126bbdf38da0731ba529b4c0
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Thu Nov 26 16:51:32 2020 -0300

Delete SNMP Host-Based entries.
    
    Delete entries related with SNMP Host-Based.
    
    Story: 2008132
    Task: 41323
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    Depends-On: https://review.opendev.org/766094
    
    Change-Id: I2c4a89fd7c4bac9895311787663a6d693600b090

commit b1997248da4bcb1d3ec0ce15d423eb42d2219a3e
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Mon Oct 5 10:33:36 2020 +0000

Add mds support in puppet for CephFS.
    
    Mds configuration needs to be present on every node that
    has a ceph monitor in order for CephFS to be available.
    
    Change-Id: Ic4270e401b2c3e5123aecfab21af1e874b733830
    Story: 2008162
    Task: 40908
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>

commit 6f881cc84e3d3c922423441304e7157effc505e7
Author: Andy Ning <andy.ning@windriver.com>
Date:   Thu Dec 3 09:41:08 2020 -0500

Skip platform ceph osds puppet manifest following DOR
    
    ceph::osd puppet manifest will fail during controller puppet
    manifests apply following DOR, because as both controllers are
    booting up, there is no ceph monitor cluster so puppet is unable
    to validate or invalidate the existing configuration.
    
    This change updated platform::ceph::controller class to skip
    platform ceph osds in the case of DOR.
    
    Change-Id: I0254ce28869bc87c5e939ea8984d175244ebb65f
    Partial-Bug: 1904739
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 8ba9e81db4e238c69edebdfec4738063aad7eb14
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Tue Dec 1 22:22:57 2020 -0500

Fix directory permissions for /var/log/rabbitmq
    
    Updated /var/log/rabbitmq directory permissions to 750 from 755
    to disallow world access to rabbitmq log files but at the same
    time to allow group access.
    The changes are made to comply as much as possible with
    openscap rules security requirements.
    Verified that installation is successful for AIO-SX
    and Standard 2+2 system configurations.
    
    Story: 2008037
    Task: 40694
    
    Change-Id: I1c0112575033c04983c56298e2131882911333de
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit 3b7c55174aafffd8f35545ad8e20d928322de2f9
Author: Lu Yao Chen <luyao.chen@windriver.com>
Date:   Wed Nov 25 14:28:48 2020 -0500

Retain more puppet log files
    
    Increased max log directories to retain more
    debugging info from puppet.log.
    
    Was tested by looping system host-cpu-modify
    commands, /var/log/puppet caps at 50 log directories
    instead of 20.
    
    Closes-Bug: 1903994
    
    Signed-off-by: Lu Yao Chen <luyao.chen@windriver.com>
    Change-Id: Ia8458396867f988d5061d3aa49fa2a21ee6ebac2

commit 77d3382d2c63dba2e04cb92333a37b0370992cd5
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Nov 23 18:22:12 2020 -0500

Fix permission of puppet saved logs tar file
    
    Changed the permissions of puppet saved logs tar file from
    644 to 600 to comply with openscap rules security requirements.
    Verified that installation is successful for AIO-SX
    and Standard 2+2 system configurations.
    
    Story: 2008037
    Task: 40694
    
    Change-Id: I1fe365e808a085999667e898788afacf61fd6612
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>

commit e5ff48c2ca6931eadff3566de33519a3496beeab
Author: Andy Ning <andy.ning@windriver.com>
Date:   Mon Nov 23 14:07:35 2020 -0500

Remove comments in keystone::upgrade class
    
    The TODO comments in keystone::upgrade class no longer applies.
    This update removed them.
    
    Change-Id: Id9f7b39c15db1f73428d4f23d93ef3e3b4ad50f5
    Partial-Bug: 1886064
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit c10b5897b9d972555228f6510803c48981050e5f
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Thu Nov 19 11:05:12 2020 -0500

Update dnsmasq config for slow DNS servers
    
    When a configured DNS server is taking a long time to respond to
    unknown domains or hosts, registry interactions like push, pull,
    and querying for images through system commands will fail due to
    hostname resolution for registry.local. This is because it attempt
    to resolve registry.local using the A record first, which times out
    since it is hitting the configured external DNS server. This
    prevents the process from looking up the AAAA record which would
    resolve to the dnsmasq CNAME record. This commit updates the dnsmasq
    config to prevent forwarding the local domain to upstream servers.
    
    Change-Id: Ic3cf6aae87f8f2d5c61a24db00a4cb814c20aac6
    Closes-Bug: 1904885
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 3ca2387ddbb455a081689be72632b408988c5d39
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Tue Nov 3 15:35:29 2020 -0300

Add variables for snmp in fm.conf
    
    Snmp trap client needs the following three variables
    to connect to snmp trap server.
    - trap_server_ip
    - trap_server_port
    - snmp_enabled
    Modify puppet to add these variables. trap_server_ip
    and trap_server_port are fixed. snmp_enabled takes
    True/False depends on snmp armada app is applied
    or not (True when applied).
    
    Change-Id: Ibedaf772153f49c6dfefe644044da07b5d32bb20
    Story: 2008132
    Task: 41207
    Depends-On: https://review.opendev.org/761213
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.