Bug #1923607 “undercloud (and overcloud nodes) in master became ...” : Bugs : tripleo

Revision history for this message

Michele Baldessari (michele) wrote on 2021-04-13:

#1

Interestingly if we inspect a normal "podman" process we see:
PID: 846098 TASK: ffff9c20fd8ddc40 CPU: 1 COMMAND: "podman"
ARG: /usr/bin/podman --root /var/lib/containers/storage --runroot /var/run/containers/storage --log-level error --cgroup-manager systemd --tmpdir /var/run/libpod --runtime runc --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend file container cleanup 881e8ef19bb5e57de90c1fb2a784f821934707b1075c44452e2e355b9df3aba7
ENV: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
     _OCI_SYNCPIPE=3
     _OCI_STARTPIPE=4
     XDG_RUNTIME_DIR=
     _CONTAINERS_USERNS_CONFIGURED=
     _CONTAINERS_ROOTLESS_UID=

But those '(podman)' processes do not show any arguments nor env variables:
PID: 846103 TASK: ffff9c20f8aa1ec0 CPU: 1 COMMAND: "(podman)"
ARG: (podman)
ENV: HOME=/
     TERM=vt220
     BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-240.10.1.el8_3.x86_64
     crashkernel=auto

Michele Baldessari (michele) on 2021-04-13

description:

updated

Revision history for this message

Michele Baldessari (michele) wrote on 2021-04-13:

#2

crash> bt 846786
PID: 846786 TASK: ffff9c2030da8000 CPU: 1 COMMAND: "(podman)"
#0 [ffffc0998fe33c60] __schedule at ffffffffbd0d33f6
#1 [ffffc0998fe33cf8] schedule at ffffffffbd0d3888
#2 [ffffc0998fe33d08] schedule_timeout at ffffffffbd0d74a6
#3 [ffffc0998fe33da0] unix_wait_for_peer at ffffffffbd03d44f
#4 [ffffc0998fe33df0] unix_stream_connect at ffffffffbd0404ac
#5 [ffffc0998fe33e70] __sys_connect at ffffffffbcf19c5a
#6 [ffffc0998fe33f30] __x64_sys_connect at ffffffffbcf19ca6
#7 [ffffc0998fe33f38] do_syscall_64 at ffffffffbc80419b
#8 [ffffc0998fe33f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7 RSP: 00007ffdd43c8940 RFLAGS: 00000293
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8fdc558aa7
    RDX: 000000000000001d RSI: 000055d5fc3f1c60 RDI: 0000000000000003
    RBP: 000055d5fc3f1c60 R8: 0000000000000000 R9: 00007ffdd43c8f14
    R10: 00007ffdd43c8bb8 R11: 0000000000000293 R12: 000000000000001d
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000003
    ORIG_RAX: 000000000000002a CS: 0033 SS: 002b

All '(podman)' processes seem to be stuck waiting on a unix socket for their peer.

Revision history for this message

Michele Baldessari (michele) wrote on 2021-04-13:

#3

(Thanks to eck who team-tagged some bits). So we know that
1) We called unix_stream_connect()
2) We ended up in unix_wait_for_peer()

So if we look inside unix_stream_connect() we call unix_wait_for_peer() here https://github.com/torvalds/linux/blob/v4.18/net/unix/af_unix.c#L1261-L1273 :

if (unix_recvq_full(other)) {
  err = -EAGAIN;
  if (!timeo)
   goto out_unlock;

timeo = unix_wait_for_peer(other, timeo);

  err = sock_intr_errno(timeo);
  if (signal_pending(current))
   goto out;
  sock_put(other);
  goto restart;
}

So this, at the very least, probably means that all those '(podman)' processes are stuck because the there is no connect backlog space in the socket:
static inline int unix_recvq_full(struct sock const *sk)
{
return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
}

In any case the other side of the unix socket is either stuck or not calling accept() quickly enough it would seem

Revision history for this message

Michele Baldessari (michele) wrote on 2021-04-13:

#4

Download full text (10.8 KiB)

So let's take one of these '(podman)' processes:

crash> bt -f 869520
PID: 869520 TASK: ffff9c1e89065c40 CPU: 1 COMMAND: "(podman)"
#0 [ffffc099a8e97c60] __schedule at ffffffffbd0d33f6
    ffffc099a8e97c68: ffff9c1e89066690 ffff9c22afaa9dc0
    ffffc099a8e97c78: ffff9c1e89061ec0 ffff9c1e89065c40
    ffffc099a8e97c88: ffff9c22afaa9dc0 ffffc099a8e97cf0
    ffffc099a8e97c98: ffffffffbd0d33f6 0000000688340c02
    ffffc099a8e97ca8: 0000000092102035 0000000000000000
    ffffc099a8e97cb8: ffff9c2200000004 0e666d1c6cf1cc00
    ffffc099a8e97cc8: ffff9c1e89065c40 7fffffffffffffff
    ffffc099a8e97cd8: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97ce8: ffff9c1f8dec30f0 ffff9c1f8dec3100
    ffffc099a8e97cf8: ffffffffbd0d3888
#1 [ffffc099a8e97cf8] schedule at ffffffffbd0d3888
    ffffc099a8e97d00: ffff9c1f8dec2d00 ffffffffbd0d74a6
#2 [ffffc099a8e97d08] schedule_timeout at ffffffffbd0d74a6
    ffffc099a8e97d10: 7fffffffffffffff ffff9c222e90e110
    ffffc099a8e97d20: ffffc099a8e97d60 ffff9c222e90e110
    ffffc099a8e97d30: ffffc099a8e97d60 ffff9c22afa36920
    ffffc099a8e97d40: ffff9c222e90e110 0000000000000202
    ffffc099a8e97d50: ffffc099a8e97da8 ffffffffbc90255a
    ffffc099a8e97d60: ffff9c1f8dec3108 0000000000000000
    ffffc099a8e97d70: 0e666d1c6cf1cc00 ffff9c1f8dec2d00
    ffffc099a8e97d80: ffff9c1f8dec3100 7fffffffffffffff
    ffffc099a8e97d90: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97da0: ffffffffbd03d44f
#3 [ffffc099a8e97da0] unix_wait_for_peer at ffffffffbd03d44f
    ffffc099a8e97da8: 0000000000000001 ffff9c1e89065c40
    ffffc099a8e97db8: ffffffffbc902b80 ffffc099a8e9fdc0
    ffffc099a8e97dc8: ffffc099a8ea7dc0 0e666d1c6cf1cc00
    ffffc099a8e97dd8: ffff9c1e761c4900 ffffc099a8e97e68
    ffffc099a8e97de8: ffff9c1f8dec2d00 ffffffffbd0404ac
#4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
#5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
#6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
#7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97...

So let's take one of these '(podman)' processes:

crash> bt -f 869520
PID: 869520  TASK: ffff9c1e89065c40  CPU: 1   COMMAND: "(podman)"
 #0 [ffffc099a8e97c60] __schedule at ffffffffbd0d33f6
    ffffc099a8e97c68: ffff9c1e89066690 ffff9c22afaa9dc0
    ffffc099a8e97c78: ffff9c1e89061ec0 ffff9c1e89065c40
    ffffc099a8e97c88: ffff9c22afaa9dc0 ffffc099a8e97cf0
    ffffc099a8e97c98: ffffffffbd0d33f6 0000000688340c02
    ffffc099a8e97ca8: 0000000092102035 0000000000000000
    ffffc099a8e97cb8: ffff9c2200000004 0e666d1c6cf1cc00
    ffffc099a8e97cc8: ffff9c1e89065c40 7fffffffffffffff
    ffffc099a8e97cd8: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97ce8: ffff9c1f8dec30f0 ffff9c1f8dec3100
    ffffc099a8e97cf8: ffffffffbd0d3888
 #1 [ffffc099a8e97cf8] schedule at ffffffffbd0d3888
    ffffc099a8e97d00: ffff9c1f8dec2d00 ffffffffbd0d74a6
 #2 [ffffc099a8e97d08] schedule_timeout at ffffffffbd0d74a6
    ffffc099a8e97d10: 7fffffffffffffff ffff9c222e90e110
    ffffc099a8e97d20: ffffc099a8e97d60 ffff9c222e90e110
    ffffc099a8e97d30: ffffc099a8e97d60 ffff9c22afa36920
    ffffc099a8e97d40: ffff9c222e90e110 0000000000000202
    ffffc099a8e97d50: ffffc099a8e97da8 ffffffffbc90255a
    ffffc099a8e97d60: ffff9c1f8dec3108 0000000000000000
    ffffc099a8e97d70: 0e666d1c6cf1cc00 ffff9c1f8dec2d00
    ffffc099a8e97d80: ffff9c1f8dec3100 7fffffffffffffff
    ffffc099a8e97d90: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97da0: ffffffffbd03d44f
 #3 [ffffc099a8e97da0] unix_wait_for_peer at ffffffffbd03d44f
    ffffc099a8e97da8: 0000000000000001 ffff9c1e89065c40
    ffffc099a8e97db8: ffffffffbc902b80 ffffc099a8e9fdc0
    ffffc099a8e97dc8: ffffc099a8ea7dc0 0e666d1c6cf1cc00
    ffffc099a8e97dd8: ffff9c1e761c4900 ffffc099a8e97e68
    ffffc099a8e97de8: ffff9c1f8dec2d00 ffffffffbd0404ac
 #4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
 #5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
 #6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
 #7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97f40: 0000000000000000 0000000000000000
    ffffc099a8e97f50: ffffffffbd2000ad
 #8 [ffffc099a8e97f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7  RSP: 00007ffdd43c8a20  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 0000000000000003  RCX: 00007f8fdc558aa7
    RDX: 000000000000001d  RSI: 000055d5fc3f1c60  RDI: 0000000000000003
    RBP: 000055d5fc3f1c60   R8: 0000000000000000   R9: 00007ffdd43c8ff4
    R10: 00007ffdd43c8c98  R11: 0000000000000293  R12: 000000000000001d
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
    ORIG_RAX: 000000000000002a  CS: 0033  SS: 002b

At frame4 (unix_stream_connect) we poked at some random 0xffff addresses until we found the right one:
crash> ptype struct sockaddr_un
type = struct sockaddr_un {
    __kernel_sa_family_t sun_family;
    char sun_path[108];
}

crash> p *(struct sockaddr_un *) 0xffffc099a8e97e80
$8 = {
  sun_family = 1,
  sun_path = "/run/systemd/journal/stdout\000\377\377\240\326\036F\037\234\377\377X\177騙\300\377\377>\000\000\300\000\000\000\000\000\000\000\000\000\000\000\000#=\200\274\377\377\377\377\230v\244\305 \234\377\377\230v\244\305 \234\377\377\231;\230\274\377\377\377\377\200\000\000\000\000\000\000\000X\177騙\300"
}

So all these (podman) processes are trying to talk to /run/systemd/journal/stdout

Which, according to https://unix.stackexchange.com/questions/205883/understand-logging-in-linux/294206#294206 is:
"It listens on the AF_LOCAL stream socket at /run/systemd/journal/stdout for log data coming from systemd-managed services."

So we started looking at the filesystem
crash> mod -s xfs
     MODULE       NAME                     SIZE  OBJECT FILE
ffffffffc0652600  xfs                   1511424  /tmp/kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/kernel/fs/xfs/xfs.ko.debug

crash> mount
     MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff9c1f461ed080 ffff9c1f47c12000 rootfs rootfs    /

crash> struct xfs_sb ffff9c222d5ca000
struct xfs_sb {
  sb_magicnum = 780673024,
  sb_blocksize = 4294941730,
  sb_dblocks = 18446634269336721408,
  sb_rblocks = 51803848705,
  sb_rextents = 4096,
  sb_uuid = {
    b = "\377\377\377\377\377\377\377\177\200'd\300\377\377\377\377"
  },
  sb_logstart = 18446744072641919744,
  sb_rootino = 0,
  sb_rbmino = 18446744072641923296,
  sb_rsumino = 18446744072641915712,
  sb_rextsize = 1879113728,
  sb_agblocks = 0,
  sb_agcount = 1,
  sb_rbmblocks = 0,
  sb_logblocks = 1481003842,
  sb_versionnum = 0,
  sb_sectsize = 0,
  sb_inodesize = 43264,
  sb_inopblock = 25688,
  sb_fname = "\"\234\377\377\000\000\000\000\000\000\000",
  sb_blocklog = 120 'x',
  sb_sectlog = 160 '\240',
  sb_inodelog = 92 '\\',
  sb_inopblog = 45 '-',
  sb_agblklog = 34 '"',
  sb_rextslog = 156 '\234',
  sb_inprogress = 255 '\377',
  sb_imax_pct = 255 '\377',
  sb_icount = 18446634269336707192,
  sb_ifree = 0,
  sb_fdblocks = 18446634270262362113,
  sb_frextents = 974957576193,
  sb_uquotino = 18446634270262931200,
  sb_gquotino = 18446744072642374496,
  sb_qflags = 0,
  sb_flags = 0 '\000',
  sb_shared_vn = 0 '\000',
  sb_inoalignmt = 0,
  sb_unit = 761026672,
  sb_width = 4294941730,
  sb_dirblklog = 112 'p',
  sb_logsectlog = 242 '\362',
  sb_logsectsize = 31460,
  sb_logsunit = 4294941728,
  sb_features2 = 1683461760,
  sb_bad_features2 = 4294941730,
  sb_features_compat = 1687773184,
  sb_features_ro_compat = 4294941730,
  sb_features_incompat = 0,
  sb_features_log_incompat = 0,
  sb_crc = 0,
  sb_spino_align = 0,
  sb_pquotino = 18446744072642373552,
  sb_lsn = 7,
  sb_meta_uuid = {
    b = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  }
}

The key here is likely:
  sb_icount = 18446634269336707192,
  sb_ifree = 0,

I.e. no inodes are free? The inode count seems quite frankly impossibly large, so maybe we need to understand things a bit better here.

https://docs.huihoo.com/doxygen/linux/kernel/3.7/xfs__sb_8h_source.html

95 /*
   96  * Superblock - in core version.  Must match the ondisk version below.
   97  * Must be padded to 64 bit alignment.
   98  */
   99 typedef struct xfs_sb {
  100     __uint32_t  sb_magicnum;    /* magic number == XFS_SB_MAGIC */
  101     __uint32_t  sb_blocksize;   /* logical block size, bytes */
  102     xfs_drfsbno_t   sb_dblocks; /* number of data blocks */
  103     xfs_drfsbno_t   sb_rblocks; /* number of realtime blocks */
  104     xfs_drtbno_t    sb_rextents;    /* number of realtime extents */
  105     uuid_t      sb_uuid;    /* file system unique id */
  106     xfs_dfsbno_t    sb_logstart;    /* starting block of log if internal */
  107     xfs_ino_t   sb_rootino; /* root inode number */
  108     xfs_ino_t   sb_rbmino;  /* bitmap inode for realtime extents */
  109     xfs_ino_t   sb_rsumino; /* summary inode for rt bitmap */
  110     xfs_agblock_t   sb_rextsize;    /* realtime extent size, blocks */
  111     xfs_agblock_t   sb_agblocks;    /* size of an allocation group */
  112     xfs_agnumber_t  sb_agcount; /* number of allocation groups */
  113     xfs_extlen_t    sb_rbmblocks;   /* number of rt bitmap blocks */
  114     xfs_extlen_t    sb_logblocks;   /* number of log blocks */
  115     __uint16_t  sb_versionnum;  /* header version == XFS_SB_VERSION */
  116     __uint16_t  sb_sectsize;    /* volume sector size, bytes */
  117     __uint16_t  sb_inodesize;   /* inode size, bytes */
  118     __uint16_t  sb_inopblock;   /* inodes per block */
  119     char        sb_fname[12];   /* file system name */
  120     __uint8_t   sb_blocklog;    /* log2 of sb_blocksize */
  121     __uint8_t   sb_sectlog; /* log2 of sb_sectsize */
  122     __uint8_t   sb_inodelog;    /* log2 of sb_inodesize */
  123     __uint8_t   sb_inopblog;    /* log2 of sb_inopblock */
  124     __uint8_t   sb_agblklog;    /* log2 of sb_agblocks (rounded up) */
  125     __uint8_t   sb_rextslog;    /* log2 of sb_rextents */
  126     __uint8_t   sb_inprogress;  /* mkfs is in progress, don't mount */
  127     __uint8_t   sb_imax_pct;    /* max % of fs for inode space */
  128                     /* statistics */
  129     /*
  130      * These fields must remain contiguous.  If you really
  131      * want to change their layout, make sure you fix the
  132      * code in xfs_trans_apply_sb_deltas().
  133      */
  134     __uint64_t  sb_icount;  /* allocated inodes */
  135     __uint64_t  sb_ifree;   /* free inodes */
  136     __uint64_t  sb_fdblocks;    /* free data blocks */
  137     __uint64_t  sb_frextents;   /* free realtime extents */
  138     /*
  139      * End contiguous fields.
  140      */
  141     xfs_ino_t   sb_uquotino;    /* user quota inode */
  142     xfs_ino_t   sb_gquotino;    /* group quota inode */
  143     __uint16_t  sb_qflags;  /* quota flags */
  144     __uint8_t   sb_flags;   /* misc. flags */
  145     __uint8_t   sb_shared_vn;   /* shared version number */
  146     xfs_extlen_t    sb_inoalignmt;  /* inode chunk alignment, fsblocks */
  147     __uint32_t  sb_unit;    /* stripe or raid unit */
  148     __uint32_t  sb_width;   /* stripe or raid width */
  149     __uint8_t   sb_dirblklog;   /* log2 of dir block size (fsbs) */
  150     __uint8_t   sb_logsectlog;  /* log2 of the log sector size */
  151     __uint16_t  sb_logsectsize; /* sector size for the log, bytes */
  152     __uint32_t  sb_logsunit;    /* stripe unit size for the log */
  153     __uint32_t  sb_features2;   /* additional feature bits */
  154 
  155     /*
  156      * bad features2 field as a result of failing to pad the sb
  157      * structure to 64 bits. Some machines will be using this field
  158      * for features2 bits. Easiest just to mark it bad and not use
  159      * it for anything else.
  160      */
  161     __uint32_t  sb_bad_features2;
  162 
  163     /* must be padded to 64 bit alignment */
  164 } xfs_sb_t;
  165

Revision history for this message

Michele Baldessari (michele) wrote on 2021-04-14:

#5

Download full text (4.7 KiB)

Adding a redacted version of the previous comment:

#4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
#5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
#6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
#7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97f40: 0000000000000000 0000000000000000
    ffffc099a8e97f50: ffffffffbd2000ad
#8 [ffffc099a8e97f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7 RSP: 00007ffdd43c8a20 RFLAGS: 00000293
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8fdc558aa7
    RDX: 000000000000001d RSI: 000055d5fc3f1c60 RDI: 0000000000000003
    RBP: 000055d5fc3f1c60 R8: 0000000000000000 R9: 00007ffdd43c8ff4
    R10: 00007ffdd43c8c98 R11: 0000000000000293 R12: 000000000000001d
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000003
    ORIG_RAX: 000000000000002a CS: 0033 SS: 002b

At frame4 (unix_stream_connect) we poked at some random 0xffff addresses until we found the right one:
crash> ptype struct sockaddr_un
type = struct sockaddr_un {
__kernel_sa_family_t sun_family;
char sun_path[108];
}

crash> p *(struct sockaddr_un *) 0xffffc099a8e97e80
$8 = {
sun_family = 1,
sun_path = "/run/systemd/journal/stdout\000\377\377\240\326\036F\037\234\377\377X\177騙\300\377\377>\000\000\300\000\000\000\000\000\000\000\000\000\000\000\000#=\200\274\377\377\377\377\230v\244\305 \234\377\377\230v\244\305 \234
\377\377\231;\230\274\377\377\377\377\200\000\000\000\000\000\000\000X\177騙\300"
}

So all these (podman) processes are trying to talk to /run/systemd/journal/stdout

Which, according to https://unix.stackexchange.com/questions/205883/understand-logging-in-linux/294206#294206 is:
"It listens on the AF_LOCAL stream socket at /run/systemd/journal/stdout for log data coming from systemd-managed services."

So we started looking at the filesystem
crash> mod -s xfs
MODULE NAME SIZE OBJECT FILE
ffffffffc0652600 xfs 1...

Adding a redacted version of the previous comment:

#4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
 #5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
 #6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
 #7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97f40: 0000000000000000 0000000000000000
    ffffc099a8e97f50: ffffffffbd2000ad
 #8 [ffffc099a8e97f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7  RSP: 00007ffdd43c8a20  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 0000000000000003  RCX: 00007f8fdc558aa7
    RDX: 000000000000001d  RSI: 000055d5fc3f1c60  RDI: 0000000000000003
    RBP: 000055d5fc3f1c60   R8: 0000000000000000   R9: 00007ffdd43c8ff4
    R10: 00007ffdd43c8c98  R11: 0000000000000293  R12: 000000000000001d
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
    ORIG_RAX: 000000000000002a  CS: 0033  SS: 002b

At frame4 (unix_stream_connect) we poked at some random 0xffff addresses until we found the right one:
crash> ptype struct sockaddr_un
type = struct sockaddr_un {
    __kernel_sa_family_t sun_family;
    char sun_path[108];
}

crash> p *(struct sockaddr_un *) 0xffffc099a8e97e80
$8 = {
  sun_family = 1,
  sun_path = "/run/systemd/journal/stdout\000\377\377\240\326\036F\037\234\377\377X\177騙\300\377\377>\000\000\300\000\000\000\000\000\000\000\000\000\000\000\000#=\200\274\377\377\377\377\230v\244\305 \234\377\377\230v\244\305 \234
\377\377\231;\230\274\377\377\377\377\200\000\000\000\000\000\000\000X\177騙\300"
}

So all these (podman) processes are trying to talk to /run/systemd/journal/stdout

Which, according to https://unix.stackexchange.com/questions/205883/understand-logging-in-linux/294206#294206 is:
"It listens on the AF_LOCAL stream socket at /run/systemd/journal/stdout for log data coming from systemd-managed services."

So we started looking at the filesystem
crash> mod -s xfs
     MODULE       NAME                     SIZE  OBJECT FILE
ffffffffc0652600  xfs                   1511424  /tmp/kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/kernel/fs/xfs/xfs.ko.debug

crash> mount
     MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff9c1f461ed080 ffff9c1f47c12000 rootfs rootfs    /

crash> struct xfs_sb ffff9c222d5ca000
struct xfs_sb {
  sb_magicnum = 780673024,
  sb_blocksize = 4294941730,
  sb_dblocks = 18446634269336721408,
  sb_rblocks = 51803848705,
  sb_rextents = 4096,
  sb_uuid = {
    b = "\377\377\377\377\377\377\377\177\200'd\300\377\377\377\377"
  },
  sb_logstart = 18446744072641919744,
  sb_rootino = 0,
  sb_rbmino = 18446744072641923296,
  sb_rsumino = 18446744072641915712,
  sb_rextsize = 1879113728,
  sb_agblocks = 0,
  sb_agcount = 1,
  sb_rbmblocks = 0,
  sb_logblocks = 1481003842,
  sb_versionnum = 0,
  sb_sectsize = 0,
  sb_inodesize = 43264,
  sb_inopblock = 25688,
  sb_fname = "\"\234\377\377\000\000\000\000\000\000\000",
  sb_blocklog = 120 'x',
  sb_sectlog = 160 '\240',
  sb_inodelog = 92 '\\',
  sb_inopblog = 45 '-',
  sb_agblklog = 34 '"',
  sb_rextslog = 156 '\234',
  sb_inprogress = 255 '\377',
  sb_imax_pct = 255 '\377',
  sb_icount = 18446634269336707192,
  sb_ifree = 0,
  sb_fdblocks = 18446634270262362113,
  sb_frextents = 974957576193,
  sb_uquotino = 18446634270262931200,
  sb_gquotino = 18446744072642374496,
  sb_qflags = 0,
  sb_flags = 0 '\000',
....

The key here is likely:
  sb_icount = 18446634269336707192,
  sb_ifree = 0,

I.e. no inodes are free? The inode count seems quite frankly impossibly large, so maybe we need to understand things a bit better here.

Revision history for this message

John Eckersberg (jeckersb) wrote on 2021-04-19:

#6

re: the questionable inode numbers above...

I did the same thing on a normal, functional undercloud. It also showed insane inode counts and no free inodes. However inspection of the running system showed that everything was fine.

So we've just done something wrong and made poor assumptions about how we're reading the data out of the xfs superblock struct.

Revision history for this message

Paras Babbar (pbabbar) wrote on 2021-04-21:

#7

I have faced similar issue in tripleo deployment , I also kept the environment for few weeeks and then I can't even ssh to the nodes and OC just became unresponsive.

Bogdan Dobrelya (bogdando) on 2021-04-28

Changed in tripleo:
importance:	High → Critical

Marios Andreou (marios-b) on 2021-05-06

Changed in tripleo:
milestone:	wallaby-rc1 → xena-1

OpenStack Infra (hudson-openstack) on 2021-06-03

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Michele Baldessari (michele) wrote on 2021-06-03:

#8

https://review.opendev.org/c/openstack/tripleo-ansible/+/794485

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-03: Fix proposed to tripleo-ansible (stable/train)

#9

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/794633

wes hayutin (weshayutin) on 2021-06-03

tags:

added: alert

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-05: Fix merged to tripleo-ansible (master)

#10

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/794485
Committed: https://opendev.org/openstack/tripleo-ansible/commit/f31bab878bfd3332c20a10bf9ca26d443028d214
Submitter: "Zuul (22348)"
Branch: master

commit f31bab878bfd3332c20a10bf9ca26d443028d214
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

{"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

Also tested the override via the corresponding THT change in
Ieffe2852111c3ec8347343a042dd78bbf691d79a.

Closes-Bug: #1923607

Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3

Reviewed:  https://review.opendev.org/c/openstack/tripleo-ansible/+/794485
Committed: https://opendev.org/openstack/tripleo-ansible/commit/f31bab878bfd3332c20a10bf9ca26d443028d214
Submitter: "Zuul (22348)"
Branch:    master

commit f31bab878bfd3332c20a10bf9ca26d443028d214
Author: Michele Baldessari <michele@acksyn.org>
Date:   Thu Jun 3 11:07:30 2021 +0200

Add podman's events_logger option by default set to journald
    
    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:
    
      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}
    
    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.
    
    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.
    
    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root  40 Jun  3 11:55 .
    drwxr-x--x. 5 root root 140 Jun  3 11:55 ..
    
    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"
    
    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.
    
    Closes-Bug: #1923607
    
    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-06: Fix proposed to tripleo-ansible (stable/wallaby)

#11

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/794948

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-06: Related fix merged to tripleo-heat-templates (master)

#12

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794592
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/4ea2a6eb7c84682d49b50f1e087c64b7dce13103
Submitter: "Zuul (22348)"
Branch: master

commit 4ea2a6eb7c84682d49b50f1e087c64b7dce13103
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

Related-Bug: #1923607

Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/794485
Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-07: Related fix proposed to tripleo-heat-templates (stable/wallaby)

#13

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795034

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-07: Fix merged to tripleo-ansible (stable/wallaby)

#14

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/794948
Committed: https://opendev.org/openstack/tripleo-ansible/commit/79be78bba35199c5b26632e51d8bda411a8239c5
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 79be78bba35199c5b26632e51d8bda411a8239c5
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200