[UBUNTU 20.04] Null Pointer issue in nfs code running Ubuntu on IBM Z
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu on IBM z Systems |
Fix Released
|
High
|
Skipper Bug Screeners | ||
linux (Ubuntu) |
Invalid
|
Undecided
|
Frank Heimes | ||
Focal |
Fix Released
|
Medium
|
Canonical Kernel Team | ||
Impish |
Fix Released
|
Medium
|
Canonical Kernel Team | ||
Jammy |
Fix Released
|
Medium
|
Canonical Kernel Team |
Bug Description
SRU Justification:
==================
[Impact]
* The kernel crashed under load with a null pointer issue in nfs code:
[556585.270959] Krnl Code:#000000000
[556585.270967] Call Trace:
[556585.270982] ([<000003ff80d6
[556585.270993] [<000003ff80e11
[556585.271004] [<000003ff80e11
[556585.271014] [<000003ff80dfd
[556585.271016] [<0000002816594
[556585.271017] [<0000002816594
[556585.271019] [<0000002816596
[556585.271021] [<0000002816bb2
* Several dumps were generated and shared with Canonical.
* Analysis (done by kernel and SEG) point to refcount leaks fixed,
that are already fixed in the following commit/fix:
[Fix]
* ca05cbae2a0468e
[Test Case]
* There is unfortunately no reproducer or trigger available for this issue.
* It just happens now and then under higher load.
* Patched test kernels (focal 5.4 and bionic 5.4-hwe) were created and
ran for more than a week in a special staging environment (at IBM)
without further crashes.
* Hence the test and verification will be done by the IBM Z team.
[Where problems could occur]
* The inode handling can become broken, in case the changes
on the pointers are erroneous.
* Problems with the authentication and/or the credentials could occur
due to the modifications in put_rpccred, rpc_cred and rpc_auth.
* The expiration of the cached credentials could be harmed as well,
due to the changes in nfs_ctx_
* The different pointer arithmetic may cause further issues - wrong
or null pointer references.
* Positive is that the original commit was brought upstream by nfs experts.
* A patched test kernel sustained day long runs under load in a staging
and test environment.
* The author of the upstream commit/patch is well known in the NFS area.
[Other]
* The Salesforce Case Number 00334334 is associated with this bug.
* Commit ca05cbae2a04 was upstream accepted with 5.16-rc1.
* But commit ca05cbae2a04 was unfortunately not tagged as stable,
hence it was not picked automatically.
* Since kinetic's (22.10) target kernel is 5.18,
it will have the patch included,
hence no dedicated PATCH request for kinetic.
__________
State the component where the Bug is occurring:
kernel
Indicate the nature of the problem by answering the below questions:
- Is this problem reproducible? No
No, steps unknown, but we have seen these before
- Is the system sitting at a debugger (kdb, or xmon)? No
- Is the system hung? No
No, dumped and rebooted
- Are there any custom patches installed? Yes
On base system level (CloudAppliance) we are still running with the zfpc_proc module loaded. But no recent changes in the module and is running absolutely stable in HA (same kernel and userspace, Ubuntu 20.04 LTS)
- Is there any special hardware that may be relevant to this problem? Yes
We are running with mlx (cloud network adapters) installed.
- Is access information for the machine the problem was found on available? Yes
- Is the bug occuring in a userspace application? No
- Was a stack trace produced? Yes
This is what mention in first comment by @Boris Barth
- Did the system produce an Oops message on the console? Yes
[556585.270902] illegal operation: 0001 ilc:1 [#10] SMP
[556585.270905] Modules linked in: vhost_net macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle xt_mark sunrpc nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat cls_cgroup sch_htb act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy nf_tables ebtable_filter ebtables xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core dm_integrity async_xor async_tx dm_bufio bonding xt_MASQUERADE nf_conntrack_
[556585.270923] scsi_dh_emc s390_trng xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter sch_fq_codel zFPC_proc(OE) zFPC_diag(OE) vfio_ap vfio_mdev drm vfio_iommu_type1 drm_panel_
[556585.270945] CPU: 28 PID: 217741 Comm: worker Kdump: loaded Tainted: G D OE 5.4.0-90-generic #101-Ubuntu
[556585.270947] Hardware name: IBM 8562 GT2 A00 (LPAR)
[556585.270948] Krnl PSW : 0704d00180000000 0000000000000002 (0x2)
[556585.270951] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[556585.270953] Krnl GPRS: 0000000000000000 0000000000000000 000003e010ebbcf8 00000071c45e1ec0
[556585.270954] 0000000000000000 0000002816f7b18c 00000078dd36a4a0 000000713a62f718
[556585.270955] 0000000000000000 000003e010ebbcf8 0000000000000068 00000071c45e1ec0
[556585.270957] 0000006090a12200 0000000000000c40 000003ff80d6fb54 000003e010ebbbf0
[556585.270959] Krnl Code:#000000000
[556585.270967] Call Trace:
[556585.270982] ([<000003ff80d6
[556585.270993] [<000003ff80e11
[556585.271004] [<000003ff80e11
[556585.271014] [<000003ff80dfd
[556585.271016] [<0000002816594
[556585.271017] [<0000002816594
[556585.271019] [<0000002816596
[556585.271021] [<0000002816bb2
- Was a system dump produced ie kdump, netdumpmp, or LKCD? Yes
That is the kdump where the stacktrace from.
Enter data below to accurately describe the problem:
- Problem description:
Null Pointer issue in nfs code running Ubuntu Ubuntu 18.04 with HWE kernel 5.4 on IBM Z
- Enter uname -a output:
@lon1-qz1-
Linux lon1-qz1-
- Enter failing machine type and model (ie p520 9111-520 lpar, x336 47U-8637):
Manufacturer: IBM
Type: 8562
Model: A00 GT2
Model Capacity: A00 00000000
Capacity Adj. Ind.: 100
LPAR CPUs Total: 16
LPAR CPUs Configured: 16
LPAR CPUs Standby: 0
LPAR CPUs Reserved: 0
LPAR CPUs Dedicated: 0
LPAR CPUs Shared: 16
LPAR CPUs G-MTID: 0
LPAR CPUs S-MTID: 1
LPAR CPUs PS-MTID: 1
- Enter primary and backup contact information (name/email):
Prabhat Ranjan
<email address hidden>
Christoph Schlameu?
<email address hidden>
- Detail the configuration of the additonal hardware
- Enter common userspace tool name: N/A
- Enter name of userspace RPM: N/A
- If failing tool is obtained from project website vs RPM install, what is the version/
If from the project's CVS, what is the branch tag and date of checkout (put "na" if not applicable)?
N/A
- Is the failing userspace tool 32-bit, 64-bit, or both? N/A
- Describe how unresponsive the system is. What steps have you taken to reclaim the system:
kernel oops was detected and automatically dumped and restarted
- Is a debugger configured (xmon or kdb enabled)? No
- Enter Oops message from console:
[556585.270902] illegal operation: 0001 ilc:1 [#10] SMP
[556585.270905] Modules linked in: vhost_net macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle xt_mark sunrpc nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat cls_cgroup sch_htb act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy nf_tables ebtable_filter ebtables xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core dm_integrity async_xor async_tx dm_bufio bonding xt_MASQUERADE nf_conntrack_
[556585.270923] scsi_dh_emc s390_trng xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter sch_fq_codel zFPC_proc(OE) zFPC_diag(OE) vfio_ap vfio_mdev drm vfio_iommu_type1 drm_panel_
[556585.270945] CPU: 28 PID: 217741 Comm: worker Kdump: loaded Tainted: G D OE 5.4.0-90-generic #101-Ubuntu
[556585.270947] Hardware name: IBM 8562 GT2 A00 (LPAR)
[556585.270948] Krnl PSW : 0704d00180000000 0000000000000002 (0x2)
[556585.270951] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[556585.270953] Krnl GPRS: 0000000000000000 0000000000000000 000003e010ebbcf8 00000071c45e1ec0
[556585.270954] 0000000000000000 0000002816f7b18c 00000078dd36a4a0 000000713a62f718
[556585.270955] 0000000000000000 000003e010ebbcf8 0000000000000068 00000071c45e1ec0
[556585.270957] 0000006090a12200 0000000000000c40 000003ff80d6fb54 000003e010ebbbf0
[556585.270959] Krnl Code:#000000000
[556585.270967] Call Trace:
[556585.270982] ([<000003ff80d6
[556585.270993] [<000003ff80e11
[556585.271004] [<000003ff80e11
[556585.271014] [<000003ff80dfd
[556585.271016] [<0000002816594
[556585.271017] [<0000002816594
[556585.271019] [<0000002816596
[556585.271021] [<0000002816bb2
- Detail the steps to reproduce this problem: unknown
- Was the system configured to capture a system dump? Yes
CVE References
tags: | added: architecture-s39064 bugnameltc-197384 severity-high targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Skipper Bug Screeners (skipper-screen-team) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-z-systems: | |
assignee: | nobody → Skipper Bug Screeners (skipper-screen-team) |
importance: | Undecided → High |
Changed in ubuntu-z-systems: | |
status: | Incomplete → In Progress |
Changed in linux (Ubuntu): | |
assignee: | Skipper Bug Screeners (skipper-screen-team) → Frank Heimes (fheimes) |
description: | updated |
Changed in linux (Ubuntu Focal): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Impish): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Jammy): | |
importance: | Undecided → Medium |
description: | updated |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Impish): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Jammy): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-z-systems: | |
status: | In Progress → Fix Committed |
Changed in ubuntu-z-systems: | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu): | |
status: | New → Invalid |
Thanks for raising this. First of all I've noticed that the kernel in use is pretty outdated (package 'linux-meta-hwe-5.4 (5.4.0. 90.101~ 18.04.80' , changelog date 22 Oct 2021) and about half a year old - the current one is '5.4.0-107-generic' (package '5.4.0. 107.121~ 18.04.92' ). image-generic- hwe-18. 04' in 'bionic-updates' (5.4.0-107-generic) to be on the latest (and supported) level. 108.122~ 18.04.93' from 'bionic-proposed' would be ideal on top.)
The delta between 5.4.0.107.121 and 5.4.0.90.101 are 12 updated kernels with ~2000 commits and more than 20 are NFS related and also some about vfs.
Hence I need to ask to update the system to the latest 'linux-
(A test with '5.4.0.
It also looks like a kernel dump was created, could you please share this dump (ideally from the current kernel) for further analysis (either via IBM Box or Canonical's anon. ftp 'http:// archive. admin.canonical .com/')?