Loaded previous kernel module breaks during upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
nvidia-graphics-drivers-535 (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Issue:
During update process of the packages for driver version 535, the previous driver that is still loaded breaks in such a way that the GPU(s) become unusable until reboot.
Symptoms:
1. all currently running & newly started processes interacting with the GPU(s) break:
- this affect both of the following APIs individually: CUDA, NVML
- the processes become stuck at 100% (single threat) system CPU load, i.e. they are stuck in and (interruptable) syscall - key can be stopped (via SIGINT/-TERM/-KILL)
- some NVML executables shows erronous total user+system time of millions of hours (far beyond the possible "uptime times CPU threads" - this may hint at bad memory accesses/writes within the kernel
2. once no processes use the GPU anymore (i.e. manually stopped) the kernel reports hung tasks in the `nvidia` and `nvidia_uvm` module (see attachment)
3. the `nvidia_uvm` kernel module cannot be unloaded: `rmmod` becomes stuck until reboot
Expected behavior (has been established through previous ~10 driver package upgrades):
1. all current processes can continue to use the GPU(s) without issue
2. once all processes have stopped using the GPU(s), i.e. none of the `/dev/nvidia*` is open, all the nvidia kernel modules can be unloaded (in appropriate order according to dependencies) via `modprobe -r` or `rmmod` - after this the new driver can be loaded, i.e. through (re)starting nvidia-persistenced
Partially retained expected behavior:
1. new processes report errors due to version incompatibilities between installed libraries and loaded kernel module
- e.g. `nvidia-smi` says something of "Driver/library version mismatch"
- the following kernel message is shown:
NVRM: API mismatch: the client has the version 535.86.05, but
NVRM: this kernel module has the version 535.54.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
- this behavior is retained in the affected versions until kernel hung tasks messages appear
Affected versions:
- 535.86.
- 535.54.
Environment:
- Ubuntu 20.04.6 LTS (`lsb_release -d`)
- all affected upgrades we automatically installed via unattended-updates
- the issue occurred on 15 different nodes with 5 different hardware configurations (Mainboard, CPU, RAM, GPU, etc.) - so it's unlikely to be an hardware issue
- all nodes are operated headless (GPUs not used for graphics output, no Xserver/whatnot installed, access was through SSH)
Related:
The following bugs may be related, since I expect this issue to manifest in the same signature: GPU entirely unusable, thus black screen, until reboot
- https:/
- https:/
summary: |
- Loaded previous kernel breaks during upgrade + Loaded previous kernel module breaks during upgrade |
description: | updated |
FYI, screenshot of the `nvidia-smi` with nonsensical runtime.