ubuntu-core-launcher nvidia driver detection is bogus

Bug #1615248 reported by Oliver Grawert
86
This bug affects 28 people
Affects Status Importance Assigned to Milestone
snap-confine
Fix Released
Critical
Zygmunt Krynicki
snap-confine (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
In Progress
Undecided
Unassigned

Bug Description

[Impact]

Snap-confine contains special support code for Nvidia proprietary driver. This code used a rather naive approach to detect the driver, it was looking for directories matching /usr/lib/nvidia-*.

This worked fine as long as the number of found directories was either zero (nothing to do) or one (we know which driver to use). The problem arises when driver updates cause leftover (even empty) directories to match that glob pattern. Snap-confine would just bail out and abort.

Now snap-confine looks at /sys/modules/nvidia/version to know which version of the driver to use (if any). This change was recommended by
Alberto Milone who maintains the nvidia proprietary driver packages in Ubuntu.

For more information about the execution environment, please see this article http://www.zygoon.pl/2016/08/snap-execution-environment.html

[Test Case]

As a test case, install an nvidia proprietary driver package (any version will do) by using software-properties-gtk and using the additional software tab or by installing one of the nvidia-* packages (e.g. nvidia-346). Then create an unrelated directory, not corresponding to any actual driver version, e.g. /usr/lib/nvidia-123.

If snap applications continue to work then everything is good. In the past snap-confine would print an error message and bail out.

This test has to be started on a machine that is using actual nvidia hardware and has the nvidia proprietary kernel module loaded.

[Regression Potential]

 * Regression potential is minimal. The same behaviour is applied as before, just the driver detection code is less dumb and actually knows which driver is running by asking the kernel.

[Other Info]

* This bug is a part of a major SRU that brings snap-confine in Ubuntu 16.04 in line with the current upstream release 1.0.41.

* snap-confine is technically an integral part of snapd which has an SRU exception and is allowed to introduce new features and take advantage of accelerated procedure. For more information see https://wiki.ubuntu.com/SnapdUpdates

== # Pre-SRU bug description follows # ==

older nvidia drivers used to leave dangling symlinks behind in their dirs in /usr/lib/nvidia-$version which makes dpkg not remove the directory on package removal:

/usr/lib/nvidia-319-updates:
libnvidia-opencl.so.1 libnvidia-wfb.so.1

/usr/lib/nvidia-346:
libnvidia-fbc.so.1 libnvidia-wfb.so.1

/usr/lib/nvidia-352:
libnvidia-fbc.so.1 libnvidia-wfb.so.1

ubuntu-core-launcher seemingly only checks for the existence of multiple nvidia-* dirs to throw the:
“multiple nvidia drivers detected, this is not supported”
error...

there is only the nvidia-361 driver installed on this machine but it was upgraded LTS->LTS and originally installed with 12.04.

instead of checking if there are multiple directories the check should instead look for something like /usr/lib/nvidia-361/libGL.so or libGLX.so so it does not fall over on left over cruft but checks for actual existence of multiple driver libs.

i seem to not be alone http://askubuntu.com/questions/811479

Revision history for this message
Abhishek (rawcoder) wrote :

Thanks for your solution, was getting the same error.

This is indeed surprising that they are just checking the dirs for detecting multiple drivers.

I guess the latest update (2.12) triggered it. So we should expect more people to hit this bug.

Revision history for this message
Nate Finch (natefinch) wrote :

Here's a workaround.

I have two nvidia directories under /usr/lib:

$ ls /usr/lib -l | grep " nvidia-"
drwxr-xr-x 6 root root 4096 Apr 25 12:19 nvidia-304
drwxr-xr-x 2 root root 4096 Apr 25 10:32 nvidia-352

Removing the unused one fixes the bug. First, figure out which one you're using:

$ dpkg -l |grep ^ii| grep nvidia-[0-9]
ii nvidia-304 304.131-0ubuntu3 amd64 NVIDIA legacy binary driver - version 304.131

Now, delete or rename any other nvidia-* directories (any that aren't the one in use):

$ sudo mv /usr/lib/nvidia-352 /usr/lib/old-nvidia-352

(I renamed mine rather than deleting it because I am extremely paranoid when it comes to mucking with video drivers in any way, but deleting it is probably totally fine, too.)

Zygmunt Krynicki (zyga)
affects: snappy → snap-confine
Changed in snap-confine:
importance: Undecided → High
assignee: nobody → Zygmunt Krynicki (zyga)
status: New → Triaged
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

It would help if we knew how to identify the actually active driver

Revision history for this message
Alberto Milone (albertomilone) wrote :

Kernel space is much more reliable, as the system can use only one nvidia kernel module at a time.

You can check the following file:

/sys/module/nvidia/version

and you will get something like 367.44 (the format being "$major_version.$minor_version\n"). You can pick the major version, and use that to determine which libraries to load, as the path is /usr/lib/nvidia-$major_version.

Revision history for this message
Barry Warsaw (barry) wrote :

% cat /sys/module/nvidia/version
367.35
% dpkg-query -W nvidia*
nvidia-352 361.45.11-0ubuntu4
nvidia-361 367.35-0ubuntu1
nvidia-367 367.35-0ubuntu1
nvidia-common
nvidia-driver-binary
nvidia-legacy-340xx-vdpau-driver
nvidia-libopencl1-367
nvidia-libopencl1-dev
nvidia-opencl-icd
nvidia-opencl-icd-352 361.28-0ubuntu1
nvidia-opencl-icd-361 367.35-0ubuntu1
nvidia-opencl-icd-367 367.35-0ubuntu1
nvidia-persistenced
nvidia-prime 0.8.4
nvidia-settings 367.35-0ubuntu1
nvidia-settings-binary
nvidia-vdpau-driver

On IRC, it was recommended to just purge the older nvidia driver versions as a workaround. I'm about to try that and if I don't follow up then that probably wasn't a good idea. :)

Revision history for this message
Barry Warsaw (barry) wrote :

FWIW I have a GeForce GTX 760/PCIe/SSE2

Revision history for this message
Barry Warsaw (barry) wrote :

Although:

% ls /usr/lib -l | grep " nvidia-"
drwxr-xr-x 2 root root 4096 Mar 4 23:16 nvidia-352/
drwxr-xr-x 2 root root 12288 Aug 10 16:53 nvidia-361/
drwxr-xr-x 6 root root 4096 Aug 10 16:54 nvidia-367/
drwxr-xr-x 2 root root 4096 Aug 10 16:53 nvidia-367-prime/

Revision history for this message
Barry Warsaw (barry) wrote :

Purging nvidia-352 and nvidia-361 does not remove any of these subdirs.

Revision history for this message
Barry Warsaw (barry) wrote :

Neither does purging fix the original bug.

Revision history for this message
Nate Finch (natefinch) wrote :

Not to be a pain in the butt, but why is a hello-world snap that just does echo "hello world" doing anything with video drivers?

Revision history for this message
Alberto Milone (albertomilone) wrote :

@Barry: 352 and 361 are transitional packages

Michael Vogt (mvo)
Changed in snap-confine:
importance: High → Critical
Zygmunt Krynicki (zyga)
Changed in snap-confine:
milestone: none → 1.0.41
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

This is fixed by this pull request: https://github.com/snapcore/snap-confine/pull/129 (still in progress).

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Nate: to reply to your question. Currently snap-confine has special code for supporting nvidia and this code runs unconditionally. In the future it may be generalized enough that all the regular interface features that snapd can influence will be sufficient to support nvida. For now that is not the case.

Zygmunt Krynicki (zyga)
Changed in snap-confine:
status: Triaged → Fix Committed
Revision history for this message
Neil McPhail (njmcphail) wrote :

I think this is a different issue to Bug #1574851 which is marked as a duplicate of this bug, as that happens even with only one nvidia version installed.

Zygmunt Krynicki (zyga)
Changed in snap-confine:
status: Fix Committed → Fix Released
Zygmunt Krynicki (zyga)
description: updated
Changed in snap-confine (Ubuntu):
status: New → Fix Released
Changed in snap-confine (Ubuntu Xenial):
status: New → In Progress
Revision history for this message
Leo Arias (elopio) wrote :

I have no machine available with an nvidia card, so I can't verify this one.

Revision history for this message
Luca (zapduke) wrote :

After the fix Blender and Krita started working, but I receive the same error for other snap like ubuntu-clock-app, chiche, rubecube. Any idea why some works and some don't?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.