pth_init() aborts on armel with "longjmp causes uninitialized stack frame"

Bug #599862 reported by Vincent Stehlé
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
gnupg2 (Ubuntu)
Fix Released
Undecided
Unassigned
pth (Ubuntu)
Fix Released
High
Linaro Builds and Baselines

Bug Description

Binary package hint: gnupg2

I cannot launch gnupg-agent on my platform:

$ gpg-agent --daemon
*** longjmp causes uninitialized stack frame ***: gpg-agent terminated
Aborted

This is on an armel maverick system.

More information:

$ lsb_release -rd
Description: Ubuntu maverick (development branch)
Release: 10.10

$ apt-cache policy gnupg2
gnupg2:
  Installed: (none)
  Candidate: 2.0.14-1.1ubuntu1
  Version table:
     2.0.14-1.1ubuntu1 0
        500 http://ports.ubuntu.com/ubuntu-ports/ maverick/main Packages

Tags: armel maverick
tags: added: armel maverick
Revision history for this message
Dave Martin (dave-martin-arm) wrote :

After some fighting against some effective efforts by gdb to prevent debugging this issue, I think I've narrowed this problem down to something pad in pth_init() in libpth.

To demonstrate this, try applying the attached diff in gnupg2 and rebuilding.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

It looks like this is definitely a bug in pth (or something pth uses).

When libpth20 2.0.7-16 was built, it bombed out with the same error while running tests, but the build rules silently ignore the failure :(

http://launchpadlibrarian.net/48374229/buildlog_ubuntu-maverick-armel.pth_2.0.7-16_FULLYBUILT.txt.gz

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

From the log:

=== TESTING BASIC OPERATION ===

Initializing Pth system (spawns scheduler and main thread)
*** longjmp causes uninitialized stack frame ***: /build/buildd/pth-2.0.7/.libs/lt-test_std terminated
Aborted

[...]

make[1]: [test-std] Error 1 (ignored)

summary: - gnupg-agent won't start
+ pth_init() aborts on armel with "longjmp causes uninitialized stack
+ frame"
Changed in gnupg2 (Ubuntu):
status: New → Invalid
Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Narrowed this down a bit further... the error appears to happen in the call to pth_spawn() at pth_lib.c:pth_init():95.

Building with the attached diff demonstrates this:

=== TESTING GLOBAL LIBRARY API ===

Fetching library version
version = 0x200207

=== TESTING BASIC OPERATION ===

Initializing Pth system (spawns scheduler and main thread)
pth_lib.c: before pth_spawn
*** longjmp causes uninitialized stack frame ***: .libs/test_std terminated

Revision history for this message
Dave Martin (dave-martin-arm) wrote :
Revision history for this message
Dave Martin (dave-martin-arm) wrote :

OK, the source of scariness seems to be in pth_mctx.c which does some evil-looking things in order to implement userspace threading.

In order to switch threads with an accompanying switch of stack, setjmp() is called from inside a signal handler which has an alternate signal stack set up. [1]

eglibc barfs when the corresponding longjmp() from the main thread tries to jump back into the signal handler after the signal handler returns (!) The maintainers claim this is portable within POSIX, but I'm less than convinced. POSIX is somewhat vague about the circumstances in which it's safe to longjmp() out of a signal handler, to say nothing of what happens on a modern kernel with interruptible system calls etc. ..., and expressly prohibits longjmp'ing back into a function which has returned. Behaviour for siglongjmp in this situation appears to be undefined, but my conclusion is that this is not supposed to be supported either [2] If so, and if I've understood the implications correctly, this would invalidate any claim of strict portability in pth.

I haven't read the full rationale [3] but it contains some interesting caveats:
"Even on operating systems which have working POSIX functions, our approach may theoretically still not work, because longjmp [...] [may branch] to error-handling code if it detects that the caller tries to jump up the stack, i.e., into a stack frame which has already returned"

We may be hitting just such a check here.

Notwithstanding this, the contents of the jmp_buf otherwise looks sane when the failing longjmp call occurs. See the attached debug log: the sp and pc values saved in the jmp_buf for pth_mctx_set_trampoline at pth_mctx.c:394 appear to match those in the jmp_buf passed to longjmp at pth_mctx.c:362

[1] see http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/maverick/pth/maverick/annotate/head%3A/pth_mctx.c

[2] See IEEE Std 1003.1-2008 (System Interfaces: longjmp) http://www.opengroup.org/onlinepubs/9699919799/

"The longjmp() function shall restore the environment saved by the most recent invocation of setjmp() in the same thread, with the corresponding jmp_buf argument. If [...] the function containing the invocation of setjmp() has terminated execution in the interim [...] the behavior is undefined."

(This specification derives directly from ISO C)

[3] "Portable Multithreading - The Signal Stack Trick For User-Space Thread Creation", Ralf S. Engelschall: http://bazaar.launchpad.net/%7Eubuntu-branches/ubuntu/maverick/pth/maverick/annotate/head%3A/rse-pmt.ps

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Looks like the failure might be restricted to armel because of the unavailability of the ucontext API:

checking for ucontext.h... yes
checking for makecontext... no
checking for swapcontext... no
checking for getcontext... no
checking for setcontext... no
checking for usable SVR4/SUSv2 makecontext(2)/swapcontext(2)... no
checking for signal.h... (cached) yes
checking for sigsetjmp... no
checking for siglongjmp... yes
checking for setjmp... yes
checking for longjmp... yes
checking for _setjmp... yes
checking for _longjmp... yes
checking for sigaltstack... yes
checking for sigstack... yes
checking for signal-mask aware setjmp(3)/longjmp(3)... yes: ssjlj

Whereas i386, amd64:

checking for makecontext... yes
checking for swapcontext... yes
checking for getcontext... yes
checking for setcontext... yes
checking for usable SVR4/SUSv2 makecontext(2)/swapcontext(2)... yes

It could be interesting to trick pth into thinking that that API is unavailable on i386/amd64 and seeing whether we get the same breakage on other platforms.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Looks like there's a check for longjmp decreasing the sp, which was merged from eglibc upstream revision 2.11~20100104.

see http://bazaar.launchpad.net/%7Eubuntu-branches/ubuntu/maverick/eglibc/maverick/annotate/head%3A/ports/sysdeps/unix/sysv/linux/arm/____longjmp_chk.S

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

This patch demonstrates a possible workaround, which enables the pth tests to pass.

This match modifies a generated file: it should be applied after running configure.

The stack pointer is hacked immediately before calling the longjmp family of functions, so that the sanity-check in the longjmp implementation is not triggered. Thanks to Andrew Stubbs for helping me to understand the problem here.

***Health warning*** at best, this is a very nasty hack--- at worst it may be totally unsafe.

Either way I DO NOT recommend attempting to merge this patch as a fix in Ubuntu unless you really know what you're doing, especially since the code is used by gpg-agent.

Steve Langasek (vorlon)
Changed in pth (Ubuntu):
assignee: nobody → Linaro Foundations (linaro-foundations)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jani Monoses (jani) wrote :

It happens on ARM only because it does not have the getcontext() API so a different implementation is used for user context switching.

On non-ARM one can select the setjmp based implementation (all 3 args must be given or it will be a mix of setjmp/ucontext and it won't build) by via configure
ex:
./configure --with-mctx-mth=sjlj --with-mctx-dsp=sjlj --with-mctx-stk=ss

then make test exhibits the same error

Probably this codepath was not much tested after changes to jmpbuf in glibc.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pth - 2.0.7-16ubuntu2

---------------
pth (2.0.7-16ubuntu2) natty; urgency=low

  * debian/rules: disable FORTIFY_SOURCE for armel build, as it breaks
    the sigjmp/longjmp mechanism used on ARM for user space threading.
    (Closes LP: #599862)
 -- Jani Monoses <email address hidden> Mon, 03 Jan 2011 15:04:54 +0200

Changed in pth (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Loïc Minier (lool) wrote :

Do we have a bug and/or a test case to track the setjmp() issues (which apparently aren't ARM specific)?

I filed bug #696794 to track the lack of getcontext() on ARM in eglibc, with a task to revert these pth changes once eglibc provides getcontext().

Revision history for this message
Jani Monoses (jani) wrote :

@Loïc these are not really issues with longjmp itself, pth relies on a behavior which was changed in glibc 2.11
the failing test case is the one in pth (make test)

Revision history for this message
Loïc Minier (lool) wrote :

I thought there was a runtime issue as well (not just make test)?! (The original bug mentions some abort when using gnupg.)

I understand that when getcontext() is missing, pth defaults to another pth backend which either is broken in pth itself or exposes a bug in eglibc; I'd like to make sure that we track this issue.

Revision history for this message
Jani Monoses (jani) wrote :

I overlooked there was a gpg related bug here. Anyway gpg-agent works with the latest pth now - the abort was due to FORTIFY_SOURCE enabled extra checks in every longjmp call.

There is no bug in glibc/eglibc , but pth relies on some tricks which are not guaranteed to work and indeed they fail starting with glibc2.11 when built with FORTIFY_SOURCE - this is documented in the pth sources and the paper quoted in comments in this bug.

Changed in gnupg2 (Ubuntu):
status: Invalid → Fix Released
Revision history for this message
Loïc Minier (lool) wrote :

Jani, thanks for the explanation; I wonder why this works on non-ARM arches though.

Should we open a bug upstream so that pth disabled FORTIFY_SOURCE on ARM if it's in the default compiler options?

Revision history for this message
Jani Monoses (jani) wrote :

It only works on non-ARM because they use getcontext() by default. If picking the setjmp/longjmp mechanism, x86 has the same behaviour as descibed in comment 10. glibc performs the same sanity check in longjmp for all archs.

I have mailed the upstream author a week ago pointing him to this bugreport.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.