exim4 autopkgtest failure on ppc64el

Bug #1983605 reported by Jeremy Bícha
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
exim4 (Ubuntu)
Triaged
High
Unassigned
Kinetic
New
Undecided
Unassigned
Lunar
Triaged
High
Unassigned

Bug Description

I'm just filing a tracker bug for this issue but am not working on it myself.

exim4's autopkgtest fails on the ppc64el architecture.

https://autopkgtest.ubuntu.com/packages/e/exim4/kinetic/ppc64el

This test doesn't fail on Debian. Ubuntu's ppc64el is slightly different than Debian's (we enable -O3 for instance only on that architecture by default).

https://ci.debian.net/packages/e/exim4/testing/ppc64el/

=== Connected to localhost.
<- 220 adt-kinetic-ppc64el-exim4-20220806-041227-juju-4d1272-prod-prop.novalocal ESMTP Exim 4.96 Ubuntu Sat, 06 Aug 2022 04:14:54 +0000
 -> EHLO adt-kinetic-ppc64el-exim4-20220806-041227-juju-4d1272-prod-prop.novalocal
<- 250-adt-kinetic-ppc64el-exim4-20220806-041227-juju-4d1272-prod-prop.novalocal Hello localhost [127.0.0.1]
<- 250-SIZE 52428800
<- 250-8BITMIME
<- 250-PIPELINING
<- 250-PIPECONNECT
<- 250-CHUNKING
<- 250-STARTTLS
<- 250-PRDR
<- 250 HELP
 -> STARTTLS
<- 220 TLS go ahead
 -> QUIT
*** TLS startup failed (connect(): error:0A000438:SSL routines::tlsv1 alert internal error)
*** STARTTLS attempted but failed
*** Remote host closed connection unexpectedly.

Jeremy Bícha (jbicha)
tags: added: rls-kk-incoming
Revision history for this message
Jeremy Bícha (jbicha) wrote :

Reassigning since it fails even without libx11 1.8

affects: libx11 (Ubuntu) → exim4 (Ubuntu)
summary: - exim4 autopkgtest failure on ppc64el with libx11 1.8
+ exim4 autopkgtest failure on ppc64el
description: updated
Jeremy Bícha (jbicha)
tags: removed: rls-kk-incoming
Changed in exim4 (Ubuntu):
status: New → Triaged
Bryce Harrington (bryce)
description: updated
Revision history for this message
Bryce Harrington (bryce) wrote :

Looking at the test history, the issue has been cropping up irregularly. We've had a string of failures til now, but prior to that was a string of successes with the same exim4 version.

The specific line in the autopkgtest that is failing is this:

    swaks -s localhost -tlso -q ehlo

Running this manually in an amd64 LXC container, I get:

$ swaks -s localhost -tlso -q ehlo
=== Trying localhost:25...
=== Connected to localhost.
<- 220 triage-kinetic.lxd ESMTP Exim 4.96 Ubuntu Mon, 26 Sep 2022 19:23:28 +0000
 -> EHLO triage-kinetic.lxd
<- 250-triage-kinetic.lxd Hello localhost [127.0.0.1]
<- 250-SIZE 52428800
<- 250-8BITMIME
<- 250-PIPELINING
<- 250-PIPECONNECT
<- 250-CHUNKING
<- 250-STARTTLS
<- 250-PRDR
<- 250 HELP
 -> STARTTLS
<- 220 TLS go ahead
=== TLS started with cipher TLSv1.3:TLS_AES_256_GCM_SHA384:256
=== TLS no local certificate set
=== TLS peer DN="/C=UK/O=Exim Developers/CN=triage-kinetic.lxd"
 ~> EHLO triage-kinetic.lxd
<~ 250-triage-kinetic.lxd Hello localhost [127.0.0.1]
<~ 250-SIZE 52428800
<~ 250-8BITMIME
<~ 250-PIPELINING
<~ 250-PIPECONNECT
<~ 250-CHUNKING
<~ 250-PRDR
<~ 250 HELP
 ~> QUIT
<~ 221 triage-kinetic.lxd closing connection
=== Connection closed with remote host.

So the difference in the failing case is just the already identified error message:

*** TLS startup failed (connect(): error:0A000438:SSL routines::tlsv1 alert internal error)

Perhaps the test should be using tls1.2? Googling turns up a lot of hits, a couple that look pertinent to evaluate further:
  * https://serverfault.com/questions/954489/ssl-routinesssl3-read-bytestlsv1-alert-decrypt-error-on-mutual-authendication
    (Suggests avoiding tls1 as deprecated)
  * https://serverfault.com/questions/1098419/updated-exim4-now-not-using-tls-due-to-validation-failure
    (Suggests configuring exim4 to disable remote verification)

However, I'm a bit skeptical since those sound like things that would either always pass or always fail, yet we see the test case sometimes work and sometimes not (and only on this architecture??) I would expect something more resembling a race condition (e.g. trying to use exim4 while it or one of its dependencies is still starting up, or two things trying to access the same port.)

I wonder if the autopkgtest could be improved by adding some additional basic checks prior to running this command, that verify the network conditions, validates the tls config, and/or checks the exim4 service with no encryption.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We are able to reproduce this somewhat consistently on a ppc64el kinetic VM. When it happens, it will continuously happen, until exim4 is restarted. Then it might happen again or not.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :
Download full text (3.3 KiB)

Andreas and I have been spending time on and off investigating this issue. Until today, we were able to determine that the failure happens even when we build exim4, gnutls28 and nettle with -O2 instead of -O3 (which is the default optimization level on ppc64el). Disabling -O3 was the very first thing that came to my mind when I saw the problem (I've been bitten by this problem a few times before...), so it was a bit surprising to see that, this time, it isn't the reason behind the bug.

We then looked at the autopkgtest logs for exim4 on ppc64el, and were able to determine that, most likely, the issue started happening around the time the package was updated from version 4.94 to 4.95 (4.94.2-7ubuntu2 to 4.95-2ubuntu2, according to https://autopkgtest.ubuntu.com/packages/exim4/jammy/ppc64el). This morning, Andreas did a rebuild of exim4 4.94.2-7ubuntu2 and was able to confirm that the problem did *not* manifest with this version. That's good because now we are able to narrow our investigation.

Today I had more time to work on this issue, and was able to make good progress. Here's what I tried and what I found:

- Knowing that the issue is most likely in exim4, I decided to build the software directly from upstream and verify if I could reproduce the problem there. This served the purpose of making sure that the bug is present in upstream as well, and not in our downstream patches.

- I was able to successfully replicate the failure with upstream's 4.95 tag, and the non-failure with upstream's 4.94.

- This allowed me to do a bisect of the code, which resulted in:

f50a063dc0b96ac95b3a7bc0aebad3b3f2534c02 is the first bad commit
commit f50a063dc0b96ac95b3a7bc0aebad3b3f2534c02
Author: Jeremy Harris <email address hidden>
Date: Tue Jun 22 23:04:59 2021 +0100

    TLS: as server, reject connections with ALPN indicating non-smtp use

- On the one hand, that's great. I double checked and indeed, the problem started happening after the commit above was pushed.

- I wanted to understand better what was going on, so I built exim4 (still from upstream) using "-O0 -g3". I was ready to start debugging it, when...

- ... I wasn't able to reproduce the problem anymore. My attention was once again turned to optimization.

- I then decided to take a closer look at the package again. I confirmed that I could still reproduce the issue using -O2 and -O1. When I tried using -O0, I initially wasn't able to build the package because -O0 interacts badly with "hardening=+all" (which is expected behaviour). And then it dawned on me that the problem could actually be some of the hardening features we use, because they're much more likely to break something than -O0 vs. -O2.

- I had a feeling that the problem could be related to PIE, so I did a build only with PIE disabled, and voilà: I could not reproduce the bug anymore.

So that's it. Where do we go from here?

- On exim4, it's enough to disable PIE on ppc64el only. That's an easy workaround and, IMHO, should not be the cause of much concern from the SRU team.

- Generally speaking, it might be interesting to continue investigating this issue a bit more. For example, I'm still unsure as to why we'...

Read more...

Revision history for this message
Bryce Harrington (bryce) wrote :

Revisiting this because (unsurprisingly) this has cropped up in update-excuses once again, on ppc64el only, and as before is intermittently passing about 20-30% of the time.

For reference, a link to the commit Sergio bisected to:

    https://git.exim.org/exim.git/commitdiff/f50a063dc0b96ac95b3a7bc0aebad3b3f2534c02

If that is indeed the code causing the problems, next question would be if the issue is something to do with the ALPN detection misfiring, or if the code itself is compiling wrong. For the latter, my only idea so far might be to tinker with the instantiation of the gnutls_datum_t protocols array, maybe something's off with the structs? The test output doesn't appear to mention ALPN at all, which makes me wonder if the debug messages could be enabled to help track where in this code things are going wrong?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.