gcc-10 breaks on armhf (flaky): internal compiler error: Segmentation fault
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gcc |
In Progress
|
Medium
|
|||
gcc-10 (Ubuntu) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Hi,
this could be the same as bug 1887557 but as I don't have enough data I'm filing it as individual issue for now.
I have only seen this happening on armhf so far.
In 2 of 5 groovy builds of qemu 5.0 this week I have hit the issue, but it is flaky.
Flakyness:
1. different file
first occurrence
/<<PKGBUILDDIR>
second occurrence
/<<PKGBUILDDIR>
Being so unreliable I can't provide mcuh more yet.
I filed it mostly for awareness and so that I can be dup'ed onto the right but if there is a better one.
Christian Ehrhardt (paelzer) wrote : | #1 |
Christian Ehrhardt (paelzer) wrote : | #2 |
Christian Ehrhardt (paelzer) wrote : | #3 |
There was another one in Groovy as of yesterday.
https:/
https:/
...
qapi/qapi-
qapi/qapi-
6570 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:/
...
The bug is not reproducible, so it is likely a hardware or OS problem.
So the compiler itself is recognizing that it isn't the source code (alone) but some awkwardness that is flaky.
It seems qemu builds in groovy hit this in ~1/3 of the builds we do on armhf - not sure if that is enough for debugging for you?
Matthias Klose (doko) wrote : | #4 |
no, try a local build until you have a reproducer. When DEB_BUILD_OPTIONS is set, the compiler driver retries up to three times to see if it's reproducible.
description: | updated |
Balint Reczey (rbalint) wrote : | #5 |
Found it again in glibc 2.32-0ubuntu3 build.
vfscanf-internal.c: In function ‘__vfscanf_
vfscanf-
3057 | }
| ^
Christian Ehrhardt (paelzer) wrote : | #6 |
I'm building qemu (known to be able to trigger it) on Canonistack armhf LXD container in an arm64 VM (the setup that should be closest to the failing builders).
I also installed whoopsie and apport to catch even a single crash.
But I'm building for quite some hours by now and nothing happened.
I'll let it run the rest of the day in a a loop, but if it won't trigger again we need a better approach trying to corner this bug.
Christian Ehrhardt (paelzer) wrote : | #7 |
I compiled for almost 24h now, it just won't crash :-/
Not sure what else I could do to more likely reproduce this ...
Christian Ehrhardt (paelzer) wrote : | #8 |
Another breakage at
https:/
I had to retry it, we will see if it works on retry as before
Christian Ehrhardt (paelzer) wrote : | #9 |
And again on the same :-/
cc -iquote /<<PKGBUILDDIR>
The bug is not reproducible, so it is likely a hardware or OS problem.
There seems to be no pattern to it (e.g. on which source file it break), just a chance that increased probably on source size. But I wonder what else I could do on top of the canonistack build that I have tried - maybe concurrency?
Christian Ehrhardt (paelzer) wrote : | #10 |
cc -iquote /<<PKGBUILDDIR>
during RTL pass: reload
/<<PKGBUILDDIR>
/<<PKGBUILDDIR>
2936 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
Now hit at 3/3 retries which is exactly what we were afraid of might happen ...
Changed in gcc-10 (Ubuntu): | |
importance: | Undecided → Critical |
Christian Ehrhardt (paelzer) wrote : | #11 |
Bumping the prio since -as we were afraid of - this starts to become a service-problem (what if we can't rebuild anymore?)
Christian Ehrhardt (paelzer) wrote : | #12 |
I reduced the CPU/Mem of my canonistack system that I try to recreate on (to be more similar).
Also I now do run with DEB_BUILD_
/me hopes this might help to finally trigger it in a debuggable environment.
P.S: I'm now at 4/4 retries that failed for the real build ... :-/ It gladly worked on the fifth retry
P.P.S: Note to myself 4cpu/8G Memory is the real size used (I have 4/4 atm since I set it up before I could reach anyone)
Seth Forshee (sforshee) wrote : | #13 |
We're also seeing this in kernel builds.
Christian Ehrhardt (paelzer) wrote : | #14 |
I got the crash in the repro env.
dmesg holds no OOM which is good - also no other dmesg/journal entry that would be related.
It might be depending on concurrent execution as this was the primary change to last time.
And not having set up apport/whoopsie to catch the crash :-/
I've installed them now and run the formerly breaking command in a loop.
For the sake of "just eating cpu cycles" I have spawned some cpu hogs in the background.
But with all that in place it ran the compile 300 times without a crash :-/
It seems I have to re-run in the build env and hope that apport will catch it into /var/crash this time :-/
Christian Ehrhardt (paelzer) wrote : | #15 |
Finally:
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
...
The bug is not reproducible, so it is likely a hardware or OS problem.
make[2]: *** [/root/
make[2]: Leaving directory '/root/
make[1]: *** [Makefile:527: i386-linux-
make[1]: *** Waiting for unfinished jobs....
Still nothing in /var/crash to report :-/
Why is that - I have apport/whoopsie installed, the kernel is set up
$ sysctl -a | grep core_patt
kernel.
Also I have set
$ cat ~/.config/
[main]
unpackaged=true
This is armhf lxd on arm64 host - maybe apport has a guest/host problem here?
@Doko - do you happen to know if there are any extra whoops to jump through to get a crash report from gcc when it crashes in debuild?
Christian Ehrhardt (paelzer) wrote : | #16 |
Ok, apport through the stack of LXD is ... not working.
I have used a more mundane core pattern and a C test program to ensure I will get crash dumps.
$ cat /proc/sys/
/var/crash/
$ gcc test.c ; ./a.out ; ll /var/crash/
Segmentation fault (core dumped)
total 3
drwxrwsrwt 2 root whoopsie 3 Sep 24 05:48 ./
drwxr-xr-x 13 root root 15 Sep 23 09:40 ../
-rw------- 1 root whoopsie 208896 Sep 24 05:48 core.a.
Trying to run into the real gcc crash again with this ensured ...
Christian Ehrhardt (paelzer) wrote : | #17 |
Three reruns later I got
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
The bug is not reproducible, so it is likely a hardware or OS problem.
Again no crash of gcc to find, how it is disabling that ... ?!?
I was reading through /usr/share/
Christian Ehrhardt (paelzer) wrote : | #18 |
Interim Summary:
- hits armhf compiles of various large source projects, chances are it it completely random
and just hits those more likely by compiling more
- build system auto-retries the compiles and they work on retry eventually reported as "The bug
is not reproducible, so it is likely a hardware or OS problem."
- The bug always occurs on different source files, retrying a failed one works for hundreds of
times so it seems to be sort of random when it hits and not tied to the source.
- It seems we need concurrency to trigger it, but again it might just have increases the
likeliness
- I can trigger it reliably now in ~2-8h of compile time on Canonistack when building qemu
on an armhf LXD container on a arm64 Hosts (same as the builders)
- Despite my tries I'm unable to gather a crash dump of the gcc segfault and would be happy
about a hint/advise on that.
Christian Ehrhardt (paelzer) wrote : | #19 |
Not sure if it is entirely random, it hit the second time on
/<<PKGBUILDDI
in like 2/8 hits I've had so far. Given how much code it builds that is unlikely to be an accident.
Christian Ehrhardt (paelzer) wrote : | #20 |
I tried to isolate what was running concurrently and found 7 gcc calls.
I have set them up to run concurrently in endless loops each.
That way they reached a lot of iterations without triggering the issue :-/
I don't know how to continue :-/
But I can share a login to this system and show how to trigger the bug.
The following will get you there and trigger the bug usually in 1-2 loops (~4h on average)
$ ssh ubuntu@10.48.130.69
$ lxc exec groovy-gccfail bash
# cd qemu-5.0/
# i=1; export DEB_BUILD_
@Doko could you take over from here as I'd hope you know how to force gcc to give you a dump?
I imported your key to the system mentioned above.
Christian Ehrhardt (paelzer) wrote : | #21 |
It was brought up with foundations last week in our sync and mentioned that someone will look into it for further guidance on the case. Since nothing happened I'll add the rls-gg-incoming tag to make sure it is re-visited in your bug meetings.
I beg you pardon, i know it is your tag and please feel free to remove it if it really is incorrect here - but I just want (more or less any) a response on this from someone able to decide if this is actually critical (or not) and how to go on.
tags: | added: rls-gg-incoming |
Christian Ehrhardt (paelzer) wrote : | #22 |
There is a new gcc-10 version from two days ago in groovy now.
I was talking with doko and we wanted to try different gcc-10 versions in general trying to corner the issue to when it started to appear.
https:/
https:/
https:/
https:/
I usually had the crash in 1-2 runs, so I will consider 4 good runs as the issue being not present. Although there is some racyness to this I just can't wait much longer without growing out of a day for a single test :-/
I'll update once the I got more results
Christian Ehrhardt (paelzer) wrote : | #23 |
Christian Ehrhardt (paelzer) wrote : | #24 |
Downloaded the other two as well and running on https:/
Christian Ehrhardt (paelzer) wrote : | #25 |
FYI: This passed two runs good by now, but that isn't enough. I need to have it running over night to be sure about 10.1
Christian Ehrhardt (paelzer) wrote : | #26 |
https:/
Now on https:/
Christian Ehrhardt (paelzer) wrote : | #27 |
So all 10.x that I could get fail:
https:/
https:/
https:/
https:/
Now looking which 9.x I could try ...
Christian Ehrhardt (paelzer) wrote : | #28 |
https:/
So the breakage was between 9.3.0-18ubuntu1 and 10-20200425-
How to continue from here, will you throw me PPA builds and/or do you still have debs anywhere I should try?
Christian Ehrhardt (paelzer) wrote : | #29 |
Trying gcc-snapshot 1:20200917-1ubuntu1 now
Christian Ehrhardt (paelzer) wrote : | #30 |
gcc-snapshot 1:20200917-1ubuntu1 fails in other places.
/root/qemu-
0xf0afc3 internal_error(char const*, ...)
???:0
0x8fa705 verify_
???:0
0x5f644b rest_of_
???:0
0x1f61c7 finish_
???:0
0x246ef9 c_parser_
???:0
0x254d81 c_parse_file()
???:0
0x2a3305 c_common_
???:0
So gcc-snapshot is no good to try this :-/
Christian Ehrhardt (paelzer) wrote : | #31 |
Doko passed me gcc-10 - 10.2.0-14ubuntu0.1 from https:/
Still building on armhf, but I'll give those a try once complete.
Christian Ehrhardt (paelzer) wrote : | #32 |
As expected the non-strip removed the dbgsym:
The following packages will be REMOVED:
gcc-10-dbgsym
The following packages will be upgraded:
cpp-10 g++-10 gcc-10 gcc-10-base gcc-10-multilib libasan6 libatomic1 libcc1-0 libgcc-10-dev libgcc-s1 libgomp1 libsfasan6 libsfatomic1 libsfgcc-10-dev libsfgcc-s1 libsfgomp1 libsfubsan1
libstdc++-10-dev libstdc++-10-pic libstdc++6 libubsan1
21 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
This is now running and likely to crash later today.
But since I fail to get a crash dump before that (how to get one) will be the remaining issue we need to solve.
Christian Ehrhardt (paelzer) wrote : | #33 |
With this build the crash does still not leave a .crash file, but it is more verbose
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
0x532d6b crash_signal
../../
0x523a5b avoid_constant_
../../
0x4f6f9d commutative_
../../
0x4f705b swap_commutativ
../../
0x51deb3 simplify_
../../
0x51df01 simplify_
../../
0x42c191 lra_constraints
../../
0x41f483 lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Does this help you in any way?
Christian Ehrhardt (paelzer) wrote : | #34 |
I'll re-run and dump a few of them just to help you to get to the root cause:
cc -iquote /root/qemu-
0x532d6b crash_signal
../../
0x41d0c7 add_regs_
../../
0x41d1c9 add_regs_
../../
0x41d1c9 add_regs_
../../
0x41e28f lra_update_
../../
0x41e3d5 lra_update_
../../
0x41e3d5 lra_push_insn_1
../../
0x436bb5 spill_pseudos
../../
0x436bb5 lra_spill()
../../
0x41f4ef lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Christian Ehrhardt (paelzer) wrote : | #35 |
cc -iquote /root/qemu-
0x532d6b crash_signal
../../
0x71769f thumb2_
../../
0x717c15 arm_legitimate_
../../
0x717c15 arm_legitimate_
../../
0x427eef valid_address_p
../../
0x427eef simplify_
../../
0x4287ed curr_insn_transform
../../
0x42c133 lra_constraints
../../
0x41f483 lra(_IO_FILE*)
../../
0x3f0915 do_reload
../../
0x3f0915 execute
../../
Christian Ehrhardt (paelzer) wrote : | #36 |
gcc-snapshot still has various issues - but not the crash
/root/qemu-
44 | };
| ^
...
/root/qemu-
Can't continue with gcc-snapshot due to those (even with the newer version).
Christian Ehrhardt (paelzer) wrote : | #37 |
Defaults:
# gcc -Q --help=target | grep -e '-marm' -e '-mthumb'
-marm [disabled]
-mthumb [enabled]
-mthumb-interwork [enabled]
Doko suggested to change that by using -marm.
This is running since a while, but needs some more time to trigger ...
Christian Ehrhardt (paelzer) wrote : | #38 |
@Doko - I can confirm that with -marm the issue is gone.
I have had 6 full runs yesterday and overnight.
We can conclude, -mthumb is a requirement to trigger the issue.
Christian Ehrhardt (paelzer) wrote : | #39 |
I spoke too soon after ~7.5 runs I got the following with -marm:
cc -iquote /root/qemu-
during RTL pass: reload
/root/qemu-
/root/qemu-
12519 | }
| ^
cc -iquote /root/qemu-
Christian Ehrhardt (paelzer) wrote : | #40 |
FYI now Testing 10.2.0-14ubuntu0.2 from https:/
I've stopped setting -marm to trigger the issue "faster", please let me know if you want me to continue to use -marm for those tests.
Changed in groovy: | |
importance: | Unknown → Medium |
status: | Unknown → New |
Changed in gcc-10 (Ubuntu): | |
status: | New → Confirmed |
Changed in groovy: | |
status: | New → Confirmed |
48 comments hidden Loading more comments | view all 128 comments |
Christian Ehrhardt (paelzer) wrote : | #89 |
Failed on #17
during RTL pass: reload
/root/qemu-
/root/qemu-
1535 | }
| ^
cc -iquote /root/qemu-
0x527c2f crash_signal
../../
0x4147bf add_regs_
../../
0x4148b3 add_regs_
../../
0x4148b3 add_regs_
../../
0x4158d9 lra_update_
../../
0x415a29 lra_update_
../../
0x415a29 lra_push_insn_1
../../
0x42dd53 spill_pseudos
../../
0x42dd53 lra_spill()
../../
0x416b1b lra(_IO_FILE*)
../../
0x3e84d1 do_reload
../../
0x3e84d1 execute
../../
Christian Ehrhardt (paelzer) wrote : | #90 |
I'm not yet sure what we should learn from that - do we need 30 runs of each step to be somewhat sure? That makes an already slow bisect even slower ...
Christian Ehrhardt (paelzer) wrote : | #91 |
FYI - another 8 runs without a crash on r10-7093.
My current working theory is that the root cause of the crash might have been added as early as r10-4054 but one or many later changes have increased the chance (think increase the race window or such) for the issue to trigger.
If that assumption is true and with the current testcase it is nearly impossible to properly bisect the "original root cause". And at the same time still hard to find the one that increased the race window - since crashing early does not surely imply we are in the high/low chance area.
We've had many runs with the base versions so that one is really good.
But any other good result we've had so far could - in theory - be challenged and needs ~30 good runs to be somewhat sure (puh that will be a lot of time).
I'm marking the old runs that are debatable with good?<count-
Also we might want to look for just the "new" crash signature.
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-7093 bad, but slow to trigger
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Of course it could be that the same root cause surfaces as two different signatures - but to it could as well be a multitude of issues. Therefore - for now - "add_regs_
With some luck (do we have any in this?) the 10 runs on 6080 are sufficient.
Let us try r10-6586 next and plan for 15-30 runs to be sure it is good.
If hitting the issue I'll still re-run it so we can compare multiple signatures.
Christian Ehrhardt (paelzer) wrote : | #92 |
Since this seems to become a reproducibility
Christian Ehrhardt (paelzer) wrote : | #93 |
r10-6586 - passed 27 good runs, no fails
Updated Result Overview:
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-6586 good?27
r10-7093 bad, but slow to trigger (2 of 19)
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Next I'll run r10-7093 in this new setup.
@Doko - It would be great to have ~6760 be built for the likely next step.
Christian Ehrhardt (paelzer) wrote : | #94 |
Add another 1/3 fails to r10-7093
Now I am on the next two
- r10-6760
- r10-6839
Christian Ehrhardt (paelzer) wrote : | #95 |
2/7 runs of r10-6839 failed with
r10-6839 add_regs_
Next will be r10-6760
Christian Ehrhardt (paelzer) wrote : | #96 |
Updated Result Overview:
20190425 good
r10-1014
r10-2027 good?4
r10-2533
r10-3040 good?4
r10-3220
r10-3400 good?4
r10-3450
r10-3475
r10-3478
r10-3593
r10-3622
r10-3657 good?5
r10-3727 good?3
r10-4054 other kind of bad - signature different, and rare?
r10-6080 good?10
r10-6586 good?27
r10-6760 next
r10-6839 bad (2 of 9)
r10-7093 bad, but slow to trigger (2 of 19)
20200507 bad bad bad
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #97 |
We'll need more runs to be sure, but so far r10-6760 seems good.
In preparation - could I requests builds between r10-6760 - r10-6839 please ?
Christian Ehrhardt (paelzer) wrote : | #98 |
Ok, r10-6760 reached 20 good runs and is considered good.
Doko was so kind to build 6779 6799 6819 for me - of which 6799 will be next.
Note: I've aligned the comments to all have the same style and dropped the untested revisions.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6779 untested
r10-6799 next
r10-6819 untested
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #99 |
FYI: r10-6799 had 14 good runs so far, I'll let it run for a bit longer to be sure.
Then - later today - if nothing changes r10-6819 will be next.
Christian Ehrhardt (paelzer) wrote : | #100 |
Completed 20 good runs on r10-6799, continuing with r10-6819 as planned.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 next
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 extract_
20200507 avoid_constant_
20200507 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
ubu-10.2 add_regs_
Christian Ehrhardt (paelzer) wrote : | #101 |
r10-6819 had 22 good runs.
r10-6829 will be the next to try.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6829 next
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #102 |
r10-6829 has 2 fails in 35 runs
Signature matches, both are: add_regs_
r10-6824 = next
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6824 next
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #103 |
r10-6824 bad 1 of 24, signature matches
We have only a few steps to go and need to increase the number of runs to be sure, so I'll let it run for a while longer.
Also - eventually - I'll re-run what we consider to be the last good, quite a few times to be sure.
Most likely I'll later today switch and test r10-6822 next.
Christian Ehrhardt (paelzer) wrote : | #104 |
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 next
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #105 |
r10-6822 so far has 0 of 20, but I'll let it run another ~24h
Christian Ehrhardt (paelzer) wrote : | #106 |
r10-6822 seems good.
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 good 0 of 37
r10-6823 next
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #107 |
r10-6823 bad 1 of 28
during RTL pass: reload
/root/qemu-
/root/qemu-
3298 | }
| ^
0x524cf3 crash_signal
0x411e07 add_regs_
0x411efb add_regs_
0x411efb add_regs_
0x411efb add_regs_
0x412f21 lra_update_
0x413071 lra_update_
0x413071 lra_push_insn_1
0x42b373 spill_pseudos
0x42b373 lra_spill()
0x414163 lra(_IO_FILE*)
0x3e5b9d do_reload
0x3e5b9d execute
Please submit a full bug report,
with preprocessed source if appropriate
I'll give the hopefully good r10-6822 another few chances to fail, because - as it is obvious by now - it seems we can't rely much on these bisect results.
Afterwards I'll give 10.2.1-1 in Hirsute a try (requested by Doko)
Christian Ehrhardt (paelzer) wrote : | #108 |
Updated Result Overview:
20190425 good 0 of 13
r10-2027 good 0 of 4
r10-3040 good 0 of 4
r10-3400 good 0 of 4
r10-3657 good 0 of 5
r10-3727 good 0 of 3
r10-4054 other kind of bad 1 of 18 (signature different)
r10-6080 good 0 of 10
r10-6586 good 0 of 27
r10-6760 good 0 of 20
r10-6799 good 0 of 20
r10-6819 good 0 of 22
r10-6822 good 0 of 37 <- giving this more runs now
r10-6823 bad 1 of 28
r10-6824 bad 1 of 33
r10-6829 bad 2 of 35
r10-6839 bad 2 of 9
r10-7093 bad 2 of 19
20200507 bad 3 of 7
Signatures:
r10-4054 arm_legitimate_
r10-6823 add_regs_
r10-6824 add_regs_
r10-6829 add_regs_
r10-6839 add_regs_
r10-7093 add_regs_
r10-7093 add_regs_
20200507 add_regs_
20200507 avoid_constant_
20200507 extract_
ubu-10.2 add_regs_
ubu-10.2 add_regs_
ubu-10.2 avoid_constant_
ubu-10.2 thumb2_
Christian Ehrhardt (paelzer) wrote : | #109 |
As mentioned before - I didn't trust this result.
And with "likeliness" of this being so low we all know that results are unreliable.
Due to that now r10-6822 is
r10-6822 - bad 2 of 67
The signature was the same "add_regs_
What to do from here ...
We could bisect again starting with r10-6822 and 20190425 and use at least like 100 runs each.
But that would be a last resort as I'm on ~1run/h which means ~4 days each step.
I have a few "maybe we are lucky" things to try first:
- 10.2.1-1 in hirsute
- trunk gcc-r11-5879.tar.xz
- Doing a run with -O1
Dimitri John Ledkov (xnox) wrote : | #110 |
"just retry the build" is our solution to this issue. It's a bit a waste of time hunting this all down at this point, unfortunately.
maybe we can try reproducing this on some publicly available hardware, i.e. graviton2 on aws. But also not sure how much value there is in doing this.
tags: |
added: rls-gg-notfixing removed: rls-gg-incoming |
Changed in gcc-10 (Ubuntu): | |
status: | Confirmed → Won't Fix |
affects: | groovy → gcc |
1 comments hidden Loading more comments | view all 128 comments |
Christian Ehrhardt (paelzer) wrote : Re: [Bug 1890435] Re: gcc-10 breaks on armhf (flaky): internal compiler error: Segmentation fault | #112 |
On Thu, Dec 10, 2020 at 5:31 PM Dimitri John Ledkov
<email address hidden> wrote:
>
> "just retry the build" is our solution to this issue.
It is not - in hirsute the builds of the actual package on LP hit 100%
fail-rate.
Unfortunately not in the repro, but due to the above the workaround
currently is to build with gcc-9 on armhf.
But that is not a long term solution.
Therefore also this IMHO can't be won't fix
Changed in gcc-10 (Ubuntu): | |
status: | Won't Fix → New |
Christian Ehrhardt (paelzer) wrote : | #111 |
> "just retry the build" is our solution to this issue.
It is not - in hirsute the builds of the actual package on LP hit 100% fail-rate.
Unfortunately not in the repro, but due to the above the workaround currently is to build with gcc-9 on armhf.
But that is not a long term solution.
Therefore also this IMHO can't be "won't fix"
1 comments hidden Loading more comments | view all 128 comments |
Christian Ehrhardt (paelzer) wrote : | #113 |
I'll give things a try in current Hirsute (gcc on 10.2.1, qemu on 5.2) building with gcc-10.
If we are back at a level where retries work I'm ok to lower severity.
I'll let you know about these results in a few days.
But since we have had the case of it reaching 100% breakage (and then would be e.g. un-serviceable) I'm unsure if we should - even then - fully close it.
Christian Ehrhardt (paelzer) wrote : | #114 |
In the test env (not LP build infra, but canonistack) I've got 30 good runs on 10.2.1 which gives me some hope ...
Christian Ehrhardt (paelzer) wrote : | #115 |
Indeed, gcc-10.2.1 with qemu 5.2 no more breaks 100%.
Here a good build log
https:/
I'll need a few more builds anyway and will let you know.
As mentioned before that does lower severity, but not close the bug.
Changed in gcc-10 (Ubuntu): | |
status: | New → Confirmed |
importance: | Critical → Medium |
Christian Ehrhardt (paelzer) wrote : | #116 |
r11-5879 - bad 8 of 10
So we know:
a) the bug has not been fixed yet
b) as we've seen with later GCC-10 runs, the chances to trigger further increased
Christian Ehrhardt (paelzer) wrote : | #117 |
I left r11-5879 running over the weekend and it concluded with 37 of 75 runs failing
That is ~50%
I'll look at -O1 next
Christian Ehrhardt (paelzer) wrote : | #118 |
Fails with -O1 as well, although I have to admit that different -O levels are deeply integrated in qemus build system. So it is hard to overwrite "all of them". Therefore - while I set -O1 and that affected some builds, it isn't implying that all compiler calls were -O1.
I know dannf has made some bare-metal tests and so far none of those have failed.
Unfortunately our builders are VM based, so that isn't very helpful anyway.
Never the less I've transported my test container over to a box to build there.
Trying to maas-deploy a few more chip types didn't work out, but maybe it will eventually with some help by the HWE team.
Christian Ehrhardt (paelzer) wrote : | #119 |
I was unable to trigger the issue on my rpi4 yet, but as you'd imagine it is rather slow.
But (thanks Dannf) I got access to an X-gene - and carrying my known bad setup there (LXD container export FTW) I was able to recreate this on bare-metal as well.
(Host) Kernel: 5.4.0-58-generic
Model: X-Gene - 8 cores
The guest is Hirsute building qemu 5.0 with r11-5879
I got two known bug signatures - once the common one we see most and once a different one (that we've seen before with 20200507).
This happened on the first two runs, once it has run some hours I'll post the rate of success-vs-fails as well.
--- ---
during RTL pass: reload
/root/qemu-
/root/qemu-
1535 | }
| ^
0x56715f crash_signal
0x4599ad add_regs_
0x459ab9 add_regs_
0x459ab9 add_regs_
0x45abc7 lra_update_
0x468985 lra_constraints
0x45bc15 lra(_IO_FILE*)
42d463 do_reload
0x42d463 execute
Please submit a full bug report,
--- ---
during RTL pass: reload
/root/qemu-
/root/qemu-
12479 | }
| ^
0x56715f crash_signal
0x527e35 extract_
0x52d84b extract_
0x52d84b decompose_
0x52d84b decompose_
0x52dbc3 decompose_
0x463551 process_address_1
0x464c47 process_address
0x464c47 curr_insn_transform
0x468913 lra_constraints
0x45bc15 lra(_IO_FILE*)
0x42d463 do_reload
0x42d463 execute
Please submit a full bug report,
Christian Ehrhardt (paelzer) wrote : | #120 |
The canonistack machines I used to crash it (and likely the LP builders) are X-Gene as well.
So we might have a chance to lock this in on specific HW if there are other chip types I could use.
Christian Ehrhardt (paelzer) wrote : | #121 |
So far 2/4 failed of r11-5879 on X-Gene BareMetal.
Doko asked me to try if I could get these to fail with -j1 as well (in the past I was unable to do so, but it is worth a try).
Christian Ehrhardt (paelzer) wrote : | #122 |
On BareMetal now also triggered with -j1 (but there were multiple LXD containers each running -j1 to increase the chance to find it).
/root/qemu-
/root/qemu-
485 | }
| ^
0x56715f crash_signal
0x4599ad add_regs_
0x459ab9 add_regs_
0x459ab9 add_regs_
0x45abc7 lra_update_
0x468985 lra_constraints
0x45bc15 lra(_IO_FILE*)
0x42d463 do_reload
0x42d463 execute
Please submit a full bug report,
with preprocessed source if appropriate.
Christian Ehrhardt (paelzer) wrote : | #123 |
Just FYI - as we were afraid of - this now starts to break SRUs and other service actions to qemu in Groovy. https:/
And without a better solution I'll need to trigger retry with fingers crossed.
In GCC Bugzilla #97323, Rguenth (rguenth) wrote : | #124 |
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Changed in gcc: | |
status: | Confirmed → In Progress |
Oibaf (oibaf) wrote : | #125 |
Is this still an issue? I was able to only reproduce it on groovy, now EoL.
In GCC Bugzilla #97323, Jakub-gcc (jakub-gcc) wrote : | #126 |
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
In GCC Bugzilla #97323, Rguenth (rguenth) wrote : | #127 |
GCC 10 branch is being closed.
In GCC Bugzilla #97323, Pinskia (pinskia) wrote : | #128 |
*** Bug 112791 has been marked as a duplicate of this bug. ***
I've today seen this on DPDK /launchpadlibra rian.net/ 497142982/ buildlog_ ubuntu- groovy- armhf.dpdk_ 20.08-1ubuntu1~ ppa1_BUILDING. txt.gz
https:/
And recently also on qemu again (but that was in the main archive and I could not hold back hitting retry on which it worked).
Is there anything in the pipeline that could address this and makes it worth running a few re-compiles?