MongoDB Memory corruption
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
GLibC |
Fix Released
|
Medium
|
|||
glibc (Ubuntu) |
Fix Released
|
Undecided
|
Adam Conrad | ||
Xenial |
Confirmed
|
Undecided
|
Adam Conrad | ||
Yakkety |
Confirmed
|
Undecided
|
Adam Conrad |
Bug Description
== Comment: #0 - Calvin L. Sze <email address hidden> - 2016-11-01 23:09:10 ==
Team has changed to the Bare-metal Ubuntu 16.4. The problem still exists, so it is not related to the virtualization.
Since the bug is complicated to reproduce, Could we use sets of tools to collect the data when this happens?
---Problem Description---
MongoDB has memory corruption issues which only occurred on Ubuntu 16.04, it doesn't occur on Ubuntu 15.
Contact Information =Calvin Sze/Austin/IBM
---uname output---
Linux master 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:00:57 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = Model: 2.1 (pvr 004b 0201) Model name: POWER8E (raw), altivec supported
---System Hang---
the system is still alive
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Unfortunately, not very easily. I had a test case that I was running on ubuntu1604-
About 3.5% of the test runs on ubuntu1604-
Hoping to be able to see the data that was being written and corrupting the stack, I manually injected a guard region into the stack of the failing functions as follows:
+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 1024;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+
Status bsonExtractFiel
+
+ volatile unsigned char* const cookie = static_
+ const Canary c(cookie);
+
When running with this, the invariant would sometimes fire. Examining the stack cookie under the debugger would show two consecutive bytes, always at an offset ending 0x...e, written as either 0 0, or 0 1, somewhere at random within the middle of the cookie.
This indicated that it was not a conventional stack smash, where we were writing past the end of a contiguous buffer. Instead it appeared that either the currently running thread had reached up some arbitrary and random amount on the stack and done either two one-byte writes, or an unaligned 2-byte write. Another possibility was that a local variable had been transferred to another thread, which had written to it.
However, while looking at the code to find such a thing, I realized that there was another possibility, which was that the bytes had never been written correctly in the first place. I changed the stack canary constructor to be:
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ ::memset(
+ _verify();
+ }
So that immediately after writing the byte pattern to the stack buffer, we verified the contents we wrote. Amazingly, this *failed*, with the same corruption as seen before. This means that either between the time we called memset to write the bytes and when we read them back, something either overwrote the stack cookie region, or that the bytes were never written correctly by memset, or that memset wrote the bytes, but the underlying physical memory never took the write.
Stack trace output:
no
Oops output:
no
Userspace tool common name: MongoDB
Userspace rpm: mongod
The userspace tool has the following bit modes: 64bit
System Dump Info:
The system is not configured to capture a system dump.
Userspace tool obtained from project website: na
*Additional Instructions for Lilian Romero/Austin/IBM:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.
-Attach ltrace and strace of userspace application.
== Comment: #1 - Luciano Chavez <email address hidden> - 2016-11-02 08:41:47 ==
Normally for userspace memory corruption type problems I would recommend Valgrind's memcheck tool though if this works on other versions of linux, one would want to compare the differences such as whether or not you are using the same version of mongodb, gcc, glibc and the kernel.
Has a standalone testcase been produced that shows the issue without mongodb?
== Comment: #2 - Steven J. Munroe <email address hidden> - 2016-11-02 10:27:40 ==
We really need that standalone test case.
Need to look at WHAT c++ is doing with memset. I suspect the compiler is short circuiting the function and inlining. That is what you would want for optimization, but we need to know so we can steer this to the correct team.
== Comment: #3 - Calvin L. Sze <email address hidden> - 2016-11-02 13:17:30 ==
Hi Luciano and Steve, Thanks for the advise,
They don't have a standalone test case without Mongodb, I could image it take a while and probably not that easy to produce. I am seeking your advise how to approach this. The failure takes at least 24 - 48 hours running to reproduce. Steve, do you have what you needed for C++ test, or there is something I need to ask Mongo development team?
Thanks
== Comment: #4 - William J. Schmidt <email address hidden> - 2016-11-02 16:29:26 ==
(In reply to comment #3)
> Hi Luciano and Steve, Thanks for the advise,
>
> They don't have a standalone test case without Mongodb, I could image it
> take a while and probably not that easy to produce. I am seeking your
> advise how to approach this. The failure takes at least 24 - 48 hours
> running to reproduce. Steve, do you have what you needed for C++ test, or
> there is something I need to ask Mongo development team?
>
> Thanks
It's unclear to me yet that we have evidence of this being a problem in the toolchain. Does the last experiment (revised Canary constructor) ALWAYS fail, or does it also fail only ever 24 - 48 hours? If the latter, then all we know is that stack corruption happens. There's no indication of where the wild pointer is coming from (application problem, compiler problem, etc.). If it does always fail, however, then I question the assertion that they can't provide a standalone test case.
We need something more concrete to work with.
Bill
== Comment: #5 - Calvin L. Sze <email address hidden> - 2016-11-03 18:08:33 ==
Could this ticket be viewed by external customer/ISV?
I am thinking how to establish the direct communications between Mongodb development team and experts/owner of the ticket to pass the middle man, me :-)
Here are the MongoDB deelopment director, Andrew's answers to my 3 questions. And in addition he added comments.
Basically, there are 3 questions,
> 1. Is the mongoDB binary built with gcc came with Linux distributions or with IBM Advance toolchain gcc?
We build our own GCC, but we have reproduced the issue with both our custom GCC, and the builtin linux distribution GCC. We have also reproduced with clang 3.9 built from source on the Ubuntu 16.04 POWER machine, so we do not think that this is a compiler issue (could still be a std library issue).
> 2. Does the last experiment (revised Canary constructor) ALWAYS fail, or does it also fail only ever 24 - 48 hours?
No, we have never been able to construct a deterministic repro. We are only able to get it to fail after running the test a very large number of times.
> 3. Is there any way we can have a standalone test case without MongoDB?
We do not have such a repro at this time.
I do understand the position they are taking - it isn't a lot of information to go on, and most of the time the correct response to a mysterious software crash is to blame the software itself, not the surrounding ecosystem. However, we have a lot of *indirect* evidence that has made us skeptical that this is our bug. We would love to be proved wrong!
- The stack corruption has not reproduced on any other systems. We are running these same tests on every commit across dozens of Linux variants, and across four cpu architectures (x86_64, POWER, zSeries, ARMv8).
- We don't see crashes on other POWER, but we do on Ubuntu POWER.
- We don't see crashes on Windows, Solaris, OS X
- We have run the under the clang address sanitizer, with no reports.
- We have enabled the clang address sanitizer use-after-return detector, and found no results.
If this were a wild pointer in the MongoDB server process that was writing to the stack of other threads, we would expect to see corruption show up elsewhere, but we simply do not.
However, lets assume that this is a bug in our code, that for whatever reason only reveals itself on POWER, and only on Ubuntu. We would still be interesting in learning from the kernel team if there are additional power specific debugging techniques that we might be able to apply. In particular, the ability to programmatically set/unset hardware watchpoints over the stack canary. Another possibility would be to mprotect the stack canary, but it is not clear to us whether it is valid to mprotect part of the stack, either in general, or on POWER.
We would be happy to hear any suggestions on how to proceed.
Thanks,
Andrew
== Comment: #6 - Steven J. Munroe <email address hidden> - 2016-11-03 18:34:30 ==
you could tell what specific GCC version you are based on and configure options.
You could provide the disassemble of the canary code.
== Comment: #7 - William J. Schmidt <email address hidden> - 2016-11-03 23:01:55 ==
It would be useful to see what the Canary is compiled into, as Steve suggested. Let's make sure it's doing what we think it is.
Given we have multiple compilers producing the same results, we may want to think more about the runtime environment -- are you using the same glibc and libstdc++ in all cases? Clang at least would pick up the distro versions, as it doesn't provide its own.
One reason you see this on Ubuntu 16.04 and not on another linux distro is likely because of glibc level. The other linux's glibc is quite old by comparison. glibc 2.23, which appears on Ubuntu 16.04, is the first version to be compiled with -fstack-
I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5 that ships with the system, in case that becomes relevant.
I don't personally have a lot of experience with trying to debug something of this nature, in case we don't see something obvious from the disassembly of the canary. CCing Ulrich Weigand in case he has some ideas of other approaches to try.
== Comment: #9 - Ulrich Weigand <email address hidden> - 2016-11-04 12:21:48 ==
I don't really have any other great ideas either. Just two comments:
- Even though the original reported mentioned they already tried clang's address sanitizer, I'd definitely still also try reproducing the problem under valgrind -- the two are different in what exactly they detect, and using both tools in a complex problem can only help.
- The Canary code sample above has strictly speaking undefined behavior, I think: it is calling memset on a const *. (The const_cast makes the warning go away, but doesn't actually cure the undefined behavior.) I don't *think* this will cause codegen changes in this example, but it cannot hurt to try to fix this and see if anything changes.
== Comment: #12 - Calvin L. Sze <email address hidden> - 2016-11-06 10:32:25 ==
Hi Bill, Thanks
I have asked Andrew, waiting for his confirmation.
== Comment: #14 - Calvin L. Sze <email address hidden> - 2016-11-06 10:56:49 ==
Hi Calvin -
I can provide the assembly of the function that contains the canary (the canary itself gets inlined), but I think it might just be easier if I uploaded a binary and an associated corefile? That way your engineers could disassemble the crashing function themselves in the debugger and see exactly what the state was at the time of the crash.
What is the best way for me to get that information to you?
Thanks,
Andrew
== Comment: #15 - Calvin L. Sze <email address hidden> - 2016-11-06 10:58:54 ==
Provided the binary and core information.
Note from Mongo;
I've uploaded a sample core file and the associated binary to your ftp
server as detailed above. The binary is named `mongod.power` and the core is
named `mongod.
You should expect to see a backtrace on the faulting thread which looks
like this (for the first few frames):
(gdb) bt
#0 0x00003fff997be5d0 in __libc_
at ../sysdeps/
#1 __GI_raise (sig=<optimized out>) at ../sysdeps/
#2 0x00003fff997c0c00 in __GI_abort () at abort.c:89
#3 0x00000000223c33e8 in mongo::
file=0x24131b38 "src/mongo/
line=<optimized out>) at src/mongo/
#4 0x00000000224bbc48 in mongo::(anonymous namespace)
this=<optimized out>) at src/mongo/
The "Canary::_verify" frame (number 4) has a local variable "_t" which is an
on-the-stack array and filled with "0xcd" for a span of 1024 bytes. Near the
end of this block we see two bytes of poisoned memory which were altered:
0x3fff5814c858: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c860: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c868: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0x01 0x00
0x3fff5814c870: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
0x3fff5814c878: 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd 0xcd
Note the two bytes set to values "0x01" and "0x00".
At the time of core-dump all the other threads seemed to be paused on system
calls such as "recv" or "__pthread_
when setting up our software canary, and checks the memory immediately after
its setup. We do not run any other functions on this thread between the
memory poisoning and the verification of the poisoning. All other threads
appear to be paused at this time.
== Comment: #16 - Calvin L. Sze <email address hidden> - 2016-11-06 10:59:40 ==
A follow up message from Mongo
The function calling the canary code, which you'll want to possibly
disassemble is in frame 6:
#6 mongo::
out=
The lower numbered frames deal with the canary code itself.
== Comment: #17 - Calvin L. Sze <email address hidden> - 2016-11-06 11:03:46 ==
From Andrew,
>Given we have multiple compilers producing the same results, we may want to
>think more about the runtime environment -- are you using the same glibc and
>libstdc++ in all cases? Clang at least would pick up the distro versions, as
>it doesn't provide its own.
We have repro'd with three compilers:
- The system GCC, using system libstdc++ and system glibc
- Our hand-rolled GCC, using its own libstdc++, and system glibc
- One off clang-3.9 build, using system libstdc++, and system glibc.
Coincidentally, both system and hand-rolled GCC are 5.4.0, so there may not be as much variation there as hoped. We could try building with clang and libc++ to at least rule out libstdc++ as a factor.
>One reason you see this on Ubuntu 16.04 and not on the other linux distro is likely because of
>glibc level. The other linux distro's glibc is quite old by comparison. glibc 2.23, which
>appears on Ubuntu 16.04, is the first version to be compiled with
>-fstack-
I'm not sure I follow. Our software has been built with -fstack-
>So this doesn't necessarily mean that the
>bug doesn't exist elsewhere; it just means that the stack protector code isn't
>enabled to spot the problem. If the stack corruption is benign, then it
>wouldn't be noticed otherwise.
Yeah, still confused. I can definitely make the other linux distro box report a stack corruption:
[<email address hidden> ~]$ cat > boom.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct no_chars {
unsigned int len;
unsigned int data;
};
int main(int argc, char * argv[])
{
struct no_chars info = { };
if (argc < 3) {
return 1;
}
info.len = atoi(argv[1]);
memcpy(
return 0;
}
[<email address hidden> ~]$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-
[<email address hidden> ~]$ ./boom 64 AAAAAAAAAAAAAAA
*** stack smashing detected ***: ./boom terminated
Segmentation fault
I assume that glibc 2.23 was compiled with Ubuntu's version of gcc 5 that ships
with the system, in case that becomes relevant.
Correct, we have not made any changes to glibc - we are using the stock version that ships on the system.
== Comment: #18 - Calvin L. Sze <email address hidden> - 2016-11-06 11:04:24 ==
From Andrew
Also, I want to re-iterate that while we have definitely observed cases where the stack protector detects the stack corruption, we have also observed stack corruption within our own hand-rolled stack buffer, per the code posted earlier. The core dump that Adam provided is of this latter sort So to some extent, this is independent of -fstack-
One thing that I have not yet ruled out is whether -fstack-
Still, it sounds like a worthwhile experiment, so I will see if I can still detect corruption in our hand-rolled stack canary when building without any form of -fstack-protector enabled.
== Comment: #19 - Calvin L. Sze <email address hidden> - 2016-11-06 11:05:58 ==
From Andrew,
I've performed this experiment, replacing our use of -fstack-
I have a core file and executable. Let me know if you would be interested in my providing those in addition to the files provided yesterday by Adam.
== Comment: #21 - William J. Schmidt <email address hidden> - 2016-11-07 11:10:54 ==
Andrew, thanks for all the details, and for the binary and core file! I'll start poking through them this morning. I've just been absorbing all the notes that Calvin dumped into our bug tracking system yesterday.
You can ignore what I was saying about -fstack-
While I'm looking at the binary, there are a couple of other things you might want to try:
- Replace ::memset with __builtin_memset with GCC to see whether that makes any difference;
- Try Ulrich Weigand's suggestions from comment #9;
- As you suggested, try clang + libc++ to try to rule libstdc++ in or out.
A couple of questions that may or may not prove relevant:
- You've mentioned you don't get the crashes on the other linux distro. Have you tried your modified canary on the other linux distro anyway? If we're certain the two systems behave differently with the canary that may help us in narrowing things down.
- Which version of the C++ standard are you compiling against? Is it just the default on all systems, or are you forcing a specific -std=...?
== Comment: #22 - William J. Schmidt <email address hidden> - 2016-11-07 12:18:41 ==
I'm having some difficulties with core file compatibility. I put your files on an Ubuntu 16.04.1 system, but I don't see quite the same results as you report under gdb, with libc and libgcc shared libs not at the correct address and a problem with the stack. There's a transcript below. I'm particularly concerned about the warning that the core file and executable may not match. Note also the report of stack corruption above frame #4, so I can't get to frame #6 to look at the register state. The library frames at #0-#3 are reporting the wrong information, which I assume to be because the libraries are at the wrong address.
For debug purposes it would probably be best to use the system compiler, just in case that wasn't the case here.
$ ls -l
total 1950688
-rw-r--r-- 1 wschmidt wschmidt 700141992 Nov 7 14:37 mongod.power
-rw-r--r-- 1 wschmidt wschmidt 1297350656 Nov 7 14:39 mongod.power.core
$ gdb mongod.power mongod.power.core
GNU gdb (Ubuntu 7.11.1-
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://
Find the GDB manual and other documentation resources online at:
<http://
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mongod.
warning: core file may not match specified executable file.
[New LWP 101461]
[New LWP 100045]
[New LWP 100062]
[New LWP 100056]
[New LWP 99983]
[New LWP 100052]
[New LWP 100054]
[New LWP 99892]
[New LWP 100051]
[New LWP 100048]
[New LWP 100007]
[New LWP 99868]
[New LWP 100059]
[New LWP 101459]
[New LWP 100001]
[New LWP 99986]
[New LWP 101403]
[New LWP 99980]
[New LWP 99882]
[New LWP 99893]
[New LWP 99877]
[New LWP 99872]
[New LWP 101462]
[New LWP 99874]
[New LWP 100058]
[New LWP 100231]
[New LWP 99994]
[New LWP 99873]
[New LWP 100003]
[New LWP 99993]
[New LWP 99879]
[New LWP 101398]
[New LWP 99891]
[New LWP 99880]
[New LWP 99910]
[New LWP 99895]
[New LWP 99901]
[New LWP 100011]
[New LWP 99974]
[New LWP 100049]
[New LWP 99898]
[New LWP 99875]
[New LWP 101460]
[New LWP 99878]
[New LWP 99871]
[New LWP 99896]
[New LWP 101954]
[New LWP 101406]
[New LWP 100015]
[New LWP 100068]
[New LWP 99984]
[New LWP 101519]
[New LWP 100053]
[New LWP 99996]
[New LWP 100050]
[New LWP 100055]
[New LWP 100057]
[New LWP 101807]
[New LWP 99890]
[New LWP 100004]
[New LWP 99884]
[New LWP 101437]
[New LWP 101455]
[New LWP 100013]
[New LWP 99894]
[New LWP 101411]
[New LWP 101457]
[New LWP 101431]
[New LWP 101458]
[New LWP 100443]
[New LWP 101438]
[New LWP 101414]
[New LWP 101433]
[New LWP 101784]
[New LWP 99979]
[New LWP 101397]
[New LWP 101402]
[New LWP 101401]
[New LWP 101435]
[New LWP 101405]
[New LWP 101423]
[New LWP 101425]
[New LWP 99897]
[New LWP 101419]
[New LWP 99989]
[New LWP 101409]
[New LWP 100008]
[New LWP 101410]
[New LWP 99998]
[New LWP 101413]
[New LWP 101469]
[New LWP 101418]
[New LWP 101427]
[New LWP 101399]
[New LWP 101235]
[New LWP 101396]
[New LWP 101421]
[New LWP 99990]
[New LWP 101407]
[New LWP 101480]
[New LWP 100060]
[New LWP 101499]
[New LWP 101506]
[New LWP 101395]
[New LWP 101415]
[New LWP 101400]
[New LWP 101412]
[New LWP 101408]
[New LWP 101420]
[New LWP 101416]
[New LWP 101492]
[New LWP 101513]
[New LWP 101782]
[New LWP 101404]
[New LWP 101481]
[New LWP 101417]
[New LWP 100067]
[New LWP 101429]
[New LWP 99883]
[New LWP 101430]
[New LWP 101436]
[New LWP 101454]
[New LWP 101428]
[New LWP 101422]
[New LWP 100108]
[New LWP 101434]
[New LWP 100064]
[New LWP 101453]
[New LWP 100061]
[New LWP 101426]
[New LWP 100066]
[New LWP 101452]
[New LWP 101439]
[New LWP 101456]
[New LWP 101451]
[New LWP 101450]
[New LWP 101432]
[New LWP 101449]
[New LWP 101424]
[New LWP 100065]
[New LWP 100063]
[New LWP 101448]
[New LWP 101447]
[New LWP 101446]
[New LWP 101445]
[New LWP 101444]
[New LWP 101443]
[New LWP 101442]
[New LWP 101441]
[New LWP 101440]
warning: .dynamic section for "/lib/powerpc64
warning: .dynamic section for "/lib/powerpc64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64
Core was generated by `/home/
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
at ../sysdeps/
233 ../sysdeps/
[Current thread is 1 (Thread 0x3fff5814ec20 (LWP 101461))]
(gdb) bt
#0 0x00003fff997be5d0 in __copysign (y=<optimized out>, x=<optimized out>)
at ../sysdeps/
#1 __modf_power5plus (x=-6.277438562
at ../sysdeps/
#2 0x00003fff997be4f0 in ?? () from /lib/powerpc64l
#3 0x00003fff997c0c00 in ?? () at ../signal/
from /lib/powerpc64l
#4 0x00000000223c33e8 in mongo::
file=0x24131b38 "src/mongo/
line=<optimized out>) at src/mongo/
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) quit
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_
Target: powerpc64le-
Configured with: ../src/configure -v --with-
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$
I'll disassemble the binary and see if I can spot anything without the state information.
Oh, still waiting on permission to mirror the bug.
== Comment: #23 - William J. Schmidt <email address hidden> - 2016-11-07 13:39:45 ==
A little more information:
I've been looking at bsonExtractStri
8ebb3c: 71 c9 06 48 bl 9584ac <00000d72.
And later I see the call to invariantFailed:
8ebc44: e9 75 f0 4b bl 7f322c <_ZN5mongo15inv
So we've answered Steve's initial question about which memset we're using. This isn't being inlined by the compiler, but does an out-of-line dynamic call to the GLIBC_2.17 version.
I'm not sure whether GCC would inline a 1024-byte memset using __builtin_memset, or just end up calling out the same way, but it might be worth trying out that replacement, and disassembling bsonExtractStri
== Comment: #24 - William J. Schmidt <email address hidden> - 2016-11-07 13:50:04 ==
I forgot to mention that the ensuing code generation to accumulate the checksum and test it is completely straightforward and looks correct. So this looks like pretty strong evidence that the problem is in the GLIBC memset implementation.
8ebb3c: 71 c9 06 48 bl 9584ac <00000d72.
8ebb40: 18 00 41 e8 ld r2,24(r1)
8ebb44: 00 04 40 39 li r10,1024
8ebb48: 00 00 20 39 li r9,0
8ebb4c: a6 03 49 7d mtctr r10
8ebb50: 00 00 43 89 lbz r10,0(r3)
8ebb54: 01 00 63 38 addi r3,r3,1
8ebb58: 14 52 29 7d add r9,r9,r10
8ebb5c: f4 ff 00 42 bdnz 8ebb50 <_ZN5mongo22bso
8ebb60: 03 00 40 3d lis r10,3
8ebb64: 00 34 4a 61 ori r10,r10,13312
8ebb68: 00 50 a9 7f cmpd cr7,r9,r10
8ebb6c: c4 00 9e 40 bne cr7,8ebc30 <_ZN5mongo22bso
...
8ebc30: 44 ff 82 3c addis r4,r2,-188
8ebc34: 44 ff 62 3c addis r3,r2,-188
8ebc38: 3a 00 a0 38 li r5,58
8ebc3c: 38 aa 84 38 addi r4,r4,-21960
8ebc40: 60 aa 63 38 addi r3,r3,-21920
8ebc44: e9 75 f0 4b bl 7f322c <_ZN5mongo15inv
== Comment: #28 - William J. Schmidt <email address hidden> - 2016-11-08 11:02:18 ==
Recording some information from email discussions.
(1) The customer is planning to attempt to use valgrind memcheck.
(2) The const cast problem with the canary has been fixed without changing the results.
(3) Prior to that fix, the canary was used on the RHEL system with no corruption detected, so this does seem to be Ubuntu-specific.
(4) -std=c++11 is used everywhere.
(5) The core and binary compatibility issues appear to be that they were generated on 16.10, not 16.04. New ones coming.
(6) The canary code now looks like:
+namespace {
+
+class Canary {
+public:
+
+ static constexpr size_t kSize = 2048;
+
+ explicit Canary(volatile unsigned char* const t) noexcept : _t(t) {
+ __builtin_
+ _verify();
+ }
+
+ ~Canary() {
+ _verify();
+ }
+
+private:
+ static constexpr uint8_t kBits = 0xCD;
+ static constexpr size_t kChecksum = kSize * size_t(kBits);
+
+ void _verify() const noexcept {
+ invariant(
+ }
+
+ const volatile unsigned char* const _t;
+};
+
+} // namespace
+
And its application in bsonExtractType
@@ -47,6 +82,10 @@ Status bsonExtractType
+
+ volatile unsigned char* const cookie = static_
+ const Canary c(cookie);
+
Status status = bsonExtractFiel
(7) Steve Munroe investigated memset and he and Andrew are in agreement that we can rule it out:
I looked at the memset_power8 code (memset is just a IFUNC resolve stub). and I don't see how this problem is caused by memset_power8.
First some observations:
The canary is allocated with alloca for a large power of 2 (1024 bytes).
Alloca returns quadword aligned memory as required to maintain quadword stack alignment.
For this case memset_power8 will quickly jump to the vector store loop (quadword x 8) all from the same register (a vector splat of the fill char).
With this code the failure modes could only be:
Overwrite by N*quadwords,
Underwrite by N*quadwords,
A repeated pattern every quadword.
But we are not see this. Also think we are back to a clobber by some other code.
== Comment: #29 - William J. Schmidt <email address hidden> - 2016-11-08 11:03:33 ==
From Andrew, difficulties with Valgrind:
I did try the valgrind repro. However, I'm not able to make valgrind work:
The first try resulted in lots of "mismatched free/delete" reports, which is sort of odd, because they all seem to be from within the standard library:
> valgrind --soname-
==17387== Memcheck, a memory error detector
==17387== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17387== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17387== Command: ./mongos
==17387==
==17387== Mismatched free() / delete / delete []
==17387== at 0x4895888: free (in /usr/lib/
==17387== by 0x59514F: deallocate (new_allocator.
==17387== by 0x59514F: deallocate (alloc_
==17387== by 0x59514F: _M_deallocate_
==17387== by 0x59514F: _M_deallocate_
==17387== by 0x59514F: _M_deallocate_
==17387== by 0x59514F: _M_rehash_aux (hashtable.h:1999)
==17387== by 0x59514F: std::_Hashtable
==17387== by 0x595253: std::_Hashtable
==17387== by 0x5954D3: std::__
==17387== by 0x593693: operator[] (unordered_
==17387== by 0x593693: mongo::
==17387== by 0x591057: mongo::
==17387== by 0x52D46F: __static_
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/
==17387== by 0x4F830A7: generic_
==17387== by 0x4F83337: (below main) (libc-start.c:116)
==17387== Address 0x5151fb0 is 0 bytes inside a block of size 16 alloc'd
==17387== at 0x48951D4: operator new[](unsigned long) (in /usr/lib/
==17387== by 0x59328F: allocate (new_allocator.
==17387== by 0x59328F: allocate (alloc_
==17387== by 0x59328F: std::__
==17387== by 0x595093: _M_allocate_buckets (hashtable.h:347)
==17387== by 0x595093: _M_rehash_aux (hashtable.h:1974)
==17387== by 0x595093: std::_Hashtable
==17387== by 0x595253: std::_Hashtable
==17387== by 0x5954D3: std::__
==17387== by 0x59356B: operator[] (unordered_
==17387== by 0x59356B: mongo::
==17387== by 0x591057: mongo::
==17387== by 0x52D46F: __static_
==17387== by 0x137FED3: __libc_csu_init (in /home/acm/
==17387== by 0x4F830A7: generic_
==17387== by 0x4F83337: (below main) (libc-start.c:116)
So, that is a puzzle. However, I can instruct valgrind to ignore that. But it still fails to start, now with something more odd:
$ valgrind --show-
==19834== Memcheck, a memory error detector
==19834== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==19834== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==19834== Command: ./mongos
==19834==
MC_(get_
Memcheck: mc_machine.c:329 (get_otrack_
host stacktrace:
==19834== at 0x3808D9B8: ??? (in /usr/lib/
==19834== by 0x3808DB5F: ??? (in /usr/lib/
==19834== by 0x3808DCDB: ??? (in /usr/lib/
==19834== by 0x38078CE3: ??? (in /usr/lib/
==19834== by 0x38076FAB: ??? (in /usr/lib/
==19834== by 0x380BAA2B: ??? (in /usr/lib/
==19834== by 0x381B9BB7: ??? (in /usr/lib/
==19834== by 0x380BE19F: ??? (in /usr/lib/
==19834== by 0x3810D04F: ??? (in /usr/lib/
==19834== by 0x3810FFEF: ??? (in /usr/lib/
==19834== by 0x3812BB97: ??? (in /usr/lib/
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 19834)
==19834== at 0x4F3AC14: __lll_lock_elision (elision-lock.c:60)
==19834== by 0x4F2BBC7: pthread_mutex_lock (pthread_
==19834== by 0x602753: mongo::
==19834== by 0x5319EB: __static_
==19834== by 0x5319EB: _GLOBAL_
==19834== by 0x137FED3: __libc_csu_init (in /home/acm/
==19834== by 0x4F830A7: generic_
==19834== by 0x4F83337: (below main) (libc-start.c:116)
Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.
If that doesn't help, please report this bug to: www.valgrind.org
In the bug report, send all the above text, the valgrind
version, and what OS and version you are using. Thanks.
I'm not really sure what to make of that, except that I did see some thing die in the same place, once or twice (__lll_
Anyway, it doesn't seem like I can get this running with valgrind. Happy to try again if anyone is aware of a workaround.
== Comment: #30 - William J. Schmidt <email address hidden> - 2016-11-08 11:06:00 ==
CCing Carl Love. Carl, have you seen this sort of interaction between valgrind and lock elision before? (Comment #29, you can ignore the rest of this bugzilla for now.)
tags: | added: architecture-ppc64le bugnameltc-148069 severity-critical targetmilestone-inin16045 |
Changed in ubuntu: | |
assignee: | nobody → Taco Screen team (taco-screen-team) |
affects: | ubuntu → gcc-4.8 (Ubuntu) |
affects: | gcc-4.8 (Ubuntu) → gcc-5 (Ubuntu) |
Changed in glibc (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → Adam Conrad (adconrad) |
Changed in glibc (Ubuntu Xenial): | |
assignee: | nobody → Adam Conrad (adconrad) |
Changed in glibc (Ubuntu Yakkety): | |
assignee: | nobody → Adam Conrad (adconrad) |
Changed in glibc: | |
importance: | Unknown → Medium |
status: | Unknown → Fix Released |
------- Comment From <email address hidden> 2016-11-09 10:52 EDT-------
Hello Canonical,
Sending this bug to you for awareness and advice.