Comment 1 for bug 1412911

Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Jim's analysis:

What I found appears to be a case of memory corruption … though I cannot really tell exactly what piece of memory was corrupted nor what piece of code might have done the damage. The general nature of the corruption is that there is something wrong with the data structures used by malloc/free to keep track of what chunks of memory are free.

The stack backtrace starts with:
#0 0x00007ffff4a458a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a4700d in abort () from /lib64/libc.so.6
#2 0x00007ffff4a837b7 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff4a890e6 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff4a8bc13 in _int_free () from /lib64/libc.so.6
#5 0x00007ffff1ad244c in llvm::MachineFunction::DeleteMachineInstr(llvm::MachineInstr*) () from /opt/home/trafodion/traf_jan17/export/lib64/libtdm_sqlexp.so

From looking at the CPU registers, I believe I found the error message that was being printed by __int_free(). It reads:
             free(): invalid next size (normal)

I wish we had the source for libc so I could determine what all that might mean in this particular case. When I googled that error message, all I found was examples of buggy code (usually by programmers new to C/C++) which resulted in that error message or similar ones.

Now, when I look further back in the stack backtrace, I find the query that is being compiled:
#31 0x00007fffef7bdc20 in CmpMain::compile (this=0x7fffd14bdb60,
    input_str=0x7fffd09166f8 "select trim(O.catalog_name || '.' || '\"' || O.schema_name || '\"' || '.' || '\"' || O.object_name || '\"' ) constr_name, trim(O2.catalog_name || '.' || '\"' || O2.schema_name || '\"' || '.' || '\"' || O2.ob"..., charset=15, queryExpr=@0x7fffd14bda98, . . . )
    at ../sqlcomp/CmpMain.cpp:2408

That particular select query is one of the standard metadata queries that the Compiler does every time a user query references a table. It is one of the queries that the Compiler uses to determine various attributes of the user table.

That means, that the Compiler is working on a query which we have compiled thousands (or millions) of times in all the testing that we have done over the past few months. It is not working on anything strange or unusual.

I looked at the stack backtrace for the other 36 threads in this mxosrvr process, but they all appear to be doing perfectly normal things … mostly in pthread_cond_wait() or otherwise asleep waiting for something. No smoking guns.

Finally, when I look at the stack backtrace, the first 18 frames are in libc.so and libtdm_sqlexp.so which are both linked without debug symbols and none of those 18 stack frames even show a line number. [Welcome to wonderful world of Release-Mode builds.] When I get back to the next frame, I find we are in PCodeCfg::layoutNativeCode where it calls llvm::JIT::runJITOnFunction() which is the normal entry point into LLVM code that is called by “Native Expressions.” So, nothing out of the ordinary here either.

Unless we can find a way to reliably reproduce this problem OR at least be able to reproduce it with a Debug Mode build, I don’t know how to make any additional progress on this problem.