Random test failures on Ubuntu

Bug #397912 reported by Lee Bieber
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Drizzle
Fix Released
Critical
Stewart Smith

Bug Description

With build 1092, and now using mtr2 we are getting random test failures on Ubuntu platforms

See http://solace.inaugust.com/builders/ubuntu904.32.1
or http://solace.inaugust.com/builders/ubuntu904.64.1

Also seeing this on my VirtualBox Ubuntu image and Jay's laptop

It always seems to be different test failures, most of the time having to do with timeouts.

Changed in drizzle:
assignee: nobody → Stewart Smith (stewart-flamingspork)
importance: Undecided → Critical
milestone: none → aloha
status: New → Confirmed
Revision history for this message
Lee Bieber (kalebral-deactivatedaccount) wrote :

Example failure:

main.year [ pass ] 13
timer 21594: expired after 90 seconds
main.consistent_snapshot [ fail ]
        Test ended at 2009-07-10 10:32:09

CURRENT_TEST: main.consistent_snapshot

Timeout [timer - pid: 21594, exit: 0] for 'check-testcase' expired after 90 seconds

 - saving '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/' to '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/'

Retrying test, attempt(2/3)...

main.consistent_snapshot [ retry-fail ]
        Test ended at 2009-07-10 10:32:26

CURRENT_TEST: main.consistent_snapshot
CURRENT_TEST: main.consistent_snapshot

Could not execute 'check-testcase' before testcase 'main.consistent_snapshot' (res: 1):
drizzletest: Could not open connection 'default': 20 drizzle_state_read:lost connection to server (104)
not ok

 - saving '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/' to '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/'

Test has failed 2 times, no more retries!

Revision history for this message
Jay Pipes (jaypipes) wrote : Re: [Bug 397912] Re: Random test failures on Ubuntu

This is mtr2 stuff. We are backing mtr2 out of trunk due to these
random test failures. We'll bake this branch some more before trying to
merge with trunk again...

Lee Bieber wrote:
> Example failure:
>
> main.year [ pass ] 13
> timer 21594: expired after 90 seconds
> main.consistent_snapshot [ fail ]
> Test ended at 2009-07-10 10:32:09
>
> CURRENT_TEST: main.consistent_snapshot
>
>
> Timeout [timer - pid: 21594, exit: 0] for 'check-testcase' expired after 90 seconds
>
> - saving '/home/lbieber/repos/drizzle-clean-
> code/tests/var/log/main.consistent_snapshot/' to '/home/lbieber/repos
> /drizzle-clean-code/tests/var/log/main.consistent_snapshot/'
>
> Retrying test, attempt(2/3)...
>
> main.consistent_snapshot [ retry-fail ]
> Test ended at 2009-07-10 10:32:26
>
> CURRENT_TEST: main.consistent_snapshot
> CURRENT_TEST: main.consistent_snapshot
>
>
> Could not execute 'check-testcase' before testcase 'main.consistent_snapshot' (res: 1):
> drizzletest: Could not open connection 'default': 20 drizzle_state_read:lost connection to server (104)
> not ok
>
>
> - saving '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/' to '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/'
>
> Test has failed 2 times, no more retries!
>

Revision history for this message
Stewart Smith (stewart) wrote :

could we please let me have a look at this before we back it out?

SSH access to the box would be nice - as have not seen this anywhere else.

At the very least, systems should be running latest (from bzr) libdrizzle.

Revision history for this message
Stewart Smith (stewart) wrote :

 need (at minimum) --parrallel level (printed at start of mtr2 run), libdrizzle version (must be new) and preferably ssh to a box.

Changed in drizzle:
status: Confirmed → Incomplete
Revision history for this message
Stewart Smith (stewart) wrote :

and what IO system do these boxes have behind them? is there disk thrashing during the test run?

other tasks running?

what if setting MTR_PARALLEL=2 or MTR_PARALLEL=1 in environment (and checking it gets passed down okay)? Does that still have any timeouts reported?

Revision history for this message
Monty Taylor (mordred) wrote :

Stewart Smith wrote:
> and what IO system do these boxes have behind them? is there disk
> thrashing during the test run?
>
> other tasks running?
>
> what if setting MTR_PARALLEL=2 or MTR_PARALLEL=1 in environment (and
> checking it gets passed down okay)? Does that still have any timeouts
> reported?
>

So - I hear all of that, and that will be good in tracking it down - but
someone with a slow computer shouldn't need to tune the test suite to be
able to run it. Needing to set env variables to lower the parallel count
to run make test also unacceptable.

How about we lower mtr2's aggressiveness with bailing out due to
timeouts in general? Like, set the timeout to 15 minutes or something?

Monty

Revision history for this message
Lee Bieber (kalebral-deactivatedaccount) wrote :

libdrizzle was out of date on the 3 machines i had tried. Will run tests several times to these machines to see if results are consistent after updating libdrizzle.

For the VirtualBox Ubuntu 904 image parallel=1, the other 2 Ubuntu 904 machines were parallel=4

Revision history for this message
Stewart Smith (stewart) wrote :

still not able to reprocude - even on build8, which Brian says has slow disk

./test-run.pl --parallel=8 --force

Spent 3596.725 of 803 seconds executing testcases

just much slower of course

Revision history for this message
Lee Bieber (kalebral-deactivatedaccount) wrote :

Ran 5 times
VirtualBox Ubuntu 904 - 2 tests failed, drizzledump and consistent_snapshot - see below for the output
Ubuntu 904 32bit - only drizzledump failed
Ubuntu 904 64bit - random failures as followed:
% grep "Failing" /tmp/log.*
/tmp/log.1:Failing test(s): main.drizzledump main.innodb_mysql main.consistent_snapshot
/tmp/log.2:Failing test(s): main.drizzledump main.endspace
/tmp/log.3:Failing test(s): main.drizzledump main.deadlock_innodb main.drizzleslap
/tmp/log.4:Failing test(s): main.drizzledump main.drizzleslap
/tmp/log.5:Failing test(s): main.drizzledump main.rowid_order_innodb main.drizzleslap

CURRENT_TEST: main.consistent_snapshot

Timeout [timer - pid: 25671, exit: 0] for 'check-testcase' expired after 90 seconds

 - saving '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/' to '/home/lbieber/repos/drizzle-clean-code/tests/var/log/main.consistent_snapshot/'

Retrying test, attempt(2/3)...

main.consistent_snapshot [ retry-fail ]
        Test ended at 2009-07-11 07:18:57

CURRENT_TEST: main.consistent_snapshot
CURRENT_TEST: main.consistent_snapshot

Could not execute 'check-testcase' before testcase 'main.consistent_snapshot' (res: 1):
drizzletest: Could not open connection 'default': 20 drizzle_state_read:lost connection to server (104)
not ok

Revision history for this message
Stewart Smith (stewart) wrote :

drizzledump failure could be old libdrizzle or tree without the test case fix.

as for the random failures... this message:
drizzletest: Could not open connection 'default': 20 drizzle_state_read:lost connection to server (104)

is rather ominous... and wondering if server is disconnecting for some reason (or going away). possible to repeat at all under GDB?

Revision history for this message
Stewart Smith (stewart) wrote :

have been able to reproduce on one of the machines.

./test-run.pl --repeat=100 consistent_snapshot
./test-run.pl --repeat=200 consistent_snapshot

failed to reproduce a test failure.

however... running the full suite can.

Changed in drizzle:
status: Incomplete → Confirmed
Revision history for this message
Stewart Smith (stewart) wrote :

(gdb) bt
#0 0xb7f70424 in __kernel_vsyscall ()
#1 0xb7bd67a6 in nanosleep () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7bd65be in sleep () from /lib/tls/i686/cmov/libc.so.6
#3 0x080e95fd in ListenHandler::bindAll (this=0xbfa8ab48, host=0x0,
    bind_timeout=100) at listen.cc:185
#4 0x08056744 in main (argc=12, argv=0x28ff1eb3) at drizzled.cc:1598
(gdb) up
#1 0xb7bd67a6 in nanosleep () from /lib/tls/i686/cmov/libc.so.6
(gdb)
#2 0xb7bd65be in sleep () from /lib/tls/i686/cmov/libc.so.6
(gdb)
#3 0x080e95fd in ListenHandler::bindAll (this=0xbfa8ab48, host=0x0,
    bind_timeout=100) at listen.cc:185
185 sleep(this_wait);
(gdb) list
180 }
181
182 errmsg_printf(ERRMSG_LVL_INFO, _("Retrying bind() on %u"),
183 (*it)->getPort());
184 this_wait= retry * retry / 3 + 1;
185 sleep(this_wait);
186 }
187
188 if (ret < 0)
189 {
(gdb) c

at the same time, in another screen, "netstat -t tcp -l -p" was showing the port bound, but with no process against it... :/

Revision history for this message
Stewart Smith (stewart) wrote :

the not at all good patch that fixes the problem:

=== modified file 'client/drizzletest.cc'
--- client/drizzletest.cc 2009-07-07 09:06:29 +0000
+++ client/drizzletest.cc 2009-07-15 05:33:33 +0000
@@ -3679,7 +3679,7 @@
   int failed_attempts= 0;
   static uint32_t connection_retry_sleep= 100000; /* Microseconds */
   drizzle_return_t ret;
-
+sleep(2);
   drizzle_con_set_tcp(con, host, port);
   drizzle_con_set_auth(con, user, pass);
   drizzle_con_set_db(con, db);

The comment before the safe_connect function (which the above patches) is rather ominous:

  Sometimes in a test the client starts before
  the server - to solve the problem, we try again
  after some sleep if connection fails the first
  time

We seem to be getting a socket to the old drizzled, and the new drizzled is unable to bind the port.

Revision history for this message
Stewart Smith (stewart) wrote :

the drizzletest backtrace:

#0 0xb7f53424 in __kernel_vsyscall ()
#1 0xb7cedaab in poll () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7f409af in drizzle_con_wait (drizzle=0xa289500) at libdrizzle/conn.c:259
#3 0xb7f41611 in drizzle_state_read (con=0x80ae344) at libdrizzle/conn.c:878
#4 0xb7f47b80 in drizzle_state_loop (con=0x80ae344) at libdrizzle/state.c:24
#5 0xb7f40773 in drizzle_con_connect (con=0x1) at libdrizzle/conn.c:604
#6 0x0804f3d7 in safe_connect (con=0x80ae344, name=0xa289120 "default",
    host=0xa28920f "localhost", user=0xa2891ef "root", pass=0xa289330 "",
    db=0x0, port=<value optimized out>) at client/drizzletest.cc:3686
#7 0x0805b6f8 in main (argc=Cannot access memory at address 0x1
) at client/drizzletest.cc:5675

Revision history for this message
Stewart Smith (stewart) wrote :

netstat -l -t tcp output during problem:

tcp 0 0 *:12500 *:* LISTEN -
tcp6 0 0 [::]:12500 [::]:* LISTEN -

Revision history for this message
Stewart Smith (stewart) wrote :

err.. "netstat -t tcp -l -p" that should read.

Revision history for this message
Stewart Smith (stewart) wrote :
Download full text (9.9 KiB)

smells like:
http://lkml.indiana.edu/hypermail/linux/kernel/0304.0/1563.html

[Bug 546] New: Close notification in poll/select never arrives
From: Martin J. Bligh (<email address hidden>)
Date: Mon Apr 07 2003 - 10:05:13 EST

    * Next message: H. Peter Anvin: "Re: Re[2]: 845GE Chipset severe performance problems"
    * Previous message: Melkor Ainur: "Re: failure due to swapper and inet_sock_destruct"
    * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

http://bugme.osdl.org/show_bug.cgi?id=546

           Summary: Close notification in poll/select never arrives
    Kernel Version: 2.5.65, in general any 2.4.x, 2.5.x kernel, possibly
                    also earlie
            Status: NEW
          Severity: normal
             Owner: <email address hidden>
         Submitter: <email address hidden>

Distribution: any, problem not sensitive or specific to any distribution
Hardware Environment: reproduces at least on i386, others not tested
Software Environment: glibc-2.2.5, gcc-2.95.3
                       (Many other combinations reproduce the bug, too. The
                        bug is not sensitive to user land config. Its a kernel
                        problem.)
History:

  This bug was originally reported by <email address hidden>
  to linux.kernel under "Subject: PROBLEM: I/O system call never returns
  if file desc is closed in the meantime". Report date was 2001-06-06.

  Thierry tells me that discussion was had back then, but was not constructive
  and died away without the bug being fixed. Not very positive experience
  about open software community's ability to get bugs fixed.

  Thierry resubmitted the bug to comp.os.linux.development.system under
  "Subject: Close notification in poll/select (works on all OS except Linux)"
  on 2003-01-13. He never got an answer. I guess there are a lot of cracks
  and this one is falling through one of them.

  I have hit the very same problems as Thierry and have verified his bug
  report to be factually correct. I used 2.6.65 kernel instead of the
  various 2.4.x kernels he used and got substantially same results.
  I find it incredible that such well researched bug report has not
  lead to resolution of the bug.

Problem Description:

Thierry's report from January:

  I have a problem which has not been resolved in the last few years:
  an interaction problem between close and poll/select.

  Consider a multithreaded program (pthreads). One thread reads from
  a file descriptor (typically a TCP or UDP socket). At some time,
  another thread decides to close(2) the file descriptor while the
  first thread is blocked, waiting for input data.

  If the reading thread uses read(2) or recv(2), the system call returns
  with an error status at the time the file descriptor is closed (expected
  behavior).

  But, since we also need a timeout on the read, we do not use read(2)
  or recv(2). We set the file descriptor in non-blocking mode and we
  use poll(2) with a timeout.

  The problem is that poll(2) does not return when the file descriptor
  is closed. On all other operating systems we have here (Solaris, HP-UX
  and even non-UNIX systems like OpenVMS and Windows), ...

Revision history for this message
Stewart Smith (stewart) wrote :

The above C program passes on Solaris. Not on OSX or Linux (even my box).

On my laptop, the above C program never finishes (were "never" is defined as "at least 7hrs 45mins").

Revision history for this message
Stewart Smith (stewart) wrote :

Never assume your code doesn't fork() and exec() a setuid binary (in this case, libuuid).

Need to set CLOEXEC on server socket.

Changed in drizzle:
status: Confirmed → Fix Committed
Changed in drizzle:
milestone: aloha → bell
Stewart Smith (stewart)
Changed in drizzle:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.