Bug #1777853 “intermittent failure github.com/juju/juju/provider...” : Bugs : Canonical Juju

intermittent failure github.com/juju/juju/provider/common.TestPackage github.com/juju/juju/provider/common_test.(*BootstrapSuite).TestWaitSSHKilledWaitingForDial

Bug #1777853 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
John A Meinel
2.5
Fix Released
Medium
John A Meinel

Bug Description

As seen here:
http://ci.jujucharms.com/job/github-merge-juju/615/testReport/junit/github/com_juju_juju_provider_common/TestPackage/

        panic: test timed out after 25m0s

goroutine 1241526 [running]:
testing.(*M).startAlarm.func1()
 /snap/go/2130/src/testing/testing.go:1240 +0xfc
created by time.goFunc
 /snap/go/2130/src/time/sleep.go:172 +0x44

goroutine 1 [chan receive, 25 minutes]:
testing.(*T).Run(0xc4204ca0f0, 0xc77fc2, 0xb, 0xcb85c8, 0x480c96)
 /snap/go/2130/src/testing/testing.go:825 +0x301
testing.runTests.func1(0xc4204ca000)
 /snap/go/2130/src/testing/testing.go:1063 +0x64
testing.tRunner(0xc4204ca000, 0xc4201a1df8)
 /snap/go/2130/src/testing/testing.go:777 +0xd0
testing.runTests(0xc4202370a0, 0x10cc740, 0x1, 0x1, 0x412299)
 /snap/go/2130/src/testing/testing.go:1061 +0x2c4
testing.(*M).Run(0xc4202a6e00, 0x0)
 /snap/go/2130/src/testing/testing.go:978 +0x171
main.main()
 _testmain.go:44 +0x151

goroutine 19 [syscall, 25 minutes]:
os/signal.signal_recv(0x0)
 /snap/go/2130/src/runtime/sigqueue.go:139 +0xa6
os/signal.loop()
 /snap/go/2130/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
 /snap/go/2130/src/os/signal/signal_unix.go:28 +0x41

goroutine 1466 [chan send, 24 minutes]:
github.com/juju/juju/provider/common_test.(*interruptOnDial).Addresses(0xc4206a0340, 0xd181e0, 0xc42067a2f8, 0x0, 0x0, 0x0, 0x0, 0x0)
 /workspace/src/github.com/juju/juju/provider/common/bootstrap_test.go:636 +0xb7
github.com/juju/juju/provider/common.WaitSSH(0xd17c00, 0xc4201872d0, 0xc42010c8a0, 0xd1bae0, 0x1100ed8, 0x0, 0x0, 0xd1ff00, 0xc4206a0340, 0xd181e0, ...)
 /workspace/src/github.com/juju/juju/provider/common/bootstrap.go:707 +0x4e4
github.com/juju/juju/provider/common_test.(*BootstrapSuite).TestWaitSSHKilledWaitingForDial(0xc4200f48f0, 0xc4204ca780)
 /workspace/src/github.com/juju/juju/provider/common/bootstrap_test.go:646 +0x17f
reflect.Value.call(0xc59760, 0xc4200f48f0, 0x5613, 0xc70ecb, 0x4, 0xc4200a8f70, 0x1, 0x1, 0x1103100, 0xc62720, ...)
 /snap/go/2130/src/reflect/value.go:447 +0x969
reflect.Value.Call(0xc59760, 0xc4200f48f0, 0x5613, 0xc42030ef70, 0x1, 0x1, 0xc4204ca870, 0xc42030ef07, 0x1)
 /snap/go/2130/src/reflect/value.go:308 +0xa4
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc4204ca780)
 /workspace/src/gopkg.in/check.v1/check.go:772 +0x575
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc4202a7180, 0xc4204ca780, 0xc4206a02e0)
 /workspace/src/gopkg.in/check.v1/check.go:666 +0x7c
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall
 /workspace/src/gopkg.in/check.v1/check.go:663 +0x20c

...

At the very least, test code shouldn't hang for 25 minutes, so we need some form of timeout on the test to have it fail early.

John A Meinel (jameinel)
Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
status: Expired → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (6.1 KiB)

When it fails with that patch, this is the traceback:
PANIC: bootstrap_test.go:652: BootstrapSuite.TestWaitSSHKilledWaitingForDial

[LOG] 0:00.001 INFO juju.test Addresses called:
goroutine 9655 [running]:
runtime/debug.Stack(0x5dbfbe0, 0x9, 0xc0004a4c60)
 /usr/local/Cellar/go/1.11.5/libexec/src/runtime/debug/stack.go:24 +0xb5
github.com/juju/juju/provider/common_test.(*interruptOnDial).Addresses(0xc000607740, 0x608c180, 0xc0000bf1e0, 0x0, 0x0, 0x0, 0x0, 0x0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap_test.go:639 +0x6b
github.com/juju/juju/provider/common.WaitSSH(0x60851a0, 0xc0000ff2d0, 0xc00051c840, 0x608d500, 0x7a66f38, 0x0, 0x0, 0x60993c0, 0xc000607740, 0x608c180, ...)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap.go:707 +0x546
github.com/juju/juju/provider/common_test.(*BootstrapSuite).TestWaitSSHKilledWaitingForDial(0xc0004d7110, 0xc0004ac2d0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap_test.go:657 +0x2cd
reflect.Value.call(0x5d7b300, 0xc0004d7110, 0x5a13, 0x5db8a05, 0x4, 0xc000071f70, 0x1, 0x1, 0x18, 0x70b48a0, ...)
 /usr/local/Cellar/go/1.11.5/libexec/src/reflect/value.go:447 +0x635
reflect.Value.Call(0x5d7b300, 0xc0004d7110, 0x5a13, 0xc000071f70, 0x1, 0x1, 0xc0004ac3c0, 0xc00005b695, 0x403158c)
 /usr/local/Cellar/go/1.11.5/libexec/src/reflect/value.go:308 +0xc1
github.com/juju/juju/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc0004ac2d0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/vendor/gopkg.in/check.v1/check.go:772 +0xa05
github.com/juju/juju/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc000240900, 0xc0004ac2d0, 0xc0006076c0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/vendor/gopkg.in/check.v1/check.go:666 +0x8a
created by github.com/juju/juju/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkCall
 /Users/jameinel/dev/go/src/github.com/juju/juju/vendor/gopkg.in/check.v1/check.go:663 +0x44d

[LOG] 0:00.002 DEBUG juju.provider.common connection attempt for 0.1.2.3 failed: mock connection failure to 0.1.2.3
[LOG] 0:00.003 DEBUG juju.provider.common connection attempt for 0.1.2.3 failed: mock connection failure to 0.1.2.3
[LOG] 0:00.004 INFO juju.test Addresses called:
goroutine 9655 [running]:
runtime/debug.Stack(0x5dbfbe0, 0x9, 0xc0004a4c60)
 /usr/local/Cellar/go/1.11.5/libexec/src/runtime/debug/stack.go:24 +0xb5
github.com/juju/juju/provider/common_test.(*interruptOnDial).Addresses(0xc000607740, 0x608c180, 0xc0000bf1e0, 0x0, 0x0, 0x0, 0x0, 0x0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap_test.go:639 +0x6b
github.com/juju/juju/provider/common.WaitSSH(0x60851a0, 0xc0000ff2d0, 0xc00051c840, 0x608d500, 0x7a66f38, 0x0, 0x0, 0x60993c0, 0xc000607740, 0x608c180, ...)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap.go:707 +0x546
github.com/juju/juju/provider/common_test.(*BootstrapSuite).TestWaitSSHKilledWaitingForDial(0xc0004d7110, 0xc0004ac2d0)
 /Users/jameinel/dev/go/src/github.com/juju/juju/provider/common/bootstrap_test.go:657 +0x2cd
reflect.Value.call(0x5d7b300, 0xc0004d7110, 0x5a13, 0x5db8a05, 0x4, 0xc000071f70, 0x1, 0x1, 0x18, 0x70b48a0, ...)
 /usr/loc...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

All 3 of them were from this code:
  select {
  case <-pollAddresses.C:
   pollAddresses.Reset(opts.AddressesDelay)
   if err := inst.Refresh(ctx); err != nil {
    return "", fmt.Errorf("refreshing addresses: %v", err)
   }
   instanceStatus := inst.Status(ctx)
   if instanceStatus.Status == status.ProvisioningError {
    if instanceStatus.Message != "" {
     return "", errors.Errorf("instance provisioning failed (%v)", instanceStatus.Message)
    }
    return "", errors.Errorf("instance provisioning failed")
   }
   addresses, err := inst.Addresses(ctx)
   if err != nil {
    return "", fmt.Errorf("getting addresses: %v", err)
   }
   checker.UpdateAddresses(addresses)

So there is a question why pollAddresses.C is getting triggered 3 times before <-interrupted is being checked.
It uses "Reset(opts.AddressesDelay)"

We seems to be passing:
var testSSHTimeout = environs.BootstrapDialOpts{
 Timeout: coretesting.ShortWait,
 RetryDelay: 1 * time.Millisecond,
 AddressesDelay: 1 * time.Millisecond,
}

So that is resetting the Addresses to be called again after 1ms from when we just called it.

Adding this:
 func (i *interruptOnDial) Addresses(ctx context.ProviderCallContext) ([]network.Address, error) {
        // kill the tomb the second time Addresses is called
+ loggo.GetLogger("juju.test").Infof("Addresses called:\n%s", string(debug.Stack()))
+ time.Sleep(1*time.Millisecond)

Allows me to trigger the failure very often. Essentially the Addresses timer gets called, and then the timer will reset to wanting to fire again as we get to the point where we *might* look at Interrupt, and if we ever reset the timer and it triggers before we notice the interrupt, we are blocked.

Revision history for this message
John A Meinel (jameinel) wrote :

Just changing the size of the buffered channel to 10 prevents the test from failing after running 4000 times with the 'Sleep(1*ms)' in play.
With a buffer of just 1, the timer just needs to wake up before we notice the interrupt 1 time, with the bigger buffer, it needs to do so 9 times before it would fail.

I did, however, get it to fail if I put a Sleep(2*ms) with a buffer of 10. So we probably need a better mechanism.
Noticing that we don't *do* anything with the interrupt value, we can just close the channel, and then it will always return.

I changed the check to just see how many times Addresses() can get called, and use close(ch) instead of sending a message. With a time.Sleep(3ms), and a loop of 4000 attempts, I was able to get 14 calls in one test, and no failures.
So this seems to be a solid fix.

Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Incomplete → In Progress
milestone: none → 2.6-beta1
Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Loading subscribers...

Remote bug watches

Bug watches keep track of this bug in other bug trackers.