It seems if the test fails, then it sets the whole package into race issues:
http://10.125.0.203:8080/job/RunUnittests-race-amd64/194/testReport/junit/github/com_juju_juju_worker_caasunitprovisioner/TestPackage/
github.com/juju/juju/worker/caasunitprovisioner/worker_test.go
WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec
(its possible the race caused the failure, but it looks more like the failure triggers cleanup paths that race.)
FAIL: worker_test.go:716: WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec
worker_test.go:738:
c.Fatal("timed out sending units change")
... Error: timed out sending units change
worker_test.go:738:
c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:58:
c.Errorf("timed out waiting for worker to stop")
... Error: timed out waiting for worker to stop
worker_test.go:738:
c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:83:
c.Check(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"workertest: worker not stopping", cause:error(nil), previous:error(nil), file:"github.com/juju/juju/worker/workertest/check.go", line:59} ("workertest: worker not stopping")
... error stack:
github.com/juju/juju/worker/workertest/check.go:59: workertest: worker not stopping
==================
WARNING: DATA RACE
Write at 0x00c420110af8 by goroutine 138:
github.com/juju/juju/worker/caasunitprovisioner_test.(*WorkerSuite).SetUpTest()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker_test.go:113 +0x94f
runtime.call32()
/snap/go/1473/src/runtime/asm_amd64.s:573 +0x3a
reflect.Value.Call()
/snap/go/1473/src/reflect/value.go:308 +0xc0
gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
/workspace/src/gopkg.in/check.v1/check.go:721 +0x177
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
/workspace/src/gopkg.in/check.v1/check.go:666 +0x89
Previous write at 0x00c420110af8 by goroutine 140:
sync/atomic.AddInt32()
/snap/go/1473/src/runtime/race_amd64.s:269 +0xb
sync.(*Mutex).Unlock()
/snap/go/1473/src/sync/mutex.go:182 +0x54
github.com/juju/testing.(*Stub).addCall()
/workspace/src/github.com/juju/testing/stub.go:145 +0x2bc
github.com/juju/testing.(*Stub).MethodCall()
/workspace/src/github.com/juju/testing/stub.go:174 +0x88
github.com/juju/juju/worker/caasunitprovisioner_test.(*mockContainerBroker).DeleteUnit()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/mock_test.go:95 +0x113
github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).loop()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:225 +0xbdf
github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:68 +0x41
github.com/juju/juju/worker/catacomb.runSafely()
/workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
github.com/juju/juju/worker/catacomb.Invoke.func3()
/workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e
Goroutine 138 (running) created at:
gopkg.in/check%2ev1.(*suiteRunner).forkCall()
/workspace/src/gopkg.in/check.v1/check.go:663 +0x419
gopkg.in/check%2ev1.(*suiteRunner).runFunc()
/workspace/src/gopkg.in/check.v1/check.go:673 +0x7e
gopkg.in/check%2ev1.(*suiteRunner).runFixture()
/workspace/src/gopkg.in/check.v1/check.go:717 +0x7e
gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic()
/workspace/src/gopkg.in/check.v1/check.go:735 +0xa7
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
/workspace/src/gopkg.in/check.v1/check.go:760 +0x207
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
/workspace/src/gopkg.in/check.v1/check.go:666 +0x89
Goroutine 140 (running) created at:
github.com/juju/juju/worker/catacomb.Invoke()
/workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:113 +0x273
github.com/juju/juju/worker/caasunitprovisioner.newApplicationWorker()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:66 +0x46d
github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).loop()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:148 +0x61c
github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
/workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:68 +0x41
github.com/juju/juju/worker/catacomb.runSafely()
/workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
github.com/juju/juju/worker/catacomb.Invoke.func3()
/workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e
==================
Digging through it, this is what I see:
worker/ caasunitprovisi oner/worker. go:
logger. Errorf( "stopping application worker for %v: %v", appId, err) appWorkers, appId)
:121 bare channel send that could easily block forever
w, ok := appWorkers[appId]
if ok {
// Before stopping the application worker, inform it that
// the app is gone so it has a chance to clean up.
// The worker will act on the removed prior to processing the
// Stop() request.
p.appRemoved <- struct{}{}
if err := worker.Stop(w); err != nil {
}
delete(
}
We are also sending on a channel, that is being listened to by *all* application workers. There is no guarantee that the worker responsible for appId will be the one that actually handles it in 'application_ worker. go:134'
case <-aw.appRemoved:
So if you ever have 2 applications running, stopping one of them will randomly kill one of them, since receiving that signal means that all units are forcibly destroyed.
What *should* happen if we are trying to tear down at the same time? It's entirely plausible that a goroutine would be hitting for _, appId := range apps at exactly the same time that application_worker notices that aw.catacomb. Dying() . er.loop( ) notices catacomb.Dying before the caasunitprovisioner worker decides it needs to delete that app, then caasunitprovisioner worker will block indefinitely.
And if the applicationWork
*Something like*
delete p.appRemoved, it doesn't belong on the provisioner, and just look directly at the application worker details. caasunitprovisi oner/applicatio n_worker. go caasunitprovisi oner/applicatio n_worker. go orker(
applicationUpd ater ApplicationUpdater, rker, error) {
application: application,
jujuManagedUn its: jujuManagedUnits, caasunitprovisi oner/worker. go b/worker/ caasunitprovisi oner/worker. go .b661245259 100644 caasunitprovisi oner/worker. go caasunitprovisi oner/worker. go
Either that, or get rid of appRemoved and *only* ever trigger on Worker.Stop()
--- a/worker/
+++ b/worker/
@@ -48,7 +48,7 @@ func newApplicationW
unitGetter UnitGetter,
unitUpdater UnitUpdater,
-) (worker.Worker, error) {
+) (*applicationWo
w := &applicationWorker{
diff --git a/worker/
index 15f076dbf3.
--- a/worker/
+++ b/worker/
@@ -73,8 +73,6 @@ func NewWorker(config Config) (worker.Worker, error) {
type provisioner struct {
catacomb catacomb.Catacomb
config Config
-
- appRemoved chan struct{}
}
// Kill is part of the worker.Worker interface.
@@ -98,8 +96,7 @@ func (p *provisioner) loop() error {
// The channel is unbuffered to that we block until string] worker. Worker) string] *applicationWor ker)
select {
case <-p.catacomb. Dying() :
/ / the app is gone so it has a chance to clean up.
/ / The worker will act on the r...
// requests are processed.
- p.appRemoved = make(chan struct{})
- appWorkers := make(map[
+ appWorkers := make(map[
for {
@@ -124,7 +121,16 @@ func (p *provisioner) loop() error {