pyjuju

local provider machine agent uses 100% CPU after host reboot

Series 0.5
Bug #1006553

Bug #1006553 reported by Māris Fogels on 2012-05-30

This bug affects 8 people

	Status	Importance	Assigned to	Milestone
pyjuju	Fix Released	High	Benjamin Saller	pyjuju 0.6 "honolulu"
0.5	Triaged	High	Unassigned
juju (Ubuntu)	Fix Released	High	Unassigned
Precise	Won't Fix	High	Unassigned

Bug Description

The juju program uses 100% CPU if the host is rebooted while running a LXC environment. The CPU consumption does not happen immediately after reboot, but starts within a day or two of regular system use.

Destroying the environment kills the processes.

Here is the runaway process command:

/usr/bin/python -m juju.agents.machine --nodaemon --logfile /tmp/local-juju/mars-local/machine-agent.log --session-file /var/run/juju/mars-local-machine-agent.zksession

Steps to reproduce:

1. Create an environment with the local (LXC) provider
2. Set up the wordpress+mysql as-per the juju tutorial
3. Reboot the host system
4. Use the system for a day

What happens: juju runs away, taking 100% of a CPU with it

I am using juju version 0.5+bzr537-1juju5~precise1

Related branches

lp://staging/~bcsaller/pyjuju/lxc-repairs

Merged into lp://staging/pyjuju at revision 562

Kapil Thangavelu (community): Needs Fixing on 2012-07-18

Revision history for this message

Māris Fogels (mars) wrote on 2012-05-30:

machine-agent.log Edit (250 bytes, text/plain)

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-05-31:

Hi Maris. When I try this with the latest juju on precise, I don't get a runaway process. Mine just reconnects every few seconds...

[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid 1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid 1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 1090] fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
[pid 1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid 1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid 1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 1090] fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
[pid 1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid 1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid 1090] close(10) = 0
[pid 1090] poll([{fd=4, events=POLLIN}], 1, 3333

Can you strace the runaway process for a while with

strace -f -o /tmp/strace-machine-agent.log -p $PID_OF_MACHINE_AGENT

And then attach that as well?

Thanks!

Hi Maris. When I try this with the latest juju on precise, I don't get a runaway process. Mine just reconnects every few seconds...

[pid  1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid  1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid  1090] close(10)                   = 0
[pid  1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid  1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid  1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  1090] fcntl(10, F_GETFL)          = 0x2 (flags O_RDWR)
[pid  1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid  1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid  1090] close(10)                   = 0
[pid  1090] poll([{fd=4, events=POLLIN}], 1, 3333) = 0 (Timeout)
[pid  1090] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
[pid  1090] setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  1090] fcntl(10, F_GETFL)          = 0x2 (flags O_RDWR)
[pid  1090] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  1090] connect(10, {sa_family=AF_INET, sin_port=htons(55429), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  1090] poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN|POLLOUT}], 2, 3333) = 1 ([{fd=10, revents=POLLIN|POLLOUT|POLLERR|POLLHUP}])
[pid  1090] getsockopt(10, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
[pid  1090] close(10)                   = 0
[pid  1090] poll([{fd=4, events=POLLIN}], 1, 3333

Can you strace the runaway process for a while with

strace -f -o /tmp/strace-machine-agent.log -p $PID_OF_MACHINE_AGENT

And then attach that as well?

Thanks!

Changed in juju:
status:	New → Incomplete
importance:	Undecided → High

Revision history for this message

Māris Fogels (mars) wrote on 2012-06-14:

Runaway juju machine agent Edit (21.3 MiB, text/plain)

Here is the output of strace, as requested.

Changed in juju:
status:	Incomplete → New

Revision history for this message

Māris Fogels (mars) wrote on 2012-06-14:

You said your system reconnects every few seconds? My system reconnects hundreds of times a second. That 22MiB logfile was generated in about ten seconds.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-06-15:

After some IRC debugging with mars, Its clear to me there is something deep within libzookeeper that causes it to reconnect furiously after a long time of zookeeper being unavailable.

I think the right thing to do here then is to just not bring up the agents at boot time, until ZooKeeper and the containers are also brought up at boot time.

Changed in juju:
status:	New → Confirmed
status:	Confirmed → Triaged

Kapil Thangavelu (hazmat) on 2012-06-19

Changed in juju:
milestone:	none → galapagos
assignee:	nobody → Benjamin Saller (bcsaller)

Revision history for this message

Neal McBurnett (nealmcb) wrote on 2012-06-22:

I tried the strace suggested in #2, with the output below - pretty much the same as Māris.

The nice thing is, doing the strace also stopped the 100% cpu usage and the process I traced is no longer there.

The odd thing is that there were two processes at 100% cpu according to htop, but only one according to ps and only one of my 4 cpus was pegged. I traced the one that was not visible in ps, and now it is gone.....

1343 poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
1343 poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
1343 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
1343 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
1343 fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
1343 fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
1343 connect(10, {sa_family=AF_INET, sin_port=htons(52489), sin_addr=inet_addr("192.168.122.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
1343 close(10) = 0

Revision history for this message

Neal McBurnett (nealmcb) wrote on 2012-06-22:

Another symptom related to this seems to be this output in the machine-agent.log, after each boot, as well as when I did the strace described in comment #2:

$ tail -10 /home/neal/juju/neal-local/machine-agent.log
2012-06-21 10:27:29,961: twisted@ERROR: Unhandled error in Deferred:
2012-06-21 10:27:29,987: twisted@ERROR: Unhandled Error
Traceback (most recent call last):
Failure: txzookeeper.client.ConnectionTimeoutException: could not connect before timeout

2012-06-22 07:03:32,081: twisted@ERROR: Unhandled error in Deferred:
2012-06-22 07:03:32,082: twisted@ERROR: Unhandled Error
Traceback (most recent call last):
Failure: txzookeeper.client.ConnectionTimeoutException: could not connect before timeout

Clint Byrum (clint-fewbar) on 2012-06-22

Changed in juju:
milestone:	galapagos → honolulu

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-06-22:

Neal thanks for the extra info. This is pretty well understood now. There are really two bugs here. One is that we are starting the machine agent on reboot, which we shouldn't do since we know there will be no ZooKeeper on reboot, and the other is that libzookeeper seems to loop uncontrollably on reconnecting at some point.

We'll fix the former in juju by not attempting to restart the machine agent. The latter, I don't think we'll get to unfortunately.

Revision history for this message

Neal McBurnett (nealmcb) wrote on 2012-06-23:

Thanks, Clint - makes sense, though I wonder how we'll configure services that need to restart on reboot.

For the record, here are some more kern.log messages, which accompanied the strace that I talked about before, and give some more background on why the cpu calmed down.

Jun 22 07:03:21 rf kernel: [74141.409402] init: juju-neal-local-machine-agent main process (1100) killed by TRAP signal
Jun 22 07:03:21 rf kernel: [74141.429782] init: juju-neal-local-machine-agent main process ended, respawning

Based on the respawn, I guess it may come back, so instructions geared to any newbies who run across this bug on how to most easily destroy the environment would help. I tried to dig up what I had done in my testing via the status command, but it doesn't help:

$ juju status
could not connect before timeout
2012-06-22 18:04:00,125 ERROR could not connect before timeout

I can't run 'juju destroy-service' without knowing what service I started ;) So e.g. how can we figure that out?

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-06-23: Re: [Bug 1006553] Re: Juju uses 100% CPU after host reboot

#10

juju destroy-environment will clean things out and clear out the upstart
jobs. the machine agent is a singleton on the host in a local provider
environment, regardless of the services deployed in containers.

On Fri, Jun 22, 2012 at 8:13 PM, Neal McBurnett <email address hidden>wrote:

> Thanks, Clint - makes sense, though I wonder how we'll configure
> services that need to restart on reboot.
>
> For the record, here are some more kern.log messages, which accompanied
> the strace that I talked about before, and give some more background on
> why the cpu calmed down.
>
> Jun 22 07:03:21 rf kernel: [74141.409402] init:
> juju-neal-local-machine-agent main process (1100) killed by TRAP signal
> Jun 22 07:03:21 rf kernel: [74141.429782] init:
> juju-neal-local-machine-agent main process ended, respawning
>
> Based on the respawn, I guess it may come back, so instructions geared
> to any newbies who run across this bug on how to most easily destroy the
> environment would help. I tried to dig up what I had done in my testing
> via the status command, but it doesn't help:
>
> $ juju status
> could not connect before timeout
> 2012-06-22 18:04:00,125 ERROR could not connect before timeout
>
> I can't run 'juju destroy-service' without knowing what service I
> started ;) So e.g. how can we figure that out?
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1006553
>
> Title:
> Juju uses 100% CPU after host reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1006553/+subscriptions
>

Revision history for this message

Adrien Lamothe (alamo) wrote on 2012-06-23: Re: Juju uses 100% CPU after host reboot

#11

It would be good to allow an environment to remain in place but not run amok, so the user doesn't have to recrete it every time.

Revision history for this message

Neal McBurnett (nealmcb) wrote on 2012-06-26:

#12

Adrien, I agree, something less destructive than juju destroy-environment would be helpful for people that run into this. The "lxc-list" command will show which containers exist, but juju destroy-service fails for me after a reboot also. It's still unclear how to avoid cpu churning but also preserve the environment.

Is there a workaround to cleanly resume after reboot, restarting the zookeeper, agents, etc?

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-06-26: Re: [Bug 1006553] Re: Juju uses 100% CPU after host reboot

#13

restarting zookeeper will stop the cpu load as the machine agent will be
able to connect. However the containers would still need to be started, the
machine agent should do this, but i don't believe its implemented as such
yet (detecting existing container and restarting vs starting new).

-k

On Tue, Jun 26, 2012 at 8:56 AM, Neal McBurnett <email address hidden>wrote:

> Adrien, I agree, something less destructive than juju destroy-
> environment would be helpful for people that run into this. The "lxc-
> list" command will show which containers exist, but juju destroy-service
> fails for me after a reboot also. It's still unclear how to avoid cpu
> churning but also preserve the environment.
>
> Is there a workaround to cleanly resume after reboot, restarting the
> zookeeper, agents, etc?
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1006553
>
> Title:
> Juju uses 100% CPU after host reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1006553/+subscriptions
>

Kapil Thangavelu (hazmat) on 2012-06-26

summary:

- Juju uses 100% CPU after host reboot
+ local provider machine agent uses 100% CPU after host reboot

Revision history for this message

Selene ToyKeeper (toykeeper) wrote on 2012-07-03:

#14

I discovered this shortly after my cpu fan driver stopped working. The two issues combined to make my notebook overheat and shut down.

Kapil Thangavelu (hazmat) on 2012-07-03

Changed in juju:
status:	Triaged → In Progress

Clint Byrum (clint-fewbar) on 2012-07-16

Changed in juju (Ubuntu):
importance:	Undecided → High

Clint Byrum (clint-fewbar) on 2012-07-16

Changed in juju (Ubuntu Precise):
importance:	Undecided → High
Changed in juju (Ubuntu):
status:	New → Triaged
Changed in juju (Ubuntu Precise):
status:	New → Triaged
milestone:	none → ubuntu-12.04.1

Stéphane Graber (stgraber) on 2012-08-02

Changed in juju (Ubuntu Precise):
milestone:	ubuntu-12.04.1 → none

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-09-10:

#15

We don't want to backport this to 0.5.x until the issues with lxc requiring root access are addressed.

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2012-09-11:

#16

This burns my lap everyday on precise :(

Clint Byrum (clint-fewbar) on 2012-10-09

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-10-11:

#17

Fixed in 0.6 upstream.

Changed in juju (Ubuntu):
status:	Triaged → Fix Released

Revision history for this message

Steve Langasek (vorlon) wrote on 2021-10-14:

#18

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in juju (Ubuntu Precise):
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1003138

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.