gnocchi is eating all my CPU :(

Bug #1626473 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Carlos Camacho

Bug Description

[root@overcloud-controller-0 ~]# top

top - 10:19:56 up 1:03, 1 user, load average: 19.50, 23.13, 21.59
Tasks: 249 total, 5 running, 243 sleeping, 0 stopped, 1 zombie
%Cpu(s): 93.8 us, 5.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 8077620 total, 1820788 free, 4891436 used, 1365396 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 2876484 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 5015 gnocchi 20 0 618416 64940 11416 R 32.9 0.8 0:00.74 /usr/bin/python2 /usr/bin/gnocchi-statsd --logfile /var/lo+
 5005 gnocchi 20 0 853784 65136 3452 S 30.7 0.8 0:00.69 gnocchi-metricd - processing(0)
 5001 gnocchi 20 0 853784 65128 3452 S 28.4 0.8 0:00.70 gnocchi-metricd - processing(1)
 5016 gnocchi 20 0 748956 56652 1420 S 27.1 0.7 0:00.61 gnocchi-metricd - reporting(0)
 5029 gnocchi 20 0 748956 56184 1420 S 20.0 0.7 0:00.45 gnocchi-metricd - janitor(0)
 5037 gnocchi 20 0 684572 55836 1552 R 13.8 0.7 0:00.31 gnocchi-metricd - scheduler(0)
14339 keystone 20 0 622720 89448 7000 S 7.1 1.1 1:15.56 keystone-main -DFOREGROUND

Revision history for this message
Steven Hardy (shardy) wrote :

As you can see from the loadavg, this is killing my controller, and it's an otherwise idle overcloud.

I did openstack overcloud deploy --templates, then ran the same command again to test updating the overcloud, then noticed the CPU on my controller was pinned.

Changed in tripleo:
milestone: none → newton-rc2
status: New → Triaged
importance: Undecided → High
Revision history for this message
Steven Hardy (shardy) wrote :

[root@overcloud-controller-0 ~]# tail -f /var/log/gnocchi/metricd.log
2016-09-22 10:23:58.415 8902 ERROR cotyledon File "/usr/lib/python2.7/site-packages/cotyledon/__init__.py", line 53, in _logged_sys_exit
2016-09-22 10:23:58.415 8902 ERROR cotyledon atexit._run_exitfuncs()
2016-09-22 10:23:58.415 8902 ERROR cotyledon File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
2016-09-22 10:23:58.415 8902 ERROR cotyledon func(*targs, **kargs)
2016-09-22 10:23:58.415 8902 ERROR cotyledon File "/usr/lib64/python2.7/multiprocessing/util.py", line 319, in _exit_function
2016-09-22 10:23:58.415 8902 ERROR cotyledon p.join()
2016-09-22 10:23:58.415 8902 ERROR cotyledon File "/usr/lib64/python2.7/multiprocessing/process.py", line 143, in join
2016-09-22 10:23:58.415 8902 ERROR cotyledon assert self._parent_pid == os.getpid(), 'can only join a child process'
2016-09-22 10:23:58.415 8902 ERROR cotyledon AssertionError: can only join a child process
2016-09-22 10:23:58.415 8902 ERROR cotyledon

possibly related, seems metricsd is stuck doing this - perhaps I need an updated overcloud image?

Revision history for this message
Carlos Camacho (ccamacho) wrote :

I faced this issue since yesterday, wasnt able to do more tests as was testing Liberty deployments.

Pushed these 3 submissions to set this to 1 in CI and to make it configurable on local deployments.

https://review.openstack.org/#/c/374694/
https://review.openstack.org/#/c/374704/
https://review.openstack.org/#/c/374709/

Changed in tripleo:
assignee: nobody → Carlos Camacho (ccamacho)
status: Triaged → In Progress
Revision history for this message
Pradeep Kilambi (pkilambi) wrote :

Thanks Steve. What version of cotyledon does your install have. The trace you see was fixed in 1.2.7 i believe.

Revision history for this message
Steven Hardy (shardy) wrote :

Ok, so I updated my overcloud image to include the 1.2.7 version mentioned above, and I can no longer reproduce.

If Carlos concurs I think we can close this invalid as an old image may have been to blame (I'd not rebuilt it in quite some time).

Revision history for this message
Carlos Camacho (ccamacho) wrote :

You are right there, I wasn't using an updated OC image, quick question, do you think its worth to have metricd workers customizable? Just to abandon also the related submissions and close the bug.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/374704
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3e0694ec1c84b201c44982cbca4e5f662f1b942d
Submitter: Jenkins
Branch: master

commit 3e0694ec1c84b201c44982cbca4e5f662f1b942d
Author: Carlos Camacho <email address hidden>
Date: Thu Sep 22 13:08:58 2016 +0200

    Add metricd workers support in gnocchi

    Depending on the environment, gnocchi workers
    uses several controller resources RAM/CPU,
    this option makes it configurable.

    Also, configured to 1 in environments/low-memory-usage.yaml
    which will reduce the service footprint in i.e. CI

    Change-Id: Ia008b32151f4d8fec586cf89994ac836751b7cce
    Closes-bug: #1626473

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-gnocchi 9.4.1

This issue was fixed in the openstack/puppet-gnocchi 9.4.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-gnocchi 10.0.0

This issue was fixed in the openstack/puppet-gnocchi 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-gnocchi 9.4.1

This issue was fixed in the openstack/puppet-gnocchi 9.4.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.