Failed to deploy node with ceph and ceilo on centos: Unknown error

Bug #1398096 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Aleksandr Didenko

Bug Description

api: '1.0'
astute_sha: 1da516b88d1a8d0014d78ab0d796e5b08379a59b
auth_required: true
build_id: 2014-11-30_11-15-26
build_number: '24'
feature_groups:
- mirantis
fuellib_sha: bbf26b499bf47ca41302ba6f62c3ebc5a493013d
fuelmain_sha: f324b592399c544eace2f64cb499564da01ab38c
nailgun_sha: 58e5f47457a0e832c005ce350e01b75a0c01b90a
ostf_sha: dc66fd39d4d035bb972e4c0225591290593c459d
production: docker
release: '6.0'

Steps to reproduce:
1. Start deploy cluster with 95 computes nodes + 3 controller in HA + neutron gre + ceph + ceilometer
Deployment failed with
Failed to deploy node 'compute_4': Unknown error
All computes nodes goes to offline

Traceback from compute_4
2014-12-01 17:03:57 ERR
 (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) change from notrun to 0 failed: ceph-deploy osd prepare node-53:/dev/sdb4 returned 1 instead of one of [0]
2014-12-01 17:03:57 ERR
 /usr/bin/puppet:4
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:91:in `execute'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:137:in `run'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:478:in `exit_on_fail'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:470:in `plugin_hook'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:146:in `run_command'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:218:in `main'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:268:in `apply_catalog'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:192:in `run'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:124:in `apply_catalog'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:160:in `benchmark'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:161:in `benchmark'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:125:in `apply_catalog'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:163:in `apply'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/report.rb:108:in `as_logging_destination'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/log.rb:149:in `with_destination'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:164:in `apply'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:108:in `evaluate'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/graph/relationship_graph.rb:118:in `traverse'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:326:in `thinmark'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:327:in `thinmark'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `call'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:187:in `eval_resource'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:174:in `apply'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:18:in `evaluate'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `perform_changes'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `each'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:82:in `perform_changes'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:130:in `sync_if_needed'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:193:in `sync'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/type/exec.rb:120:in `sync'
2014-12-01 17:03:57 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/errors.rb:97:in `fail'
2014-12-01 17:03:57 ERR
 ceph-deploy osd prepare node-53:/dev/sdb4 returned 1 instead of one of [0]
2014-12-01 17:03:35 ERR
 (/Stage[main]/Ceph::Conf/Exec[ceph-deploy gatherkeys remote]/returns) change from notrun to 0 failed: ceph-deploy gatherkeys node-46 returned 1 instead of one of [0]
2014-12-01 17:03:35 ERR
 /usr/bin/puppet:4
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:91:in `execute'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:137:in `run'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:478:in `exit_on_fail'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:470:in `plugin_hook'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:364:in `run'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:146:in `run_command'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:218:in `main'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:268:in `apply_catalog'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:192:in `run'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:124:in `apply_catalog'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:160:in `benchmark'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:161:in `benchmark'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/configurer.rb:125:in `apply_catalog'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:163:in `apply'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/report.rb:108:in `as_logging_destination'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/log.rb:149:in `with_destination'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/resource/catalog.rb:164:in `apply'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:108:in `evaluate'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/graph/relationship_graph.rb:118:in `traverse'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:326:in `thinmark'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/1.8/benchmark.rb:308:in `realtime'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util.rb:327:in `thinmark'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `evaluate'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:117:in `call'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:187:in `eval_resource'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction.rb:174:in `apply'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:18:in `evaluate'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `perform_changes'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:81:in `each'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:82:in `perform_changes'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:130:in `sync_if_needed'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/type/exec.rb:120:in `sync'
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/util/errors.rb:97:in `fail'
2014-12-01 17:03:35 ERR
 ceph-deploy gatherkeys node-46 returned 1 instead of one of [0]
2014-12-01 17:03:35 ERR
 /usr/lib/ruby/site_ruby/1.8/puppet/transaction/resource_harness.rb:193:in `sync

Tags: scale
Changed in fuel:
milestone: none → 6.0
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
importance: Undecided → High
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

In the configuration of the environment in the UI and API, it shows roles as compute, ceph-osd, but astute and yaml files report compute, cinder.

Changed in fuel:
assignee: nobody → Evgeniy L (rustyrobot)
status: New → Confirmed
Revision history for this message
Evgeniy L (rustyrobot) wrote :

For some reason Nailgun sends that all of the nodes are cinder and there are no ceph nodes.
I'll try to investigate.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

We looked at the wrong fuel master in scale lab. I'm starting over and trying to debug ceph-deploy

Changed in fuel:
assignee: Evgeniy L (rustyrobot) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

I've checked primary controller node puppet log (node-46), here's what I see:

Haproxy started at 16:24:31
2014-12-01T16:24:31.860626+00:00 notice: Proxy Stats started.

Puppet p_haproxy service evaluated at 16:24:39
2014-12-01T16:24:39.347334+00:00 info: (/Stage[main]/Cluster::Haproxy_ocf/Service[p_haproxy]) Evaluated in 9.49 seconds

Puppet Haproxy::Service[nova-api-2] service evaluated at 16:24:59
2014-12-01T16:24:59.964800+00:00 info: (/Stage[main]/Openstack::Ha::Nova/Openstack::Ha::Haproxy_service[nova-api-2]/Haproxy::Listen[nova-api-2]/Haproxy::Service[nova-api-2]/Concat[/etc/haproxy/conf.d/050-nova-api-2.cfg]/File[/var/lib/puppet/concat/_etc_haproxy_conf.d_050-nova-api-2.cfg/fragments.concat.out]) Evaluated in 0.01 seconds

But Exec[wait-for-haproxy-nova-backend] started to evaluate at 16:13:45
2014-12-01T16:13:45.345501+00:00 info: (/Stage[main]/Osnailyfacter::Cluster_ha/Exec[wait-for-haproxy-nova-backend]) Starting to evaluate the resource

And failed at 16:18:55
2014-12-01T16:18:55.551552+00:00 info: (/Stage[main]/Osnailyfacter::Cluster_ha/Exec[wait-for-haproxy-nova-backend]) Evaluated in 310.20 seconds

So we're starting to wait for nova haproxy backend before that backend is configured.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Aleksandr Didenko (adidenko)
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138327

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/138334

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Matthew Mosesohn (raytrac3r)
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Aleksandr Didenko (adidenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/138334
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5e1b9762400572535c1d8765459d2cfea0b46191
Submitter: Jenkins
Branch: master

commit 5e1b9762400572535c1d8765459d2cfea0b46191
Author: Matthew Mosesohn <email address hidden>
Date: Tue Dec 2 15:16:15 2014 +0400

    Add retries to ceph-deploy gatherkeys

    During heavy load sometimes ceph-deploy gatherkeys
    fails on the first request with connection issues.
    Adding a retry will avoid race conditions from
    breaking deployment.

    Change-Id: I077ef2019d7796ce1fe4510afae806f359e5a0e1
    Partial-Bug: #1398096

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/138327
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5ee154ecd7b95082ad34b5a8057384c889d70881
Submitter: Jenkins
Branch: master

commit 5ee154ecd7b95082ad34b5a8057384c889d70881
Author: Aleksandr Didenko <email address hidden>
Date: Tue Dec 2 13:46:38 2014 +0200

    Fix resource ordering for wait-for-nova-backend

    Fix ordering for wait-for-nova-backend exec and Haproxy services.
    Remove duplications for keystone class -> exec ordering.

    Partial-bug: #1398096
    Change-Id: I8dd6173c8d92ec4da8f2cfc2020b9d94e732bad4

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Both changesets were landed into master (guys, thanks for such a quick turnover here!!!) - can we close this bug as Fix Committed or it's not fully fixed yet?

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Reproduced on
astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
auth_required: true
build_id: 2014-12-03_11-32-25
build_number: '36'
feature_groups:
- mirantis
fuellib_sha: 1eb704034c31a7679c6cfbf13579219c7da75e4b
fuelmain_sha: 7ab330b4958ab20955372e85de05e8732e8f6df2
nailgun_sha: d2e732c5f54e35d0ed19f9a17489608dc1d11be8
ostf_sha: 7e79964ddb5092fc4568c6fb08a348bb326df2a8
production: docker
release: '6.0'

with ubuntu+ceilo+ceph

Revision history for this message
Sergey Galkin (sgalkin) wrote :

 /etc/puppet/2014.2-6.0/modules/osnailyfacter/manifests/cluster_ha.pp on the build 36 include commit https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5ee154ecd7b95082ad34b5a8057384c889d70881 for this issue

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

> Reproduced on
> astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91

Could you please provide a diagnostic snapshot for the broken deployment?

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

error on node-48:

2014-12-04T20:46:36.923122+00:00 info: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]) Starting to evaluate the resource
2014-12-04T20:46:36.925777+00:00 debug: (Exec[ceph-deploy config pull](provider=posix)) Executing 'ceph-deploy --overwrite-conf config pull node-29'
2014-12-04T20:46:36.925998+00:00 debug: Executing 'ceph-deploy --overwrite-conf config pull node-29'
2014-12-04T20:46:40.176947+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
2014-12-04T20:46:40.177564+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) [ceph_deploy.cli][INFO ] Invoked (1.5.9): /usr/bin/ceph-deploy --overwrite-conf config pull node-29
2014-12-04T20:46:40.178119+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) [ceph_deploy.config][DEBUG ] Checking node-29 for /etc/ceph/ceph.conf
2014-12-04T20:46:40.178833+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) ssh: connect to host node-29 port 22: No route to host
2014-12-04T20:46:40.179315+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) [ceph_deploy.config][ERROR ] Unable to pull /etc/ceph/ceph.conf from node-29
2014-12-04T20:46:40.180071+00:00 notice: (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) [ceph_deploy][ERROR ] GenericError: Failed to fetch config from 1 hosts
2014-12-04T20:46:40.194866+00:00 err: ceph-deploy --overwrite-conf config pull node-29 returned 1 instead of one of [0]

And it looks like there's some network issue since some nodes (for example node-48) can't see any VLAN traffic on eth2, while others (for example node-35) can.

But anyways, I think we should wrap 'ceph-deploy --overwrite-conf config pull node-29' exec into retries just to avoid intermittent ssh/net/load-spike problem on scale.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139622

Changed in fuel:
status: Fix Committed → In Progress
importance: Critical → High
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/139622
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=eab4c8711f260240ba89b50daad213d24ee7bbff
Submitter: Jenkins
Branch: master

commit eab4c8711f260240ba89b50daad213d24ee7bbff
Author: Aleksandr Didenko <email address hidden>
Date: Fri Dec 5 14:54:53 2014 +0200

    Add retries to ceph-deploy config pull

    We can have intermittent ssh connection issues on scale, so it's
    better to run several tries for "ceph-deploy config pull" exec

    Partial-bug: #1398096
    Change-Id: I660d820b70daf78bece5228ece197d65b062df1c

Revision history for this message
Sergey Galkin (sgalkin) wrote :

Reproduced on
astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
auth_required: true
build_id: 2014-12-03_11-32-25
build_number: '36'

was from incorrect lab configuration

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.