Excessive Ceph rebalancing can cause deployment failures

Bug #1415954 reported by Leontiy Istomin
46
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Stanislav Makar
5.1.x
Won't Fix
Critical
MOS Maintenance
6.0.x
Won't Fix
Critical
Stanislav Makar

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
auth_required: true
build_id: 2015-01-26_18-15-51
build_number: '79'
feature_groups:
- mirantis
fuellib_sha: 0b6794c7e035847f20fb07e397f9af6932da42b8
fuelmain_sha: ''
nailgun_sha: 223f942908715b5c9c7396e2e71cf7d298c419a7
ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
production: docker
python-fuelclient_sha: 1ad9862e9b3fc8a9f64d01b4548f38e83d2473a2
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
      build_id: 2015-01-26_18-15-51
      build_number: '79'
      feature_groups:
      - mirantis
      fuellib_sha: 0b6794c7e035847f20fb07e397f9af6932da42b8
      fuelmain_sha: ''
      nailgun_sha: 223f942908715b5c9c7396e2e71cf7d298c419a7
      ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
      production: docker
      python-fuelclient_sha: 1ad9862e9b3fc8a9f64d01b4548f38e83d2473a2
      release: '6.1'

BareMetal, Centos, HA, Neutron-gre,Ceph-for-all, Debug, 6.1_79
Controllers:3 Computes:47

I can't see TestVM image after deployment
from astute log:
http://paste.openstack.org/show/163605/
You can match it with results from another environment(6.0.1_50): http://paste.openstack.org/show/163616/
It works, if i execute it manually:
http://paste.openstack.org/show/163610/

the file /opt/vm/cirros-x86_64-disk.img placed on the controller (node-1)
I can find block device in Ceph, and if I use rbd name as glance id with glance image-show command, I can see it with killed status:
http://paste.openstack.org/show/163655/
[root@node-1 ~]# nova boot --flavor 1 --image 7569ae36-68a3-45b8-808e-1eff71814841 listomin-test
ERROR (BadRequest): Image 7569ae36-68a3-45b8-808e-1eff71814841 is not active. (HTTP 400) (Request-ID: req-af2d89e1-bb45-48ec-ac45-5301fef1047e)

http://paste.openstack.org/show/163678/

diagnostic snapshot: https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7hbmNUT3FIdWJhNjA/view?usp=sharing

Changed in mos:
milestone: none → 6.1
ruhe (ruhe)
Changed in mos:
importance: Undecided → Critical
Revision history for this message
Leontiy Istomin (listomin) wrote :
Revision history for this message
Igor Yozhikov (iyozhikov) wrote :

What was found:
from ceph.log from one of the controller nodes:
...
2015-01-29 23:20:51.670859 mon.0 192.168.0.3:6789/0 1 : [INF] mon.node-1@0 won leader election with quorum 0
...
2015-01-29 23:44:03.273977 mon.0 192.168.0.3:6789/0 31 : [INF] mon.node-1@0 won leader election with quorum 0,1,2
2015-01-29 23:44:03.276078 mon.0 192.168.0.3:6789/0 32 : [INF] monmap e3: 3 mons at {node-1=192.168.0.3:6789/0,node-44=192.168.0.46:6789/0,node-49=192.168.0.51:6789/0}
...
2015-01-29 23:53:01.390172 mon.0 192.168.0.3:6789/0 124 : [INF] osd.42 192.168.0.41:6800/13854 boot
2015-01-29 23:53:01.390511 mon.0 192.168.0.3:6789/0 125 : [INF] osdmap e22: 47 osds: 47 up, 47 in
...
2015-01-29 23:54:59.033482 osd.26 192.168.0.6:6800/14137 1 : [WRN] 3 slow requests, 3 included below; oldest blocked for > 30.781338 secs
...
2015-01-29 23:59:10.623681 osd.0 192.168.0.29:6800/13524 3 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 384.902800 secs
...
2015-01-30 00:00:09.645041 mon.0 192.168.0.3:6789/0 212 : [INF] pgmap v109: 8384 pgs: 8384 active+clean; 694 bytes data, 98193 MB used, 43169 GB / 43265 GB avail

I see that ceph cluster initialization took about 30 minutes from 2015-01-29 23:20:51 to 2015-01-29 23:53:01.
Last log record mentioned above tells up about ceph health state, we could see that health is ok.

Conlusion:
I believe that image was imported into glance too early before ceph cluster became operational.

Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote :

Can't reproduce on ISO:
{"build_id": "2015-01-28_22-55-01", "ostf_sha": "c9100263140008abfcc2704732e98fbdfd644068", "build_number": "84", "release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-01-28_22-55-01", "ostf_sha": "c9100263140008abfcc2704732e98fbdfd644068", "build_number": "84", "api": "1.0", "nailgun_sha": "92a85025f0ef9b3c5b42a1ba172573aa0ac54e33", "production": "docker", "python-fuelclient_sha": "cb8928ce34f5ca88c0d6cecc6331488db75362ac", "astute_sha": "ed5270bf9c6c1234797e00bd7d4dd3213253a413", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "", "fuellib_sha": "b2fbaa9ffb74fafe1f5c2c480944a78424e1ae28"}}}, "auth_required": true, "api": "1.0", "nailgun_sha": "92a85025f0ef9b3c5b42a1ba172573aa0ac54e33", "production": "docker", "python-fuelclient_sha": "cb8928ce34f5ca88c0d6cecc6331488db75362ac", "astute_sha": "ed5270bf9c6c1234797e00bd7d4dd3213253a413", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "", "fuellib_sha": "b2fbaa9ffb74fafe1f5c2c480944a78424e1ae28"}

CentOS, HA, Neutron GRE, Murano, Ceph-for-All. (3 virtual controllers, 1 SuperMicro compute)

It seems that the bug reprodusible only with bare-metal configurations.

tags: added: bare-metal
Changed in mos:
status: New → Confirmed
Stanislav Makar (smakar)
Changed in mos:
assignee: MOS Glance (mos-glance) → Stanislav Makar (smakar)
Stanislav Makar (smakar)
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Stanislav Makar (smakar) wrote :

proposed the patch https://review.openstack.org/#/c/153338/
waiting for test results on scale lab

Alex Ermolov (aermolov)
Changed in mos:
milestone: 6.1 → 5.1.2
milestone: 5.1.2 → 6.1
Alex Ermolov (aermolov)
no longer affects: mos/6.0.x
Revision history for this message
Dina Belova (dbelova) wrote :

Stanislav, I know you've tried it on the scale lab. Any updates?

Revision history for this message
Stanislav Makar (smakar) wrote :

Yes, I tried, it fixes but there is a propose to redo it (move this check to another step)

Revision history for this message
Ryan Moe (rmoe) wrote :

One of the issues I see with the proposed change is that it will cause problems when deploying additional OSDs. If you have a large Ceph cluster with lots of data in it and you want to add more OSDs later via Fuel it will fail with this change. Each time an OSD is added Ceph will rebalance the cluster. With lots of data this could easily take more than 30 minutes and Fuel will mark the deployment as failed even though the OSDs deployed (and are working) correctly. While Ceph is rebalancing the status will be HEATLH_WARN but the cluster is still usable.

I think moving the cluster health check to a post-deployment task is a better way to do it. Also, rather than checking the status we should write data to the cluster and read it back to verify that it's working correctly. If we want to verify that each individual OSD has deployed correctly we should verify that the OSD we just deployed is marked up and in by Ceph.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I think the solution to this bug should also take into account bug #1424060.

Changed in fuel:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Stanislav Makar (smakar)
milestone: none → 6.1
status: Triaged → In Progress
no longer affects: mos
no longer affects: mos/6.1.x
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Ryan, we should consider setting "ceph osd set noout" (thank you, Bruce Matthew, for the idea) at the deployment stage, and trigger rebalance later, by the user's decision. A decision to make rebalance after the deploy is finished should obviously as well disable the upload cirros task, if ceph is used for glance images as well. The another point was suggested by Chris Clason: "We need to go one step further and gradually ramp up the weight on the OSD's. Instead of dumping the drives in at their full weights"

no longer affects: mos/6.0.x
Ryan Moe (rmoe)
tags: added: customer-found
Ryan Moe (rmoe)
summary: - TestVM Image has been added in Glance with "killed" status
+ Excessive Ceph rebalancing can cause deployment failures
Revision history for this message
Mykola Golub (mgolub) wrote :

The rebalancing load can be significantly decreased by limiting recovery activity in ceph.conf:

  osd max backfills = 1
  osd recovery max active = 1

This is a widely used practice to set these parameters to decrease I/O load when adding/removing nodes. The defaul values are 10 and 15, which look too high. Note, this will very likely increase the rebalancing time.

I think we should apply this solution before using more complicated.

"ceph osd set noout" is not enough to disable automatic rebalance on deployment stage. To "freez" the cluster one could

  ceph osd set noout
  ceph osd set noin
  ceph osd set noup
  ceph osd set nodown

then unset them when deployment finished and you are ready for rebalance. Gradually increasing weight technique could be used too but it might be not necessary with the ceph.conf settings above (need to check though on large deployments).

Changed in fuel:
assignee: Stanislav Makar (smakar) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Stanislav Makar (smakar)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/163019

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Mykola, is it possible what osd max backfills = 1, osd recovery max active = 1 lowered from 10 and 15 could cause the rebalance never ended under the heavy write load?

Stanislav Makar (smakar)
tags: added: ceph
Revision history for this message
Mikolaj Golub (to-my-trociny) wrote :

I have not seen reports about 'osd max backfills = 1' and 'osd recovery max active = 1' causing the rebalance never ended. Sure they may increase recovery time, still it looks like the problems caused by high load during recovery/rebalance much worse so these are commonly recommended settings. E.g. they were used in 47 disk servers/1128 OSDs ceph cluster at Cern:

http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern

Revision history for this message
Mykola Golub (mgolub) wrote :

I sent the previous message (#16) from wrong account. Still it is me.

Changed in fuel:
assignee: Stanislav Makar (smakar) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Stanislav Makar (smakar)
Revision history for this message
Stanislav Makar (smakar) wrote :

 SCALE LAB (50 ceph-osd) results:
patch fixes the problem and showed good itself even when was warn due to clock skew

root@node-101:~# ceph -s
    cluster e9df6ff0-ccb8-4a51-9d56-a6f92c85c921
     health HEALTH_WARN clock skew detected on mon.node-102, mon.node-103
     monmap e3: 3 mons at {node-101=192.168.0.4:6789/0,node-102=192.168.0.5:6789/0,node-103=192.168.0.6:6789/0}, election epoch 6, quorum 0,1,2 node-101,node-102,node-103
     osdmap e47: 47 osds: 47 up, 47 in
      pgmap v979: 16576 pgs, 11 pools, 13696 kB data, 51 objects
            98513 MB used, 43401 GB / 43498 GB avail
               16576 active+clean

task logs:

2015-03-13T18:23:26 info: [424] Run hook ---
priority: 300
fail_on_error: true
type: shell
uids:
- '101'
parameters:
  cmd: ruby /etc/puppet/modules/osnailyfacter/modular/astute/ceph_ready_check.rb
  timeout: 1800

2015-03-13T18:26:18 debug: [424] 954bd275-f135-44bf-8cb8-d8f5224d69eb: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"
101", :statuscode=>0, :statusmsg=>"OK", :data=>{:stdout=>"try 0\nThere are PGs which are not in active state!\ntry 1\nThere are PGs which are not in active state!\ntry 2\nThere are PGs which are not in active state!\ntry 3\nThere are PGs which are not in active state!\ntry 4\nThere are PGs which are not in active state!\ntry 5\nThere are PGs which are not in active state!\ntry 6\nThere are PGs which are not in active state!\ntry 7\nThere are PGs which are not in active state!\ntry 8\nThere are PGs which are not in active state!\ntry 9\nThere are PGs which are not in active state!\ntry 10\nThere are PGs which are not in active state!\ntry 11\nThere are PGs which are not in active state!\ntry 12\nThere are PGs which are not in active state!\ntry 13\nThere are PGs which are not in active state!\ntry 14\nThere are PGs which are not in active state!\ntry 15\n", :stderr=>"ok\nok\nok\nok\nok\nok\nok\nok\nok\nok\nok\nok\nok\nok\nok\n", :exit_code=>0}}
2015-03-13T18:26:18 debug: [424] 954bd275-f135-44bf-8cb8-d8f5224d69eb: cmd: cd / && ruby /etc/puppet/modules/osnailyfacter/modular/astute/ceph_ready_check.rb
cwd: /
stdout: try 0
There are PGs which are not in active state!
try 1
There are PGs which are not in active state!
try 2
There are PGs which are not in active state!
try 3
There are PGs which are not in active state!
try 4
There are PGs which are not in active state!
try 5
There are PGs which are not in active state!
try 6
There are PGs which are not in active state!
try 7
There are PGs which are not in active state!
try 8
There are PGs which are not in active state!
try 9
There are PGs which are not in active state!
try 10
There are PGs which are not in active state!
try 11
There are PGs which are not in active state!
try 12
There are PGs which are not in active state!
try 13
There are PGs which are not in active state!
try 14
There are PGs which are not in active state!
try 15

stderr: ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok

exit code: 0
2015-03-13T18:26:18 info: [424] Run hook ---
priority: 400

Changed in fuel:
assignee: Stanislav Makar (smakar) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Stanislav Makar (smakar)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/153338
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d7be830168abea5ba14b0e36aaedf7e69998b568
Submitter: Jenkins
Branch: master

commit d7be830168abea5ba14b0e36aaedf7e69998b568
Author: Stanislav Makar <email address hidden>
Date: Wed Mar 4 13:59:30 2015 +0000

    Add the Ceph ready checking

    When we deploy environment with big number of Ceph OSDs it takes time
    to get ceph ready to use.

    Docimpact: operations documentation http://ceph.com/docs/master/rados/operations/monitoring/
    Closes-bug: #1415954
    Change-Id: If7ae4948978047e1dd533e300eb1bfdc61b51035

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dina Belova (dbelova) wrote :

Fix verified for 6.1 on 233 ISO and later.

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.0-updates and 5.1.1-updates as we don't expect new 5.1.1 and 6.0 deployments

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.