tl;dr: as of Ceph 0.94.5 (hammer) radosgw automatically reconnects to monitors and OSDs. No special options (command line switches) are necessary (reconnection works out of the box). If no monitors can be found within a certain time (~ several minutes) radosgw bails out (and get restarted by upstart/systemd). Perhaps the problem occurs only under a special conditions. Please specify the exact steps describe the steps to reproduce the bug so we can reproduce and fix it. The test environment -------------------- 1 monitor (saceph-mon) 3 OSDs (saceph-osd1, saceph-osd2, saceph-osd3) 1 radosgw node (saceph-rgw) Ceph configuration details --------------------------- # /etc/ceph/ceph.conf [global] fsid = 17875282-4597-4e4e-805a-a69919dbeb0c mon_initial_members = saceph-mon mon_host = 10.253.0.20 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true [client.radosgw.gateway] host = saceph-rgw keyring = /etc/ceph/ceph.client.radosgw.keyring rgw socket path = "" log file = /var/log/radosgw/client.radosgw.gateway.log rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0 rgw print continue = false # /etc/apache2/conf-enabled/rgw.conf ServerName localhost DocumentRoot /var/www/html ErrorLog /var/log/apache2/rgw_error.log CustomLog /var/log/apache2/rgw_access.log combined RewriteEngine On RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] SetEnv proxy-nokeepalive 1 ProxyPass / fcgi://localhost:9000/ Tests ===== Restarting monitor ------------------- ssh saceph-mon sh -c "'service ceph-mon-all stop && sleep 60 && service ceph-mon-all start'" radosgw notes that monitor is not available any more and tries to reconnect: 2015-12-31 12:58:27.478011 7f47ed54d700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.20:6789/0 pipe(0x7f47c400a050 sd=8 :46741 s=2 pgs=2 cs=1 l=1 c=0x7f47c40068e0).reader couldn 't read tag, (0) Success 2015-12-31 12:58:27.478054 7f47ed54d700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.20:6789/0 pipe(0x7f47c400a050 sd=8 :46741 s=2 pgs=2 cs=1 l=1 c=0x7f47c40068e0).fault (0) Suc cess 2015-12-31 12:58:27.479242 7f47e114c700 0 monclient: hunting for new mon 2015-12-31 12:58:27.479245 7f47e114c700 1 -- 10.253.0.254:0/1015759 mark_down 0x7f47c40068e0 -- pipe dne 2015-12-31 12:58:27.479287 7f47e114c700 1 -- 10.253.0.254:0/1015759 --> 10.253.0.20:6789/0 -- auth(proto 0 40 bytes epoch 1) v1 -- ?+0 0x7f47c4014cf0 con 0x7f47c4013b70 2015-12-31 12:58:27.479503 7f47edbc4700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.20:6789/0 pipe(0x7f47c4015090 sd=26 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c4013b70).connect error 10.253.0.20:6789/0, (111) Connection refused 2015-12-31 12:58:27.479550 7f47edbc4700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.20:6789/0 pipe(0x7f47c4015090 sd=26 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c4013b70).fault (111) Connection refused The connection gets restored after the monitor has been restarted: 2015-12-31 12:59:34.864740 7f47e114c700 1 -- 10.253.0.254:0/1015759 <== mon.0 10.253.0.20:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (3166451580 0 0) 0x7f47d0003dc0 con 0x7f47b401e540 2015-12-31 12:59:34.865041 7f47e114c700 1 -- 10.253.0.254:0/1015759 --> 10.253.0.20:6789/0 -- auth(proto 2 144 bytes epoch 0) v1 -- ?+0 0x7f47c4013b70 con 0x7f47b401e540 2015-12-31 12:59:34.867083 7f47e114c700 1 -- 10.253.0.254:0/1015759 <== mon.0 10.253.0.20:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 241+0+0 (3468501187 0 0) 0x7f47d0003ec0 con 0x7f47b401e540 2015-12-31 12:59:34.867311 7f47e114c700 1 -- 10.253.0.254:0/1015759 --> 10.253.0.20:6789/0 -- mon_subscribe({monmap=2+,osdmap=2039}) v2 -- ?+0 0x7f47b401eb00 con 0x7f47b401e540 2015-12-31 12:59:34.868186 7f47e114c700 1 -- 10.253.0.254:0/1015759 <== mon.0 10.253.0.20:6789/0 3 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2033331710 0 0) 0x7f47d0003dc0 con 0x7f47b401e540 Restarting OSDs --------------- ssh saceph-osd1 sh -c "'service ceph-osd-all stop && sleep 60 && service ceph-osd-all start'" radosgw detects that the OSD in question went down: 2015-12-31 13:19:18.637773 7f47d85bb700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.100:6800/20411 pipe(0x7f47efcb8dc0 sd=10 :48591 s=4 pgs=7 cs=1 l=1 c=0x7f47efcbd0b0).reader couldn't read tag, (0) Success 2015-12-31 13:19:18.637794 7f47d85bb700 2 -- 10.253.0.254:0/1015759 >> 10.253.0.100:6800/20411 pipe(0x7f47efcb8dc0 sd=10 :48591 s=4 pgs=7 cs=1 l=1 c=0x7f47efcbd0b0).fault ( 0) Success After the OSD is back online radosgw restores the connection 2015-12-31 13:25:46.065136 7f47e194d700 1 -- 10.253.0.254:0/1015759 --> 10.253.0.100:6800/20851 -- ping magic: 0 v1 -- ?+0 0x7f47c00369e0 con 0x7f47c400cea0 6 gen 2] v0'0 uv13 ondisk = 0) v6 ==== 175+0+0 (1894677755 0 0) 0x7f47b0021e10 con 0x7f47efcc4e50 2015-12-31 13:25:46.072131 7f47d86bc700 1 -- 10.253.0.254:0/1015759 <== osd.0 10.253.0.100:6800/20851 197 ==== osd_op_reply(8354 notify.1 [watch ping cookie 139946942556736 gen 2] v0'0 uv13 ondisk = 0) v6 ==== 175+0+0 (3275663798 0 0) 0x7f47b403e130 con 0x7f47c400cea0 2015-12-31 13:25:46.072276 7f47d86bc700 1 -- 10.253.0.254:0/1015759 <== osd.0 10.253.0.100:6800/20851 198 ==== osd_op_reply(8355 notify.2 [watch ping cookie 139946942558752 gen 2] v0'0 uv13 ondisk = 0) v6 ==== 175+0+0 (576919958 0 0) 0x7f47b403e130 con 0x7f47c400cea0 2015-12-31 13:25:46.072374 7f47d86bc700 1 -- 10.253.0.254:0/1015759 <== osd.0 10.253.0.100:6800/20851 199 ==== osd_op_reply(8356 notify.6 [watch ping cookie 139946942568288 gen 2] v0'0 uv13 ondisk = 0) v6 ==== 175+0+0 (3101824277 0 0) 0x7f47b403e130 con 0x7f47c400cea0