DB CA link did not reconnect (is this same as #153 ????)

Bug #541248 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Andrew Johnson

Bug Description

From Pam:

Last night after Kay rebooted some LLRF IOCs, scl-hprf-ioc05 was unable to re-connect with some links that should have been okay – caget got the values fine, for example. This morning when I came in, it still hadn’t managed to re-connect. I looked around. The only difference I notice between an IOC that’s making connections and one that isn’t is the second “timerQueue” line: the delay is 1 when it’s working and a big number when it isn’t.

After a reboot, the IOC looks okay.

   Pam.

  NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY

---------- ------------ -------- --- ---------- -------- -------- ------- -----

tExcTask excTask 1dfda20 0 PEND 1fbff8 1dfd900 3006b 0

snsUtilEven19fc1a8 1706cf8 4 PEND+T 1effbc 1706bd8 3d0002 12

syncTime 1a0ec48 1a0e568 5 DELAY 1f51fc 1a0e468 0 12701

tLogTask logTask 1dfb010 10 PEND 1fbff8 1dfaf00 0 0

tNetTask netTask 1cc9a48 50 PEND 1effbc 1cc9958 0 0

tPortmapd portmapd 1c68188 54 PEND 1effbc 1c67f28 3d0002 0

tTelnetd telnetd 1c6a358 55 PEND 1effbc 1c6a1e8 360003 0

NTPTimeSync1a82d78 1a09538 109 DELAY 1f51fc 1a093d8 3d0002 377

EIPxmtr5 1a82d78 1508388 109 PEND+T 1effbc 1507f78 3d0002 299

EIPxmtr7 1a82d78 1505218 109 DELAY 1f51fc 1505098 3d0002 1

EIPmod 1a82d78 15020a8 109 DELAY 1f51fc 1501f28 3d0002 1

cbHigh 1a82d78 1711cf8 128 PEND 1effbc 1711bd8 0 0

timerQueue 1a82d78 1725bd8 129 PEND+T 1effbc 1725648 3d0004 2146458376

scanOnce 1a82d78 14fdf40 129 PEND 1effbc 14fde30 0 0

scan0.1 1a82d78 14c81e0 133 DELAY 1f51fc 14c80a0 0 2

scan0.2 1a82d78 14cda80 134 DELAY 1f51fc 14cd940 0 6

cbMedium 1a82d78 1719448 135 PEND 1effbc 1719328 0 0

Oms58_motor1a82d78 16e5550 135 PEND+T 1effbc 16e52e0 3d0004 12880

scan0.5 1a82d78 14d3320 135 DELAY 1f51fc 14d31e0 0 17

scan1 1a82d78 14d8bc0 136 DELAY 1f51fc 14d8a80 0 22

scan2 1a82d78 14de460 137 DELAY 1f51fc 14de320 0 55

scan5 1a82d78 14e3d00 138 DELAY 1f51fc 14e3bc0 0 82

scan10 1a82d78 14e95a0 139 DELAY 1f51fc 14e9460 0 96

cbLow 1a82d78 1720b98 140 PEND 1effbc 1720a78 0 0

CAC-event 1a82d78 147bdb8 141 PEND 1effbc 147bca8 0 0

asCaTask 1a82d78 149b048 142 PEND 1effbc 149af28 0 0

CAC-UDP 1a82d78 1451530 147 PEND 1effbc 1450d60 0 0

timerQueue 1a82d78 1464d50 148 PEND+T 1effbc 14647c0 3d0004 2143265573

seqAux 1a82d78 132baf0 148 DELAY 1f51fc 132b550 0 138

CAC-event 1a82d78 12e50b8 148 PEND 1effbc 12e4fa8 0 0

CAC-event 1a82d78 fe9330 148 PEND 1effbc fe9220 0 0

dbCaLink 1a82d78 170c4f8 149 PEND 1effbc 170c3d8 0 0

B9K72.109 1a82d78 1705760 149 PEND+T 1fbff8 1705580 3d0002 7

poolPoll 1a82d78 1322f88 149 DELAY 1f51fc 1322e98 0 29

Tune 1a82d78 1310270 149 PEND+T 1effbc 13100b0 3d0002 2144567296

Tune 1a82d78 1309b20 149 PEND+T 1effbc 1309960 3d0002 2144567663

Tune 1a82d78 13033d0 149 PEND+T 1effbc 1303210 3d0002 2144567954

Tune 1a82d78 12fcc80 149 PEND+T 1effbc 12fcac0 3d0002 2144919026

Tune 1a82d78 12dc248 149 PEND+T 1effbc 12dc088 3d0002 2144568608

Tune 1a82d78 126a8d0 149 PEND+T 1effbc 126a710 3d0002 2144568820

Additional information:
IOCs involved were running R3.14.7.
This has been observed to occur only once

Original Mantis Bug: mantis-206
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=206

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

I have two questions for you.

1) What version of EPICS runs on scl-hprf-ioc05?

2) Is this is a side effect of Kay’s IOCs hanging up during the shutdown part of a soft reboot? If that was occurring then we certainly do understand why clients would not reconnect?

If *all* of the following are true the client might not reconnect on its own.

1) client side is running R3.14.4 (possibly R3.14.5) or earlier

2) the server was unavailable for longer than 10 – 30 min (depending on EPICS version and IP kernel version)

3) the server isn’t configured so the client sees the server’s beacon, or the router does not forward the server’s beacon to the client (the client sees no changes in the frequency of server beacons)

Also, looking in mantis there was a bug fixed in the R3.14.2 – R3.14.4 (inclusive) CA repeater that might cause a problem like you have described.

So if your client side (scl-hprf-ioc05) was running R3.14.6 or later and Kay’s IOC was not hung when this was supposed to reconnect then it sounds like I will need to investigate further. Let me know.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Pam,

All of the IOCs involved are running 3.14.7.

No – I don’t think it’s the same IOC that Kay was having the hangup problem on. I don’t know how long his reboot took.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Pam,

Also, this problem persisted overnight until I came in the following morning - indeed, until the IOC was rebooted. It seems likely in this case that something was messed up so it wasn't making new connections (or remaking old ones.)

It was fixed by a reboot of the IOC with the linking records.

I haven't seen a problem quite like this one before, so maybe we should just hope it doesn't happen again.

However, I have seen, with 3.14.7 soft IOCs, records taking quite long times to reconnect - 5 minutes and more. Also I've seen that after restarting the archiver.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Kay,

I doubt it's related to the LLRF reboot:
This is about rebooting about 40 IOCs in order to load new software in the afternoon by writing to some PV which runs 'reboot()'. There are terminal windows open to most of the IOCs, so I see that one IOC doesn't want to die & reboot without a Ctrl^X. All in all this happens quickly; all LLRF IOCs are back up within 5 minutes.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

This is worrisome. I went ahead and created Mantis entry 206 so that we can track this issue.

Of course the IOC was rebooted and so it's too late now, but next we will need to have a look at the stack trace for that timer queue. This is accomplished on vxWorks by typing "tt <task id>". One possibility might be that the timer queue thread is stuck in one of its callbacks and this is hanging up the reconnect mechanism.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> However, I have seen, with 3.14.7 soft IOCs, records taking quite
> long times to reconnect - 5 minutes and more. Also I've seen that
> after restarting the archiver.

If the client sees the servers beacon it should reconnect within a few seconds (maxing out at 5 seconds as I recall). With old EPICS it was possible for it to not reconnect if the servers beacon was not seen and the IOC was inaccessible for a long time. With recent R3.14, if it does not see the servers beacon, it should always reconnect after a maximum of about 5 minutes.

One possibility is that the IP kernel is not disconnecting the TCP circuit when the server (IOC) was rebooted. When CA notices that there is no activity over the circuit and no beacon for EPICS_CA_CONN_TMO it sends an "are you there" message over the TCP circuit. Given that the "are you there" message is being attempted to be delivered by the IP kernel then the IP kernel should receive a "no circuit at that port" ICMP response immediately when the IOC reboots, and that should cause it to disconnect the circuit, and that should cause the client library to build a new circuit.

What version of vxWorks are the IOCs running? Do we have a mixture of vxWorks 6.0 and other vxWorks versions?

edited on: 2005-06-15 11:15

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Kay,

Maybe related but too late now:
There was work being done on the event link hardware.
As a result, a sequence of LLRF IOCs didn't
receive time-of-day info from the SNS event system
and their "bootup" time showed 1-1-1990.

Their current time was OK because the now famous
General Time aka Colonel Time alias Chairman Time
fell back to NTP and the LLRF IOCs worked OK
until I needed to reboot for software updates.

Maybe that HPRF IOC wasn't so lucky and the timer
queue happened to run into bad time info while
calculating delays.

This is just to hint that there's so much going on
in parallel that it's hard to debug.

edited on: 2005-06-15 11:19

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Yea, we are vulnerable if there are absolute time source discontinuities. I don’t know how to avoid this with timer queues. We have to work in absolute time because we don’t know how much time goes by when users are in their callbacks.

I do know that NTP has many years ago solved the discontinuities problem. They use a software updated PLL to gracefully slew from one time source to another.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

These IOCs are all running vxWorks "SNS07" or Wind version 2.6

Revision history for this message
Andrew Johnson (anj) wrote :

The appears to have been an issue with the time distribution, which hopefully cannot reccur with the integration of generalTime into the IOC.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.