Bug #1861612 “callbackRequestDelay not waiting for 1/60 sec on v...” : Bugs : EPICS Base

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2020-02-02:

#1

Looking at release notes and commits

https://github.com/epics-base/epics-base/compare/R7.0.3...R7.0.3.1

it looks like this could be a side effect of switching internal timers to monotonic time.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2020-02-02:

#2

Wasn't 1/60 the default tick time on VxWorks?

The difference between delaying one tick and delaying zero ticks could be callback/reschedule happening every tick or as fast as the CPU can.

Revision history for this message

rivers (rivers) wrote on 2020-02-02:

#3

> Looking at release notes and commits
> https://github.com/epics-base/epics-base/compare/R7.0.3...R7.0.3.1
> it looks like this could be a side

> Wasn't 1/60 the default tick time on VxWorks? effect of switching internal timers to monotonic time.

Yes, I discussed both of these in the tech-talk thread.

The Application Developers Guide does not document a minimum value for the delay time in callbackRequestDelay(). Given that, the existing behavior should be preserved, which was that 1/60. second did not result in 0 delay. I believe that has been the behavior since at least the beginning of 3.14.

A reasonable expectation for the minimum value of the delay, given that it is not documented, is that it should be the same as the minimum value of epicsThreadSleep. That is documented as follows:

"Sleep for the specified period of time, i.e. sleep without using the cpu.
If delay is >0 then the thread will sleep at least until the next clock tick.
The exact time is determined by the underlying architecture. If delay is
<= 0 then a delay of 0 is requested of the underlying architecture. What
happens is architecture dependent but often it allows other threads of the
same priority to run."

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2020-02-02:

#4

Sure. This has been working for a long time, including the 7.0.3 release.

The interesting question is why the behavior has changed with the switch to monotonic time.

BTW. Can you verify that is was exactly that commit changing the behavior?
https://github.com/epics-base/epics-base/commit/4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d

That would make sure that we're talking about the real things here.

Revision history for this message

rivers (rivers) wrote on 2020-02-02:

#5

Download full text (4.6 KiB)

Perhaps my hypothesis about callbackRequestDelay behavior change is wrong.

I looked for where callbackRequestDelay is used in base, and it seems to be used in the HIGH delay for bo records. I thus thought that HIGH delays of 1/60. second or less would result in effectively 0 delay. However, I tested this with camonitor in a vxWorks IOC running 7.0.3.1. This is what I see with HIGH=0.01, which is less than 1 tick time=0.016666666. This is as I repeatedly write 1 to the record with caput.

corvette:~/devel/nable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable 13LAB:userTranEnable />PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
(2020-02-02 10:41:57.387524) Disable
( +4.306785) Enable
( +0.007119) Disable
( +0.346203) Enable
( +0.003968) Disable
( +0.313715) Enable
( +0.002799) Disable
( +0.236848) Enable
( +0.013807) Disable
( +0.290088) Enable
( +0.009729) Disable
( +0.249017) Enable
( +0.017225) Disable
( +0.290018) Enable
( +0.010037) Disable
( +0.192072) Enable
( +0.008027) Disable
( +0.200641) Enable
( +0.015982) Disable
( +0.208282) Enable
( +0.008714) Disable
( +0.230697) Enable
( +0.002203) Disable
( +0.206095) Enable
( +0.010524) Disable
( +0.197474) Enable
( +0.002590) Disable
( +0.174295) Enable
( +0.009070) Disable

So the transition from 1 (Enable) to 0 (Disable) seems to take a random amount of time between 0 and 0.01666 seconds. This is exactly what I expect if the delay does wait for the next clock tick, just like epicsThreadSleep. The tick rate in my IOC is 60.

However, if I change HIGH from .01 to .005 the behavior changes, and it is only high for 20-40 microseconds (measuring time on a Linux client).

corvette:~/devel/PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
13LAB:userTranEnable (2020-02-02 10:52:10.095757) Disable

13LAB:userTranEnable ( +11.698305) Enable
13LAB:userTranEnable ( +0.000056) Disable
...

Perhaps my hypothesis about callbackRequestDelay behavior change is wrong.

I looked for where callbackRequestDelay is used in base, and it seems to be used in the HIGH delay for bo records. I thus thought that HIGH delays of 1/60. second or less would result in effectively 0 delay.  However, I tested this with camonitor in a vxWorks IOC running 7.0.3.1.  This is what I see with HIGH=0.01, which is less than 1 tick time=0.016666666.  This is as I repeatedly write 1 to the record with caput.

corvette:~/devel/PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
13LAB:userTranEnable           (2020-02-02 10:41:57.387524) Disable
13LAB:userTranEnable                         (   +4.306785) Enable
13LAB:userTranEnable                         (   +0.007119) Disable
13LAB:userTranEnable                         (   +0.346203) Enable
13LAB:userTranEnable                         (   +0.003968) Disable
13LAB:userTranEnable                         (   +0.313715) Enable
13LAB:userTranEnable                         (   +0.002799) Disable
13LAB:userTranEnable                         (   +0.236848) Enable
13LAB:userTranEnable                         (   +0.013807) Disable
13LAB:userTranEnable                         (   +0.290088) Enable
13LAB:userTranEnable                         (   +0.009729) Disable
13LAB:userTranEnable                         (   +0.249017) Enable
13LAB:userTranEnable                         (   +0.017225) Disable
13LAB:userTranEnable                         (   +0.290018) Enable
13LAB:userTranEnable                         (   +0.010037) Disable
13LAB:userTranEnable                         (   +0.192072) Enable
13LAB:userTranEnable                         (   +0.008027) Disable
13LAB:userTranEnable                         (   +0.200641) Enable
13LAB:userTranEnable                         (   +0.015982) Disable
13LAB:userTranEnable                         (   +0.208282) Enable
13LAB:userTranEnable                         (   +0.008714) Disable
13LAB:userTranEnable                         (   +0.230697) Enable
13LAB:userTranEnable                         (   +0.002203) Disable
13LAB:userTranEnable                         (   +0.206095) Enable
13LAB:userTranEnable                         (   +0.010524) Disable
13LAB:userTranEnable                         (   +0.197474) Enable
13LAB:userTranEnable                         (   +0.002590) Disable
13LAB:userTranEnable                         (   +0.174295) Enable
13LAB:userTranEnable                         (   +0.009070) Disable

So the transition from 1 (Enable) to 0 (Disable) seems to take a random amount of time between 0 and 0.01666 seconds.  This is exactly what I expect if the delay does wait for the next clock tick, just like epicsThreadSleep. The tick rate in my IOC is 60.

However, if I change HIGH from .01 to .005 the behavior changes, and it is only high for 20-40 microseconds (measuring time on a Linux client).

corvette:~/devel/PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
13LAB:userTranEnable           (2020-02-02 10:52:10.095757) Disable

13LAB:userTranEnable                         (  +11.698305) Enable
13LAB:userTranEnable                         (   +0.000056) Disable
13LAB:userTranEnable                         (   +0.752278) Enable
13LAB:userTranEnable                         (   +0.000034) Disable
13LAB:userTranEnable                         (   +0.239082) Enable
13LAB:userTranEnable                         (   +0.000037) Disable
13LAB:userTranEnable                         (   +0.223219) Enable
13LAB:userTranEnable                         (   +0.000038) Disable
13LAB:userTranEnable                         (   +0.178778) Enable
13LAB:userTranEnable                         (   +0.000028) Disable
13LAB:userTranEnable                         (   +0.314707) Enable
13LAB:userTranEnable                         (   +0.000036) Disable
13LAB:userTranEnable                         (   +0.096827) Enable
13LAB:userTranEnable                         (   +0.000030) Disable
13LAB:userTranEnable                         (   +0.176998) Enable
13LAB:userTranEnable                         (   +0.000030) Disable
13LAB:userTranEnable                         (   +0.193591) Enable
13LAB:userTranEnable                         (   +0.000022) Disable
13LAB:userTranEnable                         (   +0.191531) Enable
13LAB:userTranEnable                         (   +0.000031) Disable
13LAB:userTranEnable                         (   +0.207274) Enable
13LAB:userTranEnable                         (   +0.000024) Disable

Why is there a change in behavior between HIGH = 0.01 and 0.005?

I have not done the same experiment on 7.0.3.

Revision history for this message

rivers (rivers) wrote on 2020-02-02:

#6

As Ralph suggested while having 7.0.3.1 checked out I reverted commit https://github.com/epics-base/epics-base/commit/4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d.

That fixed the problem with the scaler record, I can now run at 60 Hz display rate with no problems.

It did not change the behavior of the bo record HIGH field. I see the same as above, with .01 sec it waits till the next clock tick to go low, with .005 sec it goes low immediately. I don't understand this behavior.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-02-02:

#7

I don't have any definitive answers (or access to a vxWorks system to experiment with down here at about 18 Degrees North), but I can provide more data.

callbackRequestDelayed() is implemented using an epicsTimer; the timer queue compares the current time (which since commit 4f2228fb1 gets read using epicsTime::getMonotonic() instead of epicsTime::getCurrent()) with a delayed version of the time from the same source to work out when to fire the callback.

On vxWorks-ppc the getMonotonic() time source is the CPU's time-base register, which ticks at some multiple of the CPU's bus clock frequency. Different boards have different bus rates and multipliers, and if the BSP doesn't provide the optional routine sysTimeBaseFreq() to return the nominal tick rate it gets measured in osdMonotonicInit() which prints "osdMonotonicInit: Measuring CPU time-base frequency ... %llu ticks/sec." – I think that's called from a C++ static initializer so it should appear when loading the munch file, but I might be mistaken about that part.

A workaround would be to increase your system clock tick rate by calling sysClkRateSet(100) or higher which I would put towards the top of your startup scripts; you can call it later and our code should adjust (thanks to Dirk IIRC), but before loading the munch file is probably safest. The OS routines and any taskDelay() calls will still be using this tick rate for delay granularity, and that includes the semTake() that gets used when the timer queue needs to sleep. In epicsEventWaitWithTimeout() we calculate how many OS ticks to sleep using the equivalent of
max(1, (int) (delay * sysClkRateGet()))

I haven't had any revelations over what might be causing the problem yet though.

Revision history for this message

rivers (rivers) wrote on 2020-02-03:

#8

I do see the message that Andrew referred to when the IOC boots:
osdMonotonicInit: Measuring CPU time-base frequency ... 25000172 ticks/sec.

I just modified the scaler record to print the current value of the first counter in the callback routine. I print with errlogPrintf(), rather than printf(), so it does not slow down the operation very much.

My scaler has a 10 MHz input on scaler[0], so the ticks are 100 ns.

This is what I see if the RATE=59. The expected delta tick is 1/59.*1e7=169492. The average delta for the first 6 readings is 163981, which is pretty close.

updateCounts: posting scaler values counts[0]=74
updateCounts: posting scaler values counts[0]=150723
updateCounts: posting scaler values counts[0]=317296
updateCounts: posting scaler values counts[0]=483968
updateCounts: posting scaler values counts[0]=651202
updateCounts: posting scaler values counts[0]=817458
updateCounts: posting scaler values counts[0]=983962

This is what I see if RATE=60
updateCounts: posting scaler values counts[0]=65
updateCounts: posting scaler values counts[0]=186497
updateCounts: posting scaler values counts[0]=188068
updateCounts: posting scaler values counts[0]=188639
updateCounts: posting scaler values counts[0]=189067
updateCounts: posting scaler values counts[0]=189495
updateCounts: posting scaler values counts[0]=189924
updateCounts: posting scaler values counts[0]=190351

So now the average delta from the last 7 readings is 642 ticks, which means the callback is running every 64 microseconds, or 15.6 kHz.

This confirms that the problem is indeed that the callback is being called far too often when RATE=60.

If the problem is that callbackRequestDelay is returning too soon when RATE=60 then I don't understand why the bo record .HIGH field is not showing the same problem.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2020-02-03:

#9

Well, in both cases (HIGH delays and callbackRequestDelay) we seem to have a threshold for the requested delay with a response switching between a 1 tick delay and a 0 tick delay.

Even with your statement
> ...seems to take a random amount of time between 0 and 0.01666 seconds.
> This is exactly what I expect if the delay does wait for the next
> clock tick, just like epicsThreadSleep.

I have to say that I do not like that behavior. Agreed that it may only play out in few real life use cases, a delay API that cannot promise the delay seems awkward and wrong. Note that for any delay request, a system with a clock tick larger than the requested delay will response with an arbitrary delay between zero and one clock tick, which is unrelated to the requested value.

Much like the fundamental difference between a "snooze" timer (don't wake me up before...) and an alarm setting (wake me up no later than...) the use cases for a delay and a timer are different and the API promises should reflect that.

IMHO: A requested delay > 0 shall never call back before the requested delay has passed. If waiting for the next system tick comes back too early, delay shall wait for another system tick.
Immediate call back is acceptable (and somewhat expected) for a requested delay = 0.

Revision history for this message

rivers (rivers) wrote on 2020-02-03:

#10

> Well, in both cases (HIGH delays and callbackRequestDelay) we seem to have a threshold
> for the requested delay with a response switching between a 1 tick delay and a 0 tick delay.

But HIGH delays use callbackRequestDelay just like the scaler record does. So why does the scaler record appear to get immediate callbacks when the delay=1/60., but the bo record HIGH processing does not? And why does bo HIGH processing have a change in behavior between delay=0.01 and 0.005?

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-02-03:

#11

Remove use of time quantum in timer delay calc Edit (583 bytes, text/plain)

The attached patch is a bit of a shot in the dark, but maybe worth testing.

Revision history for this message

rivers (rivers) wrote on 2020-02-03:

#12

I tried Michael's patch. I does prevent the very fast callbacks RATE=60, but it slows down the callbacks at all rates.

This is what I now see with RATE=59 when I count for 0.2 seconds.
updateCounts: posting scaler values counts[0]=72
updateCounts: posting scaler values counts[0]=210254
updateCounts: posting scaler values counts[0]=543797
updateCounts: posting scaler values counts[0]=876856
updateCounts: posting scaler values counts[0]=1210206
updateCounts: posting scaler values counts[0]=1543730
updateCounts: posting scaler values counts[0]=1876830
updateCounts: posting scaler values counts[0]=2000000

The average delta of the first 7 readings is (1876830-72)/6.=312793. So the average callback interval is 31.2 ms, while it should be 1/59 = 16.9 ms.

At RATE=60 I see
updateCounts: posting scaler values counts[0]=73
updateCounts: posting scaler values counts[0]=255761
updateCounts: posting scaler values counts[0]=588997
updateCounts: posting scaler values counts[0]=922397
updateCounts: posting scaler values counts[0]=1255750
updateCounts: posting scaler values counts[0]=1589057
updateCounts: posting scaler values counts[0]=1922543
updateCounts: posting scaler values counts[0]=2000000

So the average callback interval is (1922543-73)/6. = 32.04 ms, while it should be 16.67 ms.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-02-03:

#13

Can you test with a second RATE (say 30Hz) to see how this error in delay scales? This may give a hint as to where to focus (looking for more quantum games, or ticksPerSec).

Also, what do epicsThreadSleepQuantum() and epicsMonotonicResolution() return?

Revision history for this message

rivers (rivers) wrote on 2020-02-03:

#14

This is 30 Hz.

updateCounts: posting scaler values counts[0]=343492
updateCounts: posting scaler values counts[0]=842531
updateCounts: posting scaler values counts[0]=1342611
updateCounts: posting scaler values counts[0]=1842512
updateCounts: posting scaler values counts[0]=2342578
updateCounts: posting scaler values counts[0]=2842581
updateCounts: posting scaler values counts[0]=3342583
updateCounts: posting scaler values counts[0]=3842498
updateCounts: posting scaler values counts[0]=4342557
updateCounts: posting scaler values counts[0]=4842722

So the average delta is (4842722-343492)/9.=499914 = 50 ms, should be 33 ms.

epicsThreadSleepQuantum=0.016667
epicsMonotonicResolution=40

Revision history for this message

rivers (rivers) wrote on 2020-02-03:

#15

Reverting Michael's patch I see this with RATE=30Hz.

updateCounts: posting scaler values counts[0]=340710
updateCounts: posting scaler values counts[0]=673958
updateCounts: posting scaler values counts[0]=1007853
updateCounts: posting scaler values counts[0]=1340645
updateCounts: posting scaler values counts[0]=1673964
updateCounts: posting scaler values counts[0]=2007397
updateCounts: posting scaler values counts[0]=2340634
updateCounts: posting scaler values counts[0]=2673957
updateCounts: posting scaler values counts[0]=3008932
updateCounts: posting scaler values counts[0]=3341156

So the average delta ticks is (3341156-340710)/9.=333382.8, which is 33.3 ms, as expected.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-02-03:

#16

The difference between expected and actual seems ~= epicsThreadSleepQuantum() (aka. 1.0/sysClkRateGet() ). This is the granularity of epicsEventWaitWithTimeout() which is ultimately (I think) what the timer queue is using.

The code in epicsEventWaitWithTimeout() may explain what the -quantum()/2 was doing.

> if (timeOut <= 0.0) {
> ticks = 0;
> } else if (timeOut >= (double) INT_MAX / rate) {
> ticks = WAIT_FOREVER;
> } else {
> ticks = timeOut * rate; // MD - rate=sysClkRateGet()
> if (ticks <= 0)
> ticks = 1;
> }

Previously, subtracting could result in timeOut <=0. Now, this is never the case so the minimum wait is 1 tick.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-02-07:

#17

Maybe relevant https://bugs.launchpad.net/epics-base/+bug/541353

Revision history for this message

Andrew Johnson (anj) wrote on 2020-02-07:

#18

Download full text (6.5 KiB)

libCom's epicsEventTest measures epicsEventWaitWithTimeout() delays from 2.0**0 down to 2.0**-19=0.000001907 and 0.0. However it isn't particularly careful with the starting point of those timings, and it also drops the sign from the calculated delay error. I just added a call to epicsEventWaitWithTimeout(id, 0.000001) before the start of each delay measurement to synchronize each one to the start of an OS tick, and made the error calculation just measured delay - requested delay. The results on VxWorks are interesting. These are with the default 60Hz tick rate:

# epicsEventWaitWithTimeout(1.000000) delay error 0.000013 sec
# epicsEventWaitWithTimeout(0.500000) delay error 0.000006 sec
# epicsEventWaitWithTimeout(0.250000) delay error 0.000003 sec
# epicsEventWaitWithTimeout(0.125000) delay error -0.008331 sec
# epicsEventWaitWithTimeout(0.062500) delay error -0.012500 sec
# epicsEventWaitWithTimeout(0.031250) delay error -0.014583 sec
# epicsEventWaitWithTimeout(0.015625) delay error 0.001042 sec
# epicsEventWaitWithTimeout(0.007813) delay error 0.008854 sec
# epicsEventWaitWithTimeout(0.003906) delay error 0.012760 sec
# epicsEventWaitWithTimeout(0.001953) delay error 0.014716 sec
# epicsEventWaitWithTimeout(0.000977) delay error 0.015690 sec
# epicsEventWaitWithTimeout(0.000488) delay error 0.016178 sec
# epicsEventWaitWithTimeout(0.000244) delay error 0.016423 sec
# epicsEventWaitWithTimeout(0.000122) delay error 0.016545 sec
# epicsEventWaitWithTimeout(0.000061) delay error 0.016605 sec
# epicsEventWaitWithTimeout(0.000031) delay error 0.016636 sec
# epicsEventWaitWithTimeout(0.000015) delay error 0.016654 sec
# epicsEventWaitWithTimeout(0.000008) delay error 0.016658 sec
# epicsEventWaitWithTimeout(0.000004) delay error 0.016663 sec
# epicsEventWaitWithTimeout(0.000002) delay error 0.016665 sec
# epicsEventWaitWithTimeout(0.000000) delay error 0.000003 sec

I then set the clock to 1KHz and got these:

# epicsEventWaitWithTimeout(1.000000) delay error 0.000025 sec
# epicsEventWaitWithTimeout(0.500000) delay error -0.000005 sec
# epicsEventWaitWithTimeout(0.250000) delay error -0.000003 sec
# epicsEventWaitWithTimeout(0.125000) delay error -0.000001 sec
# epicsEventWaitWithTimeout(0.062500) delay error -0.000501 sec
# epicsEventWaitWithTimeout(0.031250) delay error -0.000250 sec
# epicsEventWaitWithTimeout(0.015625) delay error -0.000625 sec
# epicsEventWaitWithTimeout(0.007813) delay error -0.000813 sec
# epicsEventWaitWithTimeout(0.003906) delay error -0.000907 sec
# epicsEventWaitWithTimeout(0.001953) delay error -0.000953 sec
# epicsEventWaitWithTimeout(0.000977) delay error 0.000023 sec
# epicsEventWaitWithTimeout(0.000488) delay error 0.000512 sec
# epicsEventWaitWithTimeout(0.000244) delay error 0.000756 sec
# epicsEventWaitWithTimeout(0.000122) delay error 0.000878 sec
# epicsEventWaitWithTimeout(0.000061) delay error 0.000938 sec
# epicsEventWaitWithTimeout(0.000031) delay error 0.000969 sec
# epicsEventWaitWithTimeout(0.000015) delay error 0.000984 sec
# epicsEventWaitWithTimeout(0.000008) delay error 0.000992 sec
# epicsEventWaitWithTimeout(0.000004) delay error 0.000996 sec
# epicsEvent...

libCom's epicsEventTest measures epicsEventWaitWithTimeout() delays from 2.0**0 down to 2.0**-19=0.000001907 and 0.0. However it isn't particularly careful with the starting point of those timings, and it also drops the sign from the calculated delay error. I just added a call to epicsEventWaitWithTimeout(id, 0.000001) before the start of each delay measurement to synchronize each one to the start of an OS tick, and made the error calculation just measured delay - requested delay. The results on VxWorks are interesting. These are with the default 60Hz tick rate:

# epicsEventWaitWithTimeout(1.000000)  delay error 0.000013 sec
# epicsEventWaitWithTimeout(0.500000)  delay error 0.000006 sec
# epicsEventWaitWithTimeout(0.250000)  delay error 0.000003 sec
# epicsEventWaitWithTimeout(0.125000)  delay error -0.008331 sec
# epicsEventWaitWithTimeout(0.062500)  delay error -0.012500 sec
# epicsEventWaitWithTimeout(0.031250)  delay error -0.014583 sec
# epicsEventWaitWithTimeout(0.015625)  delay error 0.001042 sec
# epicsEventWaitWithTimeout(0.007813)  delay error 0.008854 sec
# epicsEventWaitWithTimeout(0.003906)  delay error 0.012760 sec
# epicsEventWaitWithTimeout(0.001953)  delay error 0.014716 sec
# epicsEventWaitWithTimeout(0.000977)  delay error 0.015690 sec
# epicsEventWaitWithTimeout(0.000488)  delay error 0.016178 sec
# epicsEventWaitWithTimeout(0.000244)  delay error 0.016423 sec
# epicsEventWaitWithTimeout(0.000122)  delay error 0.016545 sec
# epicsEventWaitWithTimeout(0.000061)  delay error 0.016605 sec
# epicsEventWaitWithTimeout(0.000031)  delay error 0.016636 sec
# epicsEventWaitWithTimeout(0.000015)  delay error 0.016654 sec
# epicsEventWaitWithTimeout(0.000008)  delay error 0.016658 sec
# epicsEventWaitWithTimeout(0.000004)  delay error 0.016663 sec
# epicsEventWaitWithTimeout(0.000002)  delay error 0.016665 sec
# epicsEventWaitWithTimeout(0.000000)  delay error 0.000003 sec

I then set the clock to 1KHz and got these:

# epicsEventWaitWithTimeout(1.000000)  delay error 0.000025 sec
# epicsEventWaitWithTimeout(0.500000)  delay error -0.000005 sec
# epicsEventWaitWithTimeout(0.250000)  delay error -0.000003 sec
# epicsEventWaitWithTimeout(0.125000)  delay error -0.000001 sec
# epicsEventWaitWithTimeout(0.062500)  delay error -0.000501 sec
# epicsEventWaitWithTimeout(0.031250)  delay error -0.000250 sec
# epicsEventWaitWithTimeout(0.015625)  delay error -0.000625 sec
# epicsEventWaitWithTimeout(0.007813)  delay error -0.000813 sec
# epicsEventWaitWithTimeout(0.003906)  delay error -0.000907 sec
# epicsEventWaitWithTimeout(0.001953)  delay error -0.000953 sec
# epicsEventWaitWithTimeout(0.000977)  delay error 0.000023 sec
# epicsEventWaitWithTimeout(0.000488)  delay error 0.000512 sec
# epicsEventWaitWithTimeout(0.000244)  delay error 0.000756 sec
# epicsEventWaitWithTimeout(0.000122)  delay error 0.000878 sec
# epicsEventWaitWithTimeout(0.000061)  delay error 0.000938 sec
# epicsEventWaitWithTimeout(0.000031)  delay error 0.000969 sec
# epicsEventWaitWithTimeout(0.000015)  delay error 0.000984 sec
# epicsEventWaitWithTimeout(0.000008)  delay error 0.000992 sec
# epicsEventWaitWithTimeout(0.000004)  delay error 0.000996 sec
# epicsEventWaitWithTimeout(0.000002)  delay error 0.000998 sec
# epicsEventWaitWithTimeout(0.000000)  delay error 0.000002 sec

The results are quite repeatable, only the last digit seems to change by about ±2 between runs.

On Linux I see no negative delay errors at all. On both RTEMS-pc386-qemu and VxWorks I get negative delay errors for requests that are larger than and not an exact multiple of the tick quantum, so those requests are returning earlier than they are supposed to. I am not seeing negative delay errors that are greater than the tick quantum, but as I said above I have synchronized the time measurements to start immediately after a clock tick.

I think I now understand the problem: We pass the time delay to the OS as an integer tick count, but callbackRequestDelay() is now measuring and calculating its delays much more accurately than it used to, using the monotonic clock. Previously those delays would have always been multiples of the tick quantum. In epicsEventWaitWithTimeout() we calculate the delay time in ticks but throw away the remainder, which can now be up to just under a whole tick interval long. We also aren't taking into account that the start of a delay is not aligned to the OS tick event but the end always is, so we're throwing away up to another whole tick there too.

By inserting a busy-wait of just under a tick long after the synchronization to the OS tick that I described above I can produce delay errors of almost 2 ticks. Here the tick rate was 1000Hz and I had inserted a 990µs delay after the previous tick event before starting the measurement:

# epicsEventWaitWithTimeout(1.000000)  delay error -0.000976 sec
# epicsEventWaitWithTimeout(0.500000)  delay error -0.000962 sec
# epicsEventWaitWithTimeout(0.250000)  delay error -0.000956 sec
# epicsEventWaitWithTimeout(0.125000)  delay error -0.000953 sec
# epicsEventWaitWithTimeout(0.062500)  delay error -0.001451 sec
# epicsEventWaitWithTimeout(0.031250)  delay error -0.001200 sec
# epicsEventWaitWithTimeout(0.015625)  delay error -0.001574 sec
# epicsEventWaitWithTimeout(0.007813)  delay error -0.001762 sec
# epicsEventWaitWithTimeout(0.003906)  delay error -0.001855 sec
# epicsEventWaitWithTimeout(0.001953)  delay error -0.001903 sec
# epicsEventWaitWithTimeout(0.000977)  delay error -0.000926 sec
# epicsEventWaitWithTimeout(0.000488)  delay error -0.000437 sec
# epicsEventWaitWithTimeout(0.000244)  delay error -0.000194 sec
# epicsEventWaitWithTimeout(0.000122)  delay error -0.000072 sec
# epicsEventWaitWithTimeout(0.000061)  delay error -0.000011 sec
# epicsEventWaitWithTimeout(0.000031)  delay error 0.000020 sec
# epicsEventWaitWithTimeout(0.000015)  delay error 0.000035 sec
# epicsEventWaitWithTimeout(0.000008)  delay error 0.000043 sec
# epicsEventWaitWithTimeout(0.000004)  delay error 0.000047 sec
# epicsEventWaitWithTimeout(0.000002)  delay error 0.000048 sec
# epicsEventWaitWithTimeout(0.000000)  delay error 0.000002 sec

I wouldn't want to just round up the delay by 2 ticks, that could badly hit performance. The solution to this probably has to involve reading the monotonic clock and calling semTake(id, 1) again if the OS returns before the hi-res delay has expired (we might have to do that twice). However there are a number of other places that also use the tick rate and might need similar adjustments, so there's more work to this fix than just changing osdEvent.c for VxWorks and RTEMS.

Changed in epics-base:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 7.0.3.2

Revision history for this message

Andrew Johnson (anj) wrote on 2020-02-07:

#19

I propose to turn the individual delay error checks in epicsEventTest.cpp into tests that fail if the error is -ve (using my changes described above). Currently we only fail a test if the mean of the absolute delay errors is 50ms or more.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-07:

#20

I happened upon a difference between the float to int calculations of epicsThreadSleep() vs epicsEventWaitWithTimeout() for WIN32.

epicsThreadSleep() has

> seconds *= mSecPerSec;
> seconds += 0.99999999; /* 8 9s here is optimal */

Perhaps Andrew can comment on why 8 9s is optimal.
(cf. 32f556637de8c0b9de626290f0859fc24f7ccf89)

epicsEventWaitWithTimeout() has

> tmo = ( DWORD ) ( ( timeOut * mSecPerSec ) + 0.5 );

Revision history for this message

Andrew Johnson (anj) wrote on 2020-04-07:

#21

You'd have to ask me that question back in July 2012 for my thinking back then, 0.99999999 is probably the largest double that doesn't get rounded up to 1.0 by the compiler.

I tried changing the RTEMS version of epicsEventWaitWithTimeout() to add 0.99999999 before it casts the delay to an int number of ticks, but I still get one failure on most runs of the normal epicsEventTest program:

not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000042 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000004 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000027 sec

The result is better though, previously I was getting a lot more:

ok 12 - epicsEventWaitWithTimeout(0.000000) delay error 0.000284 sec
not ok 13 - epicsEventWaitWithTimeout(1.000000) delay error -0.000013 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000023 sec
not ok 15 - epicsEventWaitWithTimeout(0.250000) delay error -0.009993 sec
not ok 16 - epicsEventWaitWithTimeout(0.125000) delay error -0.004983 sec
not ok 17 - epicsEventWaitWithTimeout(0.062500) delay error -0.002499 sec
not ok 18 - epicsEventWaitWithTimeout(0.031250) delay error -0.011321 sec
ok 19 - epicsEventWaitWithTimeout(0.015625) delay error 0.004375 sec

The mean delay error is higher with that change as expected, by about 1 tick:

ok 33 - Mean delay error was 0.014291 sec
ok 33 - Mean delay error was 0.014293 sec
ok 33 - Mean delay error was 0.014296 sec

instead of

ok 33 - Mean delay error was 0.013231 sec
ok 33 - Mean delay error was 0.013224 sec

Requests of less than 1 tick generally insert a delay very close to 2 ticks (~0.019 seconds).

Note though that this test probably starts most of the delays shortly after the previous tick occurred; I wasn't using the busy-wait here that I described in
https://bugs.launchpad.net/epics-base/+bug/1861612/comments/18

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-09:

#22

I'll try to summarize some thoughts and discussions of the past 2 days or so.
Some of this is (I think) reiterating what Andrew has written. I'm repeating
it to hopefully demonstrate that I understand it.

1. epicsEventWaitWithTimeout() doesn't function as specified on RTEMS, vxWorks, or Windows.

With RTEMS and vxWorks I think this can be understood if the integer delay N underlying primitives
rtems_semaphore_obtain() or semTake() block until the Nth tick of timer (interrupt). So a call with delay of one may timeout almost immediately if made just before a tick.

So a timeout of N may result in times between T*(N-1) and T*N where T is the tick timer interval.

This functions like a truncation, but is separate, and cumulative to the truncation/rounding done in epicsEventWaitWithTimeout() converting floating point seconds to ticks.

So the minimal change to ensure epicsEventWaitWithTimeout() behaves as specified (timeout after >=delay) involves a float to int conversion like:

> double delay = ...;
> unsigned ticks = tickRate * delay;
> ticks += 2;

2. There may be more than one issue involved.

epicsEventWaitWithTimeout() was not affected by 4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d.
The effect of this commit on the timer queue which Mark reports is (I think) in changing the delay values passed to epicsEventWaitWithTimeout(). It is not known whether these delay values would produce the correct behavior if epicsEventWaitWithTimeout() functioned as specified.

I think this question could be answered using an RT linux or RTEMS5 system with a software DAC and an oscilloscope.

3. Windows is peculiar

The Windows impl. uses WaitForSingleObject() which takes a timeout in milli-seconds,
but seems to behave as if it were internally doing a similar truncation.
The delays I've observed seem to be round down to multiples of epicsThreadSleepQuantum().

4. Quantization of monotonic time effecting epicsEventTest

epicsEventTest on Windows is sometimes reporting measured delays of exactly zero.
I think this is due to effective delays being less than the monotonic time quantum,
which is reported as 1ms for the appveyor builds. epicsEventTest tries to test down
to the 10's of microseconds.

I'll try to summarize some thoughts and discussions of the past 2 days or so.
Some of this is (I think) reiterating what Andrew has written.  I'm repeating
it to hopefully demonstrate that I understand it.

1. epicsEventWaitWithTimeout() doesn't function as specified on RTEMS, vxWorks, or Windows.

With RTEMS and vxWorks I think this can be understood if the integer delay N underlying primitives
rtems_semaphore_obtain() or semTake() block until the Nth tick of timer (interrupt).  So a call with delay of one may timeout almost immediately if made just before a tick.

So a timeout of N may result in times between T*(N-1) and T*N where T is the tick timer interval.

This functions like a truncation, but is separate, and cumulative to the truncation/rounding done in epicsEventWaitWithTimeout() converting floating point seconds to ticks.

So the minimal change to ensure epicsEventWaitWithTimeout() behaves as specified (timeout after >=delay) involves a float to int conversion like:

> double delay = ...;
> unsigned ticks = tickRate * delay;
> ticks += 2;

2. There may be more than one issue involved.

epicsEventWaitWithTimeout() was not affected by 4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d.
The effect of this commit on the timer queue which Mark reports is (I think) in changing the delay values passed to epicsEventWaitWithTimeout().  It is not known whether these delay values would produce the correct behavior if epicsEventWaitWithTimeout() functioned as specified.

I think this question could be answered using an RT linux or RTEMS5 system with a software DAC and an oscilloscope.

3. Windows is peculiar

The Windows impl. uses WaitForSingleObject() which takes a timeout in milli-seconds,
but seems to behave as if it were internally doing a similar truncation.
The delays I've observed seem to be round down to multiples of epicsThreadSleepQuantum().

4. Quantization of monotonic time effecting epicsEventTest

epicsEventTest on Windows is sometimes reporting measured delays of exactly zero.
I think this is due to effective delays being less than the monotonic time quantum,
which is reported as 1ms for the appveyor builds.  epicsEventTest tries to test down
to the 10's of microseconds.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-09:

#23

To make a long story short.

I've pushed up a change to do epicsEventWaitWithTimeout()

https://github.com/mdavidsaver/epics-base/pull/new/event-timeout

I'd like to see the result of running epicsEventTest with this change on vxWorks and/or RTEMS 4.x before spending further time on this issue.

I'm also uncertain if the windows version can be made to meet spec.

Revision history for this message

Ben Franksen (bfrk) wrote on 2020-04-09: Re: [Bug 1861612] Re: callbackRequestDelay not waiting for 1/60 sec on vxWorks

#24

> So the minimal change to ensure epicsEventWaitWithTimeout() behaves as
> specified (timeout after >=delay) involves a float to int conversion
> like:
>
>> double delay = ...;
>> unsigned ticks = tickRate * delay;
>> ticks += 2;

I'd say it should be

unsigned ticks = ceil(tickRate * delay) + 1

In this way, a delay of 0.0 seconds results in a delay of 0.0 <= delay
<= tickRate, which is probably the desired outcome in this case.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-09:

#25

Timing diagram Edit (22.3 KiB, image/png)

I'm attaching an image with a timing diagram which might help to motivate this conversation.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-09:

#26

@Ben, The short answer is the both Andrew and I have tried +=1 and it doesn't work.

I think I can explain the reason by referring to the Timing diagram (delay.png) I've attached to this issue.

Consider a case where D (requested timeout) is slightly less than T (tick period), and A (time from call to next tick) is almost zero.

In this case the float to int conversion floor(D/T)==0. So +=1 gives a delay of 1. This will wait until the next tick (time A) which is almost zero.

The primitives we're working with on RTEMS and vxWorks seem to respond to a requested non-zero timeout of 'd' ticks by waiting A+T*(d-1) seconds.

Revision history for this message

Ben Franksen (bfrk) wrote on 2020-04-12:

#27

I didn't mean to dispute your anaysis. I think it is correct and I
understood it exactly in the way suggested by your diagram.

> The short answer is the both Andrew and I have tried +=1 and it
doesn't work.

It cannot work if you convert using trunc, which is the default "cast"
conversion between float and int.

> In this case the float to int conversion floor(D/T)==0. So +=1 gives a
delay of 1. This will wait until the next tick (time A) which is almost
zero.

I was proposing to use ceil. This will result in 1, except if the input
is exactly 0.0.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-13:

#28

> I didn't mean to dispute your anaysis.

I wasn't reading your comment this way. I hadn't clearly stated an analysis to dispute.

> In this way, a delay of 0.0 seconds

Since delay==0 has a special meaning (wait forever), lets talk about D==T. In this case floor(D/T)+2==3 and ceil(D/T)+1==2.

I choose floor()+2 (aka. truncation) in part because I don't know if vxWorks provides ceil(double), which isn't required by C89. Also, if floor()+2 doesn't pass Andrew's test, then ceil()+1 won't either. So the longer delay seems to me the place to test.

Revision history for this message

Ben Franksen (bfrk) wrote on 2020-04-14:

#29

Okay, fine. I just wanted to make sure you considered this as a possibility.

(And sorry for the multiple comments, I got errors when I hit the "Post Comment" button and then sent a modified version via email.)

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-17:

#30

@Andrew, the failures in epicsEventTest are confusing all of the my appveyor build results for unrelated changes. Would you object to my marking these to be TODO when #ifdef _WIN32?

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-25:

#31

After some private discussions and testing, I've given up trying to fix this issue and reverted most of the changes made by 4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d (excluding some in unit tests) with f1cbe93b6c0a8607a7e3cfb732c0eece7c64a029. I've also marked the new tests in epicsEventTest as TODOs. This will have to wait for someone with time and access to a test system to sort out.

Revision history for this message

Freddie Akeroyd (freddie-akeroyd) wrote on 2020-04-26:

#32

If you need somebody to run tests and try a few things on a Windows system, let me know

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-04-29:

#33

Thanks for the offer Freddie. In this case the test systems I had in mind are principally RTEMS and vxWorks.

Changed in epics-base:
importance:	Critical → Low

mdavidsaver (mdavidsaver) on 2020-05-24

Changed in epics-base:
milestone:	7.0.3.2 → none

Revision history for this message

Andrew Johnson (anj) wrote on 2021-02-06:

#34

Towards the end of my tl;dr comment #18 above I wrote:

The solution to this probably has to involve reading the monotonic clock and calling semTake(id, 1) again if the OS returned before the hi-res delay has expired (we might have to do that twice).

So far I don't believe anybody has tried actually coding that fairly obvious solution for any of the OSs where this problem exists.

EPICS Base

callbackRequestDelay not waiting for 1/60 sec on vxWorks

Bug Description

Other bug subscribers

Patches

Bug attachments

Remote bug watches