callbackRequestDelay not waiting for 1/60 sec on vxWorks

Bug #1861612 reported by rivers
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Triaged
Low
Unassigned

Bug Description

The problem is documented in this tech-talk thread: https://epics.anl.gov/tech-talk/2020/msg00308.php

In 7.0.3.1, but not in 7.0.3, callbackRequestDelay(1/60) results in 100% CPU time, most of the time in cbHigh.

This happens on vxWorks 6.9.4.1 and was first seen with the scaler record with multiple types of scaler hardware and device support.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Looking at release notes and commits

   https://github.com/epics-base/epics-base/compare/R7.0.3...R7.0.3.1

it looks like this could be a side effect of switching internal timers to monotonic time.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Wasn't 1/60 the default tick time on VxWorks?

The difference between delaying one tick and delaying zero ticks could be callback/reschedule happening every tick or as fast as the CPU can.

Revision history for this message
rivers (rivers) wrote :

> Looking at release notes and commits
> https://github.com/epics-base/epics-base/compare/R7.0.3...R7.0.3.1
> it looks like this could be a side

> Wasn't 1/60 the default tick time on VxWorks? effect of switching internal timers to monotonic time.

Yes, I discussed both of these in the tech-talk thread.

The Application Developers Guide does not document a minimum value for the delay time in callbackRequestDelay(). Given that, the existing behavior should be preserved, which was that 1/60. second did not result in 0 delay. I believe that has been the behavior since at least the beginning of 3.14.

A reasonable expectation for the minimum value of the delay, given that it is not documented, is that it should be the same as the minimum value of epicsThreadSleep. That is documented as follows:

"Sleep for the specified period of time, i.e. sleep without using the cpu.
If delay is >0 then the thread will sleep at least until the next clock tick.
The exact time is determined by the underlying architecture. If delay is
<= 0 then a delay of 0 is requested of the underlying architecture. What
happens is architecture dependent but often it allows other threads of the
same priority to run."

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Sure. This has been working for a long time, including the 7.0.3 release.

The interesting question is why the behavior has changed with the switch to monotonic time.

BTW. Can you verify that is was exactly that commit changing the behavior?
https://github.com/epics-base/epics-base/commit/4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d

That would make sure that we're talking about the real things here.

Revision history for this message
rivers (rivers) wrote :
Download full text (4.6 KiB)

Perhaps my hypothesis about callbackRequestDelay behavior change is wrong.

I looked for where callbackRequestDelay is used in base, and it seems to be used in the HIGH delay for bo records. I thus thought that HIGH delays of 1/60. second or less would result in effectively 0 delay. However, I tested this with camonitor in a vxWorks IOC running 7.0.3.1. This is what I see with HIGH=0.01, which is less than 1 tick time=0.016666666. This is as I repeatedly write 1 to the record with caput.

corvette:~/devel/PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
13LAB:userTranEnable (2020-02-02 10:41:57.387524) Disable
13LAB:userTranEnable ( +4.306785) Enable
13LAB:userTranEnable ( +0.007119) Disable
13LAB:userTranEnable ( +0.346203) Enable
13LAB:userTranEnable ( +0.003968) Disable
13LAB:userTranEnable ( +0.313715) Enable
13LAB:userTranEnable ( +0.002799) Disable
13LAB:userTranEnable ( +0.236848) Enable
13LAB:userTranEnable ( +0.013807) Disable
13LAB:userTranEnable ( +0.290088) Enable
13LAB:userTranEnable ( +0.009729) Disable
13LAB:userTranEnable ( +0.249017) Enable
13LAB:userTranEnable ( +0.017225) Disable
13LAB:userTranEnable ( +0.290018) Enable
13LAB:userTranEnable ( +0.010037) Disable
13LAB:userTranEnable ( +0.192072) Enable
13LAB:userTranEnable ( +0.008027) Disable
13LAB:userTranEnable ( +0.200641) Enable
13LAB:userTranEnable ( +0.015982) Disable
13LAB:userTranEnable ( +0.208282) Enable
13LAB:userTranEnable ( +0.008714) Disable
13LAB:userTranEnable ( +0.230697) Enable
13LAB:userTranEnable ( +0.002203) Disable
13LAB:userTranEnable ( +0.206095) Enable
13LAB:userTranEnable ( +0.010524) Disable
13LAB:userTranEnable ( +0.197474) Enable
13LAB:userTranEnable ( +0.002590) Disable
13LAB:userTranEnable ( +0.174295) Enable
13LAB:userTranEnable ( +0.009070) Disable

So the transition from 1 (Enable) to 0 (Disable) seems to take a random amount of time between 0 and 0.01666 seconds. This is exactly what I expect if the delay does wait for the next clock tick, just like epicsThreadSleep. The tick rate in my IOC is 60.

However, if I change HIGH from .01 to .005 the behavior changes, and it is only high for 20-40 microseconds (measuring time on a Linux client).

corvette:~/devel/PY_NTNDA_Viewer>camonitor -tic 13LAB:userTranEnable
13LAB:userTranEnable (2020-02-02 10:52:10.095757) Disable

13LAB:userTranEnable ( +11.698305) Enable
13LAB:userTranEnable ( +0.000056) Disable
...

Read more...

Revision history for this message
rivers (rivers) wrote :

As Ralph suggested while having 7.0.3.1 checked out I reverted commit https://github.com/epics-base/epics-base/commit/4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d.

That fixed the problem with the scaler record, I can now run at 60 Hz display rate with no problems.

It did not change the behavior of the bo record HIGH field. I see the same as above, with .01 sec it waits till the next clock tick to go low, with .005 sec it goes low immediately. I don't understand this behavior.

Revision history for this message
Andrew Johnson (anj) wrote :

I don't have any definitive answers (or access to a vxWorks system to experiment with down here at about 18 Degrees North), but I can provide more data.

callbackRequestDelayed() is implemented using an epicsTimer; the timer queue compares the current time (which since commit 4f2228fb1 gets read using epicsTime::getMonotonic() instead of epicsTime::getCurrent()) with a delayed version of the time from the same source to work out when to fire the callback.

On vxWorks-ppc the getMonotonic() time source is the CPU's time-base register, which ticks at some multiple of the CPU's bus clock frequency. Different boards have different bus rates and multipliers, and if the BSP doesn't provide the optional routine sysTimeBaseFreq() to return the nominal tick rate it gets measured in osdMonotonicInit() which prints "osdMonotonicInit: Measuring CPU time-base frequency ... %llu ticks/sec." – I think that's called from a C++ static initializer so it should appear when loading the munch file, but I might be mistaken about that part.

A workaround would be to increase your system clock tick rate by calling sysClkRateSet(100) or higher which I would put towards the top of your startup scripts; you can call it later and our code should adjust (thanks to Dirk IIRC), but before loading the munch file is probably safest. The OS routines and any taskDelay() calls will still be using this tick rate for delay granularity, and that includes the semTake() that gets used when the timer queue needs to sleep. In epicsEventWaitWithTimeout() we calculate how many OS ticks to sleep using the equivalent of
    max(1, (int) (delay * sysClkRateGet()))

I haven't had any revelations over what might be causing the problem yet though.

Revision history for this message
rivers (rivers) wrote :

I do see the message that Andrew referred to when the IOC boots:
osdMonotonicInit: Measuring CPU time-base frequency ... 25000172 ticks/sec.

I just modified the scaler record to print the current value of the first counter in the callback routine. I print with errlogPrintf(), rather than printf(), so it does not slow down the operation very much.

My scaler has a 10 MHz input on scaler[0], so the ticks are 100 ns.

This is what I see if the RATE=59. The expected delta tick is 1/59.*1e7=169492. The average delta for the first 6 readings is 163981, which is pretty close.

updateCounts: posting scaler values counts[0]=74
updateCounts: posting scaler values counts[0]=150723
updateCounts: posting scaler values counts[0]=317296
updateCounts: posting scaler values counts[0]=483968
updateCounts: posting scaler values counts[0]=651202
updateCounts: posting scaler values counts[0]=817458
updateCounts: posting scaler values counts[0]=983962

This is what I see if RATE=60
updateCounts: posting scaler values counts[0]=65
updateCounts: posting scaler values counts[0]=186497
updateCounts: posting scaler values counts[0]=188068
updateCounts: posting scaler values counts[0]=188639
updateCounts: posting scaler values counts[0]=189067
updateCounts: posting scaler values counts[0]=189495
updateCounts: posting scaler values counts[0]=189924
updateCounts: posting scaler values counts[0]=190351

So now the average delta from the last 7 readings is 642 ticks, which means the callback is running every 64 microseconds, or 15.6 kHz.

This confirms that the problem is indeed that the callback is being called far too often when RATE=60.

If the problem is that callbackRequestDelay is returning too soon when RATE=60 then I don't understand why the bo record .HIGH field is not showing the same problem.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Well, in both cases (HIGH delays and callbackRequestDelay) we seem to have a threshold for the requested delay with a response switching between a 1 tick delay and a 0 tick delay.

Even with your statement
> ...seems to take a random amount of time between 0 and 0.01666 seconds.
> This is exactly what I expect if the delay does wait for the next
> clock tick, just like epicsThreadSleep.

I have to say that I do not like that behavior. Agreed that it may only play out in few real life use cases, a delay API that cannot promise the delay seems awkward and wrong. Note that for any delay request, a system with a clock tick larger than the requested delay will response with an arbitrary delay between zero and one clock tick, which is unrelated to the requested value.

Much like the fundamental difference between a "snooze" timer (don't wake me up before...) and an alarm setting (wake me up no later than...) the use cases for a delay and a timer are different and the API promises should reflect that.

IMHO: A requested delay > 0 shall never call back before the requested delay has passed. If waiting for the next system tick comes back too early, delay shall wait for another system tick.
Immediate call back is acceptable (and somewhat expected) for a requested delay = 0.

Revision history for this message
rivers (rivers) wrote :

> Well, in both cases (HIGH delays and callbackRequestDelay) we seem to have a threshold
> for the requested delay with a response switching between a 1 tick delay and a 0 tick delay.

But HIGH delays use callbackRequestDelay just like the scaler record does. So why does the scaler record appear to get immediate callbacks when the delay=1/60., but the bo record HIGH processing does not? And why does bo HIGH processing have a change in behavior between delay=0.01 and 0.005?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

The attached patch is a bit of a shot in the dark, but maybe worth testing.

Revision history for this message
rivers (rivers) wrote :

I tried Michael's patch. I does prevent the very fast callbacks RATE=60, but it slows down the callbacks at all rates.

This is what I now see with RATE=59 when I count for 0.2 seconds.
updateCounts: posting scaler values counts[0]=72
updateCounts: posting scaler values counts[0]=210254
updateCounts: posting scaler values counts[0]=543797
updateCounts: posting scaler values counts[0]=876856
updateCounts: posting scaler values counts[0]=1210206
updateCounts: posting scaler values counts[0]=1543730
updateCounts: posting scaler values counts[0]=1876830
updateCounts: posting scaler values counts[0]=2000000

The average delta of the first 7 readings is (1876830-72)/6.=312793. So the average callback interval is 31.2 ms, while it should be 1/59 = 16.9 ms.

At RATE=60 I see
updateCounts: posting scaler values counts[0]=73
updateCounts: posting scaler values counts[0]=255761
updateCounts: posting scaler values counts[0]=588997
updateCounts: posting scaler values counts[0]=922397
updateCounts: posting scaler values counts[0]=1255750
updateCounts: posting scaler values counts[0]=1589057
updateCounts: posting scaler values counts[0]=1922543
updateCounts: posting scaler values counts[0]=2000000

So the average callback interval is (1922543-73)/6. = 32.04 ms, while it should be 16.67 ms.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Can you test with a second RATE (say 30Hz) to see how this error in delay scales? This may give a hint as to where to focus (looking for more quantum games, or ticksPerSec).

Also, what do epicsThreadSleepQuantum() and epicsMonotonicResolution() return?

Revision history for this message
rivers (rivers) wrote :

This is 30 Hz.

updateCounts: posting scaler values counts[0]=343492
updateCounts: posting scaler values counts[0]=842531
updateCounts: posting scaler values counts[0]=1342611
updateCounts: posting scaler values counts[0]=1842512
updateCounts: posting scaler values counts[0]=2342578
updateCounts: posting scaler values counts[0]=2842581
updateCounts: posting scaler values counts[0]=3342583
updateCounts: posting scaler values counts[0]=3842498
updateCounts: posting scaler values counts[0]=4342557
updateCounts: posting scaler values counts[0]=4842722

So the average delta is (4842722-343492)/9.=499914 = 50 ms, should be 33 ms.

epicsThreadSleepQuantum=0.016667
epicsMonotonicResolution=40

Revision history for this message
rivers (rivers) wrote :

Reverting Michael's patch I see this with RATE=30Hz.

updateCounts: posting scaler values counts[0]=340710
updateCounts: posting scaler values counts[0]=673958
updateCounts: posting scaler values counts[0]=1007853
updateCounts: posting scaler values counts[0]=1340645
updateCounts: posting scaler values counts[0]=1673964
updateCounts: posting scaler values counts[0]=2007397
updateCounts: posting scaler values counts[0]=2340634
updateCounts: posting scaler values counts[0]=2673957
updateCounts: posting scaler values counts[0]=3008932
updateCounts: posting scaler values counts[0]=3341156

So the average delta ticks is (3341156-340710)/9.=333382.8, which is 33.3 ms, as expected.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

The difference between expected and actual seems ~= epicsThreadSleepQuantum() (aka. 1.0/sysClkRateGet() ). This is the granularity of epicsEventWaitWithTimeout() which is ultimately (I think) what the timer queue is using.

The code in epicsEventWaitWithTimeout() may explain what the -quantum()/2 was doing.

> if (timeOut <= 0.0) {
> ticks = 0;
> } else if (timeOut >= (double) INT_MAX / rate) {
> ticks = WAIT_FOREVER;
> } else {
> ticks = timeOut * rate; // MD - rate=sysClkRateGet()
> if (ticks <= 0)
> ticks = 1;
> }

Previously, subtracting could result in timeOut <=0. Now, this is never the case so the minimum wait is 1 tick.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
Andrew Johnson (anj) wrote :
Download full text (6.5 KiB)

libCom's epicsEventTest measures epicsEventWaitWithTimeout() delays from 2.0**0 down to 2.0**-19=0.000001907 and 0.0. However it isn't particularly careful with the starting point of those timings, and it also drops the sign from the calculated delay error. I just added a call to epicsEventWaitWithTimeout(id, 0.000001) before the start of each delay measurement to synchronize each one to the start of an OS tick, and made the error calculation just measured delay - requested delay. The results on VxWorks are interesting. These are with the default 60Hz tick rate:

# epicsEventWaitWithTimeout(1.000000) delay error 0.000013 sec
# epicsEventWaitWithTimeout(0.500000) delay error 0.000006 sec
# epicsEventWaitWithTimeout(0.250000) delay error 0.000003 sec
# epicsEventWaitWithTimeout(0.125000) delay error -0.008331 sec
# epicsEventWaitWithTimeout(0.062500) delay error -0.012500 sec
# epicsEventWaitWithTimeout(0.031250) delay error -0.014583 sec
# epicsEventWaitWithTimeout(0.015625) delay error 0.001042 sec
# epicsEventWaitWithTimeout(0.007813) delay error 0.008854 sec
# epicsEventWaitWithTimeout(0.003906) delay error 0.012760 sec
# epicsEventWaitWithTimeout(0.001953) delay error 0.014716 sec
# epicsEventWaitWithTimeout(0.000977) delay error 0.015690 sec
# epicsEventWaitWithTimeout(0.000488) delay error 0.016178 sec
# epicsEventWaitWithTimeout(0.000244) delay error 0.016423 sec
# epicsEventWaitWithTimeout(0.000122) delay error 0.016545 sec
# epicsEventWaitWithTimeout(0.000061) delay error 0.016605 sec
# epicsEventWaitWithTimeout(0.000031) delay error 0.016636 sec
# epicsEventWaitWithTimeout(0.000015) delay error 0.016654 sec
# epicsEventWaitWithTimeout(0.000008) delay error 0.016658 sec
# epicsEventWaitWithTimeout(0.000004) delay error 0.016663 sec
# epicsEventWaitWithTimeout(0.000002) delay error 0.016665 sec
# epicsEventWaitWithTimeout(0.000000) delay error 0.000003 sec

I then set the clock to 1KHz and got these:

# epicsEventWaitWithTimeout(1.000000) delay error 0.000025 sec
# epicsEventWaitWithTimeout(0.500000) delay error -0.000005 sec
# epicsEventWaitWithTimeout(0.250000) delay error -0.000003 sec
# epicsEventWaitWithTimeout(0.125000) delay error -0.000001 sec
# epicsEventWaitWithTimeout(0.062500) delay error -0.000501 sec
# epicsEventWaitWithTimeout(0.031250) delay error -0.000250 sec
# epicsEventWaitWithTimeout(0.015625) delay error -0.000625 sec
# epicsEventWaitWithTimeout(0.007813) delay error -0.000813 sec
# epicsEventWaitWithTimeout(0.003906) delay error -0.000907 sec
# epicsEventWaitWithTimeout(0.001953) delay error -0.000953 sec
# epicsEventWaitWithTimeout(0.000977) delay error 0.000023 sec
# epicsEventWaitWithTimeout(0.000488) delay error 0.000512 sec
# epicsEventWaitWithTimeout(0.000244) delay error 0.000756 sec
# epicsEventWaitWithTimeout(0.000122) delay error 0.000878 sec
# epicsEventWaitWithTimeout(0.000061) delay error 0.000938 sec
# epicsEventWaitWithTimeout(0.000031) delay error 0.000969 sec
# epicsEventWaitWithTimeout(0.000015) delay error 0.000984 sec
# epicsEventWaitWithTimeout(0.000008) delay error 0.000992 sec
# epicsEventWaitWithTimeout(0.000004) delay error 0.000996 sec
# epicsEvent...

Read more...

Changed in epics-base:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 7.0.3.2
Revision history for this message
Andrew Johnson (anj) wrote :

I propose to turn the individual delay error checks in epicsEventTest.cpp into tests that fail if the error is -ve (using my changes described above). Currently we only fail a test if the mean of the absolute delay errors is 50ms or more.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I happened upon a difference between the float to int calculations of epicsThreadSleep() vs epicsEventWaitWithTimeout() for WIN32.

epicsThreadSleep() has

> seconds *= mSecPerSec;
> seconds += 0.99999999; /* 8 9s here is optimal */

Perhaps Andrew can comment on why 8 9s is optimal.
(cf. 32f556637de8c0b9de626290f0859fc24f7ccf89)

epicsEventWaitWithTimeout() has

> tmo = ( DWORD ) ( ( timeOut * mSecPerSec ) + 0.5 );

Revision history for this message
Andrew Johnson (anj) wrote :

You'd have to ask me that question back in July 2012 for my thinking back then, 0.99999999 is probably the largest double that doesn't get rounded up to 1.0 by the compiler.

I tried changing the RTEMS version of epicsEventWaitWithTimeout() to add 0.99999999 before it casts the delay to an int number of ticks, but I still get one failure on most runs of the normal epicsEventTest program:

not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000042 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000004 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000027 sec

The result is better though, previously I was getting a lot more:

ok 12 - epicsEventWaitWithTimeout(0.000000) delay error 0.000284 sec
not ok 13 - epicsEventWaitWithTimeout(1.000000) delay error -0.000013 sec
not ok 14 - epicsEventWaitWithTimeout(0.500000) delay error -0.000023 sec
not ok 15 - epicsEventWaitWithTimeout(0.250000) delay error -0.009993 sec
not ok 16 - epicsEventWaitWithTimeout(0.125000) delay error -0.004983 sec
not ok 17 - epicsEventWaitWithTimeout(0.062500) delay error -0.002499 sec
not ok 18 - epicsEventWaitWithTimeout(0.031250) delay error -0.011321 sec
ok 19 - epicsEventWaitWithTimeout(0.015625) delay error 0.004375 sec

The mean delay error is higher with that change as expected, by about 1 tick:

ok 33 - Mean delay error was 0.014291 sec
ok 33 - Mean delay error was 0.014293 sec
ok 33 - Mean delay error was 0.014296 sec

instead of

ok 33 - Mean delay error was 0.013231 sec
ok 33 - Mean delay error was 0.013224 sec

Requests of less than 1 tick generally insert a delay very close to 2 ticks (~0.019 seconds).

Note though that this test probably starts most of the delays shortly after the previous tick occurred; I wasn't using the busy-wait here that I described in
  https://bugs.launchpad.net/epics-base/+bug/1861612/comments/18

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I'll try to summarize some thoughts and discussions of the past 2 days or so.
Some of this is (I think) reiterating what Andrew has written. I'm repeating
it to hopefully demonstrate that I understand it.

1. epicsEventWaitWithTimeout() doesn't function as specified on RTEMS, vxWorks, or Windows.

With RTEMS and vxWorks I think this can be understood if the integer delay N underlying primitives
rtems_semaphore_obtain() or semTake() block until the Nth tick of timer (interrupt). So a call with delay of one may timeout almost immediately if made just before a tick.

So a timeout of N may result in times between T*(N-1) and T*N where T is the tick timer interval.

This functions like a truncation, but is separate, and cumulative to the truncation/rounding done in epicsEventWaitWithTimeout() converting floating point seconds to ticks.

So the minimal change to ensure epicsEventWaitWithTimeout() behaves as specified (timeout after >=delay) involves a float to int conversion like:

> double delay = ...;
> unsigned ticks = tickRate * delay;
> ticks += 2;

2. There may be more than one issue involved.

epicsEventWaitWithTimeout() was not affected by 4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d.
The effect of this commit on the timer queue which Mark reports is (I think) in changing the delay values passed to epicsEventWaitWithTimeout(). It is not known whether these delay values would produce the correct behavior if epicsEventWaitWithTimeout() functioned as specified.

I think this question could be answered using an RT linux or RTEMS5 system with a software DAC and an oscilloscope.

3. Windows is peculiar

The Windows impl. uses WaitForSingleObject() which takes a timeout in milli-seconds,
but seems to behave as if it were internally doing a similar truncation.
The delays I've observed seem to be round down to multiples of epicsThreadSleepQuantum().

4. Quantization of monotonic time effecting epicsEventTest

epicsEventTest on Windows is sometimes reporting measured delays of exactly zero.
I think this is due to effective delays being less than the monotonic time quantum,
which is reported as 1ms for the appveyor builds. epicsEventTest tries to test down
to the 10's of microseconds.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

To make a long story short.

I've pushed up a change to do epicsEventWaitWithTimeout()

https://github.com/mdavidsaver/epics-base/pull/new/event-timeout

I'd like to see the result of running epicsEventTest with this change on vxWorks and/or RTEMS 4.x before spending further time on this issue.

I'm also uncertain if the windows version can be made to meet spec.

Revision history for this message
Ben Franksen (bfrk) wrote : Re: [Bug 1861612] Re: callbackRequestDelay not waiting for 1/60 sec on vxWorks

> So the minimal change to ensure epicsEventWaitWithTimeout() behaves as
> specified (timeout after >=delay) involves a float to int conversion
> like:
>
>> double delay = ...;
>> unsigned ticks = tickRate * delay;
>> ticks += 2;

I'd say it should be

unsigned ticks = ceil(tickRate * delay) + 1

In this way, a delay of 0.0 seconds results in a delay of 0.0 <= delay
<= tickRate, which is probably the desired outcome in this case.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I'm attaching an image with a timing diagram which might help to motivate this conversation.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

@Ben, The short answer is the both Andrew and I have tried +=1 and it doesn't work.

I think I can explain the reason by referring to the Timing diagram (delay.png) I've attached to this issue.

Consider a case where D (requested timeout) is slightly less than T (tick period), and A (time from call to next tick) is almost zero.

In this case the float to int conversion floor(D/T)==0. So +=1 gives a delay of 1. This will wait until the next tick (time A) which is almost zero.

The primitives we're working with on RTEMS and vxWorks seem to respond to a requested non-zero timeout of 'd' ticks by waiting A+T*(d-1) seconds.

Revision history for this message
Ben Franksen (bfrk) wrote :

I didn't mean to dispute your anaysis. I think it is correct and I
understood it exactly in the way suggested by your diagram.

> The short answer is the both Andrew and I have tried +=1 and it
doesn't work.

It cannot work if you convert using trunc, which is the default "cast"
conversion between float and int.

> In this case the float to int conversion floor(D/T)==0. So +=1 gives a
delay of 1. This will wait until the next tick (time A) which is almost
zero.

I was proposing to use ceil. This will result in 1, except if the input
is exactly 0.0.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> I didn't mean to dispute your anaysis.

I wasn't reading your comment this way. I hadn't clearly stated an analysis to dispute.

> In this way, a delay of 0.0 seconds

Since delay==0 has a special meaning (wait forever), lets talk about D==T. In this case floor(D/T)+2==3 and ceil(D/T)+1==2.

I choose floor()+2 (aka. truncation) in part because I don't know if vxWorks provides ceil(double), which isn't required by C89. Also, if floor()+2 doesn't pass Andrew's test, then ceil()+1 won't either. So the longer delay seems to me the place to test.

Revision history for this message
Ben Franksen (bfrk) wrote :

Okay, fine. I just wanted to make sure you considered this as a possibility.

(And sorry for the multiple comments, I got errors when I hit the "Post Comment" button and then sent a modified version via email.)

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

@Andrew, the failures in epicsEventTest are confusing all of the my appveyor build results for unrelated changes. Would you object to my marking these to be TODO when #ifdef _WIN32?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

After some private discussions and testing, I've given up trying to fix this issue and reverted most of the changes made by 4f2228fb1d7527fb5ebc8b2d747c309f1dd7698d (excluding some in unit tests) with f1cbe93b6c0a8607a7e3cfb732c0eece7c64a029. I've also marked the new tests in epicsEventTest as TODOs. This will have to wait for someone with time and access to a test system to sort out.

Revision history for this message
Freddie Akeroyd (freddie-akeroyd) wrote :

If you need somebody to run tests and try a few things on a Windows system, let me know

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Thanks for the offer Freddie. In this case the test systems I had in mind are principally RTEMS and vxWorks.

Changed in epics-base:
importance: Critical → Low
Changed in epics-base:
milestone: 7.0.3.2 → none
Revision history for this message
Andrew Johnson (anj) wrote :

Towards the end of my tl;dr comment #18 above I wrote:

The solution to this probably has to involve reading the monotonic clock and calling semTake(id, 1) again if the OS returned before the hi-res delay has expired (we might have to do that twice).

So far I don't believe anybody has tried actually coding that fairly obvious solution for any of the OSs where this problem exists.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.