Bug #1841634 “CP link triggers lost when record is async” : Series 7.0 : Bugs : EPICS Base

Revision history for this message

Andrew Johnson (anj) wrote on 2019-08-27:

#1

One way to fix this might be to replicate the scanOnce queue but have the active end check PACT and set RPRO if the record is busy at the time, like dbPutField() does. One question would be whether the resulting queue ("scanForce" maybe?) should become part of dbScan or just live in dbCa.c.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2019-08-28:

#2

I'm not sure I like the semantics of this proposed scanForce. Seems like it would create a lot of churn for lack of a way to notify dbCa when previous processing has completed. The closest I can imagine to this right now would be dbNotify w/ processRequest, though I haven't gone through dbNotify.c in enough detail to be sure if it handles finding PACT=1 reasonably.

Revision history for this message

Andrew Johnson (anj) wrote on 2019-08-29:

#3

“a lot of churn”? — please explain.

I believe the rpro mechanism does exactly what we need here (but I admit I was looking at the 3.15 version of the code, not 7.0). The scanOnce code is pretty simple and short so duplicating it won’t be hard, the only difference being that the queue processor checks pact and sets rpro (and putf perhaps?) if needs be. The word Force isn’t quite right for the API name though, I’ll keep thinking on that.

Why are you thinking we might have to introduce dbNotify here? Now /that/ would be churn.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2019-08-30:

#4

I think I need to talk a bit of theory first. What I see here is a queue of queues, or a chain of queues. In such a case, it is optimal to handle overflows at the head of the first queue. Anything else results in work being done by earlier queue(s), only to be discarded when a later queue overflows. This is wasted work is what I meant by "churn".

The dbCa queue is 5 deep, while the RPRO "queue" is only 1. Potentially queuing up 5 scans only to discard 4 of them seems unnecessarily wasteful. Is is "a lot of".

imo. The proper way to handle these queue of queue situations is to couple the queues so that flow control can be applied. A full queue, other than the first, can then apply back-pressure to the previous queue.

This idea played into my adding scanOnceCallback() and using it to provide feedback to dbCa. What I see here is that that this feedback comes too early.

I raise the possibility of using dbNotify because it seems like to only mechanism I know of in Base at present which might provide the additional information which scanOnceCallback() doesn't, when an inprogress async record has finished.

I say "at present" because I don't particularly like dbNotify. It's too narrow in what it can do, and yet has enough options to be confusing. I would rather have a more general mechanism which doesn't try to handle the "put" or "process" part of the sequence, which can be accomplished with other APIs, but rather focuses on providing notification when current (async) processing completes. There is no other way to do this part.

Revision history for this message

Andrew Johnson (anj) wrote on 2019-08-30:

#5

Okay, so that part of dbCa changed in 3.16 without my realizing it, I was looking at the 3.15 version and had forgotten about the scanLinkOnce() addition, sorry my bad.

So now we need to have a discussion on whether updates from CA links with CP/CPP set should queue or cache. Maybe there could be another flag in the link to request queueing, but given that the CA event queues sending the value updates can and do throw away older updates once the queue is full I think we're better off not allowing multiple triggers to get queued up. The dbCa code doesn't queue the values, so I see little point in processing the record more than once after the last trigger has arrived and its value cached. We need to ensure that the record does process once after the last update, but I don't think we could wait for the dbNotify "finished" event. That might not happen until hours later in some cases, so I think it would be the wrong mechanism.

RPRO could do what we need if we had a way to get onceTask to do something different. How about we modify onceTask to give it a more general callback mechanism as well. It should still use the ring buffer, but maybe push variable length items to avoid wasting ring space (the push operation must be atomic, but the pulls don't need to be). A call to scanOnce() with no callback only needs to push a "process this record" command and the record pointer; a call to scanOnceCallback() needs to push "process record and callback" plus the three pointer arguments; we'd add a generic "call this" command with 2 pointer arguments.

With this the dbCa callback would lock the record, process it or set RPRO and unlock with completion checks that we need. The scanningOnce counter could become a boolean but in any case it should be zeroed in dbCaGetLink(), which is the point at which we'll have passed the latest cached value to the record so the trigger from it can expire. We no longer have a queue with an arbitrary length, and we don't process the record more than necessary.

Does that finally make some sense?

Okay, so that part of dbCa changed in 3.16 without my realizing it, I was looking at the 3.15 version and had forgotten about the scanLinkOnce() addition, sorry my bad.

So now we need to have a discussion on whether updates from CA links with CP/CPP set should queue or cache. Maybe there could be another flag in the link to request queueing, but given that the CA event queues sending the value updates can and do throw away older updates once the queue is full I think we're better off not allowing multiple triggers to get queued up. The dbCa code doesn't queue the values, so I see little point in processing the record more than once after the last trigger has arrived and its value cached. We need to ensure that the record does process once after the last update, but I don't think we could wait for the dbNotify "finished" event. That might not happen until hours later in some cases, so I think it would be the wrong mechanism.

RPRO could do what we need if we had a way to get onceTask to do something different. How about we modify onceTask to give it a more general callback mechanism as well. It should still use the ring buffer, but maybe push variable length items to avoid wasting ring space (the push operation must be atomic, but the pulls don't need to be). A call to scanOnce() with no callback only needs to push a "process this record" command and the record pointer; a call to scanOnceCallback() needs to push "process record and callback" plus the three pointer arguments; we'd add a generic "call this" command with 2 pointer arguments.

With this the dbCa callback would lock the record, process it or set RPRO and unlock with completion checks that we need. The scanningOnce counter could become a boolean but in any case it should be zeroed in dbCaGetLink(), which is the point at which we'll have passed the latest cached value to the record so the trigger from it can expire. We no longer have a queue with an arbitrary length, and we don't process the record more than necessary.

Does that finally make some sense?

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2019-09-04:

#6

> Okay, so that part of dbCa changed in 3.16 without my realizing it

cf. https://code.launchpad.net/~epics-core/epics-base/dbscan-update/+merge/245680

I hide it right at the top of the diff?

> I see little point in processing the record more than once

Fine with me. I can't recall where the 5 times came from.

> modify onceTask to give it a more general callback mechanism

In terms of complexity, this seems not far away from callbackRequest().
Why not just use that? This would also have the side effect of making
.PRIO meaningful for CP links.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-01-25:

#7

I starting looking at this again a few weeks ago when Mark Rivers reported a problem that Ben Franksen recognized as being this bug. I'm still deciding how best to fix it; this comment is just to record my conclusions to date, I'm not looking for feedback yet and I don't have code.

--------------------------------------------------------------------------------------------

The scanOnceCallback() routine was added on the 3.16 branch by the dbscan-update branch to fix lp:1362362. However dbCA's scanComplete() routine that is the only user to date doesn't know whether the record was actually processed or not. I'm convinced that we need RPRO to solve this issue.

The only internal code that calls scanOnce() is recGblFwdLink(), which does so if RPRO is set at the end of record processing. In that particular case it doesn't matter if onceTask subsequently finds PACT==TRUE since its purpose has already been met — to have the record processed again, which someone else had already triggered. Thus for this usage the original simple scanOnce() API is fine.

The dbCa case is more complicated because the scanLinkOnce() routine that uses scanOnceCallback() gets called from a CA callback routine in the context of a CA thread. It can't know whether the record is busy or not at the time, and it can't take the record lock to go and look without risking deadlocks and probably affecting performance. However it can set a flag in the caLink object to indicate that a newly arrived value is waiting to be read by the record, so when this scanOnce request gets processed by onceTask it can set RPRO if the record is busy (PACT) and the flag indicates that the value from the link has not been read yet. Using RPRO guarantees that the record will reprocess when the current process cycle completes.

The flag is set in dbCA's scanLinkOnce(), when is it cleared? I originally assumed that would happen when the value gets read, but if the CP link that triggered the processing isn't read when the record processes, the flag could then remain set forever. That's no good, I need it to indicate that a scan is already queued up, so further queuing is unnecessary (CA is allowed to throw away monitor updates, we're just doing the same thing). That means we have to clear the flag in dbCA's scanComplete() routine (or it's replacement), i.e. just before we process the record or set its RPRO field. It must happen before in case a monitor comes in during processing.

I starting looking at this again a few weeks ago when Mark Rivers reported a problem that Ben Franksen recognized as being this bug. I'm still deciding how best to fix it; this comment is just to record my conclusions to date, I'm not looking for feedback yet and I don't have code.

--------------------------------------------------------------------------------------------

The scanOnceCallback() routine was added on the 3.16 branch by the dbscan-update branch to fix lp:1362362. However dbCA's scanComplete() routine that is the only user to date doesn't know whether the record was actually processed or not. I'm convinced that we need RPRO to solve this issue.

The only internal code that calls scanOnce() is recGblFwdLink(), which does so if RPRO is set at the end of record processing. In that particular case it doesn't matter if onceTask subsequently finds PACT==TRUE since its purpose has already been met — to have the record processed again, which someone else had already triggered. Thus for this usage the original simple scanOnce() API is fine.

The dbCa case is more complicated because the scanLinkOnce() routine that uses scanOnceCallback() gets called from a CA callback routine in the context of a CA thread. It can't know whether the record is busy or not at the time, and it can't take the record lock to go and look without risking deadlocks and probably affecting performance. However it can set a flag in the caLink object to indicate that a newly arrived value is waiting to be read by the record, so when this scanOnce request gets processed by onceTask it can set RPRO if the record is busy (PACT) and the flag indicates that the value from the link has not been read yet. Using RPRO guarantees that the record will reprocess when the current process cycle completes.

The flag is set in dbCA's scanLinkOnce(), when is it cleared? I originally assumed that would happen when the value gets read, but if the CP link that triggered the processing isn't read when the record processes, the flag could then remain set forever. That's no good, I need it to indicate that a scan is already queued up, so further queuing is unnecessary (CA is allowed to throw away monitor updates, we're just doing the same thing). That means we have to clear the flag in dbCA's scanComplete() routine (or it's replacement), i.e. just before we process the record or set its RPRO field. It must happen before in case a monitor comes in during processing.

Andrew Johnson (anj) on 2022-05-07

tags:

removed: codeathon

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Status tracked in 7.0
	7.0	Confirmed	High	Unassigned

EPICS Base

CP link triggers lost when record is async

Bug Description

Other bug subscribers

Remote bug watches