'assert (size <= ntohs ( pMsg->m_postsize ))' failed in ..caserverio.c line 344

Bug #541330 reported by Jeff Hill
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
High
mdavidsaver

Bug Description

From Chen Zhi-chu,

I've written an ioc (using TekVISA from Tektronix on a DPO7054 Oscilloscope, Windows XP) and it starts okay at first. But when it runs for like several minutes, I got an error like:
==============================================================
A call to "assert (size <= ntohs ( pMsg->m_postsize ))" failed in ..caserverio.
c line 344.
EPICS Release EPICS R3.14.5 $R3-14-5$ $2004/02/04 20:45:33$.
Current time Fri Sep 07 2007 09:49:29.518463915 .
Please E-mail this message to the author or to <email address hidden>
Calling epicsThreadSuspendSelf()
==============================================================
then it stunks. I suppose that the cas_commit_msg function checks the pMsg->m_postsize and finds out it's larger than size. But I don't know how these variables come from.

I'm sorry I'm just new to EPICS and have few experiences about C programming. Could any of you give me a clue that what's happening now? Any input is welcome.

Original Mantis Bug: mantis-300
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=300

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

I dont have a stack trace so I am attempting to infer the cause based on quite a few different paths through the code to this particular assert failure. After some analysis I can identify two possible causes.

1) A string, a scalar DBR_STRING of type dbr_string_t, that somehow isnt nill terminated is passing through the database to the CA server.
2) The server is printing a diagnostic message and a record name, as obtained with RECORD_NAME ( paddr ) form the database API, is unusually long (perhaps it is corrupt or alternatively it isnt nill terminated).

I can easily detect (1) and fix it in the scalar case. Thanks to C99 we also now have functions like vsnprintf and I can easily make the code robust against (2).

edited on: 2007-09-07 11:42

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (6.0 KiB)

I committed this patch:

Index: camessage.c
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/rsrv/camessage.c,v
retrieving revision 1.105.2.10
diff -c -b -r1.105.2.10 camessage.c
*** camessage.c 17 Aug 2007 22:31:10 -0000 1.105.2.10
--- camessage.c 7 Sep 2007 17:03:09 -0000
***************
*** 24,29 ****
--- 24,30 ----
  #include "osiSock.h"
  #include "osiPoolStatus.h"
  #include "epicsEvent.h"
+ #include "epicsStdio.h"
  #include "epicsThread.h"
  #include "epicsMutex.h"
  #include "epicsTime.h"
***************
*** 134,139 ****
--- 135,141 ----
  va_list args
  )
  {
+ static const ca_uint32_t maxDiagLen = 512;
      struct channel_in_use *pciu;
      caHdr *pReqOut;
      char *pMsgString;
***************
*** 174,180 ****
       * allocate plenty of space for a sprintf() buffer
       */
      localStatus = cas_copy_in_header ( client,
! CA_PROTO_ERROR, 512, 0, 0, cid, status,
          ( void * ) &pReqOut );
      if ( localStatus != ECA_NORMAL ) {
          errlogPrintf ( "caserver: Unable to deliver err msg "%s" to client because "%s"\n",
--- 176,182 ----
       * allocate plenty of space for a sprintf() buffer
       */
      localStatus = cas_copy_in_header ( client,
! CA_PROTO_ERROR, maxDiagLen, 0, 0, cid, status,
          ( void * ) &pReqOut );
      if ( localStatus != ECA_NORMAL ) {
          errlogPrintf ( "caserver: Unable to deliver err msg "%s" to client because "%s"\n",
***************
*** 215,223 ****
      /*
       * add their context string into the protocol
       */
! status = vsprintf ( pMsgString, pformat, args );
! if ( status >= 0 ) {
! size += ( ( ca_uint32_t ) status ) + 1u;
      }
      cas_commit_msg ( client, size );
  }
--- 217,236 ----
      /*
       * add their context string into the protocol
       */
! localStatus = epicsVsnprintf ( pMsgString, maxDiagLen, pformat, args );
! if ( localStatus >= 1 ) {
! unsigned diagLen = ( unsigned ) localStatus;
! if ( diagLen < maxDiagLen ) {
! size += (ca_uint32_t) (diagLen + 1u);
! }
! else {
! errlogPrintf (
! "caserver: vsend_err: epicsVsnprintf detected "
! "error message truncation, pFormat = "%s"\n",
! pformat );
! size += maxDiagLen;
! pMsgString [ maxDiagLen - 1 ] = '\0';
! }
      }
      cas_commit_msg ( client, size );
  }
***************
*** 495,501 ****
      struct channel_in_use *pciu = pevext->pciu;
      const int readAccess = asCheckGet ( pciu->asClientPVT );
      int status;
- int strcnt;
      int v41;

      SEND_LOCK ( pClient );
--- 508,513 ----
***************
*** 575,581 ****
          }
      }
      else {
! ca_uint32_t msgSize = pevext->size;
          int cacStatus = caNetConvert (
              pevext->msg.m_dataType, pPayload, pPayload,
              TRUE /* host -> net format */, pevext->msg.m_count );
--- 587,593 ----
          }
      }
      else {
! ca_uint32_t payloadSize = pevext->...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Chen Zhi-chu,

About the patch thing, after I'd restarted the oscilloscope, I kind of lost the error message. I tried everything and the error wouldn't appear again. I'm sorry.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Thanks for the update. I'm not sure why you were ending up with what was probably an a string lacking its nill termination in your system. I think that the patch is still a good idea. It will make the software more robust.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.10

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

On 10/24/2016 08:46 AM, Vodopivec, Klemen wrote:
> Hi everybody,
>
> in one of our many-PVs IOCs we’ve received the error below. The IOC is still up and running and as far as we can tell there’s no functionality degradation. But since the error says to post to tech-talk, I figured I would ask anyway. Has anybody seen this error before? It seems to be on CA client side inside IOC so hopefully the client was able to spawn and use a new thread.
>
> IOC is scheduled for reboot next Tuesday to recover the suspended thread, but we can leave it running a bit longer to gather some info if needed.
>
> [Wed Oct 5 13:35:44 2016] ^M
> ^M
> ^M
> A call to 'assert(size <= ntohs ( pMsg->m_postsize ))'^M
> by thread 'CAS-client' failed in ../caserverio.c line 360.^M
> EPICS Release EPICS R3.14.12.4 $Date: Mon 2013-12-16 15:51:45 -0600$.^M
> Local time is 2016-10-05 13:35:44.419250921 EDT^M
> Please E-mail this message to the author or to <email address hidden>^M
> Calling epicsThreadSuspendSelf()^M
> [Wed Oct 5 13:35:50 2016] Thread CAS-client (0x7f9a1404da20) suspended^M
>
> Thanks, Klemen

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Looks like a logic error during outbound message construction. If you (Klemen)
could attach a debugger before restarting it would be helpful. In
addition to a stack trace, I'd be interested to know 'size' and '*pMsg'.

Changed in epics-base:
status: Fix Released → Incomplete
milestone: 3.14.10 → 3.14.branch
assignee: Jeff Hill (johill-lanl) → mdavidsaver (mdavidsaver)
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

The likely culprit for the assert() that Klemen sees is read_reply() in camessage.c. This, and vsend_err(), are the places where sizes are changed between a call to cas_copy_in_header() and to cas_commit_msg(). vsend_err() looks right, so I'm inclined to think that this is a new bug introduced since Jeff's fix. read_reply() does some complicated recalculation of sizes to handle count==0 requests. I can't find anything obviously wrong, but may be missing something. Hopefully Klemen can provide a stack trace as this would answer a lot of questions.

Also, it seems like the 'assert(count < 65536)' in cas_set_header_count() could be triggered when a get/monitor on a large array returns an error or autosize'd to a count smaller than 64k. Have to look more closely at this.

Revision history for this message
Klemen Vodopivec (klemenv) wrote :

Michael's guessing was right, gdb output *does* point to read_reply() function:

(gdb) bt
#0 0x00007f9aacaee68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f9aad6fda54 in condWait (condId=0x7f9a14041338, mutexId=0x7f9a14041310) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#2 0x00007f9aad6fddac in epicsEventWait (pevent=0x7f9a14041310) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#3 0x00007f9aad6fcabe in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/posix/osdThread.c:556
#4 0x00007f9aad6fae11 in epicsAssert (pFile=0x7f9aae649681 "../caserverio.c", line=360, pExp=0x7f9aae649820 "size <= ntohs ( pMsg->m_postsize )",
    pAuthorName=0x7f9aad70c7fc "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:51
#5 0x00007f9aae640580 in cas_commit_msg (pClient=0x7f9a0c000f20, size=48) at ../caserverio.c:360
#6 0x00007f9aae6440ed in read_reply (pArg=0x7f9aa9dcebf0, paddr=0x7f995c018968, eventsRemaining=1, pfl=0x0) at ../camessage.c:623
#7 0x00007f9aae644654 in read_notify_action (mp=0x7f9aa9dcec70, pPayload=0x7f9a1403a1f0, client=0x7f9a0c000f20) at ../camessage.c:761
#8 0x00007f9aae6484b6 in camessage (client=0x7f9a0c000f20) at ../camessage.c:2574
#9 0x00007f9aae642d01 in camsgtask (pParm=0x7f9a0c000f20) at ../camsgtask.c:123
#10 0x00007f9aad6fc3ce in start_routine (arg=0x7f9a1404da20) at ../../../src/libCom/osi/os/posix/osdThread.c:385
#11 0x00007f9aacaeaaa1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f9aac837aad in clone () from /lib64/libc.so.6
(gdb) frame 5
#5 0x00007f9aae640580 in cas_commit_msg (pClient=0x7f9a0c000f20, size=48) at ../caserverio.c:360
360 assert ( size <= ntohs ( pMsg->m_postsize ) );
(gdb) print size
$3 = 48
(gdb) print *pMsg
$4 = {m_cmmd = 3840, m_postsize = 10240, m_dataType = 0, m_count = 256, m_cid = 16777216, m_available = 1210253312}

And further looking into read_reply() function:
(gdb) frame 6
#6 0x00007f9aae6440ed in read_reply (pArg=0x7f9aa9dcebf0, paddr=0x7f995c018968, eventsRemaining=1, pfl=0x0) at ../camessage.c:623
623 cas_commit_msg ( pClient, payload_size );
(gdb) print status
$18 = 0
(gdb) print cacStatus
$19 = 1
(gdb) print autosize
$20 = 0
(gdb) print payload_size
$21 = 44
(gdb) print pevext->msg.m_dataType
$22 = 28
(gdb) print dbr_type[28]
No symbol "dbr_type" in current context.
(gdb) print dbr_size[28]
$23 = 44
(gdb) print item_count
$24 = 1

Now trying to understand m_dataType mismatch, pMsg->m_dataType=0 while pevext->msg.m_dataType=28. Maybe one is receive and other transmit message.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

These results confuse me. This looks like a successful read_notify of a scalar w/o dynamic sizing. As I read it, autosize==0 in read_reply() should mean that payload_size does't change between cas_copy_in_header() on line 541 and the failing call to cas_commit_msg() on lie 623.

Yet cas_copy_in_header() appears to have set pMsg->m_postsize=40 (in LSB) while payload_size==44 when the assert fails. If I read the code correctly, m_postsize should be 48, and indead in cas_commit_msg() it is. Unless CA_MESSAGE_ALIGN() has rounded down instead of up in one case, I have no explanation.

I have no explanation for the different m_dataType. 28 is DBR_CTRL_STRING, while 0 is DBR_STRING, though this may be a coincidence. The other members of *pMsg look reasonable, so if this is corruption it is very localized.

https://github.com/epics-base/epics-base/blob/30090918754ed694ec3e11bdd268260131f91c70/src/rsrv/camessage.c#L623

Revision history for this message
Klemen Vodopivec (klemenv) wrote :

Per Michael's request:

(gdb) frame 6
#6 0x00007f9aae6440ed in read_reply (pArg=0x7f9aa9dcebf0, paddr=0x7f995c018968, eventsRemaining=1, pfl=0x0) at ../camessage.c:623
623 cas_commit_msg ( pClient, payload_size );

(gdb) print pevext->msg
$1 = {m_postsize = 0, m_count = 1, m_cid = 57762, m_available = 9037, m_dataType = 28, m_cmmd = 15}

(gdb) print *pClient
$2 = {node = {next = 0x7f9a1404d0b0, previous = 0x7f9a1404cfa0}, send = {buf = 0x7f9a1403cf50 "", stk = 0, maxstk = 16384, cnt = 0, type = mbtSmallTCP}, recv = {buf = 0x7f9a14038f50 "", stk = 4752, maxstk = 16384,
    cnt = 8640, type = mbtSmallTCP}, lock = 0x7f9a1408e2a0, putNotifyLock = 0x7f9a14041810, chanListLock = 0x7f99b8090900, eventqLock = 0x7f9a8c0037f0, chanList = {node = {next = 0x7f995c024a20,
      previous = 0x7f9950036140}, count = 219}, chanPendingUpdateARList = {node = {next = 0x7f995c024720, previous = 0x7f995c0247a0}, count = 976}, putNotifyQue = {node = {next = 0x0, previous = 0x0}, count = 0},
  addr = {sin_family = 2, sin_port = 16622, sin_addr = {s_addr = 2199023370}, sin_zero = "\000\000\000\000\000\000\000"}, time_at_last_send = {secPastEpoch = 844536944, nsec = 317760488}, time_at_last_recv = {
    secPastEpoch = 844536944, nsec = 217504176}, evuser = 0x7f9a500bb940, pUserName = 0x7f99a4000b10 "bl9-user", pHostName = 0x7f99a4000b30 "bl9-dasopi1.sns.gov", blockSem = 0x7f9a14041060, sock = 35, proto = 6,
  tid = 0x7f9a1404da20, minor_version_number = 13, seqNoOfReq = 0, recvBytesToDrain = 0, priority = 0, disconnect = 1 '\001'}

*dbch doesn't exist in our version of base, which is R3.14.12.4.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Looking again at the client state.

> chanPendingUpdateARList = ..., count = 976},

There appears to be have been an access security re-compute in progress. Though this may have started later and stalled waiting for the client lock held by the suspended thread.

> disconnect = 1
> send = {... stk = 0

As I read it, disconnect==1 only happens from a send() error. This should only be set while the client lock is held, so unless this rule is being violated it shouldn't have changed since the assert() failure.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

lp:1707931 triggered the same assertion as this issue, but has a WIN32 specific initiating event. So this can not be the same issue that Klemen saw. However, the additional locking may fix this as well.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Looks like this issue has come up again. https://epics.anl.gov/tech-talk/2018/msg00044.php

Andrew Johnson (anj)
Changed in epics-base:
milestone: 3.14.branch → 3.15.branch
Andrew Johnson (anj)
Changed in epics-base:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.