node down: sandesh fails marshalling a float onto a u32 value

Bug #1540973 reported by Martin Millnert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Fix Committed
Undecided
Unassigned

Bug Description

Our contrail-api node experiences a problem when trying to marshall a sandesh message into serialized form.
We run daily builds, so this build we currently run is from 2016-02-02 01:47 CET (some 15 hours old by now).

See this log / debug excerpt:

Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se contrail-api[14463]: 02/02/2016 04:22:49 PM [TXMLProtocol]: TXML Protocol: Invalid u32 value 0.02490234375
Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se contrail-api[14463]: ERROR:TXMLProtocol:TXML Protocol: Invalid u32 value 0.02490234375
Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se contrail-api[14463]: ERROR:contrail-api:Failed to send sandesh
Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se contrail-api[14463]: ERROR:contrail-api:PFORMAT: ModuleCpuStateTrace(_context='', _scope='', _category='', _send_queue_enabled=True, _seqnum=1, _versionsig=4054510910, _s
ource='contrail-dev-config1.srv.lab.cloud.ipnett.se', _instance_id='0', _client=None, _type=6, _hints=1, _http_server=None, _logger=None, _more=False, _node_type='Config', data=ModuleCpuState(module_cpu_info=[ModuleCpuInfo(instance
_id='0', module_id='contrail-api', cpu_info=CpuLoadInfo(sys_mem_info=SysMemInfo(total=1884124L, buffers=24956L, free=0.02490234375, used=1404096L), meminfo=MemInfo(virt=677196, peakvirt=677196, res=71532), cpu_share=10.0, num_cpu=1
, cpuload=CpuLoadAvg(fifteen_min_avg=0.08, five_min_avg=0.12, one_min_avg=0.14)))], name='contrail-dev-config1.srv.lab.cloud.ipnett.se', deleted=None, config_node_ip=['10.19.30.116'], _table='ObjectConfigNode', build_info='{"build-
info" : [{"build-version" : "3.0", "build-time" : "2016-02-02 02:22:41.454097", "build-user" : "jenkins-remote", "build-hostname" : "jenkinsx86-64-test-rpmcentos7.srv.lab.cloud.ipnett.se", "build-git-ver" : "6caa96c", "build-id" :
"3.0-1602020147.el7", "build-number" : "@ipnett"}]}'), _module='contrail-api', _level=2147483647, _timestamp=1454426569455028, _client_context='', _connect_to_collector=True, _role=0)
Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se contrail-api[14463]: ERROR:contrail-api:sandesh type: : ModuleCpuStateTrace

We traced the error itself to https://github.com/Juniper/contrail-sandesh/blob/master/library/python/pysandesh/protocol/TXMLProtocol.py#L255

It happens at https://github.com/Juniper/contrail-sandesh/blob/master/library/python/pysandesh/sandesh_session.py#L177

So, as far as we underestand, this ModuleCpueStateTrace , held in /usr/lib/python2.7/site-packages/cfgm_common/uve/cfgm_cpuinfo/ttypes.py , distributed by the package "python-contrail", is actually generated:
anticimex@galileo ~/code/cloud-services/opencontrail $ grep -r ttypes.py .
./contrail-sandesh/compiler/generate/t_py_generator.cc: string f_types_name = package_dir_+"/"+"ttypes.py";

The error can be said to be that "someone put a float where there must be an int", but shouldn't the marshaller code actually be able to detect whether it's float or int, since python allows you to do that detection, and use the appropriate format when marshalling?

I.e.
>>> type(5).__name__
'int'
>>> type(0.2).__name__
'float'
>>>

Anyway, looking forward to finding a solution. The bug is severe, it appears this prevents nodes from reporting in that they are alive in some sense for contrail; though discovery pub/sub is working unaffected.

information type: Proprietary → Public
Revision history for this message
Martin Millnert (r-martin-5) wrote :

More information, and patch submitted.

The problem is that numeric array references to python-psutil's phymem_usage() are used at
https://github.com/Juniper/contrail-controller/blob/8cc4abe1d8422ce515673a79bc614b27c4fb088e/src/config/common/vnc_cpu_info.py#L132

We have currently python-psutil-2.2.1-el7 installed, and it looks like this:
>>> import psutil
>>> phymem_usage = psutil.phymem_usage()
>>> phymem_usage
svmem(total=1929342976L, available=1342205952L, percent=30.4, used=1854337024L, free=75005952L, active=851570688, inactive=844365824, buffers=85241856L, cached=1181958144)
>>> phymem_usage[2]/1024
0.0296875

Now, this is slightly tricky since psutil is an external library.

First of all, one can note that since commit https://github.com/giampaolo/psutil/commit/13689e21aae3cef779b54a57bc985cab36fdaa9e which was between psutil 0.5.1 and 0.6.0, phymem_usage() has been deprecated in favour of the virtual_memory() call, which provides the data as named tuples, i.e.:
>>> vmem = psutil.virtual_memory()
>>> dir(vmem)
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_fields', '_make', '_replace', 'active', 'available', 'buffers', 'cached', 'count', 'free', 'inactive', 'index', 'percent', 'total', 'used']
>>> vmem.available
1341956096L
>>> vmem.free
109649920L

Three files use this psutil SysMemInfo method in contrail-controller:
anticimex@galileo ~/code/cloud-services/external/opencontrail/contrail-controller [bug] $ grep -lr SysMemInfo --include='*.py' .
./src/vnsw/agent/uve/cpuinfo.py
./src/opserver/cpuinfo.py
./src/config/common/vnc_cpu_info.py

The proposed fix is two-fold:
 1) impose minimum version requirement of python-psutil to 0.6.0 around the board,
 2) use the new and less imprecise python-psutil method virtual_memory(), which has been stable since version 0.6.0, released 2012-07-24.

A proposed fix for contrail-packages comes in a separate bugreport.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16828
Submitter: Martin Millnert (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16828
Committed: http://github.org/Juniper/contrail-controller/commit/4b9601713e86e90f2ee20f8c33ca5489fdba6a83
Submitter: Zuul
Branch: master

commit 4b9601713e86e90f2ee20f8c33ca5489fdba6a83
Author: Martin Millnert <email address hidden>
Date: Wed Feb 3 16:04:02 2016 +0100

fix usage of psutil

This patch fixes usage of psutil API functions that have been deprecated
since psutil 0.6.0, or commit 13689e21aae3 to be exact.

Basically, the psutil function phymem_usage() and phymem_buffers() have
been deprecated in favour of a combind virtual_memory().
And virtual_memory() exposes named tuples, which are MUCH safer to use
since the size of the array may change at any time.

The specific bug, reported in
https://bugs.launchpad.net/juniperopenstack/+bug/1540973 , was that on
an EL 7.2 system with python-psutil 2.2.1, phymem_usage() returns the
virtual_memory() array, which is unlike whatever version that the three
patched files were coded against.

Specifically, the bug that appears is that phymem_usage.free, is expectd
to sit at phymem_usage[2], but that array position instead carries the
"percent" variable:

>>> import psutil
>>> phymem_usage = psutil.phymem_usage()
>>> phymem_usage
svmem(total=1929342976L, available=1347878912L, percent=30.1,
used=1836646400L, free=92696576L, active=875323392, inactive=802705408,
buffers=87674880L, cached=1167507456)
>>> phymem_usage[2]
30.1
>>> phymem_usage[2]/1024
0.02939453125

Contrail later transfers these values onto a SysMemInfo variable,
that gets marshalled by pysandesh onto a u32 array struct
(
https://github.com/Juniper/contrail-controller/blob/0485424b00ab62cef037020907db18ded6d4fe87/src/base/sandesh/cpuinfo.sandesh#L25 ),
which causes config to not be able to report its cpuinfo correctly:

Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se
contrail-api[14463]: 02/02/2016 04:22:49 PM [TXMLProtocol]: TXML
Protocol: Invalid u32 value 0.02490234375

Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se
contrail-api[14463]: ERROR:TXMLProtocol:TXML Protocol: Invalid u32 value
0.02490234375

Feb 02 16:22:49 contrail-dev-config1.srv.lab.cloud.ipnett.se
contrail-api[14463]: ERROR:contrail-api:Failed to send sandesh

This patch fixes the problem.
Note: requirements.txt, test-requirements.txt, setup.py and SConscripts
have been updated to reflect these changes. We'll see where that leads
us, given that the origin/master of psutil has fully removed some
deprecated API functions by now.

Change-Id: I125a85425ed1e9c011b33c7b26839781ed12f407
Closes-Bug: 1540973
Signed-off-by: Martin Millnert <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.