lxml

Bug #583249
Comment #8

Comment 8 for bug 583249

Revision history for this message

scoder (scoder) wrote on 2019-04-06:

Here is the valgrind output, using libxslt 1.1.32, libxml2 2.9.8 and CPython 3.7.0:

==20079== 1 errors in context 1 of 50:
==20079== Invalid free() / delete / delete[] / realloc()
==20079== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x6F377FE: xmlFreeNodeList (tree.c:3721)
==20079== by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079== by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079== by 0x6F37583: xmlFreeDoc (tree.c:1253)
==20079== by 0x6415957: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (etree.c:51785)
==20079== by 0x64157B0: __pyx_pw_4lxml_5etree_9_Document_1__dealloc__ (etree.c:51765)
==20079== by 0x6744E65: __pyx_tp_dealloc_4lxml_5etree__Document (etree.c:224844)
==20079== by 0x67457E8: __pyx_tp_dealloc_4lxml_5etree__Element (etree.c:225159)
==20079== by 0x674667F: __pyx_tp_dealloc_4lxml_5etree__ElementTree (etree.c:226052)
==20079== by 0x1FE1CA: tupledealloc (tupleobject.c:246)
==20079== by 0x1688F4: call_function (ceval.c:4615)
[...]
==20079== Address 0x887dd00 is 0 bytes inside a block of size 120 free'd
==20079== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x6AB2744: xsltApplyStripSpaces (transform.c:5732)
==20079== by 0x6AB3733: xsltApplyStylesheetInternal (transform.c:6011)
==20079== by 0x66D4E1F: __pyx_f_4lxml_5etree_4XSLT__run_transform (etree.c:200006)
==20079== by 0x66CE938: __pyx_pf_4lxml_5etree_4XSLT_18__call__ (etree.c:198792)
==20079== by 0x66CC3EF: __pyx_pw_4lxml_5etree_4XSLT_19__call__ (etree.c:198352)
==20079== by 0x18999E: _PyObject_FastCallKeywords (call.c:199)
==20079== by 0x16A617: call_function (ceval.c:4605)
[...]
==20079== Block was alloc'd at
==20079== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x70154AE: xmlSAX2TextNode (SAX2.c:1863)
==20079== by 0x70181BC: xmlSAX2Characters (SAX2.c:2557)
==20079== by 0x6F1C1B3: xmlParseCharData (parser.c:4457)
==20079== by 0x6F29AB6: xmlParseContent (parser.c:9862)
==20079== by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079== by 0x6F29B5A: xmlParseContent (parser.c:9846)
==20079== by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079== by 0x6F2AC1A: xmlParseDocument (parser.c:10711)
==20079== by 0x6F323A0: xmlDoRead (parser.c:15191)
==20079== by 0x6F323A0: xmlCtxtReadFile (parser.c:15436)
==20079== by 0x6558BCB: __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile (etree.c:122932)
[...]

It shows that libxslt frees text nodes in xsltApplyStripSpaces(), which are then freed again by xmlFreeDoc() later. Meaning, somehow, they still reside in the document, although they have been freed. libxslt clearly corrupts the tree state here, which then leads to a crash when lxml discards the input document.

These nodes are created by the parser in libxml2, freed by the XSLT processor in libxslt, and then freed again by the document disposal in libxml2. All of this is outside of the control of lxml. Honestly, I cannot see what lxml could do to prevent this. It cannot even safely warn about XSLTs that strip whitespace, because that can even be triggered by transitively imported stylesheets.

It is also not obvious how libxslt can be fixed. That might require a complete rewrite of the strip-space implementation.

Note that it is inherently wrong for libxslt to modify the *input* document in place during an XSLT transformation. If you run the same transform twice, once with stripping whitespace and once without it, you would get the same result in both cases, even though you asked for something else. Here is another nice example:

----------------
from lxml import etree as et

transform = et.XSLT(et.fromstring('''\
<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:strip-space elements="*"/>
 <xsl:template match="/">
 <foo><xsl:value-of select="a/b/text()" /></foo>
 </xsl:template>
</xsl:stylesheet>'''))

xml = et.fromstring('''\
<a>
huhu
</a>
''')

print("BEFORE", et.tostring(xml, encoding='unicode'))
print("XSLT", transform(xml))
print("AFTER", et.tostring(xml, encoding='unicode'))
----------------

Output:

----------------
BEFORE <a>
huhu
</a>
XSLT <?xml version="1.0"?>
<foo>huhu</foo>

AFTER <a>huhu</a>
----------------

Here is the valgrind output, using libxslt 1.1.32, libxml2 2.9.8 and CPython 3.7.0:

==20079== 1 errors in context 1 of 50:
==20079== Invalid free() / delete / delete[] / realloc()
==20079==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079==    by 0x6F377FE: xmlFreeNodeList (tree.c:3721)
==20079==    by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079==    by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079==    by 0x6F37583: xmlFreeDoc (tree.c:1253)
==20079==    by 0x6415957: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (etree.c:51785)
==20079==    by 0x64157B0: __pyx_pw_4lxml_5etree_9_Document_1__dealloc__ (etree.c:51765)
==20079==    by 0x6744E65: __pyx_tp_dealloc_4lxml_5etree__Document (etree.c:224844)
==20079==    by 0x67457E8: __pyx_tp_dealloc_4lxml_5etree__Element (etree.c:225159)
==20079==    by 0x674667F: __pyx_tp_dealloc_4lxml_5etree__ElementTree (etree.c:226052)
==20079==    by 0x1FE1CA: tupledealloc (tupleobject.c:246)
==20079==    by 0x1688F4: call_function (ceval.c:4615)
[...]
==20079==  Address 0x887dd00 is 0 bytes inside a block of size 120 free'd
==20079==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079==    by 0x6AB2744: xsltApplyStripSpaces (transform.c:5732)
==20079==    by 0x6AB3733: xsltApplyStylesheetInternal (transform.c:6011)
==20079==    by 0x66D4E1F: __pyx_f_4lxml_5etree_4XSLT__run_transform (etree.c:200006)
==20079==    by 0x66CE938: __pyx_pf_4lxml_5etree_4XSLT_18__call__ (etree.c:198792)
==20079==    by 0x66CC3EF: __pyx_pw_4lxml_5etree_4XSLT_19__call__ (etree.c:198352)
==20079==    by 0x18999E: _PyObject_FastCallKeywords (call.c:199)
==20079==    by 0x16A617: call_function (ceval.c:4605)
[...]
==20079==  Block was alloc'd at
==20079==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079==    by 0x70154AE: xmlSAX2TextNode (SAX2.c:1863)
==20079==    by 0x70181BC: xmlSAX2Characters (SAX2.c:2557)
==20079==    by 0x6F1C1B3: xmlParseCharData (parser.c:4457)
==20079==    by 0x6F29AB6: xmlParseContent (parser.c:9862)
==20079==    by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079==    by 0x6F29B5A: xmlParseContent (parser.c:9846)
==20079==    by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079==    by 0x6F2AC1A: xmlParseDocument (parser.c:10711)
==20079==    by 0x6F323A0: xmlDoRead (parser.c:15191)
==20079==    by 0x6F323A0: xmlCtxtReadFile (parser.c:15436)
==20079==    by 0x6558BCB: __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile (etree.c:122932)
[...]

It is also not obvious how libxslt can be fixed. That might require a complete rewrite of the strip-space implementation.

----------------
from lxml import etree as et

xml = et.fromstring('''\
<a>
 huhu
</a>
''')

print("BEFORE", et.tostring(xml, encoding='unicode'))
print("XSLT", transform(xml))
print("AFTER", et.tostring(xml, encoding='unicode'))
----------------

Output:

----------------
BEFORE <a>
 huhu
</a>
XSLT <?xml version="1.0"?>
<foo>huhu</foo>

AFTER <a>huhu</a>
----------------