dtd resolver resolves from parent directory
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
Python : sys.version_
lxml.etree : (4, 5, 0, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
Hi there,
I have a very weird case where the DTD is not correctly searched:
when I use a parameter entity ref inside a declaration subset
the DTD itself is being searched in the parent(!) directory.
It works if there is no declaration subset.
It works if the external entityrefs are specified
directly inside the decl subset.
--------------- TEST PROGRAM
#!/usr/bin/env python3
import sys
from lxml import etree
doc = open(sys.
parser = etree.XMLParser
tree = etree.fromstring( doc, parser )
res = etree.tostring(
print( res )
--------------- DOCUMENT (rama.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "buch.dtd" [
<!ENTITY % parts SYSTEM "parts.ent" >
%parts;
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>
</kapitel>
</buch>
--------------- buch.dtd
<!ELEMENT buch (titel?,(kapitel)*) >
<!ELEMENT kapitel (absatz)* >
<!ATTLIST kapitel nr CDATA #IMPLIED >
<!ENTITY % plaintext "(#PCDATA)*" >
<!ELEMENT titel %plaintext; >
<!ELEMENT absatz %plaintext; >
<!ENTITY auml "ä">
<!ENTITY ouml "ö">
<!ENTITY uuml "ü">
-------------- parts.ent
<!ENTITY kap1 SYSTEM "kapitel1.xml">
-------------- kapitel1.xml
<kapitel nr="1">
<absatz>
</kapitel>
--------------- working without parametric entref (rama2.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "buch.dtd" [
<!ENTITY kap1 SYSTEM "kapitel1.xml">
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>
</kapitel>
</buch>
-------------- working with different directory (rama3.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "lxmlbug/buch.dtd" [
<!ENTITY % parts SYSTEM "parts.ent" >
%parts;
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>
</kapitel>
</buch>
After some more digging I found out that the DTD entity resolution
machanism prefixes the system ID with the path of the parent directory,
whereas parametric or general entites do not get that treatment.
class DTDResolver( etree.Resolver) : self,system_ id,public_ id,context) : .resolve( system_ id,public_ id,context)
def resolve(
print( f"*** SYSTEM {system_id} PUBLIC {public_id}" )
return super()
doc = open("rama. xml","rb" ).read( ) (dtd_validation =True,load_ dtd=True) resolvers. add( DTDResolver() )
parser = etree.XMLParser
parser.
tree = etree.fromstring( doc, parser )
/home/em/ Workbench/ beautifulsoup> ./dtdbug.py em/Workbench/ buch.dtd PUBLIC -//Testing//DTD Buch//DE etree.pyx" , line 3235, in lxml.etree. fromstring parser. pxi", line 1876, in lxml.etree. _parseMemoryDoc ument parser. pxi", line 1764, in lxml.etree. _parseDoc parser. pxi", line 1127, in lxml.etree. _BaseParser. _parseDoc parser. pxi", line 601, in lxml.etree. _ParserContext. _handleParseRes ultDoc parser. pxi", line 711, in lxml.etree. _handleParseRes ult parser. pxi", line 640, in lxml.etree. _raiseParseErro r XMLSyntaxError: failed to load external entity "/data/ home/em/ Workbench/ buch.dtd" , line 5, column 3 Workbench/ beautifulsoup>
*** SYSTEM parts.ent PUBLIC None
*** SYSTEM /data/home/
*** SYSTEM kapitel1.xml PUBLIC None
Traceback (most recent call last):
File "./dtdbug.py", line 16, in <module>
tree = etree.fromstring( doc, parser )
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "<string>", line 5
lxml.etree.
/home/em/
However at least I can fix that using explicit catalog.xml
<?xml version="1.0"?> ///usr/ share/xml/ schema/ xml-core/ catalog. dtd"> urn:oasis: names:tc: entity: xmlns:xml: catalog" > "-//Testing/ /DTD Buch//DE" uri="buch.dtd"/> "parts. ent" uri="parts.ent"/> "kapitel1. xml" uri="kapitel1. xml"/> "kapitel2. xml" uri="kapitel2. xml"/>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.0//EN"
"file:
<catalog xmlns="
<public publicId=
<system systemId=
<system systemId=
<system systemId=
</catalog>
> XML_CATALOG_ FILES=catalog. xml ./dtdbug.py em/Workbench/ buch.dtd PUBLIC -//Testing//DTD Buch//DE
*** SYSTEM parts.ent PUBLIC None
*** SYSTEM /data/home/
*** SYSTEM kapitel1.xml PUBLIC None
>
Still gets the wrong system id, but does not throw expections.