preserve white space outside root element
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
Overview:
lxml doesn't preserve white space outside the document element. This removes the trailing newline at the end of the file, interferes with diffing, and makes the output harder to read.
Steps to reproduce:
parser = etree.XMLParser
tree = etree.parse(source, parser=parser
o = html5lib.
Using lxml.etree.
Actual results:
White space outside the document element is stripped.
Expected results:
White space is preserved in the tree so that it can be serialized back out in its original state, preserving among doctype declarations, PIs, comments, etc.
Other information:
I don't actually know what version of lxml this is, or how to get that information. :(
Changed in lxml: | |
importance: | Undecided → Wishlist |
status: | New → Triaged |
I too would like to see this improvement. Many times I need to make widespread changes to XML files that are stored in a version control system (e.g.: Visual Studio project files, sample data for unit tests, etc. etc.). It is often helpful to write a script that can process the XML content in a structured way rather than doing a dumb search-replace. If every single file that gets touched by the XML parser gets re-written with all the white space changed then it is difficult at best to use common diff tools to see exactly what parts were "actually" changed. The problem is further exacerbated as the number of modified files increases (e.g.: hundreds or thousands of files are very difficult to analyse and compare).
I suspect the reason this library works as it does (and most, if not all other XML processing libraries I've looked at) is because it parses the source file and stores its content in some internal structure for efficient XML processing operations. After the modifications are complete it likely looses the context of how that information was originally formed in the source XML.
However, this doesn't negate or minimize the importance of the use case I have just described. I suspect my use case is probably just one of many where people could have the need to preserve the format and style of the original content. If this or any other library was able to satisfy this requirement I suspect it would open a whole new market for that tool.