@John
> > FTP servers are usually encoding agnostic. They treat file names as
> > built of 8-bit characters without any specific encoding. They don't
> > support any encoding explicitly,
>
> If the actual server is agnostic, then it means that Bazaar can chose
> what encoding it wants to use.
Not exactly. If I have in local file system two files: one containing non-ascii chars, and other that contains reference to the first one inside then after uploading to FTP server I expect for this reference to still be right.
If ftp client reads filename, interprets it as iso8859-2, sends it to FTP server encoded as utf-8, but the server does not support utf-8 communication and treats incoming communication as 8-bit characters without specific encoding, then writes them to disk, then the reference between two files will be lost.
In such case sending file names with special chars encoded with local system encoding is better because it preserves exact byte representation that the filenames have in this system.
os.path calls do not return unicode, they return string of bytes. In other languages with varying unicode support this is also the case. System calls return filenames as 8-bit strings they are actually encoded by.
> I *believe* that fancy_rename is being called on URL fragments, which
should *not* be Unicode strings. (In bzr, paths are Unicode, urls are
url-escaped-utf8-encoded 7-bit ascii strings.)
So at some point 8-bit characters in file names should be url-escaped at some point (after reading them from disk?) before passing them to actual functions fancy_rename. This should also fix the problem in clean way.
During my investigation I encountered two errors. One in _remote_path(), other in pathjoin() call inside fancy_rename()
@Vincent
> I realized the approach is different, that's why I asked you to test it :)
I'm gonna definitely do that next week.
> Kamil> FTP servers are usually encoding agnostic.
> Not really, they have to obey whatever file system is used underneath.
The FTP servers and clients I encountered seemed like they didn't care about the difference between chars and bytes. They never converted anything between encodings. Clients read streams of bytes from system calls, passed them to servers and servers wrote what they received by passing it directly to system calls. They seemed not to use unicode at any point or be aware of system locale.
Of course I might be wrong in my impression. I also might encounter FTP software that did not obey RFC as it should or was just too lenient in what it accepted.
> Many file systems will refuse to create files with arbitrary 8-bits characters.
I imagine this might be the case when system file system uses multi-byte encoding for file names like utf-8.
For one I can verify that ext4 is not such file system. Despite the fact that my system LOCALE is utf-8 I can create file with invalid UTF-8 characters in its file name by passing arbitrary string of 8-bit characters to function touch() in PHP language.
> Your changes only impact the client side, the problem is on the server side.
Perhaps but it was the client that threw exceptions at me, and it was the client that I had under my control.
> If the clients uses iso-8859-1, mac-roman or utf8 and the server uses iso-8859-2, then using your changes can break but using utf8 should work.
> Now, it depends on whether the server is handling utf8.
If you could apply my ideas only to servers that report that they do not support utf-8 it would be great. I don't want to break anything. I just want to have it working in my setup that most likely does not support RFC2640.
> Now, it depends on whether the server is handling utf8.
> If it doesn't, then if (but only if) the server and all the clients use the same fs encoding, your changes will work.
As I mentioned file systems I encounter also tend to be encoding agnostic. Only file system I imagine that might actively deny using arbitrary 8-bit strings as filenames (perhaps with exception of some bytes explicitly forbidden to occur in filenames) must be using 7-bit characters for filenames or multibyte encoding with strict checking.
Can you give an example of such file system that is widely used?
> The ftp transport receives paths that have been processed by higher layers, it should not worry about what the file system encoding is, the higher layers did.
This statement actually supports my claim that my fixes are applied in right places because the cause of the exceptions I encountered was mixing strings and unicode in these functions without caring about proper encoding to be used.
fancyrename() gets string from os.dirname() and mixes it with unicode without without converting it with right encoding first and _remote_path() crams unicode into 7-bit string without regad that unicode string might contain national characters.
Maybe proper way to detect encoding in fancy rename should be different (the get_terminal_encoding and get_user_encoding mentioned?), and maybe _remote_path should be rephrased somehow or removed altogether (as you did, I'm not sure how :-)).
> Kamil> I wonder how the core of bzr deals with this.
> By using unicode internally and decoding the paths respecting the file system encoding when needed (outside the scope or both your changes and mine).
So paths in fancyrename() after reading them should be passed through same functions that rest of bzr uses for decoding the paths with respect of the file system encoding.
> But merge proposals are easier to deal with even at that point as they present a diff of your changes without the need to drill down into each revision or grab a local copy of your branch.
> They also allows us to discuss there instead of here :)
Ok. So I'm gonna make a merge proposal. I was under wrong impression that this is something reserved for mature code.
Thank you for verbose responses.
@John
> > FTP servers are usually encoding agnostic. They treat file names as
> > built of 8-bit characters without any specific encoding. They don't
> > support any encoding explicitly,
>
> If the actual server is agnostic, then it means that Bazaar can chose
> what encoding it wants to use.
Not exactly. If I have in local file system two files: one containing non-ascii chars, and other that contains reference to the first one inside then after uploading to FTP server I expect for this reference to still be right.
If ftp client reads filename, interprets it as iso8859-2, sends it to FTP server encoded as utf-8, but the server does not support utf-8 communication and treats incoming communication as 8-bit characters without specific encoding, then writes them to disk, then the reference between two files will be lost.
In such case sending file names with special chars encoded with local system encoding is better because it preserves exact byte representation that the filenames have in this system.
os.path calls do not return unicode, they return string of bytes. In other languages with varying unicode support this is also the case. System calls return filenames as 8-bit strings they are actually encoded by.
> I *believe* that fancy_rename is being called on URL fragments, which utf8-encoded 7-bit ascii strings.)
should *not* be Unicode strings. (In bzr, paths are Unicode, urls are
url-escaped-
So at some point 8-bit characters in file names should be url-escaped at some point (after reading them from disk?) before passing them to actual functions fancy_rename. This should also fix the problem in clean way.
During my investigation I encountered two errors. One in _remote_path(), other in pathjoin() call inside fancy_rename()
@Vincent
> I realized the approach is different, that's why I asked you to test it :)
I'm gonna definitely do that next week.
> Kamil> FTP servers are usually encoding agnostic.
> Not really, they have to obey whatever file system is used underneath.
The FTP servers and clients I encountered seemed like they didn't care about the difference between chars and bytes. They never converted anything between encodings. Clients read streams of bytes from system calls, passed them to servers and servers wrote what they received by passing it directly to system calls. They seemed not to use unicode at any point or be aware of system locale.
Of course I might be wrong in my impression. I also might encounter FTP software that did not obey RFC as it should or was just too lenient in what it accepted.
> Many file systems will refuse to create files with arbitrary 8-bits characters.
I imagine this might be the case when system file system uses multi-byte encoding for file names like utf-8.
For one I can verify that ext4 is not such file system. Despite the fact that my system LOCALE is utf-8 I can create file with invalid UTF-8 characters in its file name by passing arbitrary string of 8-bit characters to function touch() in PHP language.
> Your changes only impact the client side, the problem is on the server side.
Perhaps but it was the client that threw exceptions at me, and it was the client that I had under my control.
> If the clients uses iso-8859-1, mac-roman or utf8 and the server uses iso-8859-2, then using your changes can break but using utf8 should work.
> Now, it depends on whether the server is handling utf8.
If you could apply my ideas only to servers that report that they do not support utf-8 it would be great. I don't want to break anything. I just want to have it working in my setup that most likely does not support RFC2640.
> Now, it depends on whether the server is handling utf8.
> If it doesn't, then if (but only if) the server and all the clients use the same fs encoding, your changes will work.
As I mentioned file systems I encounter also tend to be encoding agnostic. Only file system I imagine that might actively deny using arbitrary 8-bit strings as filenames (perhaps with exception of some bytes explicitly forbidden to occur in filenames) must be using 7-bit characters for filenames or multibyte encoding with strict checking.
Can you give an example of such file system that is widely used?
> The ftp transport receives paths that have been processed by higher layers, it should not worry about what the file system encoding is, the higher layers did.
This statement actually supports my claim that my fixes are applied in right places because the cause of the exceptions I encountered was mixing strings and unicode in these functions without caring about proper encoding to be used.
fancyrename() gets string from os.dirname() and mixes it with unicode without without converting it with right encoding first and _remote_path() crams unicode into 7-bit string without regad that unicode string might contain national characters.
Maybe proper way to detect encoding in fancy rename should be different (the get_terminal_ encoding and get_user_encoding mentioned?), and maybe _remote_path should be rephrased somehow or removed altogether (as you did, I'm not sure how :-)).
> Kamil> I wonder how the core of bzr deals with this.
> By using unicode internally and decoding the paths respecting the file system encoding when needed (outside the scope or both your changes and mine).
So paths in fancyrename() after reading them should be passed through same functions that rest of bzr uses for decoding the paths with respect of the file system encoding.
> But merge proposals are easier to deal with even at that point as they present a diff of your changes without the need to drill down into each revision or grab a local copy of your branch.
> They also allows us to discuss there instead of here :)
Ok. So I'm gonna make a merge proposal. I was under wrong impression that this is something reserved for mature code.