Features Download
From: David Goodger <goodger <at> python.org>
Subject: Re: Re: found a UnicodeDecodeError in a very simple rst file
Newsgroups: gmane.text.docutils.user
Date: Tuesday 9th May 2006 14:21:23 UTC (over 12 years ago)
[David Goodger]
>> I came to the same conclusion, and have prepared a patch to add a
>> --command-line-encoding option/setting to allow for non-ASCII file
>> names.

[Felix Wiemann]
> Well honestly, that'd be pretty ugly, and I hope and think we don't
> need it.

Sorry, I misspoke a bit.  I added a setting, but it doesn't really
make sense as a command line option; it's too late at that point.  So
it would be a set-once-and-forget setting.  Seeing that there doesn't
seem to be a reliable way to know (again, *not* guess, but *know*) the
command-line encoding, a mechanism to specify explicitly is necessary
and appropriate.

The patch I mentioned on Friday is attached.  I don't think it's the
right way to go now though; more below.

> When I type "rst2html.py heizölrückstoß.txt", I just want it to
> work, because what I type is characters, not bytes!

Yes, I agree.

>> What I'd like to know is, is there a reliable way to *know* (not
>> guess) the encoding of sys.argv?  IOW, a way to know the shell's
>> encoding.
> I'd go for the locale encoding.

As François pointed out, that's not good enough.

[Michael Zheng]
>> >>> import sys
>> >>> sys.stdin.encoding
>> 'cp936'

[Felix Wiemann]
> I didn't know sys.stdin has an encoding attribute -- from my quick
> testing, it seems to coincide with the locale encoding, but I'm not
> sure if that's always the case.

I tested on my system, where the shell is set to use UTF-8, but Python
launched from that shell reports sys.stdin.encoding as 'US-ASCII'.
It's obvious we can't depend on sys.stdin.encoding, but perhaps we can
use it as a clue.

[Felix Wiemann]
> IIRC we always treat file names as byte sequences internally.

Yes, and that should always work for opening files.  There is
sys.getfilesystemencoding for converting Unicode file names to system
file names, but it's only in Python 2.3+.  In any case, I'd rather we
didn't convert to Unicode and back again for this purpose -- too much
chance of mis-conversion and data corruption.

But we must convert file names to Unicode to report errors & document
the source.

> So when we convert them to unicode, the best we can do is probably
> to use a reasonable encoding (e.g. the locale -- ASCII is *not*
> reasonable) and use the 'replace' error handler.

I would try UTF-8/strict first, then try the locale's encoding with
strict error handling, then ASCII/replace.  UTF-8 is almost 100% safe
to try first; false positives are rare, and the world is moving toward
UTF-8.  The locale's encoding with strict error handling may be a
reasonable compromise, but any error would indicate that the locale's
encoding is inappropriate.  The only safe fallback is ASCII/replace.

> It might be a good idea to create a "file name" type (derived from
> str or unicode or so), in order to make sure we *never* implicitly
> convert file names from str to unicode (implicit conversions easily
> slip through unnoticed), and in order to centralize that
> file-name-to-unicode logic.

You may be on to something.  Rather than "file name" though, a class
for "command-line text" would be appropriate, since we accept more
than just file names on the command line (for example, --title).

The attached patch converts command-line text to Unicode early, but I
don't think this is the right approach.  I think the unconverted
values should be used for opening the files themselves rather than
re-encoding the decoded values (see sys.getfilesystemencoding).

[François Pinard]
> Just fantasizing here  :-) .
> If docutils was fully Unicodized inside (I mean, working almost
> exclusively 16-bit inside, using ``unicode`` type), then the seldom
> ``str`` type strings could be used to represent file names and in
> fact, any other kind of object for which Unicode conversion should
> be forbidden.  I vaguely remember having read that some trick --
> with ``defaultencoding``? -- guarantees to trap any automatic
> ``str`` to ``unicode`` implicit promotion.  This could be a
> debugging device.

That would require changing the default encoding for the entire Python
session.  But Docutils is a library, not a self-contained application,
and this would surely break applications using Docutils.  And Docutils
uses modules from the standard library, which may also break (for
example, exception messages are typically 8-bit strings containing
US-ASCII; converting them to Unicode would fail).

David Goodger <http://python.net/~goodger>
CD: 3ms