Features Download
From: Tom Christiansen <tchrist <at> perl.com>
Subject: Either clear the Unicode air--or make a release-blocker? (was: Unicode cheatsheet for Perl)
Newsgroups: gmane.comp.lang.perl.perl5.porters
Date: Saturday 25th February 2012 22:28:42 UTC (over 5 years ago)
Maybe that's too strong a subject line, but I *really* want to know
what we can't tell people they can use Perl with Unicode in anything but
most excruciating all all possible ways -- if this is indeed true.

Leon Timmermans  on Tue, 21 Feb 2012 09:44:15 +0100

> I'm not entirely sure what gets fixed by that and what doesn't, it isn't
> documented at all. Looking at the source makes me feel it's a hack IMO,
> I strongly suspect it is not quite a complete fix: there are just too
> places that would need to get fixed. I believe the utf8 layer would be
> right place to do it because that's the only place that almost all Input
> passes through. Fixing it there fixes it almost everywhere (except
sysread I
> suppose, but that can be fixed too).

Some folks claim the only "safe" way to use Unicode in Perl is to always
explicit calls to encode/decode with a bonus FB_CROAK argument.  They claim
that all nine of these perfectly reasonable and

    #1.   $ perl -C...
    #2.   $ export PERL_UNICODE=...

    #3.     use utf8;

    #4.     use open qw[ :std :utf8            ];
    #5.     use open qw[ :std :encoding(UTF-8) ];

    #6.     binmode(FH, ":utf8");
    #7.     binmode(FH, ":encoding(UTF-8)");

    #8.     open(FH,  "< :utf8",            $path);
    #9.     open(FH,  "< :encoding(UTF-8)", $path);

...are all of them flawed in their not raising exceptions on UTF-8
encoding errors of one sort of another, and that somehow not even...

    #0.     use warnings qw(FATAL utf8);

...is good enough to fix it.  

I do not know whether these claims are true.  My own tests suggest this may
not be the whole story, because this behaves as I think it should:

  darwin$ perl -C0 -E 'say for "caf\xE9", "stuff"' | 
	  perl -CS -Mwarnings=FATAL,utf8 -pe 'print "$. "'
  utf8 "\xE9" does not map to Unicode, <> line 1.
  Exit 255


Which seems to say that #0 makes at least #1 safe.  Again, I'm fuzzy on
the perceived problem actually is.  Maybe they're using autodie or
which is known broken. I'm trying to get more info.

I also do not know the precise details of these so-called "security" bugs
Christian references.  I've no reason to disbelieve Christian; I just don't
know the details myself, nor am I asking that they be splatted all over.

What I do know is that telling people that the only "right" or "safe" or
"acceptable" way to use Unicode in Perl is via myriad exclicit calls to
&Encode::{utf8_,}{en,de}code(..., FB_CROAK) just doesn't cut it—full

If there's something so important that it must be done everytime to ensure
correct behavior, then that is too important to be left up to the
to forget to do.  It needs to be done for him.

We should not have to endure five more tedious years of people getting
tonguelashed and flamenagged into writing horribly complicated code
all because something deep down in Perl's dwimmer is flawed.

I say five years, not one year, because of how long it takes to get vendors
to get themselves updated.  If this is a legit issue, and we push it off
2013's v5.18, then it will be a further 2–4 years past that until people
have reasonable Unicode processing in their vendor Perl.  That puts us into
the 2015–2017 (!) time frame, and... that's just not acceptable, eh?

By that time, one or both of two things will have happened.  Either untold
zillions of lines of code will have been written that either conform to
this ridiculous amount of monkeywork, entrenching a bad pattern forever, or
else untold zillions of line of code will have been written that ignore it
and are themselves open to the kind of spooky catastrophic failure that
people allude to, thereby rendering all Perl a security hole of Chicken
Little proportion.

Since we won't let either of those happen, I figure that either 

 —— all #1 .. #9 of my numbered points above are completely safe and
    proper, preferably always but minimally along with the #0 fatalization
 —— or else they need to be made so before we dare release v5.16.

Why?  Simple: remove those 9 simple ways to approach implicit Unicode
processing in Perl, and you so gut Perl's Unicode dwimmer that nobody
save the very most diligent and élite [sic] of Perl gurus will ever dare
use Perl with Unicode.  That would be tragic, maybe disastrous even.

Possibly this is all well known, and I just haven't been listening.  Maybe
it's even been fixed, assuming it was ever broken in the first place, which
I'm highly fuzzy on and can neither prove nor disprove with my own meagre
poking at the problem.  If so, I apologize for making much ado about

I do have notions about what should be happening with encoding layers,
including backwards compatibility concerns versus security concerns;
something along the lines that we're under no obligation to leave our
backdoors standing open for eternity.  Also, I strongly feel that all
encoding  errors should croak by default.  I hate garbage in files as
things silently fill them with manglings.  Those should croak if you
explicitly asked to get garbage out.  That shouldn't be a default.

I'm still troubled that Encode is *not* one of "our" modules, yet a whole
of what we do seems dependent on it.  We can't usefully create new warnings
and classes, errors, encoding names, and exceptions of encodings used in
internally if they don't sync up with Encode.  But we have and they don't:
already the utf8 warnings subclasses are broken with Encode.  That makes it
even more important that we get the internal stuff "right". There's a bunch
more where that came from, but I'll save the rest till someone tells me
we actually stand.

Karl, Leon, and Christian, thank you for your time and insights, both past
and future.  Nothing would please me more than to learn that I've had my
rattled for nothing, and that these are all non-issues.


PS: Christian Hansen, if you say your name fast enough, 
                      it rather sounds like my surname. :)
CD: 3ms