Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Thorsten Glaser <tg-vpiyNrvJqjezQB+pC5nmwQ <at> public.gmane.org>
Subject: Unicode PUA mapping F000..F7FF and F81F..F89F
Newsgroups: gmane.os.miros.general
Date: Saturday 15th March 2008 20:27:20 UTC (over 9 years ago)
Hello!

This is a request for both LANANA and the ConScript Unicode Registry to
assign the two ranges to a purpose outlined below.

I am the head developer of the MirOS project, working on a BSD operating
system.


First, regarding the range F000..F7FF, which is already assigned by the
Linux operating system kernel, according to the list at:
	http://www.lanana.org/docs/unicode/unicode.txt

Linux provides a straight-to-font mapping for these. I would very much
like MirOS BSD to use this mapping as well, and hereby ask the ConScript
Unicode Registry to add this to their list.


Second, the range F820..F89F (the F81F mapping comes last in my request).
I need a range of 128 codepoints for an application defined below, and
this range seems to be not used yet. I could have started at F800, but
the above LANANA document shows historical use for F800..F804 and unre-
commended use of F810..F813, so that I decided to move the request to
directly below the Aiha (Kesh) mapping.

Finally, the codepoint F81F – I would have used the character past the
128-codepoint range, but this is already used, so I decided to just go
back one codepoint for this request.


Proposal for the range U+F81F..F89F:

A new Unicode-compatible character set shall be defined, in which the
PUA code points F820..F89F have an assigned meaning, and in which the
PUA code point F81F is reserved.

This character set shall be encoded in two different ways. One of these
is called -16 (where  is the name of the encoding group, which
we have not yet decided upon, but we will inform you once this is done),
a 16-bit encoding similar to UTF-16 (host, big or little endian). This
will be the internal representation of MirOS’ wchar_t type. The -16
character set shall have a range of 0x0000 to 0xFFFD, representing the
Unicode codepoints U+0000 to U+FFFD and, via UTF-16’s use of surrogates,
the non-BMP codepoints; the values 0xFFFE and 0xFFFF are invalid -16
and can be used by the operating system, e.g. in MirOS, 0xFFFF is WEOF.

The second encoding is called -8, and shall be to -16 what
CESU-8 is to UTF-16.

The difference between  and UTF is as follows:

When converting from -16 to -8, the following wide characters
are mapped DIFFERENT from the CESU-8 multibyte character mapping:
• 0xF81F maps to 0xEF 0xA0 0x9F, as expected
• 0xF820 maps to 0x80
• 0xF821 maps to 0x81
• 0xF822 maps to 0x82
• …
• 0xF89E maps to 0xFE
• 0xF89F maps to 0xFF
To avoid wcrtomb(3) to throw EILSEQ, I would like to find a way to encode
0xFFFE and 0xFFFF as U+FFFE and U+FFFF, but since these cannot fit into
surrogates, this is going to be difficult. Probably convert U+FFFE/U+FFFF
to “UTF-16 surrogates” according to the formula “subtract 0x10000 to
give
a 20-bit value, then pack”, so that they wrap to 0xFFFFE/0xFFFFF and end
up as surrogate pair 0xD8FF plus 0xDFFE/0xDFFF, and then encode these like
we would do with CESU-8:
• 0xFFFE maps to 0xED 0xA3 0xBF 0xED 0xBF 0xBE
• 0xFFFF maps to 0xED 0xA3 0xBF 0xED 0xBF 0xBF
This would probably need a PUA assignment in plane 15. If you think that
this mapping is a good idea / needed, please consider this a request to
reserve it.

When converting from -8 to -16, the above scheme is reserved, but
in a way that always the shortest -16 sequence is created. This means
that, on a round trip from -16 via -8 back to -16, the ini-
tial wchar_t array { 0xF862, 0xF840, 0x0000 } ends up as { 0x00A0, 0x0000
},
because of the intermediate mapping { 0xC2, 0xA0, 0x00 }.

• 0x00..0x7F map to 0x0000..0x007F
• 0xC0 maps to 0xF860
• 0xC1 maps to 0xF861
• { 0xC2..0xDF, 0x00..0x7F } map to { 0xF862..0xF87F, 0x0000..0x007F }
• { 0xC2..0xDF, 0x80..0xBF } map to 0x0080..0x07FF
• { 0xC2..0xDF, 0xC0..0xFF } map to { 0xF862..0xF87F, 0xF860..0xF89F }
• { 0xE0,       0x00..0x7F } map to { 0xF880, 0x0000..0x007F }
• { 0xE0,       0x80..0x8F } map to { 0xF880, 0xF820..0xF82F }
• { 0xE0,       0xA0..0xBF, 0x00..0x7F } map to { 0xF880, 0xF840..0xF85F,
0x0000..0x007F }
• { 0xE0,       0xA0..0xBF, 0x80..0xBF } map to 0x0800..0x0FFF
• { 0xE0,       0xA0..0xBF, 0xC0..0xFF } map to { 0xF880, 0xF840..0xF85F,
0xF860..0xF89F }
• { 0xE0,       0xC0..0xFF } map to { 0xF880, 0xF860..0xF89F }
• { 0xE1..0xEE, 0x00..0x7F } map to { 0xF881..0xF88E, 0x0000..0x007F }
• { 0xE1..0xEE, 0x80..0xBF, 0x00..0x7F } map to { 0xF881..0xF88E,
0xF820..0xF85F, 0x0000..0x007F }
• { 0xE1..0xEE, 0x80..0xBF, 0x80..0xBF } map to 0x1000..0xEFFF
• { 0xE1..0xEE, 0x80..0xBF, 0xC0..0xFF } map to { 0xF881..0xF88E,
0xF820..0xF85F, 0xF860..0xF89F }
• { 0xE1..0xEE, 0xC0..0xFF } map to { 0xF881..0xF88E, 0xF860..0xF89F }
• { 0xEF,       0x00..0x7F } map to { 0xF88F, 0x0000..0x007F }
• { 0xEF,       0x80..0xBF, 0x00..0x7F } map to { 0xF88F, 0xF820..0xF85F,
0x0000..0x007F }
• { 0xEF,       0x80..0xBE, 0x80..0xBF } map to 0xF000..0xFFBF
• { 0xEF,       0xBF,       0x80..0xBD } map to 0xFFC0..0xFFFD
• { 0xEF,       0xBF,       0xBE       } maps to { 0xF88F, 0xF85F, 0xF85E
}
• { 0xEF,       0xBF,       0xBF       } maps to { 0xF88F, 0xF85F, 0xF85F
}
• { 0xEF,       0x80..0xBF, 0xC0..0xFF } map to { 0xF88F, 0xF820..0xF85F,
0xF860..0xF89F }
• { 0xEF,       0xC0..0xFF } map to { 0xF88F, 0xF860..0xF89F }
• 0xF0..0xFF map to 0xF890..0xF89F

Regarding the { 0xEF, 0xBF, 0xBE..0xBF } mapping to { 0xF88F, 0xF85F,
0xF85E..0xF85F }: these are only valid if the above 0xFFFE..0xFFFF mapping
to { 0xED, 0xA3, 0xBF, 0xED, 0xBF, 0xBE..0xBF } is valid, *and* if we can
codify this behaviour in wcrtomb(3) and wcsrtombs(3) successfully. Other-
wise, we probably map 0xFFFE and 0xFFFF like CESU-8 do. Feedback welcome,
from the MirOS discussion mailing list as well as from the LANANA or the
ConScript Unicode Registry.

Illegal UTF-8/CESU-8, for example { 0xC0, 0x80 }, is mapped back into the
raw-octet range of 0xF820..F89F, thus to { 0xF860, 0xF820 }. This means
that mbrtowc(3) cannot ever throw EILSEQ either, and the security impli-
cations of UTF-8/CESU-8 do not apply to -8 at all.

While we expect -16 and -8 to be used exclusively inside MirOS
BSD, and do not recommend it for data interchange (like CESU-8), we need
to assign names to it to enable iconv_open(3) to handle it. On the other
hand, the design goals for -{16,8} make it attractive for other OSes
with a 16-bit wchar_t type.

Design goals were: since MirOS does not have full locale/I18N support,
but is fully UTF-8ised, applications like 'tr [a-z] [A-Z] file' would
error out or change unrelated parts of the file if it were in binary.
Either, non-ASCII octets were removed, autoconverted from latin1 to
UTF-8, or mapped to U+FFFD, since there would be no way to tell tr(1)
to handle it as octet stream instead of multibyte stream. On locale-
enabled GNU/Linux, one could 'env LC_ALL=C tr [a-z] [A-Z] file', but
this would not be possible in MirOS BSD. (setlocale(3) always returns
"en_US.UTF-8" for LC_CTYPE, "C" for everything else.)
The -8 encoding would replace the use of UTF-8/CESU-8 (we chose
UTF-8 because our internal encoding as of now is UCS-2, not UTF-16),
allowing raw octets, no matter which, to be preserved in a round trip
from -8 via -16 back to -8. (As MirOS is a Unix, we do
not use -16 except internally as wchar_t, unlike e.g. Microsoft
NT, whose applications often have wchar_t arrays inside them.)

Finally, a third encoding, -0, would be defined. While we do not
have a real use for it, I decided to include its definition here (and
its code to our libiconv) since this seems to be the natural place for
it. -0 relates to -8 like Java™’s “Modified UTF-8”
relates
to CESU-8. The definition is as follows:

• 0x0000 maps to 0xC0 0xA0
• 0xF81F maps to 0x00
• 0xF820 maps to 0x80
• 0xF821 maps to 0x81
• 0xF822 maps to 0x82
• …
• 0xF89E maps to 0xFE
• 0xF89F maps to 0xFF
• 0xFFFE maps to 0xED 0xA3 0xBF 0xED 0xBF 0xBE
• 0xFFFF maps to 0xED 0xA3 0xBF 0xED 0xBF 0xBF

The back-mapping table is modified like this:

• 0x00 maps to 0xF81F
• 0x01..0x7F map to 0x0000..0x007F
• { 0xC0, 0x00..0x7F } map to { 0xF860, 0x0000..0x007F }
• { 0xC0, 0x80..0x9F } map to { 0xF860, 0xF820..0xF83F }
• { 0xC0, 0xA0       } maps to 0x0000
• { 0xC0, 0xA1..0xFF } map to { 0xF860, 0xF841..0xF89F }
• 0xC0 maps to 0xF860
• 0xC1 maps to 0xF861
• { 0xC2..0xDF, 0x00..0x7F } map to { 0xF862..0xF87F, 0x0000..0x007F }
• { 0xC2..0xDF, 0x80..0xBF } map to 0x0080..0x07FF
• { 0xC2..0xDF, 0xC0..0xFF } map to { 0xF862..0xF87F, 0xF860..0xF89F }
• { 0xE0,       0x00..0x7F } map to { 0xF880, 0x0000..0x007F }
• { 0xE0,       0x80..0x8F } map to { 0xF880, 0xF820..0xF82F }
• { 0xE0,       0xA0..0xBF, 0x00..0x7F } map to { 0xF880, 0xF840..0xF85F,
0x0000..0x007F }
• { 0xE0,       0xA0..0xBF, 0x80..0xBF } map to 0x0800..0x0FFF
• { 0xE0,       0xA0..0xBF, 0xC0..0xFF } map to { 0xF880, 0xF840..0xF85F,
0xF860..0xF89F }
• { 0xE0,       0xC0..0xFF } map to { 0xF880, 0xF860..0xF89F }
• { 0xE1..0xEE, 0x00..0x7F } map to { 0xF881..0xF88E, 0x0000..0x007F }
• { 0xE1..0xEE, 0x80..0xBF, 0x00..0x7F } map to { 0xF881..0xF88E,
0xF820..0xF85F, 0x0000..0x007F }
• { 0xE1..0xEE, 0x80..0xBF, 0x80..0xBF } map to 0x1000..0xEFFF
• { 0xE1..0xEE, 0x80..0xBF, 0xC0..0xFF } map to { 0xF881..0xF88E,
0xF820..0xF85F, 0xF860..0xF89F }
• { 0xE1..0xEE, 0xC0..0xFF } map to { 0xF881..0xF88E, 0xF860..0xF89F }
• { 0xEF,       0x00..0x7F } map to { 0xF88F, 0x0000..0x007F }
• { 0xEF,       0x80..0xBF, 0x00..0x7F } map to { 0xF88F, 0xF820..0xF85F,
0x0000..0x007F }
• { 0xEF,       0x80..0xBE, 0x80..0xBF } map to 0xF000..0xFFBF
• { 0xEF,       0xBF,       0x80..0xBD } map to 0xFFC0..0xFFFD
• { 0xEF,       0xBF,       0xBE       } maps to { 0xF88F, 0xF85F, 0xF85E
}
• { 0xEF,       0xBF,       0xBF       } maps to { 0xF88F, 0xF85F, 0xF85F
}
• { 0xEF,       0x80..0xBF, 0xC0..0xFF } map to { 0xF88F, 0xF820..0xF85F,
0xF860..0xF89F }
• { 0xEF,       0xC0..0xFF } map to { 0xF88F, 0xF860..0xF89F }
• 0xF0..0xFF map to 0xF890..0xF89F

While -8 has normal NUL handling (0x00↔0x0000 or conversion stop in
strings), -0 can only be handled by functions with length arguments,
like iconv(3), and has security implications.

No matter where we decide to map 0xFFFE and 0xFFFF, conversions between
the  encoding suite cannot fail (but conversion between  and
Unicode probably can).


Conclusion:

As  gives meaning to the Unicode PUA, I probably can’t apply for an
IANA registration, but I would if I could. However, I would like both the
ConScript Unicode Registry and the LANANA to recognise the allocation of
the U+F81F..F89F range, and ConScript to recognise LANANA’s U+F000..F7FF
allocation. I welcome feedback on the definition of -{16,8,0} and
the handling of 0xFFFE and 0xFFFF (which are, to my knowledge, not valid
Unicode codepoints anyway). If someone were to help me with IANA, that’d
be welcome anyway.

[email protected] and I will decide on a name to replace that  place-
holder soon. Code for libc and libiconv, a manual page, and possibly an
RFC for IANA, will be written upon reception of enough feedback and te-
sting new the conversion routines (think of 0xFFFE and 0xFFFF).

Users of MirOS will be able to handle binary data (BLOBs, octet streams)
with the Unicode-enabled tools without destroying them. In MirOS BSD #10,
the newly released version, tr(1) and col(1) would auto-propagate data,
for instance "printf 'Stra\337e\n'", which hexdumps to 53 74 72 61 DB 65
0A, piped through "tr x u", would yield 53 74 72 61 00 65 0A, and piped
through col, it would yield 53 74 72 61 EF BF BD 65 0A. (tr would strip
the eszett-in-latin1 octet, col would convert it to U+FFFD).
When adopting the  character set in MirOS #11, the result of both
operations will still be 53 74 72 61 DB 65 0A.


Thanks in advance!

bye,
//mirabilos
-- 
I believe no one can invent an algorithm. One just happens to hit upon it
when God enlightens him. Or only God invents algorithms, we merely copy
them.
If you don't believe in God, just consider God as Nature if you won't deny
existence.		-- Coywolf Qi Hunt
 
CD: 3ms