Features Download
From: Thorsten Glaser <tg <at> mirbsd.de>
Subject: ruling on mksh locale tracking
Newsgroups: gmane.os.miros.mksh
Date: Saturday 11th April 2015 20:55:01 UTC (over 2 years ago)
Dixi quod…

>Note that XBD 8.2 only specifies the "POSIX" and "C" locales,
>and that support for GNU-style language[_territory][.codeset]
>locales is XSI-optional. For the purpose of POSIX, mksh is an
>operating environment with 32-bit integers and support for only
>the C (= POSIX) locale.
>|   If the locale value is not recognized by the implementation, the
>|   behavior is unspecified.

>accordingly. (Admittedly, this means we should probably track the
>locale-related variables and imply set +U when LC_ALL=C is set…
>I’ll put that on my TODO, though low-priority.)

I have now thought on and decided about implementing locale
tracking in the mksh codebase.

Current behaviour is:
– busybox-style builtin calls get set ±U dependent
  on the LC_ALL, LC_CTYPE, LANG environment variables
– script and interactive shells get set -U or set +U
  if one of them is set on the command line, otherwise:
– scripts always get set +U ("C" locale)
– if the shell was compiled with -DMKSH_ASSUME_UTF8=0,
  interactive sessions get set +U ("C" locale)
– if the shell was compiled with -DMKSH_ASSUME_UTF8,
  interactive sessions get set -U ("C.UTF-8" locale)
– if 「setlocale(LC_CTYPE, "")」 is supported and
  returns something that matches /utf-?8/i, or if it is
  supported but doesn’t yet a subsequent call of
  「nl_langinfo(CODESET)」 is supported and returns something
   matching /utf-?8/i, an interactive session gets set -U
– if ${LC_ALL:-${LC_CTYPE:-$LANG}} matches /utf-?8/i, same
– otherwise, the interactive session gets set +U

Recapitulating some constraints and users’ wishes:

– the locale is usually set dependent on the same environment
  variables mksh uses; UTF-8 locales usually have UTF-8 or
  utf8 in their name; legacy encoding locales usually don’t
– POSIX requires locale tracking (so e.g. “LC_ALL=C” inside
  a running shell session must turn off “set -U”, i.e. imply
  “set +U”
– POSIX only requires support for the "C" locale, though
– mksh only supports UTF-8 and 8-bit modes, not full locales
– users on some systems, e.g. glibc-using GNU/Linux, wish
  for the shell to operate according to system locales for
  mksh; some may wish it for sh (but see below for lksh);
  most may not wish it for mksh-static (MKSH_SMALL) but
  some may; some users may not wish it (e.g. libc5/Linux)
– users on some systems (those without any UTF-8 locale,
  or other traditional ones, specifically MirBSD), as well
  as legacy scripts to be run on other systems (making this
  relevant for lksh) specifically require set +U to be the
  default for scripts, because those scripts typically do
  not, in contrast to many (but by far not all; you see them
  fail often enough in contemporary Debian still) scripts on
  modern GNU systems, begin with 'export LC_ALL=C'
– the user can always run “mksh -Uc 'string'” to force UTF-8 mode
– the builtin rules allow locales to be recognised at start:
  $ ln -s $(whence -p mksh) eval
  $ LC_ALL=C ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9); echo $a
  é 2
  $ LC_ALL=C.UTF-8 ./eval 'a=$(LC_ALL=C.UTF-8 /usr/bin/printf \\u00e9);
echo $a ${#a}'
  é 1
– it is easy to set the shell into the mode of the currently
  active locale using its own rules (set -u safe):
  set -U; [[ ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} = *[Uu][Tt][Ff]?(-)8* ]] ||
set +U

My ruling on the issue is therefore:

① POSIX locale tracking requires to switch to set +U mode
  if e.g. “LC_ALL=C” is set, but “set -U” can only ever be
  enabled in a script if the (nōn-POSIX) command “set -U”
  is run, making operation unspecified. Therefore, there’s
  no concern wrt. POSIX.

② I have weighed the various user requirements and requests
  and come to the conclusion that implementing locale tracking
  in a manner that fits all users is impossible, and thus would
  require adding compile-time or run-time options, leading to
  the same mess (people having to select that option somehow);
  I have shown a one-liner to make set ±U match the locale
  from the environment, which can already now be used for this
  very purpose.

③ Implementing locale tracking if “set -o posix” is declined,
  because POSIX only requires support for the "C" locale, and
  mksh does not implement anything other than UTF-8 anyway,
  which would wake false hopes (“why does LANG=de_DE.UTF-8
  work but [email protected] doesn’t?”).

④ Operations like splitting a string along multibyte character
  boundaries have no place in /bin/sh scripts, due to portability
  concerns, so those scripts need to be tailored for the various
  existing shells anyway; the one-liner from above can easily be
  added to the “if mksh (or lksh)” case, POSIXly:

  case ${KSH_VERSION:-} in
	case ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} in
		set -U
		set +U
	# the next line implies: set +o braceexpand
	set -o posix

  This code parses fine (mksh part ignored) with Heirloom sh.

⑤ The user is recommended to not rely on the environment locale
  settings for shell behaviour anyway, as that can change, e.g.
  when using ssh without a shell (batch mode), from cronjobs, etc.

That concludes this issue.

Vincent: on a less “ruling” note, sorry for the bunch of recent
disagreements over shell behaviour we had. There are reasons,
some good, some not so good, some legacy, and sometimes it’s just
because nobody had yet put any thought into it, for details on
mksh’s behaviour. I have thought over this issue for a while and
in deep detail – I hope you can see that from this eMail – and
wish to, never mind the outcome, thank you for re-raising this
issue (locales stuff occasionally pops up).

Feel free to ask upstream (IRC, mailing list or eMail) next time.
I’m sorry for the initial harsh “closed” response, but debbugs
is not a discussion forum. I promise to try and listen to your
issues when you bring them up e.g. via IRC to me.

Yay for having to rewrite other people's Bash scripts because bash
suddenly stopped supporting the bash extensions they make use of
	-- Tonnerre Lombard in #nosec
CD: 3ms