Gmane
X-Face Favicon
From: Katsumi Yamaoka <yamaoka <at> jpl.org>
Subject: Re: [BUG]What does this mean:"Mention that multibyte characters
Newsgroups: gmane.emacs.gnus.general
Date: 2004-10-20 05:36:48 GMT (4 years, 36 weeks, 5 days, 2 hours and 9 minutes ago)
>>>>> In <b9yacuji7t6.fsf <at> jpl.org> Katsumi Yamaoka wrote:

> I'll try translation of Kenichi Handa's advice next time.

This is practice of my English composition.  No matter what mistake
may be there, the responsibility is in me.

>>>>> Katsumi Yamaoka wrote:

> With the following form, Emacs 21.3.50 returns non-nil, and 22.0.0
> returns nil.  Could you let me know for reference what occurs there?

> (with-temp-buffer
>   (set-buffer-multibyte t)
>   (insert (string-as-unibyte "\200"))
>   (goto-char (point-min))
>   (search-forward (string-as-multibyte "\200") nil t))

;; Annotation by K.Y.:
;; At that time, I didn't know the possible insertion forms are
;; `(insert ?\200)' and `(insert (format "%c" ?\200))' yet.

>>>>> Kenichi Handa wrote:

Even with Emacs 21.3.50, the above form will return nil according to a
certain language environment (e.g., Vietnamese, etc.).

You have to understand first that a unibyte string is converted into a
multibyte string by `string-make-multibyte' when inserting a unibyte
                            ^^^^
string in a multibyte buffer.  Therefore,

the `(insert (string-as-unibyte "\200"))' form is identical to the
`(insert (string-make-multibyte (string-as-unibyte "\200")))' form.
Where how `string-make-multibyte' converts depends on the language
environment.  As for Emacs 21, in the Latin-1 language environment,
for example, the string of "\200" will be converted into the character
which corresponds to \200 in the eight-bit-control charset since the
primary charset latin-iso8859-1 doesn't contain \200.

Second, `string-as-multibyte' converts STRING into the multibyte
string, keeping its byte sequence as much as possible.  It works
``as much as possible'' but sometimes brings differences.  For
example, the string of "\200" will be converted into the byte-sequence
of "\236\240" which is a character contained in the eight-bit-control
charset.  It is the same as the character which the above program
inserted in the buffer.

Consequently, in the Latin-1 language environment, for example, the
above program returned non-nil, in Emacs 21.

On the other hand, in Emacs 22, since iso-8859-1 which is the primary
charset for Latin-1 contains \200, the form

(insert (string-as-unibyte "\200"))

inserts the character of U+0080 rather than the character which
belongs to eight-bit-control.  However, `string-as-multibyte' always
converts \200 into the character of eight-bit-control.  This is the
reason that program returns nil.

If you have a need to look for \200 after inserting it in a buffer, it
will go well in both Emacs 21 and 22 using the following way for
example:

(with-temp-buffer
  (set-buffer-multibyte t)
  (insert (string-to-multibyte "\200"))
  (goto-char (point-min))
  (search-forward (string-to-multibyte "\200") nil t))

;; Annotation by K.Y.:
;; I didn't use that way in the `gnus-update-summary-mark-positions'
;; function (which see).

`string-to-multibyte' always converts a string into the characters
which belong to eight-bit-control or eight-bit-graphic, so the string
which it makes will never match usual string.

P.S.
In Emacs 21, the form

`(insert (string-to-multibyte "\200"))'

does the same as the form

`(insert ?\200)'

does.  It is because there is not the character corresponding to 128
in the multibyte buffer, and it is treated as the raw byte which
belongs to eight-bit-control.

However, it differs in Emacs 22.  Since the character corresponding to
128 exists as U+0080, it will be inserted.

;; Annotation by K.Y.:
;; I deeply thank to Kenichi Handa.  There was all knowledge that I
;; needed.