This chapter has been written by Takumi Doi at CST, March 29 1996. It is included here verbatim.
Among significant improvements made to Guile in this release, this appendix describes the internationalization features that allow Guile manipulate and render characters from various international character sets including Latin, Japanese Kanji, Chinese Hanji, Korean Hangul, et al.
To include internationalization feature in Guile, you should
specify the option --enable-i18n when you configure Guile,
such as:
% ./configure --enable-i18n --prefix=`pwd`/=inst
In the current release, the modules affected by this option are only libguile, gls, and gtcltk. In addition, Tcl/Tk that comes with this release incorporates support solely for Japanese EUC patches. This also means the rest of plugin libraries are yet to be internationalized.
In Guile, each character is identified by a unique internal character
code, a 24 bit integer. The function char->integer can be used
to retrieve the internal character code of a character.
On the other hand, strings are represented as sequences of 8 bit-byte elements, where characters with character codes beyond 255 are split into adjacent bytes to form a multibyte string.
This design decision of using multibyte strings is motivated by one of major goals of Guile; to provide an east-to-integrate underlying implementation for GNU Emacs. Authors also believe that this design allows a single string data type and compact string representation that can easily support evolving international standards with larger character sets.
While Guile has its own internal character representation, the new internationalization features enable users to easily handle characters from wide variety of existing coded character sets.
By external character encoding the author means any character encoding method other than the Guile Scheme internal encoding, whose coded characters have to be converted when reading from text streams and writing to external devices for saving or rendering.
In Guile Scheme, such conversion is controled by optional arguments to several I/O procedures, in addition to a couple of global variable values to be used as the default conversion method.
Following self-evaluating symbols designate available character encodings for file I/O as well as process I/O:
*sjis* -- Microsoft Kanji code, or Shift-JIS
*iso-2022-jp*, aka *junet* -- encoding used in Japan
to transfer emails and netnews
*iso-2022-int-1* -- "ISO-2022-INT-1" [So what's this?!]
*old-jis* -- Obsolete JIS encoding
*ctext*, aka *iso-8859-1* -- Compound Text encoding
*euc-japan* -- Japanese version of Extended Unix Code
*euc-korea*, aka *euc-kr* -- Korean version of Extended
Unix Code
*iso-2022-kr*, aka *korean-mail* -- encoding used in Korea
to transfer emails and netnews
*iso-2022-ss2-8* -- ISO-2022 encoding using SS2 for 96-charset
in 8-bit code
*iso-2022-ss2-7* -- ISO-2022 coding system using SS2 for
96-charset in 7-bit code
*iso-2022-lock* -- ISO-2022 coding system using Locking-Shift
for 96-charset
*big5*, aka *big5-eten* -- BIG5, a Chinese encoding.
*internal* -- Mule's representation in buffers.
*utf-8* -- ISO10646 UCS2 (known as Unicode) character set
represented in UTF-8 encoding scheme.
*noconv* -- for "NO CONVersion"
*autoconv* -- for "AUTOmatic CONVersion"
These symbols can be specified as encoding parameter for procedures described later in this document.
For precise meaning of these values, see also online Info manuals included in Mule, the Multilingual Emacs. As of this writing, Guile supports as many external coded character sets as the Mule version 2.3, in addition to Unicode.
These variables are used by the function open-file and its buddies
to determine default external encoding to associate with opened ports.
The value #f means no conversion takes place on I/O, which is
the default.
Used by functions to determine the character encoding understood by the operating system environment. The value of this variable affects the behavior of each of follwing functions:
open-file, open-input-file,
open-output-file, open-io-file, call-with-input-file,
call-with-output-file.
system.
getenv.
The initial value is #f, meaning that no conversion takes place.
Although not apparent changes, each of string operations now treats a string as a sequence of characters, not a byte chunk. Namely, index value is assumed to be the character position instead of byte position, the length of a string is the number of characters in the string, not the number of bytes, and so on. This is also the case with the uniform vector operations on multibyte strings.
Users who need to operate on byte sequences are encouraged to use byte-vector extensions instead.
Follwing procedures are now extended to form the internationalization features in Guile:
The mode argument to open-file specifies the direction[s]
to which I/O operations are allowed via the returned port. It can be
one of following values:
OPEN_READ, for input
OPEN_WRITE, for output
OPEN_BOTH, for input and output
The optional argument encoding can be a symbol that names
an external character encoding.
If specified, further I/O operations via opened port will convert
the file contents between the specified character encoding and the
Guile Scheme internal character encoding.
If encoding is omitted, the encoding of the file is determined
by the current value of input-coding-system (for input) and
output-coding-system (for output).
The functions open-input-file, open-output-file and
open-io-file are similiar to open, except for opening
a file for read-only, write-only, and read-write, respectively.
Both functions close the port after proc returns.
Both return the value of proc.
The optional argument encoding specifies the external character
encoding used in the file str. Default behavior is determined by
the current values of input-coding-system and
output-coding-system, respectively.
input-coding-system is used.
The function port-coding retrieves the character encoding used by
port.
port must be an open port object, otherwise an error is signaled.
It returns a symbol that names the character encoding currently used by
port. Refer previous sections for external character encoding symbols
that are available in this release.
The function set-port-coding! sets the character encoding
attribute of port to encoding.
Following procedures are not immediately relevant to internationalization, however added in the hope they will effectively compliment the uses of raw byte data (such as binary image data and network packet data) that might normally have been implemented using strings:
#\nul.
If specified, code conversion between encoding and Guile internal encoding is performed. Otherwise, no conversion takes place.
For uniform-vector->string, programmers must make sure if each
vector element has a valid value to form a string element.
string-append, but works on any uniform vectors.
each of args must be uniform-vector with same element type.
concatenate returns a newly created vector, where
concatenate! modifies the original vector.
Note in this release, returned vector is a shared-vector to the original vector. This implementation is subject to change in future releases.
Similiar to gscm_str and gscm_str0 respectively, except
for accepting encoding argument. encoding must be of type SCM and
a valid Scheme symbol representing an character encoding.
This means you may have to scm_intern the encoding name in your
code. This subject to change.
Similiar to gscm_2_string. but accepts encoding
argument. encoding must be of type SCM and a valid Scheme symbol
representing an character encoding. This means you may have to
scm_intern the encoding name in your code (this is subject to
change).
str_out must be an address to unsigned char * storage,
but not neccessarily be an allocated memory.