Go to the first, previous, next, last section, table of contents.

Internationalization Features in Guile

This chapter has been written by Takumi Doi at CST, March 29 1996. It is included here verbatim.

Among significant improvements made to Guile in this release, this appendix describes the internationalization features that allow Guile manipulate and render characters from various international character sets including Latin, Japanese Kanji, Chinese Hanji, Korean Hangul, et al.

Building Guile with Internationalization Features

To include internationalization feature in Guile, you should specify the option --enable-i18n when you configure Guile, such as:

% ./configure --enable-i18n --prefix=`pwd`/=inst

In the current release, the modules affected by this option are only libguile, gls, and gtcltk. In addition, Tcl/Tk that comes with this release incorporates support solely for Japanese EUC patches. This also means the rest of plugin libraries are yet to be internationalized.

Character and String representations

In Guile, each character is identified by a unique internal character code, a 24 bit integer. The function char->integer can be used to retrieve the internal character code of a character.

On the other hand, strings are represented as sequences of 8 bit-byte elements, where characters with character codes beyond 255 are split into adjacent bytes to form a multibyte string.

This design decision of using multibyte strings is motivated by one of major goals of Guile; to provide an east-to-integrate underlying implementation for GNU Emacs. Authors also believe that this design allows a single string data type and compact string representation that can easily support evolving international standards with larger character sets.

External Character Encodings

While Guile has its own internal character representation, the new internationalization features enable users to easily handle characters from wide variety of existing coded character sets.

By external character encoding the author means any character encoding method other than the Guile Scheme internal encoding, whose coded characters have to be converted when reading from text streams and writing to external devices for saving or rendering.

In Guile Scheme, such conversion is controled by optional arguments to several I/O procedures, in addition to a couple of global variable values to be used as the default conversion method.

Following self-evaluating symbols designate available character encodings for file I/O as well as process I/O:

These symbols can be specified as encoding parameter for procedures described later in this document.

For precise meaning of these values, see also online Info manuals included in Mule, the Multilingual Emacs. As of this writing, Guile supports as many external coded character sets as the Mule version 2.3, in addition to Unicode.

New Guile Scheme variables

Variable: input-coding-system
Variable: output-coding-system

These variables are used by the function open-file and its buddies to determine default external encoding to associate with opened ports. The value #f means no conversion takes place on I/O, which is the default.

Variable: process-coding-system

Used by functions to determine the character encoding understood by the operating system environment. The value of this variable affects the behavior of each of follwing functions:

The initial value is #f, meaning that no conversion takes place.

Changes to existing Guile Scheme commands

Although not apparent changes, each of string operations now treats a string as a sequence of characters, not a byte chunk. Namely, index value is assumed to be the character position instead of byte position, the length of a string is the number of characters in the string, not the number of bytes, and so on. This is also the case with the uniform vector operations on multibyte strings.

Users who need to operate on byte sequences are encouraged to use byte-vector extensions instead.

Follwing procedures are now extended to form the internationalization features in Guile:

Function: open-file str mode &optional encoding
Function: open-input-file str &optional encoding
Function: open-output-file str &optional encoding
Function: open-io-file str &optional encoding
These functions open a file specified by str, and return a port associated with the file. str must be a string or a symbol that names the file to open.

The mode argument to open-file specifies the direction[s] to which I/O operations are allowed via the returned port. It can be one of following values:

The optional argument encoding can be a symbol that names an external character encoding. If specified, further I/O operations via opened port will convert the file contents between the specified character encoding and the Guile Scheme internal character encoding. If encoding is omitted, the encoding of the file is determined by the current value of input-coding-system (for input) and output-coding-system (for output).

The functions open-input-file, open-output-file and open-io-file are similiar to open, except for opening a file for read-only, write-only, and read-write, respectively.

Function: call-with-input-file str proc &optional encoding
Function: call-with-output-file str proc &optional encoding
Call proc with one argument, a port which is the file named by str.

Both functions close the port after proc returns.

Both return the value of proc.

The optional argument encoding specifies the external character encoding used in the file str. Default behavior is determined by the current values of input-coding-system and output-coding-system, respectively.

Function: load name &optional encoding
Loads the Scheme source file named by name in core. if encoding is specified, the file name is assumed to have contents encoded in encoding. Otherwise, the current value of input-coding-system is used.

New Guile Scheme commands

Function: port-coding port
Function: set-port-coding! port encoding &optional modes

The function port-coding retrieves the character encoding used by port. port must be an open port object, otherwise an error is signaled. It returns a symbol that names the character encoding currently used by port. Refer previous sections for external character encoding symbols that are available in this release.

The function set-port-coding! sets the character encoding attribute of port to encoding.

Following procedures are not immediately relevant to internationalization, however added in the hope they will effectively compliment the uses of raw byte data (such as binary image data and network packet data) that might normally have been implemented using strings:

Function: uniform-vector->string uve encoding
Function: string->uniform-vector str encoding
Coerces between byte-vector and string. In Guile, a byte-vector is a uniform-vector whose prototype is #\nul.

If specified, code conversion between encoding and Guile internal encoding is performed. Otherwise, no conversion takes place.

For uniform-vector->string, programmers must make sure if each vector element has a valid value to form a string element.

Function: concatenate &rest args
Function: concatenate! &rest args
Similiar to string-append, but works on any uniform vectors. each of args must be uniform-vector with same element type. concatenate returns a newly created vector, where concatenate! modifies the original vector.

Function: subvector vec start end
Returns a uniform-vector formed from elements of the uniform-vector vec, beginning from index start (inclusive) and ending with index end (exclusive).

Note in this release, returned vector is a shared-vector to the original vector. This implementation is subject to change in future releases.

New libguile procedures

Function: gscm_foreign_str src len encoding
Function: gscm_foreign_str0 src encoding
Used for converting a string from the encoding in "foreign" (anywhere outside GSCM) code to Guile Scheme internal encoding.

Similiar to gscm_str and gscm_str0 respectively, except for accepting encoding argument. encoding must be of type SCM and a valid Scheme symbol representing an character encoding. This means you may have to scm_intern the encoding name in your code. This subject to change.

Function: gscm_2_foreign_str str_out len_out obj_in encoding
Used for converting SCM string contents to "foreign" string encoded in encoding.

Similiar to gscm_2_string. but accepts encoding argument. encoding must be of type SCM and a valid Scheme symbol representing an character encoding. This means you may have to scm_intern the encoding name in your code (this is subject to change).

str_out must be an address to unsigned char * storage, but not neccessarily be an allocated memory.


Go to the first, previous, next, last section, table of contents.