TIP #388 Version 1.2: Extending Unicode literals past the BMP

This is not necessarily the current version of this TIP.


TIP:388
Title:Extending Unicode literals past the BMP
Version:$Revision: 1.2 $
Author:Jan Nijtmans <jan dot nijtmans at users dot sf dot net>
State:Draft
Type:Project
Tcl-Version:8.6
Vote:Pending
Created:Wednesday, 10 August 2011
Discussions To:Tcl Core list
Keywords:Tcl

Abstract

This TIP proposes to extend Tcl's syntax in order to be able to cope with quoted forms of Unicode characters outside the Basic Multilingual Plane.

Summary

Tcl provides backslash substitutions of the form \uhhhh for unicode characters, but this form is not sufficient to model unicode literals past the BMP. The outcome of the discussion on Tcl-Core was to add the form \Uhhhhhhhh (one up to 8 hexadecimal digits), but still it is not ambigous how characters > 0x10ffff, Unicode Noncharacers and Unicode Substitutes need to be handled. This TIP is meant to sort that out, it is not meant to specify how characters outside the BMP are handled. The reference implementation just replaces any character in the range \U010000 - \U10ffff with \ufffd, but as soon as Tcl has support for characters outside the BMP this range is reserved for exactly that.

Currently, the form \U is parsed by Tcl as a literal U, so - however small - this change results in a non-trivial potential incompatibility which therefore requires a TIP.

Considering backslash sequences, there are two other forms which are currently not consistent: \xhh accepts an unlimited number of hex digits, unlike other modern languages, and the form \ooo, where the first octal digit is in the range 4..7 is currently not handled consistently in Tcl. Now is an opportunity to reconsider this.

In tcl.h there is a remark regarding the possible values of TCL_UTF_MAX:

This document proposes to add another value:

Rationale

Consider the string \701, how is that supposed to be interpreted? Tcl specifies octal sequences as 8 bits, and silently strips the 9th bit, the same as gcc does. In Tcl's regular expression engine, the 9th bit is not stripped, there it is equivalent to \u01c1. Java parses it as \70 - a valid 8-bit octal value - followed by 1, so it's a string of length 2.

Then the string \x1234. Tcl specifies this as 8 bits as well, and silently strips all higher bits, so it is equivalent to \u0034. This is the same as gcc does, but Java considers it as \x12 followed by 34, so it's a string of lenght 3.

Consider the string \U00123456, which would result in an invalid Unicode character. In the Tcl parser we don't have the possibility to flag invalid backslash sequences, in Tcl's regexp engine we have. Unicode characters higher than \U0010ffff cannot appear in an UTF-8 stream, Unicode Noncharacters and Unicode surrogates are not supposed to appear in an UTF-8 stream, it would be best to handle that as early as possible.

In tcl.h, we find Tcl_UniChar to be defined as unsigned int when TCL_UTF_MAX > 3 and as unsigned short otherwise. It would be useful to allow TCL_UTF_MAX to be defined in extensions as 4 and still define Tcl_UniChar as unsigned short. That would allow the path to a full support for out-of BMP Unicode characters shorter, because Unicode Surrogate pairs can be used for that.

Specification

This document proposes:

Compatibility

Tcl scripts using the form \ooo where the first digit is in the range 4-7, will now interpred the string as \oo followed by o. There is no test-case in the Tcl test suite for that.

Tcl scripts using the form \uhhhh where it represents a Unicode noncharacter or surrogate result in a different character \ufffd. In the Tcl regexp engine, those are flagged as illegal and will generate an exception.

Tcl scripts using \U as a literal U will no longer work when it is followed with at least one hexadecimal digit. There is no test case in the Tcl test suite for this.

Alternatives

How should unicode sequences bigger than \U0010ffff be handled? Alternatives are replacing it with \ufffd or (in the regexp engine) flagging it as invalid backslash sequence.

How should unicode noncharacters be handled? Is flagging them as an invalid sequence or replacing them with \ufffd really a good idea?

How should Unicode surrogates be handled? Should we allow something like \udc00\ud800 as equivalent to \U00010000?

Reference Implementation

A reference implementation is available at http://core.tcl.tk/tcl in branch ??? (to be determined)

Copyright

This document has been placed in the public domain.


Powered by TclThis is not necessarily the current version of this TIP.

TIP AutoGenerator - written by Donal K. Fellows