<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE TIP SYSTEM "http://tcl.activestate.com/cgi-bin/tct/tip/tipxml.dtd">
<!-- Converted at Thu Feb 09 15:34:03 GMT 2012 -->
<!-- TIP AutoGenerator - written by Donal K. Fellows -->

<TIP number='249'>
<header><title>Unification of Tcl&apos;s Parsing of Numbers</title><author address="mailto:kennykb@acm.org">Kevin B. Kenny</author><author address="mailto:escargo@skypoint.com">David S. Cargo</author><author address="mailto:dgp@users.sf.net">Don Porter</author><status type='informative' state='draft' vote='none'>$Revision: 1.9 $</status><history></history><created day='13' month='jun' year='2005' /></header>
<abstract>This TIP proposes to unify the recognition of all of Tcl&apos;s &quot;numeric&quot; objects into a single parser. The intended effect is to improve performance by eliminating a number of cases where a cached numeric representation may be discarded, and to restore (more accurately, to establish) the &quot;everything is a string&quot; principle in dealing with numbers.</abstract>
<body><section title="Rationale">
<para>Tcl&apos;s handling of numbers has always been problematic and ambiguous. Even in the earliest releases of the <emph style="bold">expr</emph> command, there were issues with the unexpected demotion of floating point numbers to integers, causing subsequent divisions to be interpreted as integer division with incorrect results.</para>
<para>Another trouble spot has been the interpretation of constants with leading zeroes. When these are interpreted as integers, they are octal numbers. They can also be interpreted as floating point constants (at least with <emph style="italic">Tcl_GetDoubleFromObj</emph>), in which case they are decimal. Because of this ambiguity, the <emph style="bold">expr</emph> system cannot make effective use of the internal representation of a floating point number; it needs to refer back to the string to make sure that the number is not an octal integer to which <emph style="italic">Tcl_GetDoubleFromObj</emph> has been applied.</para>
<para>Even more confusing is the treatment of numbers that have leading zeroes but contain the digits 8 or 9. These are rejected by the <emph style="bold">expr</emph> parser as invalid octal but are accepted by <emph style="italic">Tcl_GetDoubleFromObj</emph>.</para>
</section>
<section title="Proposal">
<para>This TIP proposes a strict &quot;everything is a string&quot; interpretation for strings as numeric values. The set of strings that can be interpreted as numbers shall be partitioned into disjoint subsets, with a single &quot;canonical&quot; representation for each.</para>
<para>This change will imply that a few C calls will break compatibility. In particular, <emph style="italic">Tcl_GetDoubleFromObj</emph> may leave an integer internal representation in the object, despite the documentation&apos;s assertion that the object will shimmer. Similarly, <emph style="italic">Tcl_GetDoubleFromObj</emph> will no longer interpret octal integers as decimal; this feature causes only surprise and consternation.</para>
<para>The <emph style="italic">Tcl_ConvertToType</emph> call will also no longer force conversion to a specific numeric type. Since it does not do so, it is not reasonable for extensions to use it on the numeric types. For this reason, the numeric types <emph style="italic">shall not be registered;</emph> <emph style="italic">Tcl_GetObjType</emph> will fail when presented with one of their names.</para>
<para>When one of the conversion procedures <emph style="italic">Tcl_GetIntFromObj</emph>, <emph style="italic">Tcl_GetWideIntFromObj</emph>, <emph style="italic">Tcl_GetBignumFromObj</emph> (assuming the eventual approval of <tipref type="text" tip="237"/>), or <emph style="italic">Tcl_GetDoubleFromObj</emph> is called, it will cast any pre-existing numeric internal representation that it finds to the appropriate return type (throwing an error if the number is too large to represent, or a double is used in an integer context). If the procedure finds no pre-existing numeric internal representation, it will extract the string representation, determine its canonical representation as a number, and store that.</para>
<para>The easiest way to visualize the specific sets of strings that are recognized as numbers is with a diagram of the state machine that implements them.</para>
<image src="249statemachine" caption="State machine that recognizes numbers." />
<para>In the diagram, &quot;Start&quot; represents the start state of the machine. The leading and trailing whitespace that is allowed for all numbers is not diagrammed, for clarity.</para>
<para>Intermediate states of the machine are represented by small ovals. Large rectangles represent final states, and are labeled with the type of number that will result. Note that any number can optionally begin with a &apos;+&apos; or &apos;-&apos; character, which will not be mentioned further. Each of the accepting states, however, merits further discussion.</para>
<enumerate><item.e index='1'><para>The string &quot;0&quot; shall always represent an integer of the smallest type available (<emph style="bold">tclIntType</emph>). It shall never represent a floating point value.</para></item.e><item.e index='2'><para>A leading zero followed by a string of octal digits shall be interpreted as an octal integer. The integer shall be stored in the smallest of <emph style="bold">tclIntType</emph>, <emph style="bold">tclWideIntType</emph> and <emph style="bold">tclBignumType</emph> that will hold it. (Note that storing <emph style="bold">tclBignumType</emph> is possible without accepting <tipref type="text" tip="237"/>, provided that the <emph style="italic">Tcl_Get*FromObj</emph> routines recognize it and convert its value as needed.) The interpretation as an octal integer shall hold even if the string is presented to <emph style="italic">Tcl_GetDoubleFromObj</emph>, which today interprets it as decimal.</para></item.e><item.e index='3'><para>A leading zero, followed by the letter &apos;X&apos; (case insensitive) and a string of hexadecimal digits shall be interpreted as a hexadecimal integer. Again, the smallest representation needed is chosen.</para></item.e><item.e index='4'><para>A string of decimal digits beginning with a nonzero digit is interpreted as a decimal integer and stored in the smallest suitable internal representation.</para></item.e><item.e index='5'><para>A string of digits beginning with a zero but containing the digits <emph style="bold">8</emph> or <emph style="bold">9</emph> is an error; it appears to be bad octal. It would be possible to allow this case in <emph style="italic">Tcl_GetDoubleFromObj</emph>, but it seems unwise, since the consequence would be that <emph style="bold">string is double</emph> would accept &quot;double&quot; strings that will fail in <emph style="bold">expr</emph>.</para></item.e><item.e index='6'><para>A string consisting of a nonempty sequence of decimal digits and a single period (which may appear anywhere within the string) is a valid floating point constant in &apos;F&apos; format, even if it begins with &apos;0&apos;. It is interpreted in decimal and stored in a <emph style="bold">tclDoubleType</emph>. If the input number is too small to represent, an appropriately signed zero is stored. If the input number is too large to represent, an appropriately signed infinity is stored.</para></item.e><item.e index='7'><para>Floating point numbers in the usual &apos;E&apos; format are accepted and interpreted in decimal. Once again, they are stored in <emph style="bold">tclDoubleType</emph> and are replaced with zero or infinity if they are too small or large.</para></item.e><item.e index='8'><para>The constants, &quot;Inf&quot;, and &quot;Infinity&quot; (perhaps with a leading signum) are interpreted as infinities. Infinity is represented as <emph style="bold">tclDoubleType.</emph></para></item.e><item.e index='9'><para>The constant &quot;NaN&quot; is the IEEE &quot;Not a Number&quot; value. It is specifically permitted in the parser so that <emph style="bold">binary format q NaN</emph> and similar calls can produce NaN on an external medium. The presence of NaN in expressions, or in <emph style="italic">Tcl_GetDoubleFromObj</emph>, signals an error. NaN is represented as <emph style="bold">tclDoubleType</emph>.</para></item.e><item.e index='10'><para>IEEE floating point does not have a single unique NaN value, so a NaN may be augmented by a parenthesized string of hexadecimal digits, which will be stored in its least significant bits. It shall not be possible to construct signalling NaN by this route; only quiet NaN will be supported. NaN is represented as <emph style="bold">tclDoubleType.</emph></para></item.e></enumerate>
</section>
<section title="Additions">
<para>In addition to the base state machine detailed above, the state machine of the reference implementation contains additional states to parse integer values beginning with the <emph style="bold">0b</emph> or <emph style="bold">0o</emph> prefixes as originally proposed in <tipref type="text" tip="114"/>. Getting these prefixes recognized in Tcl 8.5 is an important migration step to support migration to whatever version of Tcl drops the &quot;leading <emph style="bold">0</emph> implies octal format&quot; rule.</para>
<para>Also in addition, the parsing routine will accept a <emph style="italic">flags</emph> value containing the flag bits below that exert finer control on the parsing. These extra controls were found to be required to permit the [scan] command to use the same parser.</para>
<itemize><item.i><para><emph style="bold">TCL_PARSE_INTEGER_ONLY</emph> -- accept only integer values; reject strings that denote floating point values (or accept only the leading portion of them that are integer values).</para></item.i><item.i><para><emph style="bold">TCL_PARSE_SCAN_PREFIXES</emph> -- ignore the prefixes <emph style="bold">0b</emph> and <emph style="bold">0o</emph> that are not part of the [scan] command&apos;s vocabulary. Use only in combination with <emph style="bold">TCL_PARSE_INTEGER_ONLY</emph>.</para></item.i><item.i><para><emph style="bold">TCL_PARSE_OCTAL_ONLY</emph> - parse only in the octal format, whether or not a prefix is present that would lead to octal parsing. Use only in combination with <emph style="bold">TCL_PARSE_INTEGER_ONLY</emph>.</para></item.i><item.i><para><emph style="bold">TCL_PARSE_HEXADECIMAL_ONLY</emph> - parse only in the hexadecimal format, whether or not a prefix is present that would lead to hexadecimal parsing. Use only in combination with <emph style="bold">TCL_PARSE_INTEGER_ONLY</emph>.</para></item.i><item.i><para><emph style="bold">TCL_PARSE_DECIMAL_ONLY</emph> - parse only in the decimal format, no matter whether a <emph style="bold">0</emph> prefix would normally force a different base.</para></item.i></itemize>
</section>
<section title="Incompatibilities">
<para>The change described is sufficient to run the Tcl and Tk test suites with unwanted test results only in the detailed format of error messages for integer overflow and in the types returned by using the <emph style="bold">testobj</emph> command (not part of the usual distribution) to introspect them. Despite this reassurance, several potential incompatibilities are identified.</para>
<para>First, as mentioned above, C extensions will no longer have fine control over Tcl&apos;s built-in numeric types, because the types will not be registered and hence will be unavailable for use with <emph style="italic">Tcl_ConvertToType.</emph> This is actually a good thing, since it means that the rest of Tcl can assume that they are well-behaved, resulting in a considerable simplification. Most of the Tcl Core Team believes that <emph style="italic">Tcl_ConvertToType</emph> has no legitimate use in any case.</para>
<para>Second, it will no longer be correct to assume that <emph style="italic">Tcl_Get*FromObj</emph> will leave an internal representation of precisely the requested type. It is, in any case, a highly questionable practice for callers to assume a specific internal representation (with the possible exception of Tcl_Set*Obj and Tcl_New*Obj). There will no doubt be a few extensions that run afoul of this change, but they can be fixed easily in such a way that they will continue to compile and run on earlier versions of Tcl.</para>
<para>Third, <emph style="italic">Tcl_GetDoubleFromObj</emph> will be both more and less permissive than before. It will no longer accept constants with a leading zero and no decimal point or &apos;E&apos; that are invalid octal numbers. On the other hand, it will accept constants that are too large to fit in a <emph style="bold">Tcl_WideInt</emph>; somewhat surprisingly, <emph style="bold">string repeat 9 50</emph> cannot today be interpreted as a double. <emph style="bold">string is double</emph> will follow <emph style="italic">Tcl_GetDoubleFromObj</emph> in what it considers acceptable. Any string that is accepted as either an integer or a double by <emph style="bold">expr</emph> will be accepted in <emph style="italic">Tcl_GetDoubleFromObj</emph>, and only those strings will be accepted.</para>
<para>Fourth, the recognition of <emph style="bold">0b</emph> and <emph style="bold">0o</emph> as valid prefixes for integer values is a type of incompatibility.</para>
</section>
<section title="Reference Implementation">
<para>See <tipref type="text" tip="237"/> for more implementation details.</para>
</section>
<section title="Copyright">
<para>Copyright (c) 2005 by Kevin B. Kenny. All rights reserved. </para>
<para>This document may be distributed subject to the terms and conditions set forth in the Open Publication License, version 1.0 [<url ref="http://www.opencontent.org/openpub/"/>].</para>
</section>
</body></TIP>

