TIP #29 Version 1.4: Allow array syntax for Tcl lists

This is not necessarily the current version of this TIP.


TIP:29
Title:Allow array syntax for Tcl lists
Version:$Revision: 1.4 $
Authors: Kevin Kenny <kennykb at acm dot org>
Donal K. Fellows <fellowsd at cs dot man dot ac dot uk>
State:Draft
Type:Project
Tcl-Version:9.0
Vote:Pending
Created:Wednesday, 07 March 2001
Discussions To:news:comp.lang.tcl
Discussions To:mailto:kennykb@acm.org

Abstract

Most popular programming languages provide some sort of indexed array construct, where array subscripts are integers. Tcl's lists are, in fact, arrays, but the existing syntax obscures the fact. Moreover, the existing list commands make it difficult to manipulate lists as arrays without running into peculiar performance issues. This TIP proposes that the syntax of variableName(value) be extended to function as an array selector if variableName designates a list. This change is upward compatible with existing Tcl scripts, because the proposed syntax results in a runtime error in every extant Tcl release.

Rationale

The implementation of lists in Tcl has evolved far beyond the original conception. While lists were originally conceived to be strings with a particular syntax that allowed them to be parsed as lists, the internal representation of a list is now an array of pointers to Tcl_Obj structures.

Tcl programmers, for the most part, have not taken advantage of this evolution. Code that uses hash tables for the purpose is still extremely common. Moreover, it is difficult to update lists in place, even if their internal representations are known not to be shared. One example of this difficulty is seen in the discussions (http://purl.org/thecliff/tcl/wiki/941) of how best to shuffle a list of items. The discussion began with a naïve implementation of Jon Bentley's method of performing random swaps:

  proc shuffle1 { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp [lindex $list $j]
          set list [lreplace $list $j $j [lindex $list $i]]
          set list [lreplace $list $i $i $temp]
      }
      return $list
  }

Aside from the fact that the syntax obscures what the program is doing, the implementation suffers from an obscure performance problem. When the lreplace calls in the shuffle1 procedure are executed, the internal representation of list has two references: the value of the variable, and the parameter passed to lreplace. The multiple references force lreplace to copy the list, leading to quadratic performance when large lists are shuffled.

It is possible, albeit difficult, to alleviate this problem by careful management of the lifetime of Tcl_Obj structures, but this change complicates the code. The simplest way to fix the performance is probably to use Donal Fellows's implementation of the K combinator:

 proc K { x y } { set x }

which allows the caller of lreplace to extract the value of list, change the value of list so that the extracted value is unshared, and then pass the extracted value as a parameter to lreplace:

  proc shuffle1a { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp1 [lindex $list $j]
          set temp2 [lindex $list $i]
          set list [lreplace [K $list [set list {}]] $j $j $temp2]
          set list [lreplace [K $list [set list {}]] $i $i $temp1]
      }
      return $list
  }

Now the performance of the code is O(n) where n is the length of the list, but the programmer's intent has been seriously obscured!

These drawbacks have led prominent individuals such as Richard Stallman (http://www.vanderburg.org/Tcl/war/0000.html) to assert that Tcl lacks arrays.

Specification

This TIP's proposed change can be stated succinctly:

Wherever the notation a(x) may be used to refer to an array element in the language, allow it also to refer to an element of a list, provided that the variable a is scalar and the value x is a non-negative integer.

Exception: Read traces and upvar calls designating individual list elements shall not be supported.

Note that this change is backward compatible with existing Tcl scripts! If a notation like a(x) is used to refer to a scalar variable in today's Tcl, the result is an error:

 % set a [list foo bar grill]
 foo bar grill
 % set a(2)
 can't read "a(2)": variable isn't array
 % puts $a(2)
 can't read "a(2)": variable isn't array
 % set a(2) zot
 can't set "a(2)": variable isn't array

The default behavior, if a is not set, and a script executes

 set a(2) zot

will still be to create an associative array. If a script wishes to perform such actions on a list, it will be necessary first to initialize the variable:

 set a [list]
 set a(0) foo

Note that in the example above, there is no requirement that the internal representation of a be a list; the line,

 set a [list]

could have been replaced with

 set a {}

with the only impact being the run-time cost of shimmering the empty string into an empty list. Nowhere does this proposal introduce behavior that depends on a specific internal representation for any variable.

In the proposed syntax, negative or non-numeric subscripts shall cause errors, as shall referring to variables that do not contain well-formed lists. Read operations shall support subscripts ranging from zero to one less than the length of the list. Write operations shall support subscripts ranging from zero to the length of the list, that is, it shall be legal to store into the element one space beyond the current last element (which shall have the effect of expanding the list length by one). Unset operations may use only a subscript equal to one less than the length of the list, that is, it shall be legal to unset only the last element of the list. Unsetting this element shall have the effect of shrinking the list by one member.

With the proposed change in syntax, the procedure to shuffle a list becomes much more straightforward:

  proc shuffle1 { list } {
      set n [llength $list]
      for { set i 0 } { $i < $n } { incr i } {
          set j [expr {int(rand()*$n)}]
          set temp $list($j)
          set list($j) $list($i)
          set list($i) $temp
      }
      return $list
  }

The given implementation copies the list only once, the first time that the line:

          set list($j) $list($i)

is executed. Thereafter, the list is an unshared object, and the replacements are performed in place.

It shall be illegal to pass a list element as the parameter to upvar; that is, the following usage:

  proc increment { varName } {
      upvar 1 $varName v
      incr v
  }
  set x [list 1 2 3]
  increment x(0)

will not be supported. However, the commoner form:

  proc incrementElement { arrayName index } {
      upvar 1 $arrayName array
      incr array($index)
  }
  set x [list 1 2 3]
  incrementElement x 0

will, of course, work as expected.

Write and unset traces on list elements shall be supported; it shall be permissible to write:

    trace add variable x(1) write writeCallback

or

    trace add variable x(1) unset unsetCallback

The write callback shall be invoked whenever the given list element changes value; the unset callback shall be invoked whenever the variable is unset or when its length shrinks to the point that it no longer has a member with the given index.

Read traces on list elements shall not be supported. It is too difficult at this point to define what their semantics should be. For instance, if a program executes

   trace add variable x(0) read readCallback
   set x [list foo bar grill]
   set y [string range $x 4 end]

should the callback fire? By one argument, the program has not read element zero of the list; by another, using the list as a string has read every element, and all read traces should fire. In any case, the read trace on a variable fires before its usage is known; it appears impossible in existing code to implement selective read tracing on list elements.

The implementation of write and unset traces on list elements will be done by establishing a C-level write trace on the variable as a whole. The client data of the trace will designate a structure containing the ordinal number of the element being traced, and a Tcl_Obj pointer designating its old value. The reference count of the Tcl_Obj will be incremented when this pointer is stored. Note that this increment operation makes the object shared. Any change to the designated element will thus need to copy the object.

When the write trace fires, the list representation of the variable will be extracted, reconstituting it from the string representation if necessary. If extracting the list representation fails, the trace will be considered to have failed as well, and the trace callback will return TCL_ERROR. If extracting the list representation succeeds, the list length will be compared with the ordinal number of the element being traced. If the element number is no longer within the list, an unset trace fires if one exists. If the element number is within the list, the two Tcl_Obj pointers are compared. If they are identical, the list element in question is unchanged, and nothing need be done. Otherwise, the write trace fires.

This behavior is conservative in that an operation that spoils the list representation of the object is considered to have written every element of the list. This rule is consistent with the rule that write traces on ordinary Tcl variables fire whenever the variable is set, even if it is being set to an identical value.

In any event, after the conclusion of a trace callback, the saved Tcl_Obj will have its reference count decremented and be replaced with the current element of the list (with reference count appropriately incremented, of course).

Discussion

There are several obvious extensions to the proposal that are not addressed, and these omissions are intentional.

These changes are not addressed primarily because the author of this TIP foresees that they are likely to cause undesirable interactions with future attempts to implement other types of indexed data structure, such as the maps, vectors, and matrices of products like BLT and Feather. In a fully developed Tcl 9.0, it would be desirable for all of these objects to support similar interfaces; this TIP proposes a minimal first step.

Lists are a simple enough structure that the full power of the array command is not required to deal with them, and having it work on lists as well as arrays seems like needless effort. Moreover, existing code may well depend on a combination of [array exists] and [info exists] to distinguish associative arrays from scalar variables (including lists).

Extending the syntax in this fashion would make upvar more consistent in its behavior, but appears to be expensive, in terms of both performance (tracking down the linked references if a list is rebuilt) and the effort required for implementation (the author of this TIP is unlikely to have the time required to implement the necessary changes to struct Var and the associated code). In fact, the author of this TIP would much rather deprecate the use of upvar to refer to array elements. A disproportionate amount of code goes into supporting this seldom-used feature.

would be synonymous with

     set x [lreplace $x 2 2]

that is, to allow deletion of interior elements of a list.

would result in a three-element list,

     foo {} bar

The author of this TIP rejects the last two alternatives because they conflict with most users' expectations of how arrays behave. Allowing write and unset operations to create lists having unspecified elements suggests an implementation of sparse arrays. While sparse arrays are a useful concept, and there are efficient and elegant data structures to support them, they have distinctly different semantics from Tcl lists and it seems likely that the conceptual conflict will cause trouble over time.

See Also

TIP #22

Reference Implementation

No reference implementation has yet been developed; the author of this TIP wishes to solicit the opinions of the Tcl community before spending a lot of time implementing a possibly bad idea.

Change history

12 March 2001: Added detailed discussion of the specific subscript ranges supported by read, write and unset operations. Changed the discussion to reject the alternative of padding an array when setting an index beyond the end. Added discussion of the details of write and unset traces, and rejecting read traces as being infeasible to implement. Clarified the example of creating an empty list so as to avoid any misapprehension that these changes depend on list variables' having a particular representation at any given time; in fact, every detail of this proposal is tolerant of shimmering.

13 March 2001: Fixed a copy-and-paste error in the 'incrementElement' example, and added to the discussion the fact that all operations will throw errors in the event of a malformed list.

Copyright

This document has been placed in the public domain.


Powered by TclThis is not necessarily the current version of this TIP.

TIP AutoGenerator - written by Donal K. Fellows