TIP:            29
Title:          Allow array syntax for Tcl lists
Version:        $Revision: 1.8 $
Author:         Kevin Kenny <kennykb@acm.org>
Author:         Donal K. Fellows <fellowsd@cs.man.ac.uk>
State:          Rejected
Type:           Project
Vote:           Done
Created:        07-Mar-2001
Post-History:   
Discussions-To: news:comp.lang.tcl,mailto:kennykb@acm.org
Tcl-Version:    9.0

~ Abstract

Most popular programming languages provide some sort of indexed array
construct, where array subscripts are integers.  Tcl's lists are, in
fact, arrays, but the existing syntax obscures the fact.  Moreover,
the existing list commands make it difficult to manipulate lists as
arrays without running into peculiar performance issues.  This TIP
proposes that the syntax of ''variableName(value)'' be extended to
function as an array selector if ''variableName'' designates a list.
This change is upward compatible with existing Tcl scripts, because
the proposed syntax results in a runtime error in every extant Tcl
release.

~ Rationale

The implementation of lists in Tcl has evolved far beyond the original
conception.  While lists were originally conceived to be strings with
a particular syntax that allowed them to be parsed as lists, the
internal representation of a list is now an array of pointers to
''Tcl_Obj'' structures.

Tcl programmers, for the most part, have not taken advantage of this
evolution.  Code that uses hash tables for the purpose is still
extremely common.  Moreover, it is difficult to update lists in place,
even if their internal representations are known not to be shared.
One example of this difficulty is seen in the discussions
[http://purl.org/thecliff/tcl/wiki/941] of how best to shuffle a list
of items.  The discussion began with a naïve implementation of Jon
Bentley's method of performing random swaps:

|  proc shuffle1 { list } {
|      set n [llength $list]
|      for { set i 0 } { $i < $n } { incr i } {
|          set j [expr {int(rand()*$n)}]
|          set temp [lindex $list $j]
|          set list [lreplace $list $j $j [lindex $list $i]]
|          set list [lreplace $list $i $i $temp]
|      }
|      return $list
|  }

Aside from the fact that the syntax obscures what the program is
doing, the implementation suffers from an obscure performance problem.
When the ''lreplace'' calls in the ''shuffle1'' procedure are
executed, the internal representation of ''list'' has two references:
the value of the variable, and the parameter passed to ''lreplace''.
The multiple references force ''lreplace'' to copy the list, leading
to quadratic performance when large lists are shuffled.

It is possible, albeit difficult, to alleviate this problem by careful
management of the lifetime of ''Tcl_Obj'' structures, but this change
complicates the code.  The simplest way to fix the performance is
probably to use Donal Fellows's implementation of the ''K''
combinator:

| proc K { x y } { set x }

which allows the caller of ''lreplace'' to extract the value of
''list'', change the value of ''list'' so that the extracted value is
unshared, and then pass the extracted value as a parameter to
''lreplace:''

|  proc shuffle1a { list } {
|      set n [llength $list]
|      for { set i 0 } { $i < $n } { incr i } {
|          set j [expr {int(rand()*$n)}]
|          set temp1 [lindex $list $j]
|          set temp2 [lindex $list $i]
|          set list [lreplace [K $list [set list {}]] $j $j $temp2]
|          set list [lreplace [K $list [set list {}]] $i $i $temp1]
|      }
|      return $list
|  }

Now the performance of the code is ''O(n)'' where ''n'' is the length
of the list, but the programmer's intent has been seriously obscured!

These drawbacks have led prominent individuals such as Richard
Stallman [http://www.vanderburg.org/Tcl/war/0000.html] to assert that
Tcl lacks arrays.

''This proposal includes the absolute minimum of functionality needed
to provide array-style indexing for variables containing Tcl list
objects.''  The reason for this limitation is that omitted
functionality can be added later without breaking existing scripts.
On the other hand, ill-considered extensions may turn into something
that we're doomed to support forever.

~ Specification

This TIP's proposed change can be stated succinctly:

 > Wherever the notation ''a(x)'' may be used to refer to an array
   element in the language, allow it also to refer to an element of a
   list, provided that the variable ''a'' is scalar and the value
   ''x'' is an index suitable for the ''lindex'' command.

   ''Exception:'' Traces, ''unset'' and ''upvar'' calls designating
   individual list elements shall not be supported.  (As a consequence
   of this rule, list elements may also not appear as linked variables
   in C code, implying that they also cannot appear as ''-variable''
   or ''-textvariable'' options on Tk widgets.)

Note that this change is backward compatible with existing Tcl
scripts!  If a notation like ''a(x)'' is used to refer to a scalar
variable in today's Tcl, the result is an error:

| % set a [list foo bar grill]
| foo bar grill
| % set a(2)
| can't read "a(2)": variable isn't array
| % puts $a(2)
| can't read "a(2)": variable isn't array
| % set a(2) zot
| can't set "a(2)": variable isn't array

The default behavior, if ''a'' is not set, and a script executes

| set a(2) zot

will still be to create an associative array.  If a script wishes to
perform such actions on a list, it will be necessary first to
initialize the variable:

| set a [list]
| set a(0) foo

Note that in the example above, there is no requirement that the
internal representation of ''a'' be a list; the line,

| set a [list]

could have been replaced with

| set a {}

with the only impact being the run-time cost of shimmering the empty
string into an empty list.  Nowhere does this proposal introduce
behavior that depends on a specific internal representation for any
variable.

This proposal the syntax of the subscript shall be precisely those
values that are accepted as the second argument to the ''lindex'' command.
In other words, the subscript may be an integer ''N'', or the string
''end'' or ''end-N''.  The value of ''N'' may not be less than zero
nor greater than nor equal to the length of the list on any usage that
reads a list element.  

A usage that writes a list element may use an integer equal to
 the length of the list, or the string ''end+1'', to designate
the element one past the end.  In other words,

|  set a(end+1) foo

will have the same effect as:

|  lappend a foo

With the proposed change in syntax, the procedure to shuffle a list
becomes much more straightforward:

|  proc shuffle1 { list } {
|      set n [llength $list]
|      for { set i 0 } { $i < $n } { incr i } {
|          set j [expr {int(rand()*$n)}]
|          set temp $list($j)
|          set list($j) $list($i)
|          set list($i) $temp
|      }
|      return $list
|  }

The given implementation copies the list only once, the first time
that the line:

|          set list($j) $list($i)

is executed.  Thereafter, the list is an unshared object, and the
replacements are performed in place.

It shall be illegal to pass a list element as the parameter to
''upvar;'' that is, the following usage:

|  proc increment { varName } {
|      upvar 1 $varName v
|      incr v
|  }
|  set x [list 1 2 3]
|  increment x(0)

will ''not'' be supported.  However, the commoner form:

|  proc incrementElement { arrayName index } {
|      upvar 1 $arrayName array
|      incr array($index)
|  }
|  set x [list 1 2 3]
|  incrementElement x 0

will, of course, work as expected.

~ Discussion

Several reviewers expressed concern about the reuse of array syntax.
In particular, the alternative syntax $a<$element> was proposed
repeatedly.  Alas, there is no good alternative syntax that will not
break at least some existing scripts.  The proposed syntax using
angle-brackets is a poor choice, because Tcl scripts that generate
Web pages frequently have code like:

|  puts "<$tag>$text</$tag>

that would be broken horribly by such a change.

There are several obvious extensions to the proposal that are not
addressed, and these omissions are intentional.

   * The proposal makes no attempt to deal with multiple subscripts
     as a means of accessing nested lists.

Use of multiple subscripts is closely related to the withdrawn [22]
(which the author of this TIP intends to revive).  If the related TIP
is accepted, the syntax for the subscript could readily be expanded
so that it could be a Tcl list giving the subscripts in lexicographic
sequence.  For example

| set a(2 3) foo

could be used to set the fourth element of the third sublist.

   * The proposal allows the ''set'' command (or any other use of
     ''Tcl_SetVar2Ex'') to set only the elements that are in the list
     already plus the one one beyond the end.

Tcl lists are fundamentally dense arrays.  Allowing non-contiguous
elements, that is, sparse arrays, is a fundamental change to their
semantics.  Such a change is not contemplated at this time.

   * The proposal does not allow the ''unset'' command (or any
     other command that arrives at ''Tcl_UnsetVar2'') to delete members
     of a list.

Earlier versions of the proposal had proposed to permit:

|  unset a([expr { [llength $a] - 1}])

or equivalently:

|  unset a(end)

to reduce the length of the list by one.  In subsequent discussions,
the reviewers found it distasteful that the proposed syntax did not
permit unsetting interior elements of a list.  Alas, the discussion
did not arrive at a consensus on what the precise semantics of such
an operation ought to be.  Some reviewers favored attempting to emulate
sparse arrays (again, a fundamental change to the semantics of
Tcl lists that is not contemplated at this time).  Others preferred
the semantics of shifting the remaining elements, so that

|   unset a($n)

would always be equivalent to

|   set a [lreplace $a $n $n]

except for performance.  Both camps found it overly restrictive to
limit the semantics of ''unset'' to those of the original proposal.
Because the two groups failed to achieve a consensus, the author of
this TIP finds it prudent to forbid ''unset'' altogether in the
initial implementation.

   * The ''array'' command continues to operate only on associative
     arrays.

Lists are a simple enough structure that the full power of the
''array'' command is not required to deal with them, and having it
work on lists as well as arrays seems like needless effort.  Moreover,
existing code may well depend on a combination of ''[array exists]''
and ''[info exists]'' to distinguish associative arrays from scalar
variables (including lists).

   * The ''upvar'' command cannot address individual list elements.

Extending the syntax in this fashion would make ''upvar'' more
consistent in its behavior, but appears to be expensive, in terms of
both performance (tracking down the linked references if a list is
rebuilt) and the effort required for implementation (the author of
this TIP is unlikely to have the time required to implement the
necessary changes to ''struct Var'' and the associated code).

   * No traces on list elements shall be supported.  List elements
     cannot function as linked variables in C code.

The original proposal had specified how write and unset, but not read,
traces could be implemented.  The original proposed functionality is
described in the Appendix.  The author of this TIP had proposed it
primarily so that list elements could function as linked variables
(for instance, in the ''-variable'' and ''-textvariable'' options
of Tk widgets).

Once again, this part of the original proposal failed for lack of
consensus among the reviewers.  Some felt that supporting read traces
in one context but not another would be overly confusing.  Moreover,
the proposal as written would cause write traces on the elements to fire
if the internal representation of a variable shimmered between a list
and something else.  Some reviewers found the excess trace callbacks
to be objectionable.  

At least one reviewer proposed a separate ''trace add element'' syntax
for list-element traces.  This syntax would address some of the
concerns about the lack of read traces (there's no reason that ''trace
add element'' should function the same as ''trace add variable'').
Alas, it would not address the problem of linked variables, which was
the main reason for having the traces in the first place.

Given the lack of consensus, the author of this TIP finds it prudent
to withdraw or postpone this portion of the proposal.

~ See Also

[22] - withdrawn.

~ Reference Implementation

No reference implementation has yet been developed; the author of this
TIP wishes to solicit the opinions of the Tcl community before
spending a lot of time implementing a possibly bad idea.

~ Change history

''12 March 2001:'' Added detailed discussion of the specific subscript
ranges supported by read, write and unset operations.  Changed the discussion
to reject the alternative of padding an array when setting an index beyond
the end.  Added discussion of the details of write and unset traces, and
rejecting read traces as being infeasible to implement.  Clarified the
example of creating an empty list so as to avoid any misapprehension that
these changes depend on list variables' having a particular representation
at any given time; in fact, every detail of this proposal is tolerant of
shimmering.

''13 March 2001:'' Fixed a copy-and-paste error in the 'incrementElement'
example, and added to the discussion the fact that all operations will
throw errors in the event of a malformed list.

''30 March 2001:'' Revised yet again, in an attempt to remove as
much controversial functionality as possible and reduce the TIP to the
minimum useful subset, on the grounds that it is prudent to avoid
supporting functionality that may later prove ill-considered.

~ Summary of objections

''DeJong,'' ''English'' (non-voting), ''Flynt'' (non-voting),
''Harrison,'' ''Ingham,'' ''Lehenbauer,'' ''Polster,'' (non-voting),
''Porter,'' and ''Sofer'' (non-voting), expressed concern that the
proposed syntax is confusing, since the target object could be either
an associative array or a linear array (that is, a Tcl list).
These objections varied in stridency from "yes, it is a risk, and
I'm prepared to accept it," to "this will just be too confusing, and I
can't countenance this proposal."

''Hobbs'' found the original proposal's omission of reverse indexing
distasteful.  The current version of the proposal embraces his
suggested change.

''Cuthbert'' (non-voting), ''Hobbs'', and ''Porter'' expressed concern
over the semantics of ''unset.'' Since consensus was not achieved, the
current version of the proposal defers implementation of ''unset.''

Several reviewers, most notably ''Ousterhout,'' found the proposed
''trace'' semantics distasteful.  The current version of the proposal
eliminates ''trace'' on list elements.

Several reviewers appeared to labor under the misconception that this
TIP introduces behavior that is dependent at run time upon the
internal representation of a Tcl object.  It does not; it is tolerant
of shimmering in all cases.

Several reviewers objected to the proposal on the grounds that it does
not specify a general object system and how such a system would allow
for generic containers with array syntax.  The author's intention in
writing it was not to propose such a system, but only to propose a
small piece of syntactic sugar, implementable here and now, that is
compatible with that broader vision.

~ Appendix: Possible implementation of read and unset traces.

The original proposal contained the following language, which could
be used as a guide if traces on list elements are contemplated at a
future time.

Write and unset traces on list elements shall be supported; it shall
be permissible to write:

|    trace add variable x(1) write writeCallback

or

|    trace add variable x(1) unset unsetCallback

The ''write'' callback shall be invoked whenever the given list
element changes value; the ''unset'' callback shall be invoked
whenever the variable is unset or when its length shrinks to the point
that it no longer has a member with the given index.

Read traces on list elements shall ''not'' be supported.  It is
too difficult at this point to define what their semantics should be.
For instance, if a program executes the following code:

|   trace add variable x(0) read readCallback
|   set x [list foo bar grill]
|   set y [string range $x 4 end]

should the callback fire?  By one argument, the program has not
read element zero of the list; by another, using the list as a string
has read every element, and all read traces should fire.  In any case,
the read trace on a variable fires before its usage is known; it appears
impossible in existing code to implement selective read tracing on list
elements.

The implementation of write and unset traces on list elements will be
done by establishing a C-level write trace on the variable as a whole.
The client data of the trace will designate a structure containing the
ordinal number of the element being traced, and a ''Tcl_Obj'' pointer
designating its old value.  The reference count of the ''Tcl_Obj''
will be incremented when this pointer is stored.  Note that this
increment operation makes the object shared.  Any change to the
designated element will thus need to copy the object.

When the write trace fires, the list representation of the variable
will be extracted, reconstituting it from the string representation if
necessary.  If extracting the list representation fails, the trace
will be considered to have failed as well, and the trace callback will
return ''TCL_ERROR''.  If extracting the list representation succeeds,
the list length will be compared with the ordinal number of the
element being traced.  If the element number is no longer within the
list, an unset trace fires if one exists.  If the element number is
within the list, the two ''Tcl_Obj'' pointers are compared.  If they
are identical, the list element in question is unchanged, and nothing
need be done.  Otherwise, the write trace fires.  

This behavior is conservative in that an operation that spoils the
list representation of the object is considered to have written every
element of the list.  This rule is consistent with the rule that write
traces on ordinary Tcl variables fire whenever the variable is set,
even if it is being set to an identical value.

In any event, after the conclusion of a trace callback, the saved
Tcl_Obj will have its reference count decremented and be replaced with
the current element of the list (with reference count appropriately
incremented, of course).

~ Copyright

This document has been placed in the public domain.

