This is not necessarily the current version of this TIP.
| TIP: | 308 |
| Title: | Tcl Database Connectivity (TDBC) |
| Version: | $Revision: 1.3 $ |
| Author: | Kevin B. Kenny <kennykb at acm dot org> |
| State: | Draft |
| Type: | Informative |
| Vote: | Pending |
| Created: | Thursday, 15 November 2007 |
This TIP defines a common database access interface for Tcl scripts.
There has been a fair amount of discussion, that flares and dies back, regarding the need for a "Tcl database connectivity layer" in the Tcl core. This document specifies what this discussion means. At its present stage of development, it is to be considered very much a draft; discussion is actively solicited.
Parties who are interested in a detailed background of this TIP may a more extensive discussion of motivations and objectives in the author's posting to comp.lang.tcl and the tcl-core newsgroup, obtainable from [1] or [2].
If we look at other database connectivity layers such as ODBC/DAO, JDBC, Perl's DBD/DBI, we find that there really isn't very much, if anything, inside them. Rather than being a body of code, they consist primarily of specifications of the interfaces to which the author of a database connectivity module must conform. The real work of connecting to the databases happens inside the connectivity modules, which are generally speaking under the control of the database teams. In terms of practical politics, there isn't really any other way to do it; the Tcl maintainers are highly unlikely to want to take on the job of connecting to arbitrary database API's.
In other languages, such as C++ and Java, it is often necessary to have interface definitions that are understood by a compiler in order to get the "pluggability" of arbitrary database connectivity. In Tcl, however, an "interface" is best understood as an ensemble implementing a predetermined set of commands. There is no counterpart to a Java or C++ interface definition, nor does there need to be. For this reason, the work product of a "Tcl database connectivity" development effort is likely (at least at the first stage) to consist primarily of a specification document, perhaps with reference implementations for one or a few popular databases. To be considered "in the core", the specification should be included with the Tcl documentation, and be under control of the TIP process. The database implementations should be considered "extensions," and have their own configuration management. This statement doesn't say that we can't choose from among them a set that we will package with releases of the Tcl core. In fact, I hope that this effort will be one driver for the TCT to sort out the management of "bundled extensions."
I write this document in "standards committee prose". (While turgid, it at least is often well-understood; I offer no further defence.) In particular:
the word "MAY" is construed as allowing a given behaviour but imposing no requirement other than that clients be prepared for it;
the word "MUST" (and conversely "MUST NOT") is construed as requiring a given behaviour; implementations that fail one or more requirements given by "must" are non-compliant;
the word "SHOULD" (and conversely "SHOULD NOT") indicates that a given behaviour is expected of an implementation unless there is a compelling reason not to include it; while not formally non-compliant, implementations that fail one or more requirements given by "SHOULD" can be understood to have issues with respect to "quality of implementation."
the future of determination ("SHALL" or "WILL" according to the usual customs of formal written English) is construed as a promise to which the Tcl Core or the Tcl Core Team, as appropriate, shall adhere. It describes requirements of the Tcl Core, rather than of database connection modules.
the term, "integer value" refers to any string acceptable to Tcl_GetBignumFromObj; the term "native integer value" refers to a value acceptable to Tcl_GetIntFromObj, and hence to a value that can be represented by a C int on the target machine.
the term, "boolean value" refers to any string acceptable to Tcl_GetBooleanFromObj and hence includes at least '1', '0', 'on', 'off', 'yes', 'no', 'true', and 'false'.
Obviously the first thing that any connectivity layer has to offer is the ability to select a database. The way databases are named is quite specific to the database manager, as is the way access is negotiated (credentials such as user name and password may be required, session keys may be negotiated for privacy and authentication, and so on). All of this machinery is formally out of scope for this specification. Similarly, the machinery of database administration (at least at the level of creating/deleting entire databases, managing the physical layer, and authorizing clients) is presumed to be already taken care of. We need merely specify that a connectivity layer must provide at least one command that accepts arguments describing the desired connection and returns a database handle - defined to be an ensemble through which interactions with the given database instance will take place. Here, database instance means the database, or databases, that the given handle can access; rather a circular definition. In many SQL systems, it is possible for a single connection to access several "databases" managed by SQL CREATE DATABASE statments, or several "tablespaces" or similar constructs. We presume that database module implementors will know what is appropriate for their systems, and intentionally leave this particular matter somewhat vague.
Database handles are Tcl ensembles, meaning that they are commands that support subcommands. Other ensembles, such as statement handles, may be supported. Any of the ensembles may support abbreviation of its subcommands according to the rules defined by Tcl_GetIndexFromObj; nevertherless, code that uses the database interface should spell out subcommands in full.
Many of the subcommands are expected to take options in Tcl's usual syntax of:
?-option ?value?? ?-option value?...
In all of the places where this syntax is expected, a database module may support abbreviation of options according to the rules of Tcl_GetIndexFromObj(); once again, code that uses the interface should spell out options in full.
Once a handle is returned, there are a number of session-level attributes that may be controllable. Every database handle MUST provide a configure subcommand that takes the form:
dbHandle configure ?-option ?value?? ?-option value?...
This configuration process is analogous to configuring a Tk widget. If there are no arguments presented to configure, the return value MUST be a list of triples; each element of the list MUST comprise the name of an available configuration parameter, its default value, and its current value. If a single argument is presented, it MUST be the name of a configuration parameter, and the return value MUST be the triple of name, default value, and current value for that parameter. Finally, if more than one argument is presented, they MUST be a list of alternating parameter names and values. This last form is an order to set the given parameters to the given values.
The connectivity layer SHOULD implement the following parameters, and MAY implement others:
-autocommit boolean
Requests a change in the 'auto-commit' behaviour of the connection. The value MUST be a boolean value. If the value is true, the connection SHOULD henceforward treat every SQL statement as belonging to a transaction by itself, and commit changes to the database when any statement is successfully concluded. If the value is false, the connection SHOULD give the caller control over transaction isolation by means of the commit and rollback subcommands (q.v.). Single-process database interfaces that do not support transaction isolation MAY return an error if the caller requests -autocommit true; they SHOULD accept -autocommit false silently.
-encoding name
Requests that the encoding to be used in database communication protocol be changed to the one given by name, which MAY be any name acceptable to the [encoding] command. A well-designed database interface SHOULD NOT require this command; however, some backends make it virtually inevitable that mid-stream changes of encodings will be required.
-timeout ms
Requests that operations requested on the database SHOULD time out after the given number of milliseconds, if such an option is supported by the underlying connectivity layer.
-readonly boolean
Notifies that the application will, or will not, limit its activity to operations that do not modify the content of the database. This option MAY have the effect of adjusting the transaction isolation level.
The command that returns a database handle SHOULD also accept these options.
A database handle MUST implement the command:
dbHandle transaction script
In this command, the script argument is a Tcl script to be executed in the calling scope. The script is treated as a single atomic database transaction. If it completes successfully (TCL_OK), the transaction MUST be committed to the database. If it fails, (TCL_ERROR), the transaction MUST be rolled back and not visible to other users of the database. TCL_BREAK, TCL_CONTINUE and TCL_RETURN with a return code other than error result in a commit and subsequently rethrow the same exception status outside the transaction.
Notwithstanding the above, the drivers of databases without transaction semantics MAY choose to ignore [dbHandle transaction] semantics, and simply treat the transaction command as being akin to the built-in eval.
If a [dbHandle transaction] command is executed while another is already in progress, it is requesting nested transaction semantics. A database handle to an engine that supports nested transactions MUST treat this case correctly; a database handle to an engine that does not support nested transactions (including one that does not support transactions at all) MUST throw an error.
A database handle MUST implement the command:
dbHandle close
This command MUST dismiss the connection to the database and is expected to clean up the system resources associated with it. If there is an uncommitted transaction, it SHOULD be rolled back. Any handles to other objects associated with the database SHOULD become invalid.
A database interface also SHOULD perform the same actions if a handle is deleted by means of the rename command. (Interfaces that are implemented in Tcl may be notified of this action by creating a deletion trace with trace add command.) It is recognized that command deletion traces present difficulties in situations like namespace and interpreter deletion; the close subcommand shall therefore be considered the preferred way to terminate connections.
A database interface SHOULD attempt to arrange, if possible, to rollback unfinished transactions and clean up on process exit. In particular, if the underlying database engine supports transactions, it SHOULD be considered an error to commit any work that remains uncommitted on process exit.
A database handle must support the prepare command, which has the syntax:
dbHandle prepare SQL-code
The SQL-code argument is a SQL statement that is to be executed against the given database connection. This command does not execute the statement directly; rather, it prepares to execute the statement, possibly performing tasks such as code compilation and query optimisation.
The database interface MUST support substitutions in SQL-code. Each substitution request has the form :variableName. That is, each substitution request begins with a literal colon (:), followed by a letter or underscore, followed by zero or more letters, digits, or underscores. The database interface is responsible for translating from this syntax to whatever the underlying engine requires. Typical strings required in database interfaces are :name, :number, @name, @number, and ?.
The return value from the prepare command is a statement handle, discussed under "The statement interface" below.
Rationale. The choice of the colon deserves some discussion. It would surely be more natural for Tcl to use a literal dollar sign to introduce a variable name. This choice, however, seems unwise, since several databases (most notably Oracle) allow the use of table and view names that contain dollar signs. While it might be possible to continue to use these while allowing for variable substitution (for instance, by mandating that table or view names with dollar signs be enclosed in double quotes), it seems unnatural. The colon is syntax that is recognized by JDBC, ODBC, and Oracle's native API, and as such will be familiar to most SQL programmers and unlikely to collide with native syntax.
The requirement to support prepared statements is intended to guard against SQL insertion attacks. An interface to a database whose native API does not support prepared statements MUST simulate them. In particular, when the run or execute commands are executed on a statement, substitution must be performed in a safe fashion with whatever magic quoting is required. In any case, magic quoting should be regarded as an infelicitous expedient and avoided if at all possible.
A database handle MUST support the statements command:
dbHandle statements
This command MUST return a list of the statements that have been prepared by means of [dbHandle prepare] but not yet dismissed using [statementHandle close].
Likewise, a database handle MUST support the resultsets command:
dbHandle resultsets
This command MUST return a list of the result sets that have been returned (by executing statements, or by querying metadata) and have not yet been dismissed using [resultSetHandle close].
A database interface SHOULD provide a way of enumerating the tables in the database. The syntax for querying tables MUST be:
dbHandle tables ?matchPattern?
The optional argument matchPattern, if supplied, is a pattern against which the table names are to be matched. The database interface MUST recognize the SQL wildcards % and _ in the pattern.
A database interface SHOULD provide a way of enumerating the columns in a database table. The syntax for querying columns MUST be:
dbHandle columns tableName ?matchPattern?
The return value from the tables and columns commands MUST be a result set handle, and MUST function as all other result set handles do; see "The result set interface" below for the API.
The result set of the tables command must include at least a column called name in its result set; this column must contain the name of a table.
Similarly, the result set of the columns command MUST include at least the columns name, type, scale, precision, and nullable. The name column MUST contain the column name. The type column MUST be the data type of the column, and SHOULD be chosen from among the standard types BIGINT, BINARY, BIT, CHAR, DATE, DECIMAL, DOUBLE, FLOAT, INTEGER, LONGVARBINARY, LONGVARCHAR, NUMERIC, REAL, TIME, TIMESTAMP, SMALLINT, TINYINT, VARBINARY, and ''VARCHAR'. The scale and precision SHOULD give the scale and precision of the column, and nullable SHOULD give a boolean value that represents whether the given column can contain NULL values.
Other columns MAY be included in the result sets of tables and columns, and SHALL be added to this document (as optional columns) on request from the implementors of database interfaces.
The statement handle returned from the prepare command on a database interface must itself be an ensemble, which means that it must be a Tcl command that accepts a subcommand as its first parameter.
The following subcommands MUST be accepted:
statementHandle params
Requests a description of the names and expected data types of the parameters to the given statement. The return value from the params command MUST be a result set (See "The result set interface" below for details of its API). The columns of the result set MUST include name, type, scale, precision, and nullable. They are interpreted in the same way as those of the columns subcommand to a database interface (shown above).
statementHandle execute ?dictionary?
Executes a statement against a database. Any variable substitution present in the SQL that was provided when the statement was created MUST be performed at this time. The variable values MUST be obtained from the given dictionary, if one is supplied. If the dictionary does not contain a key equal to a variable name in the statement, a NULL value MUST be provided.
If the dictionary argument is omitted, the variable values MUST be obtained from the scope in which the execute command was evaluated. Any variable that is undefined in that scope must be replaced with a NULL value. An array variable provided to a substituent MUST result in an error. Read traces against the substituted variables SHOULD fire, in left-to-right order as they appeared in the SQL statement. The result of the execute command SHOULD be a result set, as defined under "The result set interface" below.
statementHandle close
Announces that a statement is no longer required, and frees all system resources associated with it. The close command MAY invalidate any result sets that were obtained by the params and execute commands.
As with database connections, the database interface SHOULD also clean up if a statement handle is removed with [rename $statement {}]. Once again, it is recognized that the strange side effects of namespace and interpreter deletion may make this cleanup impossible in some interfaces, so close SHALL be considered the standard means of discarding statements.
The syntax described so far presumes that the database interface can determine the expected types of the variables that appear in a prepared statement, or at the very least can accept some sort of variant type and perform automatic type coercion. This requirement does not seem horribly onerous at first inspection, since SQLite allows for "everything is a string" parameters; ODBC offers parameter introspection via the SQLDescribeParam call; and JDBC offers it via the getParameterMetaData method of the PreparedStatement interface.
Nevertheless, a deeper examination discovers that in at least ODBC, a driver is allowed to fail to offer SQLDescribeParam. Inspection of the JDBC-ODBC bridge reveals that in this case, JDBC will return a ParameterMetaData object that throws a SQLException on any attempt to query specific data. The result is that, while the APIs to introspect parameter types are available, they may be unusable against a particular database engine. In these cases, a backup is needed.
For this reason, a database interface MUST support allowing the user to specify types of the parameters of a prepared statement. The syntax for doing so MUST be:
statementHandle paramtype paramName type scale precision
Defines that the parameter identified by paramName in the given statement is to be of type type. The type MUST be chosen from among the names BIGINT, BINARY, BIT, CHAR, DATE, DECIMAL, DOUBLE, FLOAT, INTEGER, LONGVARBINARY, LONGVARCHAR, NUMERIC, REAL, TIME, TIMESTAMP, SMALLINT, TINYINT, VARBINARY, and VARCHAR.
(Rationale: These types appear to suffice for ODBC, and we can always come back and extend them later if needed.)
The scale of a parameter defines the number of characters or digits that it requires, and its precision defines the number of digits after the decimal point, if neeeded.
Examples:
$statement paramtype name varchar 40 $statement paramtype balance decimal 10 2 $statement paramtype transactionDate timestamp
Implementors of database APIs SHOULD make every effort to do appropriate type introspection so that programmers can avoid needing to include explicit type information in their SQL statements.
Result sets represent the results of operations performed on the database. A preferred implementation for large result sets is that they be implemented as database cursors, so that it is possible to iterate over result sets that will not fit in memory. A result set MUST be an ensemble, that is to say, a Tcl command whose first parameter is a subcommand. The following subcommands MUST be accepted:
resultSetHandle rows
Determines the number of rows affected by a SQL statement such as INSERT, DELETE or UPDATE. This count MUST be returned as an integer. It SHOULD NOT be confused with the number of rows in the result set. A database interface need not provide any interface to determine that number (often, the only way to determine it is to read all the rows). For this reason, the rows command MAY return an empty string for SELECT operations.
resultSetHandle columns
Determines the set of columns contained in the result set. The set of columns is itself returned as a result set. The columns in this second result set MUST include name (the name of the column in the first result set) and SHOULD include type, scale, and precision. A database interface MAY include other columns if additional information about the columns of the result set is available.
resultSetHandle nextrow variableName
Fetches a row of data from the result set and stores it in the given variable in the caller's context. The row MUST be represented as a dictionary suitable for use with the dict command. The keys in the dictionary SHALL be the column names, and the values SHALL be the values of the cells. If no rows remain, the 'nextrow command MUST store an empty dictionary. The return value of nextrow MUST be 1 if a row has been returned, and 0 if no rows remain in the result set.
In the result set, values of type BIGINT, BIT, DECIMAL, DOUBLE, FLOAT, INTEGER, NUMERIC, REAL, SMALLINT, and TINYINT MUST receive their natural representation as decimal numbers. Ideally, they should be returned as "pure" numbers with their string representations generated only on demand. Values of type CHAR, LONGVARCHAR and VARCHAR MUST be returned as Tcl strings. A database interface implemented in C MUST take care that all strings are well-formed UTF-8. Values of type DATE and TIMESTAMP MUST be returned as a numeric count of seconds from the Tcl epoch; if necessary, this count may have a decimal point and an appropriate number of additional decimal places appended to it. Values of type TIME MUST be returned as a integer count of seconds since midnight, to which MAY be appended a decimal point and a fraction of a second. Values of type BINARY, LONGVARBINARY and VARBINARY MUST be returned as Tcl byte arrays.
If a cell in the row is NULL, the key MUST be omitted from the dictionary. A database interface MUST NOT use a special value of any kind to represent a NULL.
Rationale: As far as the author of this TIP is aware, no existing database API makes use of dictinaries for result sets; they tend to prefer lists to represent them. Nevertheless, this TIP proposes dictionaries because they allow for a ready distinction between NULL values in a database and any other string. With any scheme where values that can include NULLs can appear in Tcl objects, the problem arises that NULL must be distinguished from any other string, particularly including the empty string and the word "NULL". The lack of such a distinction has led to several ill-advised proposals, such as TIP #185, for representing NULLs in Tcl. These alternatives founder on the principle of "everything is a string". The NULL value is not any string.
resultSetHandle close
Dismisses a result set and releases any system resources associated with it.
As with statements and database connections, the database interface SHOULD also clean up if a resut set handle is removed with [rename $statement {}]. Once again, it is recognized that the strange side effects of namespace and interpreter deletion may make this cleanup impossible in some interfaces, so close SHALL be considered the standard means of discarding result sets.
Since the cognitive overhead of preparing a statement, executing the prepared statement, and iterating over a result set is overkill for many applications, the Tcl system SHALL provide convenience procedures. The following procedures are envisioned: they can be obtained from [package require tcl::db].
::tcl::db::foreach variableName statementHandle ?dictionary? script
Runs the given statement (with the given dictionary specifying its substituents, if one is provided), obtaining a result set. Applies the given script to each row of the result set, substituting the given variable with the dictionary returned from $resultSet nextrow prior to each iteration. When no rows remain, closes the result set.
::tcl::db::allrows statementHandle ?dictionary?
Runs the given statement (with the given dictionary, if one is provided, specifying its substituents) obtaining a result set. Applies the nextrow command to the result set repeatedly, and constructs a Tcl list whose elements are the rows returned. When no rows remain, closes the result set and returns the list of rows.
::tcl::db::execute dbHandle SQL-code ?variable? ?script?
::tcl::db::executewithdictionary dbHandle SQL-code dictionary ?variable? ?script?
Prepares a statement with the given SQL code and executes it. In the first form, the substituents of the statement are obtained from variables in the current scope; in the second, they are obtained from the given dictionary. If a script argument is supplied, retrieves rows from the result set one at a time and executes the given script once per row with the row dictionary stored in the given variable. If the script argument is omitted, instead stores in variable a list of all the rows returned. If neither variable nor script is present, the result set is simply discarded. In any case, the result set and the statement are both closed, and the return value from execute is the number of rows affected by the operation (for INSERT, UPDATE and DELETE operations) and may be an empty string for SELECT operations.
It is envisioned that database interfaces will add these conveniences to the API as
dbHandle execute SQL-code ?variableName ?script??
dbHandle executewithdictionary SQL-code dictionary ?variableName ?script??
statementHandle foreach variableName script
statementHandle allrows
either by wrapping the ::tcl::db procedures or by implementing equivalent functionality themselves.
In addition to the convenience commands discussed above, the Tcl system SHALL provide certain commands to aid the job of database implementors.
The task of mapping variable substituions in the form, :varName into whatever form that a native database API can handle is a somewhat tricky one. For instance, substitutions that appear inside quoted strings MUST NOT be mapped. In order to aid in this task, the Tcl system SHALL provide a command, ::tcl::db::tokenize. This command SHALL accept a SQL statement as its sole parameter, and return a list of alternating strings and variable names. Assuming that a native database's lexical structure conforms with standard SQL, the variable names (which SHALL always be at the odd indices in the list) can be substituted with parameter numbers, question marks, or whatever the database needs, to yield the native SQL that must be prepared.
This specification is largely built from studying existing cross-platform database APIs and deriving a comon set of requirements from them. These include both popular offerings in lower-level languages (ODBC and JDBC) and Tcl-level ones (notably the 'nstcl-database' package, the SQLite API and tclodbc).
"ODBC Programmer's Reference." Redmond, Wash.: Microsoft Corporation, 2007. [3].
"Java Platform Standard Edition 6 API Specification." Santa Clara, Calif.: Sun Microsystems, 2007 [4]; in particular the package named, java.sql.
Cleverly, Michael. "nstcl-database Package." [5].
Hipp, D. Richard. "The Tcl interface to the Sqlite library." [6].
Nurmi, Roy. "Tclodbc v 2.3 Reference." Available as part of the Tclodbc distribution at [7], in the file, DOC/REFERENC.HTM.
This file is explicitly released to the public domain and the author explicitly disclaims all rights under copyright law.
Changed the transaction management API from explicit commit and rollback to a model where a script is executed as an atomic operation.
Changed the "execute" API and the convenience procedures that use it to accept an optional dictionary containing substituents, so the substituents need not pollute the local namespace. The version accepting variables is still provided, because it is useful in the case of static queries where the substitutions follow a predetermined pattern.
Added reference to the author's cover letter on tcl-core.
Added missing citation of the nstcl-database API.
This is not necessarily the current version of this TIP.