Gmane
Favicon
From: Darren Duncan <darren@...>
Subject: updating SQLite to implement The Third Manifesto
Newsgroups: gmane.comp.db.sqlite.general
Date: 2006-03-11 00:37:36 GMT (2 years, 9 weeks, 3 days, 19 hours and 11 minutes ago)
All, and particularly Dr. Hipp,

Lately, my own database project in Perl, named Rosetta, has evolved 
to officially be an implementation of Chris Date's and Hugh Darwen's 
proposal for relational databases called "The Third Manifesto", which 
is talked about at http://www.thethirdmanifesto.com/ and in various 
books and papers of theirs.

Rosetta has its own API which defines a "D" relational database 
language, as apposed to a SQL language, and Rosetta's back-ends to 
implement this are interchangeable.  I am making a pure Perl 
reference implementation (called Rosetta::Engine::Example) which is 
coded simply for ease of understanding but that is slow.

Separately, I or third parties would be making alternate back-ends 
that either are self-implemented and better performing, or that 
constitute wrappers over existing database products, usually SQL 
based, since those are fairly mature and plentiful.  SQLite is one of 
the first such back-ends to be used.

Now, I would like to propose, and if necessary I will contribute 
significant targeted funding (when I have the money) to pay Dr. Hipp 
and/or other developers, some significant feature changes (as a fork 
if necessary) to SQLite such that it directly implements and gains 
the efficiencies of The Third Manifesto.  This includes both the 
addition of and the removal of current features, and certain 
behaviours would change.  Hopefully all for the better.

As a result of these changes, not only would SQLite better serve as a 
back-end of Rosetta, but non-Rosetta users of SQLite would get the 
most critical of the same benefits from it directly.

I anticipate that the changes would mainly affect the upper layers, 
which convert user commands into virtual machine code, but that the 
virtual machine and b-tree and OS layers would remain more or less 
unchanged (this depends, of course, on a few details).  Possibly, we 
would add a new command language.

I am hoping that, to keep longer term maintenance easier, these 
changes can be implemented in the trunk and activated using either 
run time pragmas or compile time options or both.  But if they would 
require a fork, then the forked product would have to be named 
something else that doesn't have 'SQL' in its name, since SQL does 
not satisfy The Third Manifesto.  Maybe 'TTMLite' or something that 
sounds better.

Here are some of the changes that I propose the pragma or compile 
time option or fork would have; they all refer to what the user sees, 
not to implementation details that should be hidden:

1.  Add a distinct logical BOOLEAN data type.  It is the data type of 
output from logical expressions like comparisons, and the input to 
'and', 'or', etc.

2.  Have strong and strict data typing for both variables and values.

2.1  Table columns are always declared to be of a specific type (eg: 
BOOLEAN, INTEGER, REAL, TEXT, BLOB) and nothing but values of the 
same type can be stored in them; attempting to do otherwise would 
fail with an exception.

2.2  The plain equality test is supported for all data types.

2.3  All operators/functions have strongly typed parameters and 
return values, and invoking them with arguments that aren't of the 
right type will fail with an exception.  The equality test likewise 
can only compare operands of the same type.

2.4  There is no implicit type conversion; data types must be 
explicitly converted from one type to another.

2.5  INTEGER and REAL data types have separate sets of operators, 
which do the expected thing with their types.  For example, each has 
a separate division operator whose input and output are all of that 
same type.  No worrying about when to round or not.

2.6  SQLite may already be this way, but:  All characters in a string 
are significant, including whitespace, so 'a' and 'a ' are always 
unequal.

3.  There is no such thing as a NULL.

3.1  All logic is 2VL (true, false) not 3VL (true, false, unknown).

3.2  Every variable of a particular type always contains a value that 
is valid for that type, so logic for dealing with it is simpler. 
Likewise, every with every literal value.

3.3  The code to implement operators is a lot simpler.

3.4  Missing data can be either represented with the data type's 
empty value, or a table column that may possibly be unknown can be 
split into a separate related table, that only has records when the 
value is known.

3.5  All variables default to a reasonable valid value for their type 
if not explicitly set, such as the number zero or the empty string.

4.  There is no significant hidden data.  A row id can only be an 
explicitly declared table column.  The implementation of a table can 
use hidden row ids, but the user wouldn't see them.

5.  No duplicate rows in tables or queries are allowed.

5.1  In SQL terms, every table has an implicit unique key constraint 
over all of its columns.  This is ignored if there are any actual 
explicit keys, whether primary or otherwise.  In TTM terms, it is 
impossible by definition to have duplicate rows.

5.2  The results of all stages of queries do not contain duplicate 
rows.  In SQL terms, every query or subquery has an implicit 
'distinct' or 'group by all' clause on it.  No joins produce 
duplicates.  No unions etc do either.

5.3  By doing this and #3, all queries that look like they should 
return the same results actually do, whereas in SQL they may return 
different results in the presence of duplicates or nulls.  Queries 
can also be simpler.

6.  Columns in tables and views and query results have no ordinal 
value; they all have names and are referred to using only those 
names.  Moreover, every column must have a different name from every 
other column.

7.  Rows in tables and views and query results have no ordinal value; 
they are referenced by relational expressions that match on the 
values of columns, like in a SQL where-clause.

7.1  An order-by or limit clause only makes sense in an outer-most 
query, right when results are being returned from the database to the 
application, where it then specifies the order to return otherwise 
order-less rows.

In doing all of the above, SQLite should actually be simpler to 
implement, and it will be easier to use, with more predictable 
results and fewer bugs.

This next one can be implemented separately from all the other suggestions:

8.  Add some standard relational logic operators that can be combined 
and nested to get all the power of selects and more, with less 
effort, such as any of the following you don't already have: 
restrict, project, join, product, union, intersection, difference, 
divide, rename.

8.1  The simplest join syntax, such as an unqualified comma-delimited 
list, would perform a natural join by default.  Or we could more or 
less just have natural joins (and cartesian products, 'product') as 
the only kind of join.

8.2  Using these instead of 'select' should allow for easier 
implementation and optimization; for one thing, the expressions are 
more associative or commutative.

This next one can be implemented separately from all the other suggestions:

9.  Support nested/child transactions, such as a 'begin transaction' 
inside another one, which can make things a lot easier for 
applications; they have to worry less about whether a transaction 
already exists before starting another one.  These are functionally 
sort of like save-points in SQL, in that even if an inner transaction 
commits, it is still thrown away if the outer transaction rolls back. 
To implement this best, you would probably need multiple (cascading?) 
journal files, one per transaction level.

Following are also features of The Third Manifesto, but can possibly 
be left out of SQLite in accordance with its Lite nature:

1.  All views are updateable like they were tables.  From the user's 
point of view, tables and views are the same sort of thing in how 
they can be used.

2.  Tables can be assigned to directly like they were variables, and 
insert/update/delete is actually a short-hand for this.  Eg, an 
insert is equivalent to an assignment to a table of the table's old 
value unioned with the rows being inserted.  Supporting this allows 
users to define arbitrarily flexible updating operations, such as 
"replace or add" and such.

3.  The system catalog tables can be updated directly using data 
definition language, which results in the schema being updated.  Eg, 
you can use insert statements to create a table rather than a create 
statement.

4.  Support definition and use of custom data types.

5.  It should not be necessary to explicitly declare indexes to help 
with speed.

6.  Generally speaking, users should not have to know about 
implementation details, but rather just express what their data 
actually means.

Okay, that's about all for this initial proposal email.

Ultimately, I believe that the core of my proposal involves 
simplifying SQLite, making it leaner and meaner, and also reduces 
possible or actual bugs or difficulty in understanding.

At the very least, I hope that the trunk would have the pragma or 
compile option that essentially strips out current features like 
nulls and other ambiguity, so essentially we have a restricted or 
simplified SQL.

I also bring this up because I would expect that SQLite should be 
able to perform faster when it doesn't handle nulls or duplicates or 
weak data types than if it does.  The conceptual logic is simpler 
when we don't have those, and the implementation code should also be 
simpler, and perform faster, since there are fewer possibilities to 
check at logical decision points.  And it should be easier to 
optimize queries.

So even if no incompatible changes are made, I would hope that it is 
possible to optimize for the simplest case.

-- Darren Duncan