|
Subject: An alternative syntax for aux-list Newsgroups: gmane.lisp.scheme.ssax-sxml Date: 2004-01-05 21:51:56 GMT (5 years, 25 weeks, 4 days, 19 hours and 56 minutes ago) Hello! Kirill and I have briefly discussed a possible, hypothetical alternative syntax for aux-lists is SXML. We haven't come to a conclusion however. The biggest problem is that the alternative requires modifications in code that explicitly uses uses aux-lists. It is not clear if the advantages of the proposal are strong enough to offset the breaking of such code. Kirill has suggested I present the argument here. If you currently handle aux-lists, please speak out. If the proposed change breaks a lot of code, the change will be abandoned. OTH, someone may point out a compelling application that can take advantage of the alternative syntax. The proposal will have to be taken seriously then. Currently SXML defines an attributes list and an aux-list as: [3] <attributes-list> ::= ( @ <attribute>* ) [15] <aux-list> ::= ( @@ <namespaces>? <aux-node>* ) See http://pobox.com/~oleg/ftp/Scheme/SXML.html#Grammar Both lists look alike, as tagged associative lists. Attribute lists are tagged with a distinguished symbol '@' and aux-lists are tagged with a distinguished symbol '@@'. Both lists are "improper" children of their parent SXML element. Both lists are optional. An aux-list contains `auxiliary' associations, e.g., the information about original namespace prefixes or a pointer to the parent SXML element. Here's an example of an SXML element with both attributes-list and aux-list: (tag (@ (attr "val")) (@@ (*parent* val)) kid1 kid2) We will use this example throughout the message. In a normalized SXML (2NF), both lists must be present, and appear in the right order among the children of an element. The empty attributes-list should be coded as '(@)' and the empty aux-list should be coded as '(@@)' The proposed hypothetical alternative syntax changes the two SXML grammar productions into the following: <attux-list> ::= ( @ <attribute>* <aux-sublist>? ) <aux-sublist> ::= ( @ <namespaces>? <aux-node>* ) That is, both attribute-list and aux-list are tagged with the same symbol: '@'. The aux-list can no longer appear among the children of an element. Rather, aux-list can only appear inside attribute-list. That's why <attributes-list> is renamed <attux-list> and <aux-list> is renamed <aux-sublist>. The running example will then be written as (tag (@ (attr "val") (@ (*parent* val))) kid1 kid2) In a manner of speaking, the proposal makes aux-nodes attributes of the attribute (pseudo-)node. The tag '@' signifies a collection of ancillary information associated with an SXML node. For a proper SXML node, the collection is that of attributes. A nested '@' list is a collection of "second-level" attributes, aux-nodes, such as namespace nodes, parent pointers, etc. The proposal seems to be in accord with the spirit of the XML Recommendation, which uses XML attributes for two distinct purposes. Genuine, semantic attributes provide ancillary description of the corresponding XML element, e.g., <weight units='kg'>16</weight> OTH, attributes such as xmlns, xml:prefix, xml:lang and xml:space are incidental (meta-auxiliary), or being used by XML itself. The XML Recommendation distinguishes 'auxiliary' attributes by their prefix 'xml'. Our proposal groups all such auxiliary attributes into a '@-tagged list inside the attribute list. The proposal makes it easy to skip aux-list when not needed. It would be easy to test for it. Furthermore, the aux list provides ancillary information -- just like attributes do. An application rarely processes an attr list in its entirety: an application typically looks up attributes it wants and disregards the rest. Aux-list is handled in the same way. When aux-list is inside attr list, it does not get in the way. Kirill has noted his many doubts concerning the same tag, @, for both lists [see below for more discussion]. The first point of contention concerns semantics. <attux-list> is not the list of attributes any more. The semantics of some SXPath query changes. Keeping our running example in mind, the following SXPath query ((sxpath '(@)) node) currently returns a nodelist '((@ (attr "val"))). Under the proposal, the result will be ((@ (attr "val") (@ (*parent* pval)))) One can argue that the latter result is legitimate. When dealing with an attribute list, the programmer rarely looks up items by their position of by count -- only by their name. The attribute collection is just a dust bin of various stuff. For example, the XSLT Recommendation specifically allows for extra attributes in xslt:template and other elements, provided these attributes are in a non-XSLT namespace. A user may annotate XSLT templates any way he wants to. The XSLT processor will look up only the attributes it needs, and thus tacitly disregard the rest. RELAX/NG explicitly allows a schema author to specify that an element may have more attributes than given in the schema (provided those attributes come from a particular namespace). Therefore, if an SXML processor looks up attributes by their names and disregards 'extra' attributes, the change in semantics is transparent. Furthermore, if we use "proper" SXML queries to access attribute lists of an element, the change is transparent. Indeed, the SXPath query '(sxpath '(tag @))' is improper: it corresponds to no XPath query. An XPath expression to get the list of attributes of the current (element) node is "attribute::*" or "@*". In SXPath, that would be (sxpath '(@ *)) And indeed, this query applied to our running example will return '((attr "val")) -- either now, or under the proposed change. No changes in SXPath are even necessary! Here's the reason for that magical transparency: we have seen ((sxpath '(@)) node) returns ((@ (attr "val") (@ (*parent* pval)))). We then have to apply to that result (sxpath '(*)) -- which is equivalent to (node-typeof?? '*). The latter filters out all 'improper' SXML nodes, including the nodes tagged with '@'. Hence we get the desired answer. The advantage of the proposal is that we can filter out aux-sublist automatically, without any change to SXPath. In my view, this feature justifies the using of the same symbol '@' to tag both <attux-list> and <aux-sublist>. Likewise, an SXPath query to access a particular attribute, (sxpath '(@ attr *text*)) will work as before. These facts seem to suggest that most of the SXML processing code will not be affected by the proposed change. As before, <aux-sublist> are optional. If we use assq to search for <aux-sublist> among attributes (as we do for any attribute), then there doesn't seem to be any need to require the presence of a dummy aux-list. BTW, we can use SXPath as it is to search for a relevant aux-list element: (sxpath '(tag @ @ *parent* *text*)) Again, no changes to SXPath are needed. The proposal seems to make aux-lists more transparent. If we don't need aux-lists, we won't look for them -- and nothing should be broken as we have seen. We simply pretend aux-lists aren't there. The same node-typeof?? and the rest of SXPath will work regardless of the presence or absence of aux-list. SXPath functions don't even need to check for '@@. Currently, an SXML function that doesn't use aux-lists still should know about aux-list's possible existence and check for @@ nodes. Under the new hypothetical proposal a function that doesn't care about aux-lists doesn't need to do anything special at all. It doesn't even have to know that aux-lists exist. The proposal also makes it easier to drop aux-lists when we serialize SXML into XML. In fact, the reason for the new aux-list proposal came from SXSLT. It seems it would be quite easier for SXSLT to deal with aux-lists if they were inside the <attux-list>. Again, it seems that most of the SXML processing code will not be affected by the proposed change. Does someone have a collaborating or refuting evidence? Another doubt about the proposal concerns the aux-sublist access speed. If the source document is in 2NF and aux-list contains the *parent* node (see the example above), we only need to do (assq '*parent* (caddr node)) to get to the corresponding association. With the proposal, we have to do more: ((sxpath '(@ @ *parent* *text*)) node) Kirill noted that the *parent* aux-node is being used extensively in STX. STX performance will be notably affected. There are applications, Kirill noted, which rely on the fast access to aux-nodes. Let's consider the access no aux-list in mode detail. Currently, sxpathlib provides the following function, which assumes that the source document in is the 2NF normal form: ; Returns the list of auxiliary nodes for given element or nodeset. ; Analogue of ((sxpath '(@@ *)) obj) ; Empty list is returned if a list of auxiliary nodes is absent. (define (sxml:aux-list obj) (if (or (null? (cdr obj)) (null? (cddr obj)) (not (pair? (caddr obj))) (not (eq? (caaddr obj) '@@))) '() (cdaddr obj))) Under the proposal, the function should be re-written as ; Returns the list of auxiliary nodes for given element or nodeset. ; Analogue of ((sxpath '(@@ *)) obj) ; Empty list is returned if a list of auxiliary nodes is absent. (define (sxml:aux-list obj) (or (and (pair? obj) (pair? (cdr obj)) (let ((sc (cadr obj))) (and (pair? sc) (eq? '@ (car sc)) (let ((aux (assq '@ (cdr sc)))) (and aux (cdr aux)))))) '())) I chose to introduce local variables in favor of ca..ddr functions. The new code is quite similar to the old one. The only notable change is 'assq' in the latter function. Would it affect the performance to a large extent? It is not clear. The proposal will lead to some space efficiency, for documents where most elements have no aux-lists nor attributes. Indeed, an SXML node without attributes and aux-lists has to be written as (tag (@) (@@) data) in 3NF (which is most amenable to the efficient processing). Under the proposal, the same node will have to be written as (tag (@) data) or (tag (@ (@)) data) That saves space because '(@) and '(@ (@)) can all be shared. Also, under the proposal, SXPath doesn't need to check for @@-lists. The presence or absence of aux-lists should be transparent to most applications. Kirill has noted that so far, the upward traversal was the only application critical to aux-list access speed. If we are able to handle the upward traversal using a context, then the aux-list proposal will be less doubtful. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click |
|
|