String character range

Fri Aug 1 03:02:32 PDT 2008

Hallo all.

while writing the hub tests, I have come across a problem with the
definition of the SAMP string data type.  Section 3.3 of the SAMP
doc defines a string as:

     "a scalar value consisting of a sequence of characters;
      each character may be in the range 0x01-0x7f"

Section 2.2 of the XML specification meanwhile 
(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) has the following
BNF production for characters allowed in an XML document:

    [2] Char  ::=  #x9 | #xA | #xD | [#x20-#xD7FF]
                 | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

                            /* any Unicode character, excluding the
                               surrogate blocks, FFFE, and FFFF. */

(I do not understand the comment here - as far as I can see Unicode
does include the other control characters in the range #x0-#x1f.
Oh well).

What this means is that there are legal SAMP strings (ones containing
any character in the ranges 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F) which
cannot be transmitted as an XML-RPC <string> element.  This means
that either the definition of a SAMP string, or the prescription for
transmitting SAMP strings in XML-RPC messages in the Standard Profile,
must be modified to avoid inconsistency.

I think the possibilities are as follows:

    1. Encode all SAMP strings as <base64> elements when transmitting
       over XML-RPC.

    2. Allow SAMP strings to be transmitted as either <string> or
       <base64> elements when transmitting over XML-RPC (the latter
       case being required only if the string contains un-XML
       characters).

    3. Define some escaping convention for un-XML characters, e.g.
       \u001f for character 31.

    4. Change the SAMP string definition so that only XML-friendly
       characters are allowed.

Both (1) and (2) would entail significant extra complication 
(base64 decoding required) for Standard Profile clients, and (2) would 
additionally make debugging harder (it's nice that you can see what's 
in a SAMP/XML-RPC message just by looking).  (3) would make life a bit 
more complicated than now for clients, but not that much.  The existing
legal range 0x01-0x7f for SAMP string characters was in any case just 
intended to be a range of characters which would be sufficient for 
'normal' strings, while excluding non-printable ones (i.e. ones which 
would likely cause problems for some transport types), and it looks 
like I decided on a range that was too wide for that purpose.

So I suggest that we do (4).  I think we do need at least one line-break
character, though the need for both 0xA and 0x0D may be moot, as is the
need for 0x09 (tab).  So I suggest that we change the definition of
a SAMP string in sec 3.3 to one of:

   4a. "a scalar value consisting of a sequence of characters;
        each character may be in the range 0x20-0x7f or one of
        the special characters 0x09 (tab), 0x0A (line feed) or
        0x0d (carriage return)"

or

   4b. "a scalar value consisting of a sequence of characters;
        each character may be in the range 0x20-0x7f or the
        line break character 0x0a"

(4b) might be more rigorous since it obviates the possibility of 
confusion when transforming between OSs (Windows and *nix), but
since SAMP usage will probably mostly be intra-OS this might cause
more trouble than it's worth - also, I bet that Windows-based 
implementations would routinely violate this in any case
(see Goldfarb's First Law of Text Processing) so probably 4a is
better.

Comments/agreements/disagreements?

Mark

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/