String character range

Mon Aug 4 03:09:16 PDT 2008

On Fri, 1 Aug 2008, Luigi Paioro wrote:

> Hi.
>
> I find that your suggestion below is a good compromise. I would split it in 
> two points:
>
> 1. At SAMP protocol definition level we might define that "string" can accept 
> any sequence of 0X01-0x7f characters adding the escape convention for any 
> printable Unicode char out of the specified range (so it is general).
>
> 2. At Standard Profile level I would put more constraints, limiting the 
> charset to the XML range and introducing the escape convention for the other 
> unsupported chars.
>
> Is it reasonable?

Luigi,

that is a reasonable way to go for permitting transmission of Unicode
characters.  However, any kind of escaping does introduce a fair 
amount of fiddly complication to handle all cases, both in the 
standard and at the client end.

In the standard we have to say exactly what counts as a unicode
escape, which characters it is permitted/required for, and make sure
that there is some mechanism for escaping the escape (so for 
instance if you want to send a string that looks like the ASCII
"\u001f" rather than the Unicode character at code point 0x1f,
there has to be a way of doing that which will not be misunderstood).

At the client end, for reading strings at least, implementors will have 
to make sure that they take account of all of these things in order
to decode a string acquired from the SAMP transport (XML-RPC in the 
case of the Standard Profile).  Not hard in Unicode-aware languages 
which use the same escaping mechanism as SAMP does for Unicode 
characters (Java, Python); not too hard in languages designed for
text manipulation (Perl); probably quite a drag in certain other
languages which do not fall into these categories (C, FORTRAN, IDL) -
I'd guess at least 10-20 lines of code just for string decoding
(though in many cases quite likely client implementations would
treat it as normal ASCII and work 99% of the time, behaving
incorrectly in mostly-not-very-catastrophic ways 1%).  Of course
the best that languages with no Unicode support can do in any case
if they encounter non-ASCII Unicode characters is probably to 
replace them with a "?" or something.

If we reckon that transmission of 
(a) control characters (everything between 0x01 and 0x1f) and
(b) non-7-bit-ASCII characters (Unicode beyond 0x7f)
is a requirement for what we're doing here, OK, let's draft a revised
definition of the SAMP string data type which is capable of doing 
all this and clients will have to do the extra work if they want 
to behave correctly.

My feeling is it would be better to restrict what can be sent in a
SAMP string to something that is going to be easy to implement in all 
sensible languages/transports (probably 0x09, 0x0a, 0x0d, 0x20-0x7f),
so that both the standard, and the requirements on clients, stay 
as simple as possible.  If specific requirements for sending full 
Unicode strings arise, we could mark these on a per-MType basis
and come up with a convention along the lines of the SAMP int and
SAMP float already defined in Section 3.4.

Which of these is best depends on how important the requirement to
be able to send Unicode and control characters is.  My vote is 
not very.  Can we have a show of hands?

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/