String character range

Tue Aug 19 04:30:58 PDT 2008

Luigi, Doug and others,

sorry I've let this one go cold, I got sidetracked by something else.

On Mon, 4 Aug 2008, Luigi Paioro wrote:
> I think that Unicode chars would be rarely sent, and control chars never at 
> all. Probably in the 99% of the cases ASCII charset with the limitations you 
> indicated is enough, so I don't have a strong position respect the Unicode 
> support.
>
> Anyway I've thought to Dough's suggestion regarding UTF-8 and I've looked 
> here and there for what string encoding mechanism adopt other RPC systems 
> like ZeroC's Ice and DBus (I've also looked for CORBA encoding, but I didn't 
> succeed). Well, DBus and Ice, either use UTF-8 (with no limitations). I've 
> not looked at the other RPC systems (there are a plethora) but those are my 
> favourite (along with XML-RPC and SOAP of course) and so I've looked there.
>
> Now, suppose that in the far far future, a perverted guy decides to implement 
> SAMP using a different profile, for instance using Ice as wire protocol (in 
> principle it should be possible) instead of XML-RPC. It would be a shame if 
> such an implementation inherited the limitations coming from the XML limits. 
> In my opinion the limits should be put at implementation and language level, 
> not at protocol level... it should be as general (and flexible) as possible.
>
> So, why not follow Dough's suggestion and specify at SAMP protocol definition 
> level that the strings serialization is in UTF-8 (in general), and specify at 
> Standard Profile level that not all the UTF-8 chars are allowed but only 
> those supported by XML?

This is a coherent suggestion and it could be done.  However in my 
opinion it's not the best way to go.  While making the protocol as
general and flexible as possible sounds like a good thing, the price
that you pay is a reduction in interoperability.  If the protocol
says that SAMP strings can only ever contain characters 0xA, 0xD and 
0x20-7F (or whatever) then you know that if you can handle those 
characters then you can definitely interoperate with anyone else
speaking the protocol.  If the protocol says that any UTF-8 character
is permitted then someone trying to write middleware that does
translation between the far future perverted Ice-based profile and the
current Standard Profile will have a problem.  Is that kind of 
middleware something we're going to need?  I don't know.  But in 
weighing up how we ought to plan for unknown future evolutions,
I would rather err on the side of safety than of flexibility.

Mark

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/