String character range

Fri Aug 1 13:53:06 PDT 2008

Hey Mark -

I agree with your sentiment that string data which we want to
manipulate in any language or environment should be simple; if
necessary a separate datatype could be declared for representing
e.g. general Unicode encoded text.

What about UTF-8 though?  This is backwards compatible with ASCII
but allows any Unicode character to be represented using multi-byte
sequences - if there are no funny characters it is the same as ASCII.
This is much like your escape sequence proposal, but is a widely used
standard.  XML has mandatory support for UTF-8 (almost any XML document
one sees is UTF-8 encoded) so there should be no problems there.

I suspect that if some old ASCII-oriented code got a UTF-8 encoded
string containing multi-byte Unicode characters it would print these
oddly, however it would probably still work (things like the null
test for end of string etc. still work normally for UTF-8).  There
would be no problem for the usual case of simple ASCII text.

	- Doug

On Fri, 1 Aug 2008, Mark Taylor wrote:

> On Fri, 1 Aug 2008, Carlos Rodrigo Blanco wrote:
> 
> > Hi
> > 
> > I'm sorry that I don't know much about unicode encoding and I feel quite
> > ashamed of showing this ignorance, but I wonder what happens with latin
> > characters and so.
> > 
> > If I have to write, for instance, some author name in a xml document that
> > includes some latin character (like ñ), is that allowed?
> 
> Writing it in an XML document - no problem.  XML, and Unicode on which
> it is based, is very capable at representing almost any character
> from almost any language you can think of (and a lot more).
> 
> As far as SAMP goes: that character looks to me like code point 0xf1, from the
> Latin-1 Supplement code block.  So you could not send it using either the
> existing definition for a SAMP string or the proposal (4) that I am
> suggesting.  If we used a variant of my suggestion (3):
> 
>   3. Define some escaping convention for un-XML characters, e.g. \u001f
>      for character 31.
> 
> with the intention that this escaping mechanism could be used for
> any 8-bit character it would be possible to transmit this kind of non-7-bit
> Latin character.  However, characters with the 8th bit set might cause
> problems for certain other transports and language environments.  I must admit
> apart from RFC-822 mail-type contexts I can't think of what these might be,
> but I'd be inclined to steer clear of non-7-bit characters just in case.
> However, if others (e.g. with less Anglo-Saxon prejudices) think that it's an
> important requirement to permit transmission of characters like this within
> SAMP we could take that on board.  We could even in principle say that this
> escaping mechanism could be used to specify any Unicode character - but I
> think that would definitely be a bad idea as it would effectively restrict use
> of the protocol to languages with Unicode support, which excludes quite a lot.
> 
> Mark
> 
> -- 
> Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
> m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/