String character range

Mon Aug 4 00:59:15 PDT 2008

On Fri, 1 Aug 2008, Doug Tody wrote:

> Hey Mark -
>
> I agree with your sentiment that string data which we want to
> manipulate in any language or environment should be simple; if
> necessary a separate datatype could be declared for representing
> e.g. general Unicode encoded text.
>
> What about UTF-8 though?  This is backwards compatible with ASCII
> but allows any Unicode character to be represented using multi-byte
> sequences - if there are no funny characters it is the same as ASCII.
> This is much like your escape sequence proposal, but is a widely used
> standard.  XML has mandatory support for UTF-8 (almost any XML document
> one sees is UTF-8 encoded) so there should be no problems there.

Hi Doug,

you're right, UTF-8 does look like a better solution than the \uxxxx
escaping mechanism (borrowed from Java) that I suggested as far as 
transmitting things like accented letters and characters from non-Latin 
alphabets.  However, it doesn't solve the problem which started this
thread off, since you still won't be able to include characters in
the ranges excluded by the XML Char definition; those are simply 
not permitted in an XML document, regardless of encoding (and in any
case the UTF-8 encoding of 0x1f is the single byte 0x1f).

Mark

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/