<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Markus, Mark, Russ et al.,</p>

    <p>I support the ideas of deprecating "unicodeChar" and allowing

      UTF-8 in "char" with<br>

      "arraysize" being the length of the string in bytes.<br>

      (But also, what about adding comments about the possible meaning

      of "width" in such cases, see below)<br>

    </p>

    <p>It seems that  in (recent versions of?) some databases, 'char(n)'

      do means 'n-bytes length strings', see e.g.:<br>

      * Microsoft SQL Server doc here:

<a class="moz-txt-link-freetext" href="https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16">https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16</a><br>

      * JavaDB:

      <a class="moz-txt-link-freetext" href="https://docs.oracle.com/javadb/10.10.1.2/ref/rrefsqlj13733.html">https://docs.oracle.com/javadb/10.10.1.2/ref/rrefsqlj13733.html</a><br>

    </p>

    <p>PostgreSQL warns against the usage of fixed length strings, see

      the "Tip" in

      <a class="moz-txt-link-freetext" href="https://www.postgresql.org/docs/current/datatype-character.html">https://www.postgresql.org/docs/current/datatype-character.html</a><br>

      A PSQL type CHAR(8) with a character set different from ASCII

      (<a class="moz-txt-link-freetext" href="https://www.postgresql.org/docs/current/multibyte.html">https://www.postgresql.org/docs/current/multibyte.html</a>) <br>

      could be transformed into arraysize="8*" .<br>

      And a fixed arraysize="x" could be transformer into

      VARCHAR((x+3)/4).</p>

    <p>For ASCII or fixed number of characters, we may enforce the usage

      of the "width" attribute<br>

      since "width" is "the number of characters to be used for input or

      output of the quantity".<br>

      Am I right if I say that <span style="white-space: pre-wrap">datatype="char" + arraysize="x" + width="x" could mean fixed length ASCII string?</span></p>

    <p><span style="white-space: pre-wrap">We may find corner cases, but, FWIW, the Rust VOTable library (<a class="moz-txt-link-freetext" href="https://github.com/cds-astro/cds-votable-rust">https://github.com/cds-astro/cds-votable-rust</a>)

and vot-cli (<a class="moz-txt-link-freetext" href="https://github.com/cds-astro/cds-votable-rust/tree/main/crates/cli">https://github.com/cds-astro/cds-votable-rust/tree/main/crates/cli</a>) seem to already support UTF-8.

E.g. the (non-standard) VOTable:</span></p>

    <p><span style="white-space: pre-wrap"><font face="monospace"><?xml version="1.0" encoding="UTF-8"?>

<VOTABLE version="1.5<a class="moz-txt-link-rfc2396E" href="xmlns:xsi=">" xmlns:xsi="</a>http://www.w3.org/2001/XMLSchema-instance"

 xmlns=<a class="moz-txt-link-rfc2396E" href="http://www.ivoa.net/xml/VOTable/v1.3">"http://www.ivoa.net/xml/VOTable/v1.3"</a> >

  <RESOURCE>

    <TABLE>

      <FIELD name="a" datatype="char" arraysize="3*"/>

      <FIELD name="b" datatype="unicodeChar" arraysize="*"/>

      <FIELD name="c" datatype="char" arraysize="8"/>

      <FIELD name="d" datatype="unicodeChar" arraysize="3"/>

      <DATA><TABLEDATA>

        <TR><TD>é😀è</TD><TD>éàè</TD><TD>é😀è</TD><TD>éàè</TD></TR>

      </TABLEDATA></DATA>

    </TABLE>

  </RESOURCE>

</VOTABLE></font></span></p>

    <p><span style="white-space: pre-wrap">can be converted back and forth in DATATABLE / BINARY / BINARY2 (+ JSON / TOML / YAML) using the following commands:

</span></p>

    <p><span style="white-space: pre-wrap">vot convert -i ${in} -o ${in}.bin.xml  -f xml-bin

vot convert -i ${in} -o ${in}.bin2.xml -f xml-bin2

vot convert -i ${in} -o ${in}.json  -f json --pretty

vot convert -i ${in} -o ${in}.toml  -f toml --pretty

vot convert -i ${in} -o ${in}.yaml   -f yaml

vot convert -i ${in}.bin.xml  -f xml-td

vot convert -i ${in}.bin2.xml -f xml-td

vot convert -i ${in}.json -f xml-td

vot convert -i ${in}.toml -f xml-td

vot convert -i ${in}.yaml -f xml-td

</span></p>

    <p><span style="white-space: pre-wrap">

</span></p>

    <p><span style="white-space: pre-wrap">fx

</span></p>

    <p><span style="white-space: pre-wrap">

</span></p>

    <div class="moz-cite-prefix">Le 11/06/2025 à 18:35, Russ Allbery via

      apps a écrit :<br>

    </div>

    <blockquote type="cite" cite="mid:8734c644nw.fsf@hope.eyrie.org">

      <pre wrap="" class="moz-quote-pre">Mark Taylor via apps <a class="moz-txt-link-rfc2396E" href="mailto:apps@ivoa.net"><apps@ivoa.net></a> writes:

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre"> 1. Redefine datatype="char" to mean UTF-8 in the BINARY/BINARY2 encoding,

    and document-encoded unicode in TABLEDATA.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">    The arraysize attribute and the BINARY/BINARY2 byte count are both 

    equal to the number of bytes in the UTF-8 encoded value (not the

    number of characters/codepoints in the string).

    This won't break anything which is already correct, since you're only

    supposed to put 7-bit ASCII (whose UTF-8 representation is identical)

    into char fields.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">    The downside is that a FIELD with datatype="char" arraysize="8"

    can't store an 8-character string if those characters are emojis.

    Personally, I think that's OK, if you want to declare fixed-length

    char fields, you will now have to think in UTF-8 terms not 

    code-point terms.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

I was trying to decide if this would cause a problem for TAP table upload

given that database schemas generally specify limits in terms of

characters, not bytes, for CHAR and VARCHAR data types, but I think I'm

convinced that this isn't a concern in that direction. If one takes the

approach of blindly translating the arraysize parameter of the type to the

length of the field, the result would be a database column that is "too

large" for the VOTable data type for non-ASCII strings, but I don't think

that causes problems as long as the TAP service knows the original type

and can reflect it on query results.

This does feel like it's going to increase the existing schism between

underlying database types and VOTable types because there will be no clear

translation of arraysize for char fields between VOTable semantics and

common database semantics. I'm not sure there's any way to avoid that, but

it feels awkward.

For example, suppose that one has a column in the database that is defined

as CHAR(8) with a Unicode character set. What should the corresponding

arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously

wrong and will truncate valid data. 48 is safe but seems weird.

("Don't use fixed-width char fields for anything other than

single-character ASCII flags; this is a false optimization for modern

databases" is probably the correct answer in most cases, but we all know

database schemas are hard to change.)

Also, a probably obvious point, but worth stating explicitly: Suppose that

the VOTable schema for a column is datatype="char" arraysize="8" but the

database column value is two Unicode characters whose UTF-8 representation

totals 12 bytes. The TAP server I think needs to truncate at the last

character that fits into the size when converted to UTF-8, and then pad.

It definitely should not take the naive approach of converting to UTF-8

and then truncating at 8 bytes, since that will result in corrupt UTF-8

that should be rejected by any UTF-8 decoder.

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre"> 2. Deprecate datatype="unicodeChar".  Anybody who wants to write

    non-ASCII text should use UTF-8 in datatype="char" instead.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

While in general I am in favor of using Unicode everywhere, do we lose

anything by no longer having a way of marking fields as containing simple

one-byte-per-character results that don't require any special processing?

I suppose the alternative is to introduce yet another datatype, though,

which seems even more unappealing.

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre"> 3. Just to remove mention of the obsolete UCS-2 from the standard, 

    change the text to say that BINARY/BINARY2 unicodeChar is to be 

    interpreted as UTF-16, but that behaviour is undefined where it

    contains characters outside of the UCS-2 subset of UTF-16.

    Then the BINARY/BINARY2 byte count for unicodeChar arrays is

    2*arraysize.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">    That's somewhat nasty, but I claim OK since (a) unicodeChar 

    only used to be allowed for UCS-2 so it won't break any existing 

    code/data[*], and (b) unicodeChar will now be deprecated so nobody

    should write new code/data that encounters this.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

So basically telling implementors that to fully support Unicode you should

ignore both UTF-16 and unicodeChar and just use char with UTF-8, which can

handle the entire character set. This part seems reasonable to me.

</pre>

    </blockquote>

  </body>

</html>