VOResource 1.1 and i18n

Thu Aug 4 11:20:44 CEST 2016

Dear colleagues,

A while ago, I had asked for feedback about languages and scripts in
VOResource:

On Wed, Jul 20, 2016 at 04:36:11PM +0200, Markus Demleitner wrote:
> My current plan for VOResource 1.1 is:
> 
> (1) essentially codify the status quo ("natural-language content is
> expected in English"; suggestions for what to say about
> transliteration are welcome).  As long as we have strong cultural
> biases, let's at least be honest about them.
> 
> (2) say that registry extensions (the edu IG will want this) may use
> the xml:lang mechanism from the XML spec, but the elements then have
> to be repeatable, and a version without xml:lang and English content
> should always be provided (that's going to be a tough nut for
> RegTAP, I suppose).

Since there has been no feedback so far, I have gone ahead and added
a section (in volute rev. 3498) saying this much:

  \subsubsection{Language and Transliteration}

  Several VOResource elements contain natural language (e.g.,
  \xmlel{description}, \xmlel{title}, \xmlel{subject}).  In order to
  ensure reliable discovery, in core VOResource, these elements must
  contain English text, with US spelling strongly preferred; technically,
  an \xmlel{xml:lang} value of \texttt{en-US} is implied.

  Registry extensions may allow \xmlel{xml:lang} attributes on elements.
  If they do, such elements must be repeatable, and an element without
  \xmlel{xml:lang} (and hence, \texttt{en-US} implied) should be required
  for global discoverability.  The requirements on \xmlel{xml:lang} from
  the XML specification \citep{std:XML} apply.  Additionally, in
  VOResource documents RFC 3066 language identifiers must be written in
  lowercase.

  Several VOResource elements contain names.  Again, for reliable global
  discoverability, such names must be given in (common) English
  transliteration where their original form uses non-Latin scripts.
  Latin letters with diacritics should also be transliterated.

I feel a bit bad about codifying this amount of cultural bias, but
I'm convinced that for reliable discovery, we'll have to say
something pretty close to that.

In particular on the question of names, I'm really uncertain, though.
It seems patently wrong to me to have no place for names in, say,
Cyrillic or Chinese or Japanese scripts.  At least for elements with
an explicit name element (creator, contributor, contact), it would
not be hard to add an additional element (perhaps originalName?) that
could legally contain non-latin letters.  I'd be happy to introduce
them if people asked for them and would volunteer to put out records
using them.

Any takers?

       -- Markus