VOResource 1.1 and i18n

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Fri Aug 5 16:24:37 CEST 2016


Hi Alberto,

On Thu, Aug 04, 2016 at 08:10:01AM -0400, Accomazzi, Alberto wrote:
> On Thu, Aug 4, 2016 at 5:20 AM, Markus Demleitner <
> msdemlei at ari.uni-heidelberg.de> wrote:
> 
> >
> >   Several VOResource elements contain names.  Again, for reliable global
> >   discoverability, such names must be given in (common) English
> >   transliteration where their original form uses non-Latin scripts.
> >   Latin letters with diacritics should also be transliterated.
> >
> 
> The transliteration of Latin letters with diacritics seems a bit harsh to
> me.  If it's there to make sure that searches containing non-diacritic
> terms match the original strings with diacritics, there are other ways to
> do this (downgrade everything to ascii when indexing and searching).  Since
> this is a problem that only affects searchable registries, I would
> investigate if the technologies currently used to host these databases
> allow for that, in which case there should be no extra work involved in
> keeping the diacritics in.

Yes, I suppose that's true.  I've changed that part to

  Several VOResource elements contain names.  Again, for reliable global
  discoverability, such names must be given in (common) English
  transliteration where their original form uses non-Latin scripts.
  Latin letters with diacritics are allowed, but Registry components are
  generally expected to treat them equivalent to their base letters.

This isn't really exactly a strict spec -- e.g., it leaves open
whether or not ligatures like "ß" or "æ" should be transliterated --,
but I suppose we'd be good if people actually adhered to this in
spirit.  And anyway, at least for the "ß" my impression is that
everyone  in their right minds transliterates it.

> > Cyrillic or Chinese or Japanese scripts.  At least for elements with
> > an explicit name element (creator, contributor, contact), it would
> > not be hard to add an additional element (perhaps originalName?) that
> > could legally contain non-latin letters.  I'd be happy to introduce
> > them if people asked for them and would volunteer to put out records
> > using them.
> >
> 
> We have a field in our bibliographic data that can be used to retain the
> author name in its native script.  Although at the moment we don't do
> anything useful with it the plan is to expose it and use it for indexing to
> help with disambiguation.  I don't think you should worry about
> disambiguation now but it seems like a good idea to capture the faithful
> representation of somebody's name, so I'd vote for that.

So -- would anyone want to playtest it?

> BTW I noticed that Datacite says nothing of the sort and one of their
> examples has a name in chinese script in the <creator> field:
> https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-complicated-v3.0.xml
> The schema however allows for a title and a translated title (which is in
> english).

Well, perhaps they are not really worried about reliable
searchability in that their main use case is resolving DOIs, not
answering natural-language queries.  Hm.


         -- Markus


More information about the registry mailing list