Representing data

Tue Feb 3 03:55:57 PST 2004

Yes I'll probably get more interest here because no firm decisions have been 
made here yet :-) but this is not the ideal place either really - I'm after a 
way of representing data instances for transport rather than modelling it in the 
general case.  As developers we can 'just get on with it' and the astronomers 
will have to put up with whatever we make - but I'd rather get some input!  

As you say amongst all your sensible stuff below, the case of the large dataset 
*has* to be dealt with by making use of some binary format, not just for the 
storage space and network bandwidth, but also because of the tremendous amount 
of processing power required to convert to and from ASCII.    Given the current 
binary formats available, a table-based structure is probably the most suitable 
- but let's not assume that is always the case.  The fixed sizes make editing 
tables easier & faster, but if we are just going to be reading the data then it 
makes little difference whether the data has fixed-record size or not as long as 
there are suitable indexes.

I realise that changing a way of thinking from years of tradition can be 
difficult, and so it may be hard to think of catalogue data in forms other than 
tables.  We also don't want to lose existing skills, techniques and data by 
getting carried away with new shiny technology.  But at the same time we don't 
want to handicap our new shiny VO because we couldn't think outside these 
traditions *as well*.

So during the meanwhilst - can anyone suggest an email list for discussing 'data 
representations'?  This should include my favourite bee-in-the-bonnet topic, 
inter-service message formats.  Should we start a new one, or perhaps use 
interop at ivoa.net?

Cheers,

Martin

Quoting Ed Shaya <edward.j.shaya.1 at gsfc.nasa.gov>:

> Martin Hill wrote:
> 
> >
> > > Perhaps you want to drop the table view entirely?
> >
> > I don't want to lose the ability to pass generic tables of data about, 
> > or lose the current toolsets that work with VOTable - which is why I'm 
> > happy to see VOTable stay as it is - but I do want to drop it 
> > (entirely) for the *default* service data exchange format.
> >
>   Well, it is not surprising that there were not many takers for this 
> idea at the VOTable discussion group.  You may find more sympathizers 
> here at DM though.  Personally, I have always advocated passing a hybrid 
> that consists of an XML description of a table plus a file of either 
> fixed width ASCII or binary, perhaps packaged in SOAP or as an SMTP 
> message+attachment.  This is not totally at odds with VOTable schema, 
> although thus far application writers have been pushing the <TD> 
> option.  Now, as we have been discussing, one can incorporate much 
> better semantics and validation if there is a properly modeled view, or 
> layer, of the hybrid container.
>     Perhaps when the exchange is about a few objets then one can use the 
> model view directly,  But, I think it is a given that when the number of 
> objects being discussed reaches into the many 1000s, as is typical in 
> astronomy, then we simply must switch over to the hybrid tabular 
> representation.
>   As for the model view, the basic concept of XML is to have information 
> bracketed by start and stop tags that are descriptive of the info. And 
> to allow subsections of this info to be tagged in a nested matter.  When 
> done properly a single XPath request finds the desired object and 
> retrieves the whole twig of nested relevant information.  Tables are 
> missing this property.  We absolutely need this capability to ensure 
> background information (aka. metadata)  is discoverable and indepth.
>   Plain tables have served the human eye well for thousands of years 
> because it has always been supplemented with human readable text.  You 
> understand a table in a scientific article because you have read the 
> article.  If you have not read the article, you most likely do not 
> really understand the table.  Although we have no real substitute for 
> reading the literature, much analysis can be automated provided certain 
> key information is entered along with the tabulated numbers, but this 
> information does not neatly fit into canonical tables.  Hence the 
> tabular format must adapt to hold extra metadata about any cell.  It 
> would be very useful if we have a means of making round trip from model 
> view to tabular view and back to tabular view with  no loss of information.
>   If we can do that, then it does not really matter whether the 
> application writers use one or the other representation for I/O.  You 
> might feel that it is more straightforward for them to start using the 
> model view.  And I agree with you that they would lose no capabilities 
> if they did.  But purely on practical grounds of speed and memory usage, 
> they will probably always prefer the tabular way.
> 
> Cheers,
> Ed
> 
> 
> 

-- 
Martin Hill
07901 55 24 66
www.mchill.net