Definitive version of the VOTable schema for web services

Linde, A.E. ael13 at leicester.ac.uk
Wed Jul 30 12:03:01 PDT 2008


I think one of the failures of the whole VO effort has been the inability to take processing of huge datasets off the user's desktop/laptop and onto servers where it could be performed more efficiently. We've developed great ways of getting results from a diverse range of databases but it all still comes back to the user. Maybe the next phase of the VO ought to focus more on this issue.

t.
________________________________________
From: Dave Morris [dave at ast.cam.ac.uk]
Sent: 30 July 2008 16:34
To: Anita M. S. Richards
Cc: Grid_Ivoa_List; IVOA VOTable
Subject: Re: Definitive version of the VOTable schema for web services

As Guy pointed out, putting the data in the SOAP response causes enough
problems for the software experts, and is setting a trap for astronomers
who just want to write a simple program to connect to a service and get
at the data.

One of the drivers for the VO was to create tools that astronomers could
use to access and process data from large data sets without requiring
expert programming knowledge. Relying on Moores law is not an option.
However much memory you can pack into a laptop/desktop, it will not be
able to keep up with the rapidly growing data sets held by the data
archives. It is not beyond imagining that a valid science query to a
large data set could return 21G bytes of data.

One of the reasons for using SOAP/WSDL is that non-expert programmers
should be able to use a generic toolkit to connect to a service and
process the response automagically, based on the structure defined in
the service WSDL.
If the WSDL defines the data as a string, then the toolkit will treat it
as just that - a single string. So if Anita used a Python SOAP library
to call one of our services, it would hand back the results as a single
string .... all 21G bytes of it, in one large anonymous BLOB, probably
melting her laptop in the process. In which case, we might as well
return the results as base64 encoded FITS and be done with it.

One of the advantages of using XML is that it should be possible to
process it as a stream of elements, allowing the client to process the
data one row at a time.

Ray mentioned avoiding the WSDL generated classes and treating the
response as a document. Using the SOAP toolkit to return the contents as
an array of DOM elements would still mean that the client would have to
hold  all 21G bytes of data in memory, albeit split into a tree of
thousands of tiny DOM elements.  However, the more recent SOAP toolkits
(e.g. Axis2 and XFire in Java) can process the XML one element at a
time, without building the entire tree. These tools would allow the
client to process the data one row at a time, without holding the entire
data set in memory.

Matthew mentioned using XSLT to process the response. A good example of
this would be an XSLT processor that parsed the data one row at a time,
kept the few 'interesting' rows and threw away the rest. If data
contained one 'interesting' row in a thousand, a simple parser on an
ordinary laptop could process the 21G byte stream and return the 21M
bytes of 'interesting' rows (network bandwidth allowing).

Whatever we replace/update VOTable with it should be easy process the
service response as a stream of rows, without requiring the client to
hold the entire data set in memory. If not for the
programmer/astronomers, then as a software developer I know it would
make my life a lot easier.
My job is to give Anita a Python package she can use in her scripts that
calls a service, processes the results, and returns a simple Python
object with two methods, hasNext() and getNext(). Without requiring her
to upgrade her laptop with 30G of memory and a muti-core CPU.

Dave

Anita M. S. Richards wrote:

>
> On Wed, 30 Jul 2008, Ray Plante wrote:
>
>>> astronomer
>>
>>
>> I believe it was implicit in the discussion that by "astronomer" we
>> meant the "scripting astronomer", one who has enough scripting
>> ability to use, say, a Python module to access a web service.
>>
>> cheers,
>> Ray
>>
>
> That is fine for e.g. the VO expert in a large project or students
> whose project has a major fraction working  on these sorts of data,
> but it excludes the majority of astronomers; whereas most astronomers
> do use VOTables although most are not aware of it.  If the minority
> are your target audience, fine.  Regarding the list of pacjkages from
> fortran to IDL... most astronomers will learn one or two, they will
> _not_ learn the whole lot.  Currently, to make best use of VOs, people
> need a bit of SQL plus _one_ scripting language out of python, perl or
> IDL, in most cases. That is about as much as we can expect.
>
>
> cheers
> a




More information about the grid mailing list