Definitive version of the VOTable schema for web services

Dave Morris dave at ast.cam.ac.uk
Wed Jul 30 08:34:58 PDT 2008


As Guy pointed out, putting the data in the SOAP response causes enough 
problems for the software experts, and is setting a trap for astronomers 
who just want to write a simple program to connect to a service and get 
at the data.

One of the drivers for the VO was to create tools that astronomers could 
use to access and process data from large data sets without requiring 
expert programming knowledge. Relying on Moores law is not an option. 
However much memory you can pack into a laptop/desktop, it will not be 
able to keep up with the rapidly growing data sets held by the data 
archives. It is not beyond imagining that a valid science query to a 
large data set could return 21G bytes of data.

One of the reasons for using SOAP/WSDL is that non-expert programmers 
should be able to use a generic toolkit to connect to a service and 
process the response automagically, based on the structure defined in 
the service WSDL.
If the WSDL defines the data as a string, then the toolkit will treat it 
as just that - a single string. So if Anita used a Python SOAP library 
to call one of our services, it would hand back the results as a single 
string .... all 21G bytes of it, in one large anonymous BLOB, probably 
melting her laptop in the process. In which case, we might as well 
return the results as base64 encoded FITS and be done with it.

One of the advantages of using XML is that it should be possible to 
process it as a stream of elements, allowing the client to process the 
data one row at a time.

Ray mentioned avoiding the WSDL generated classes and treating the 
response as a document. Using the SOAP toolkit to return the contents as 
an array of DOM elements would still mean that the client would have to 
hold  all 21G bytes of data in memory, albeit split into a tree of 
thousands of tiny DOM elements.  However, the more recent SOAP toolkits 
(e.g. Axis2 and XFire in Java) can process the XML one element at a 
time, without building the entire tree. These tools would allow the 
client to process the data one row at a time, without holding the entire 
data set in memory.

Matthew mentioned using XSLT to process the response. A good example of 
this would be an XSLT processor that parsed the data one row at a time, 
kept the few 'interesting' rows and threw away the rest. If data 
contained one 'interesting' row in a thousand, a simple parser on an 
ordinary laptop could process the 21G byte stream and return the 21M 
bytes of 'interesting' rows (network bandwidth allowing).

Whatever we replace/update VOTable with it should be easy process the 
service response as a stream of rows, without requiring the client to 
hold the entire data set in memory. If not for the 
programmer/astronomers, then as a software developer I know it would 
make my life a lot easier.
My job is to give Anita a Python package she can use in her scripts that 
calls a service, processes the results, and returns a simple Python 
object with two methods, hasNext() and getNext(). Without requiring her 
to upgrade her laptop with 30G of memory and a muti-core CPU.

Dave

Anita M. S. Richards wrote:

>
> On Wed, 30 Jul 2008, Ray Plante wrote:
>
>>> astronomer
>>
>>
>> I believe it was implicit in the discussion that by "astronomer" we 
>> meant the "scripting astronomer", one who has enough scripting 
>> ability to use, say, a Python module to access a web service.
>>
>> cheers,
>> Ray
>>
>
> That is fine for e.g. the VO expert in a large project or students 
> whose project has a major fraction working  on these sorts of data, 
> but it excludes the majority of astronomers; whereas most astronomers 
> do use VOTables although most are not aware of it.  If the minority 
> are your target audience, fine.  Regarding the list of pacjkages from 
> fortran to IDL... most astronomers will learn one or two, they will 
> _not_ learn the whole lot.  Currently, to make best use of VOs, people 
> need a bit of SQL plus _one_ scripting language out of python, perl or 
> IDL, in most cases. That is about as much as we can expect.
>
>
> cheers
> a




More information about the grid mailing list