Collaboration on Source Catalogue DM, ADQL and SkyNodes

Thu Dec 22 05:39:50 PST 2005

Hi,

I'm on "vacations" with very limited internet connection so I have tried 
to summarize in one single mail my comments. After Dec 28th I will 
be back to a fast link in case my comments generate rivers of bits that I
cannot respond :-)

Cheers

Maria

Legend:

M - Maria
P - Pedro
MT - Mark Taylor
C - Clive

>M - How does Catalogue Data Model used look like, especially what is the
>M   common set of attributes and the associated metadata.

>P The point is in the (Source) Catalogue Data Model, with emphasis in the
>P "Source" part. This one is the one I showed on behalf of the Catalogue
>P DM subgroup at our last interop meeting here at ESAC. I attach a pdf
>P with the initial proposal, but please use it only for temporal
>P reference, as the whole document will be changed (according to
>P requirements from Jonathan after the interop meeting).

>M Unfortunatelly, I'm in a dial-up connection and I cannot get the 6.6MB 
pdf 
>M but from Patricio's email and what I remember from  the last IVOA I can 
imagine.
>M Being more specific, what I am interested is to know how the mapping 
>M "original catalog - SCDM" is done for its two aspects: scientific and 
technical.

>M By scientific I mean: How did you map USNOB and Tycho-2 columns into 
the model? 
>M I'm very interested in seeing this mapping. This is the very first step 
to 
>M have mechanisms that allow for common query. If all collumns are called 
the same 
>M and represent the same, running engines asking the same ADQL question
>M is trivial. 

>M By technical: Do the original catalogs remain the same and you compute 
on the 
>M fly the new columns? I assume some relationships "original-model" will 
not be 
>M direct. I personally would create new columns and pre-compute the 
transformations 
>M to make things faster but probably not all catalog providers are 
willing to do so.

>M - What are the plans about registration? Will these nodes (Basic?) be
>M   registered and therefore accessible through Open SkyQuery? How many?

>P yes, they will. How many, I don't know. In Strasbourg, Inaki and
>P Aurelien worked on a couple of them, Tycho-2 and USNOB, but the CDS
>P colleagues will work on more.... Francois will answer to this question
>P at some point I presume.

>M This is good but brings two issues:

>M - 1) If many Basic SkyNodes are going to be registered, we need to plan 
>M how to do it.

>M - 2) Having a second USNOB skynode which is not exactly the same USNOB 
as 
>M the one currently working. 

>M Both issues, how to deal with many skynodes and how to deal with 
"mirrors" has 
>M been "avoided" but it is about time we start attacking the problem.

>P n-catalogue cross-match is what we are trying to get at; it will be a
>P client based cross-match, and therefore the cross-match function will 
be
>P designed and run at the client side (i.e., servers do not need to worry
>P about implementing one specific cross-match or the other). 

>M The client based cross-match is a good idea. You cannot be dissapointed 
with
>M your own specific cross-match. However, I wonder what is the plan
>M to cross-match your own "big" source catalog (let's say 700.000 rows 
>M as Mark mentions) against USNOB 1000 millions rows (If I remmember 
correctely)
>M If your objects are in a region, I can see making 1 query and get all 
objects inside a 
>M region or few but without that ... I hope the idea is not to make 
700.000 ADQL queries.

>P At the current status, the client sends an ADQL to the server to 
discern
>P which type of cross-match it can do with it (whether only positional,
>P positional with errors, etc.), and takes the corresponding action.

>M Let's see, ADQL is the language. In principle, an ADQL query will not 
>M tell you what cross-match can be performed. You can use ADQL to gather 
the
>M information you are thinking of like ra, dec, ra_err, dec_err, only
>M if the SkyNodes(databases) contain tables with this type of metadata. I 
hope
>M the proposal to make this mandatory is successful and publishers 
actually follow 
>M it. In any case, what it is mandatory are the Tables and Columns 
methods which 
>M should give you this information, but that is not ADQL. It is a call to 
a Web 
>M service interface.

>MT STILTS provides this functionality from a command-line
>MT tool (tmatch2), but a public java API is also available for 
>MT programs that want access to it within a JVM.

>M What would be worth a try is using Mark's library to set up a server 
that
>M does the cross-match when providers don't want to use a DBMs, because 
as
>M Clive mentioned "if the data are already in a relational DBMS
>M then by far the simplest way to do the cross-match, and in many cases
>M also the fastest, is to use R-tree indexing and a spatial join."

>M I will not get now into the R-tree indexing, HTM, Zones, Healpix debate 
but
>M without a question if the data is already in a database then probably 
will 
>M be less bourden for the system doing the job that answer millions of 
>M individual queries. This is the MyDATA skyNode approach which putting 
aside 
>M the problem of uploading big tables, it is much more efficient.

>M However, I'm kind of interested (proabably, eassier than working in 
writting my thesis ;-))
>M in this other debate

>C Support for spatial indexing is now included in or readily available 
for
>C DB2, Oracle, Informix, Sybase, MySQL, and Postgres, i.e. just about all
>C the DBMS widely used in astronomy (with perhaps just one exception,
>C which Jim can tell you about :-).

>M It would be nice to know what exactly widely mean. 
>M So I volunteer to have an inventory (catalog :-) ) with information
>M about

> Catalog Name, Acess point (URL), Default position, DBMS, Host 
Organization

>M This could give us an excellent test bed to compare data access and 
>M cross-match functionalities provided by different DBMS and
>M organizations
>M So if you guys sent me a list with those 4 data points. 
>M I will collect and make public the information. Since I'm a database 
girl
>M please send me a file in CVS format if you have many catalogs and 
>M I will import the data into a database.

>J But, getting objects into a node dominates all other costs (moving 
>J stuff thru xml is expensive).

>C Indeed that is a very serious problem.  I wonder if we can't solve this
>C by using, instead of XML, some more efficient data format, e.g. one
>C which holds tabular data in binary form with just the metadata in plain
>C text. 
>C There's something called the "FITS table" with exactly these properties
>C which perhaps astronomers should investigate :-)

>M I do agree something needs to be done about this as well.

-- 
------------------------------------------------
Maria A. Nieto-Santisteban (nieto at pha.jhu.edu)
Johns Hopkins University
3400 N. Charles St.
Physics & Astronomy Department
Baltimore, MD 21218 (USA)

Tel: 	1 410 516-7679  Fax: 	1 410 516-5096