a recipe for crumpets

Thu Jan 29 06:16:42 PST 2004

It's best if we think of our data as relational, rather than heirarchical (and 
as you say certainly not tabular :-), as heirarchical is just a special case of 
relational.  If we take a sky catalogue for example, an object may be linked to 
from several points, especially if it is considered part of an 'artifact' (as I 
understand things like diffraction spikes may be called) or other group when 
dealing with multi-passband catalogues with different resolutions.  Ideally, we 
should try not to lose any of this relational information, however we represent 
it.  So I think we have a slightly harder problem than flattening/crumbling, but 
there are mechanisms for representing links/pointers in most data 
representations, so it's not impossible.  In fact our main difficulty comes from 
our normal data extraction process: SQL.  Because it flattens our data and we 
then need to unflatten it again, though normally into a different unflatness :-#.

On the data types: A lot of our discussions here are about organising in our 
minds how astronomical data is arranged and structured, and how we can represent 
it.  Which is fine.  But there seems to be a tendancy among astronomers (oooh 
generalisation coming up) to then put a layer between what is already available 
to us and the natural structure of the information (viz my whittering on about 
V2 on another thread!). I know historically this has been necessary, to create 
eg FITS, but nowadays there are many more tools available that we can use 
directly.  I'm saying this because while there is useful stuff here, the 
'typing' below restricts rather than enables (pardon my english).  I'm going to 
get a bit software engineery here, but generally speaking, on the level Ed was 
talking about below, there are only two/three different data types:

1) Primitives (What you call Atomic I think) ie integers, real numbers, strings 
and enumerations.

2) Data Objects/Records (What you call Groups), which are assemblies of 
Primitives and other Data Objects/Records.

3?) Lists might be considered a special case of Objects/Records with only one 
type of thing being grouped.

That's it.  We can go straight from there to building structures of these to 
represent our data (such as position, shape, passbands, etc), without 
restriction.  Generally speaking then any search statement will go O/O/O/../O/A 
etc.

(On a sidenote this is also why I'm not happy with the general concept of 
Quantity: either it's going to have to be all things to all people, or it will 
restrict the things we can represent.  It certainly puts a layer between the 
things we need to represent, such as position, and the primitives we would 
combine to do so).

I believe we've already covered at least in principle how to map between XML and 
existing databases on the votable list (see 
http://ivoa.net/forum/votable/0549.htm).  Automating the mapping process based 
on structure would be tricky - we want (well I want) common XML exchange formats 
for our data, but it is likely everyone's RDBMS datasets are in their own wierd, 
er *individual* style.

If we're considering how we might *create* a database from a given XML document, 
say when uploading to a data warehouse, then we can map directly from 
object<->table and primitive<->cell.

I will have a mull over how to map between pointers in databases and pointers in 
XML, but I suspect those of you (eg Ed?) with good archive experience can think 
of an elegant answer.

Cheers,

Martin

Ed Shaya wrote:

> We are trying to synthesize a number of requirements into a consistent 
> model.  We want to be able to make statements about very many different 
> types of objects using a vocabulary of terms from UCDs that is well over 
> 1300 in number (to which we will be adding many more, I bet).  We want 
> to be able to use XML tools, especially XPATH which then permits 
> XQuery.  We need a high level language to express queries independent of 
> any datacenter's organization.  We have extremely large quantities of 
> data that require the speed and compact size of  relational databases.
> But, our knowledge is not simply 2-dimensional and so one wants to be 
> able to address the data as if it  is  hierarchical, even though the  
> internal storage and access  MAY be relational.  This means that  we 
> need  clear rules for  flattening and "crumbling".
> Start by noting that a record in a table is usually a list of Quantities 
> about some Object.  So we should have a clear way to identify in our XML 
> which elements are Objects and which are Properties, perhaps by 
> namespacing them.  Along the way we find that there are a few tricks to 
> designing the schemas so that one generates nicer tables and directions 
> for VOTable to develop.
> 
> O=O(id,P*)
> O are Objects.  Statements always begin with an O element.
> 
> Object take P's, properties, which are of type A, G, M.
> 
> A=A(value,error,units,O*)
> A is an Atomic Quantity, an example is RA, and the child O's are Metadata.
> 
> G=G((A|G)*)
> This is a Group Property of A's, each A typically is different, an 
> example is position with several coordinates.  In fact each A requires a 
> bit of grouping to hold it together also, but I ignore that.
> 
> M=M(O*)
> This is a Membership Property that holds Objects. An example is globular 
> clusters have M=MembersStars which holds many O=star.  It is probably 
> best if each M is constrained to a certain range of Object type.
> All of this is much like OWL-lite but I am paying special attention to 
> properties which take physical Objects as children.  The OWL 
> objectProperty is a property that takes an Instance, ie not a native 
> number.  We are now working a notch above OWL because our Quantities are 
> quite a bit richer than a common OWL property.
> 
> A basic example that conforms to O then P or M, M then O.
> Telescope
>     name
>     type
>    aperture size
>    location
>    PositionGroup
>             lat
>             long
>    M_hasInstruments
>          Instrument1
>                name
>                  ....
>          Instrument2
>                name
>                   ....
> /Telescope
> 
> We can incorporate an image into this (we may not want to, but it can be 
> done without stretching too far) by simply noticing that each pixel 
> mapped onto the sky is a region of the sky which is an Object.
> We may need to extend our id to include a position Group.
> So an image, spectra, or timeseries is
> I=(O*,M)  The first O* is metadata and the M refers to a series of O(id,A)
> as in M=[O(spot1,A), O(spot2,A), O(spot3,A),...., O(spotN,A)]
> But, in this fancy image one can add additional information at any spot.
> So, one can easily add-in O(spot1,A/P1,A2/P2),O(spot2,A,M(O*)...), etc.  
> Why can we do this?
> Because it is XML and so you can do just about anything.
> 
> And in fact we can include spectra and time series in a similar way.  We 
> simply think about a region in coordinate space as an Object.
> 
> The path to any A Quantity starts with an O passes through 0 or more 
> M/O, then ends with a series of G's and finally the A.  For instance:
> Xpath = /O/M/O/G/G/A
> represents A cluster of galaxies that M_hasGalaxies and these have 
> velocities measured and there are radial velocities and one of them is 
> radio redshift.
> 
> Xpath 
> =/GalaxyCluster at id="343"/MemberStars/Star at id="2323"/Velocities/RadialVelocities/RadioCZ 
> 
> (Actually I am cheating a bit on the Xpath expression just for 
> explanation).
> 
> 
> There is a flattening algorithm that is wonderfully simple:
> At the top level one can make tables of each ObjectType.  Then, whenever 
> there is an M, each M becomes a table and the table id is the Xpath to M.
> So there is a table here:
> TableName='/GalaxyCluster at id="343"/MemberStars'
> In the top level table, each A is 3 or so columns (value, error, units), 
> but for an M property a single  column contains the pointer to the 
> "MTable".
> 
> The table consists of stars in GalaxyCluster343 and has all of the A`a 
> and G's of A's.
> On the unlikely chance that there are actually several MemberStars at 
> this point one needs to allow for a qualifier attribute.  It does not 
> modify the theory though because this is to be thought of as subclassing 
> the M.
> 
> One thing that I swept under the rug is the metadata in each A. These 
> can go into FIELD/Metadata.  But, if they differ from item to item then 
> we need a column whose cells take XML.  Also note that an Mtable in the 
> Metadata is a likely occurrence, so this has to be transformed into a 
> table and a pointer replaces it in the cell.
> As it turns out Norman Grey has just described how one adds  extra 
> branches of XML info into VOTABLE (see the VOTable discussion list, 
> yesterday!).
> 
> My conclusion is that one gets a wonderfully simple but powerful 
> mechanism if one can identify XML elements as one of type O, A, G, or 
> M.   Actually  O and  A  can be detected simply by position.  It is the 
> M element that is difficult to distinguish (for the computer, that is) 
> from A.  So we could name these special properties starting with M: or 
> M_ or whatever.
> 
> This all follows from simply noting that a table is confined to a  
> /O/G/A or O/G/G/A (or can be cast into this) but that these may be 
> incorporated into a hierarchical pattern by linking properties, M's.
> 
> IF this works, it would mean that with a little bit of simple code to 
> flatten and crumble and to convert XPATH into SQL, any relational 
> database can become an XML ORDB. The price is that schema need to follow 
> a few rules.
> 
> Ed
> 
> 
> 
> 

-- 
Martin Hill
Software Engineer
AstroGrid @ ROE
Tel: +44 7901 55 24 66
www.astrogrid.org