a recipe for crumpets
Martin Hill
mchill at dial.pipex.com
Thu Jan 29 06:16:42 PST 2004
It's best if we think of our data as relational, rather than heirarchical (and
as you say certainly not tabular :-), as heirarchical is just a special case of
relational. If we take a sky catalogue for example, an object may be linked to
from several points, especially if it is considered part of an 'artifact' (as I
understand things like diffraction spikes may be called) or other group when
dealing with multi-passband catalogues with different resolutions. Ideally, we
should try not to lose any of this relational information, however we represent
it. So I think we have a slightly harder problem than flattening/crumbling, but
there are mechanisms for representing links/pointers in most data
representations, so it's not impossible. In fact our main difficulty comes from
our normal data extraction process: SQL. Because it flattens our data and we
then need to unflatten it again, though normally into a different unflatness :-#.
On the data types: A lot of our discussions here are about organising in our
minds how astronomical data is arranged and structured, and how we can represent
it. Which is fine. But there seems to be a tendancy among astronomers (oooh
generalisation coming up) to then put a layer between what is already available
to us and the natural structure of the information (viz my whittering on about
V2 on another thread!). I know historically this has been necessary, to create
eg FITS, but nowadays there are many more tools available that we can use
directly. I'm saying this because while there is useful stuff here, the
'typing' below restricts rather than enables (pardon my english). I'm going to
get a bit software engineery here, but generally speaking, on the level Ed was
talking about below, there are only two/three different data types:
1) Primitives (What you call Atomic I think) ie integers, real numbers, strings
and enumerations.
2) Data Objects/Records (What you call Groups), which are assemblies of
Primitives and other Data Objects/Records.
3?) Lists might be considered a special case of Objects/Records with only one
type of thing being grouped.
That's it. We can go straight from there to building structures of these to
represent our data (such as position, shape, passbands, etc), without
restriction. Generally speaking then any search statement will go O/O/O/../O/A
etc.
(On a sidenote this is also why I'm not happy with the general concept of
Quantity: either it's going to have to be all things to all people, or it will
restrict the things we can represent. It certainly puts a layer between the
things we need to represent, such as position, and the primitives we would
combine to do so).
I believe we've already covered at least in principle how to map between XML and
existing databases on the votable list (see
http://ivoa.net/forum/votable/0549.htm). Automating the mapping process based
on structure would be tricky - we want (well I want) common XML exchange formats
for our data, but it is likely everyone's RDBMS datasets are in their own wierd,
er *individual* style.
If we're considering how we might *create* a database from a given XML document,
say when uploading to a data warehouse, then we can map directly from
object<->table and primitive<->cell.
I will have a mull over how to map between pointers in databases and pointers in
XML, but I suspect those of you (eg Ed?) with good archive experience can think
of an elegant answer.
Cheers,
Martin
Ed Shaya wrote:
> We are trying to synthesize a number of requirements into a consistent
> model. We want to be able to make statements about very many different
> types of objects using a vocabulary of terms from UCDs that is well over
> 1300 in number (to which we will be adding many more, I bet). We want
> to be able to use XML tools, especially XPATH which then permits
> XQuery. We need a high level language to express queries independent of
> any datacenter's organization. We have extremely large quantities of
> data that require the speed and compact size of relational databases.
> But, our knowledge is not simply 2-dimensional and so one wants to be
> able to address the data as if it is hierarchical, even though the
> internal storage and access MAY be relational. This means that we
> need clear rules for flattening and "crumbling".
> Start by noting that a record in a table is usually a list of Quantities
> about some Object. So we should have a clear way to identify in our XML
> which elements are Objects and which are Properties, perhaps by
> namespacing them. Along the way we find that there are a few tricks to
> designing the schemas so that one generates nicer tables and directions
> for VOTable to develop.
>
> O=O(id,P*)
> O are Objects. Statements always begin with an O element.
>
> Object take P's, properties, which are of type A, G, M.
>
> A=A(value,error,units,O*)
> A is an Atomic Quantity, an example is RA, and the child O's are Metadata.
>
> G=G((A|G)*)
> This is a Group Property of A's, each A typically is different, an
> example is position with several coordinates. In fact each A requires a
> bit of grouping to hold it together also, but I ignore that.
>
> M=M(O*)
> This is a Membership Property that holds Objects. An example is globular
> clusters have M=MembersStars which holds many O=star. It is probably
> best if each M is constrained to a certain range of Object type.
> All of this is much like OWL-lite but I am paying special attention to
> properties which take physical Objects as children. The OWL
> objectProperty is a property that takes an Instance, ie not a native
> number. We are now working a notch above OWL because our Quantities are
> quite a bit richer than a common OWL property.
>
> A basic example that conforms to O then P or M, M then O.
> Telescope
> name
> type
> aperture size
> location
> PositionGroup
> lat
> long
> M_hasInstruments
> Instrument1
> name
> ....
> Instrument2
> name
> ....
> /Telescope
>
> We can incorporate an image into this (we may not want to, but it can be
> done without stretching too far) by simply noticing that each pixel
> mapped onto the sky is a region of the sky which is an Object.
> We may need to extend our id to include a position Group.
> So an image, spectra, or timeseries is
> I=(O*,M) The first O* is metadata and the M refers to a series of O(id,A)
> as in M=[O(spot1,A), O(spot2,A), O(spot3,A),...., O(spotN,A)]
> But, in this fancy image one can add additional information at any spot.
> So, one can easily add-in O(spot1,A/P1,A2/P2),O(spot2,A,M(O*)...), etc.
> Why can we do this?
> Because it is XML and so you can do just about anything.
>
> And in fact we can include spectra and time series in a similar way. We
> simply think about a region in coordinate space as an Object.
>
> The path to any A Quantity starts with an O passes through 0 or more
> M/O, then ends with a series of G's and finally the A. For instance:
> Xpath = /O/M/O/G/G/A
> represents A cluster of galaxies that M_hasGalaxies and these have
> velocities measured and there are radial velocities and one of them is
> radio redshift.
>
> Xpath
> =/GalaxyCluster at id="343"/MemberStars/Star at id="2323"/Velocities/RadialVelocities/RadioCZ
>
> (Actually I am cheating a bit on the Xpath expression just for
> explanation).
>
>
> There is a flattening algorithm that is wonderfully simple:
> At the top level one can make tables of each ObjectType. Then, whenever
> there is an M, each M becomes a table and the table id is the Xpath to M.
> So there is a table here:
> TableName='/GalaxyCluster at id="343"/MemberStars'
> In the top level table, each A is 3 or so columns (value, error, units),
> but for an M property a single column contains the pointer to the
> "MTable".
>
> The table consists of stars in GalaxyCluster343 and has all of the A`a
> and G's of A's.
> On the unlikely chance that there are actually several MemberStars at
> this point one needs to allow for a qualifier attribute. It does not
> modify the theory though because this is to be thought of as subclassing
> the M.
>
> One thing that I swept under the rug is the metadata in each A. These
> can go into FIELD/Metadata. But, if they differ from item to item then
> we need a column whose cells take XML. Also note that an Mtable in the
> Metadata is a likely occurrence, so this has to be transformed into a
> table and a pointer replaces it in the cell.
> As it turns out Norman Grey has just described how one adds extra
> branches of XML info into VOTABLE (see the VOTable discussion list,
> yesterday!).
>
> My conclusion is that one gets a wonderfully simple but powerful
> mechanism if one can identify XML elements as one of type O, A, G, or
> M. Actually O and A can be detected simply by position. It is the
> M element that is difficult to distinguish (for the computer, that is)
> from A. So we could name these special properties starting with M: or
> M_ or whatever.
>
> This all follows from simply noting that a table is confined to a
> /O/G/A or O/G/G/A (or can be cast into this) but that these may be
> incorporated into a hierarchical pattern by linking properties, M's.
>
> IF this works, it would mean that with a little bit of simple code to
> flatten and crumble and to convert XPATH into SQL, any relational
> database can become an XML ORDB. The price is that schema need to follow
> a few rules.
>
> Ed
>
>
>
>
--
Martin Hill
Software Engineer
AstroGrid @ ROE
Tel: +44 7901 55 24 66
www.astrogrid.org
More information about the dm
mailing list