Preserving electronic data

Rob Seaman seaman at noao.edu
Wed Nov 19 09:51:34 PST 2008


Hi Reagan,

I'm copying the list with this message.  I would certainly think your  
message should be approved :-)

When your message arrived just now I was taking a few minutes to scan  
through "The Story of the Westinghouse Time Capsule":

	http://www.archive.org/details/storyofwestingho00pendrich

having read through "The Book of Record of the Time Capsule of  
Cupaloy..." yesterday:

	http://www.archive.org/details/timecapsulecups00westrich

I had come across a physical copy of the latter at some library back  
in the 1980's, but hadn't seen it since.  A truly unique artifact in  
itself.  ("Copies have now been distributed to libraries, museums,  
monasteries, convents, lamaseries, temples and other safe repositories  
throughout the world.")  Perhaps there's a copy in a lamasery near  
you...

Another fertile source of stringent "deep time" requirements is  
nuclear waste storage, eg:

	http://www.wipp.energy.gov/library/PermanentMarkersImplementationPlan.pdf

Ongoing curation is the real issue (it seems to me) for astronomical  
data.  One problem with the economic incentive you mention is that  
it's hard to attach a valuation to the practice of astronomy itself.   
On the other hand, long term preservation for astronomical data  
implies centuries or millennia of potentially useful life.  Each wide  
field image is a unique snapshot of phase space for tens of thousands  
of objects.

Rob
--

On Nov 19, 2008, at 9:47 AM, Reagan Moore wrote:

> Rob:
> Please post for me.  I had tried to post, but my message was  
> quarantined for approval by the list administrator.
>
> I agree that long term preservation requires an economic incentive.   
> As long as a group gains benefit from use of the data, they will  
> sustain the collection.  Of course you can manufacture incentives.   
> Thus detecting near-earth objects can be justified as long as we  
> expect to have the technology to respond to the situation.
>
> Reagan
>
>
>> Hi Reagan,
>>
>> Thanks for the thoughtful response.  Since my principle goal was to  
>> get conversation started on the datacp list, perhaps you might  
>> permit me to forward your response there?
>>
>> Regarding all the excellent grid work on persistent storage, I have  
>> nothing to add - other than that such persistence lasts only as  
>> long as the political and economic entities backing it.  For high  
>> visibility data sets such as LSST, that is likely enough to  
>> preserve the data indefinitely - where "indefinitely" is some time  
>> much shorter than the 5000 year goal of the Westinghouse time  
>> capsules.
>>
>> For more ordinary or smaller or more ad hoc collections of data, it  
>> is not obvious that the will to curation and preservation will  
>> survive even the ordinary social interruptions that we can  
>> anticipate over the next few decades, let alone the next century or  
>> two.  Perhaps the Grid will gobble up everything, but will that  
>> include meaningful curation for obscure areas of investigation  
>> (like, for instance, lunar dust measurements)?
>>
>> Rob
>> --
>>
>> On Nov 18, 2008, at 12:07 PM, Reagan Moore wrote:
>>
>>> There are successful preservation projects that already manage  
>>> data at the scale of petabytes in size and hundreds of millions of  
>>> files. Data Grid technology implements the interoperability  
>>> mechanisms needed to manage technology evolution. At the point in  
>>> time when a system becomes obsolete, the new technology is  
>>> available.  The ability to read from the old and write to the new  
>>> makes it possible to migrate a collection onto new storage  
>>> technology.
>>>
>>> There are several communities faced with preservation of future  
>>> collections that will be 150 petabytes in size:
>>> - LSST
>>> - National Climatic Data Center
>>> - US National Archives and Records Administration
>>>
>>> The technology we use to build preservation environments is based  
>>> on the integrated Rule Oriented Data System.  This is data grid  
>>> technology that implements:
>>> - data virtualization (management of properties associated with  
>>> each file such as descriptive metadata, provenance metadata,  
>>> administrative metadata)
>>> - trust virtualization (management of authentication and  
>>> authorization across administrative domains)
>>> - management virtualization (policies and procedures that enforce  
>>> assertions about the collection).
>>>
>>> The software is open source (available at http://irods.diceresearch.org 
>>> ), supports interoperability across vendor supplied storage and  
>>> database technology, and is in production use around the world.   
>>> The properties that are conserved include authenticity (linking or  
>>> representation information to each file), integrity (use of  
>>> replicas, checksums, synchronization), chain of custody (audit  
>>> trails), trustworthiness (assessment criteria for validating  
>>> properties).
>>>
>>> The system allows each community to specify their preservation  
>>> policies as computer actionable rules, and their preservation  
>>> procedures as computer executable micro-services.  The state  
>>> information that is generated by the procedures can be  
>>> periodically queried to validate assessment criteria.
>>>
>>> The collaboration with NARA tests preservation concepts on a  
>>> Transcontinental Persistent Archive Prototype that includes  
>>> storage resources at 7 sites within the US.
>>>
>>> The LSST collaboration will demonstrate a data grid at SC08 next  
>>> week.  The focus is on the integration of data processing  
>>> pipelines with data administration workflows.
>>>
>>> Reagan Moore
>>> University of North Carolina at Chapel Hill
>>>
>>>
>>>
>>>
>>>> El 10/11/2008, a las 17:10, Rob Seaman escribió:
>>>>
>>>>> Here is a cautionary tale of data preservation from the UK:
>>>>>
>>>>> 	http://catless.ncl.ac.uk/Risks/25.44.html#subj7
>>>>
>>>>
>>>> Getting into the meat of the link, I find that the problems with  
>>>> the Domesday project came basically from the fact that funding  
>>>> was given to a selection and recording effort, with no thought of  
>>>> funding of the curation of the stored data.
>>>>
>>>> It is also worrying that in those years the National Library was  
>>>> endowed with a material it wasn't really able to handle, and the  
>>>> response was to lose it.
>>>>
>>>> So the main lessons to learn from that project would be:
>>>>
>>>> a) Establish long term commitment to the data assets from the  
>>>> start. And the keyword
>>>>  here is commitment, not just long term.
>>>>
>>>> b) Try to use mainstream technologies as much as possible,  
>>>> because ad-hoc solutions
>>>>  can die unexpectedly. Again, Domesday used an ad-hoc solution  
>>>> with players which
>>>>  were not mainstream. Perhaps they should have not attempted  
>>>> their effort because
>>>>  other technologies were not available?
>>>>
>>>> c) More than one copy of digital assets (preferably 3?) should be  
>>>> stored AND
>>>>  maintained at different locations. In the same way paintings at  
>>>> a museum are
>>>>  kept at given room temperatures and humidity degrees, digital  
>>>> assets must be
>>>>  protected from "bit rot" by COPYING the assets ALSO in the  
>>>> latest technology,
>>>>  and keep the original.
>>>>
>>>> a) and c) are clearly the most costly, and what was not addressed  
>>>> by Domesday.
>>>>
>>>> --
>>>> Juan de Dios Santander Vela
>>>> Diplomado en CC. Físicas, Ingeniero en Electrónica
>>>> Doctorando en Tecnologías Multimedia
>>>> Becario Predoctoral del Instituto de Astrofísica de Andalucía
>>>>
>>>> Franklin P. Adams: Encuentro que gran parte de la información que  
>>>> poseo la adquirí buscando algo, y encontrándome con otra cosa por  
>>>> el camino.
>



More information about the datacp mailing list