Preserving electronic data
Rob Seaman
seaman at noao.edu
Wed Nov 19 09:51:34 PST 2008
Hi Reagan,
I'm copying the list with this message. I would certainly think your
message should be approved :-)
When your message arrived just now I was taking a few minutes to scan
through "The Story of the Westinghouse Time Capsule":
http://www.archive.org/details/storyofwestingho00pendrich
having read through "The Book of Record of the Time Capsule of
Cupaloy..." yesterday:
http://www.archive.org/details/timecapsulecups00westrich
I had come across a physical copy of the latter at some library back
in the 1980's, but hadn't seen it since. A truly unique artifact in
itself. ("Copies have now been distributed to libraries, museums,
monasteries, convents, lamaseries, temples and other safe repositories
throughout the world.") Perhaps there's a copy in a lamasery near
you...
Another fertile source of stringent "deep time" requirements is
nuclear waste storage, eg:
http://www.wipp.energy.gov/library/PermanentMarkersImplementationPlan.pdf
Ongoing curation is the real issue (it seems to me) for astronomical
data. One problem with the economic incentive you mention is that
it's hard to attach a valuation to the practice of astronomy itself.
On the other hand, long term preservation for astronomical data
implies centuries or millennia of potentially useful life. Each wide
field image is a unique snapshot of phase space for tens of thousands
of objects.
Rob
--
On Nov 19, 2008, at 9:47 AM, Reagan Moore wrote:
> Rob:
> Please post for me. I had tried to post, but my message was
> quarantined for approval by the list administrator.
>
> I agree that long term preservation requires an economic incentive.
> As long as a group gains benefit from use of the data, they will
> sustain the collection. Of course you can manufacture incentives.
> Thus detecting near-earth objects can be justified as long as we
> expect to have the technology to respond to the situation.
>
> Reagan
>
>
>> Hi Reagan,
>>
>> Thanks for the thoughtful response. Since my principle goal was to
>> get conversation started on the datacp list, perhaps you might
>> permit me to forward your response there?
>>
>> Regarding all the excellent grid work on persistent storage, I have
>> nothing to add - other than that such persistence lasts only as
>> long as the political and economic entities backing it. For high
>> visibility data sets such as LSST, that is likely enough to
>> preserve the data indefinitely - where "indefinitely" is some time
>> much shorter than the 5000 year goal of the Westinghouse time
>> capsules.
>>
>> For more ordinary or smaller or more ad hoc collections of data, it
>> is not obvious that the will to curation and preservation will
>> survive even the ordinary social interruptions that we can
>> anticipate over the next few decades, let alone the next century or
>> two. Perhaps the Grid will gobble up everything, but will that
>> include meaningful curation for obscure areas of investigation
>> (like, for instance, lunar dust measurements)?
>>
>> Rob
>> --
>>
>> On Nov 18, 2008, at 12:07 PM, Reagan Moore wrote:
>>
>>> There are successful preservation projects that already manage
>>> data at the scale of petabytes in size and hundreds of millions of
>>> files. Data Grid technology implements the interoperability
>>> mechanisms needed to manage technology evolution. At the point in
>>> time when a system becomes obsolete, the new technology is
>>> available. The ability to read from the old and write to the new
>>> makes it possible to migrate a collection onto new storage
>>> technology.
>>>
>>> There are several communities faced with preservation of future
>>> collections that will be 150 petabytes in size:
>>> - LSST
>>> - National Climatic Data Center
>>> - US National Archives and Records Administration
>>>
>>> The technology we use to build preservation environments is based
>>> on the integrated Rule Oriented Data System. This is data grid
>>> technology that implements:
>>> - data virtualization (management of properties associated with
>>> each file such as descriptive metadata, provenance metadata,
>>> administrative metadata)
>>> - trust virtualization (management of authentication and
>>> authorization across administrative domains)
>>> - management virtualization (policies and procedures that enforce
>>> assertions about the collection).
>>>
>>> The software is open source (available at http://irods.diceresearch.org
>>> ), supports interoperability across vendor supplied storage and
>>> database technology, and is in production use around the world.
>>> The properties that are conserved include authenticity (linking or
>>> representation information to each file), integrity (use of
>>> replicas, checksums, synchronization), chain of custody (audit
>>> trails), trustworthiness (assessment criteria for validating
>>> properties).
>>>
>>> The system allows each community to specify their preservation
>>> policies as computer actionable rules, and their preservation
>>> procedures as computer executable micro-services. The state
>>> information that is generated by the procedures can be
>>> periodically queried to validate assessment criteria.
>>>
>>> The collaboration with NARA tests preservation concepts on a
>>> Transcontinental Persistent Archive Prototype that includes
>>> storage resources at 7 sites within the US.
>>>
>>> The LSST collaboration will demonstrate a data grid at SC08 next
>>> week. The focus is on the integration of data processing
>>> pipelines with data administration workflows.
>>>
>>> Reagan Moore
>>> University of North Carolina at Chapel Hill
>>>
>>>
>>>
>>>
>>>> El 10/11/2008, a las 17:10, Rob Seaman escribió:
>>>>
>>>>> Here is a cautionary tale of data preservation from the UK:
>>>>>
>>>>> http://catless.ncl.ac.uk/Risks/25.44.html#subj7
>>>>
>>>>
>>>> Getting into the meat of the link, I find that the problems with
>>>> the Domesday project came basically from the fact that funding
>>>> was given to a selection and recording effort, with no thought of
>>>> funding of the curation of the stored data.
>>>>
>>>> It is also worrying that in those years the National Library was
>>>> endowed with a material it wasn't really able to handle, and the
>>>> response was to lose it.
>>>>
>>>> So the main lessons to learn from that project would be:
>>>>
>>>> a) Establish long term commitment to the data assets from the
>>>> start. And the keyword
>>>> here is commitment, not just long term.
>>>>
>>>> b) Try to use mainstream technologies as much as possible,
>>>> because ad-hoc solutions
>>>> can die unexpectedly. Again, Domesday used an ad-hoc solution
>>>> with players which
>>>> were not mainstream. Perhaps they should have not attempted
>>>> their effort because
>>>> other technologies were not available?
>>>>
>>>> c) More than one copy of digital assets (preferably 3?) should be
>>>> stored AND
>>>> maintained at different locations. In the same way paintings at
>>>> a museum are
>>>> kept at given room temperatures and humidity degrees, digital
>>>> assets must be
>>>> protected from "bit rot" by COPYING the assets ALSO in the
>>>> latest technology,
>>>> and keep the original.
>>>>
>>>> a) and c) are clearly the most costly, and what was not addressed
>>>> by Domesday.
>>>>
>>>> --
>>>> Juan de Dios Santander Vela
>>>> Diplomado en CC. Físicas, Ingeniero en Electrónica
>>>> Doctorando en Tecnologías Multimedia
>>>> Becario Predoctoral del Instituto de Astrofísica de Andalucía
>>>>
>>>> Franklin P. Adams: Encuentro que gran parte de la información que
>>>> poseo la adquirí buscando algo, y encontrándome con otra cosa por
>>>> el camino.
>
More information about the datacp
mailing list