Homes for old software

The more hybrid archives we work with, the more obvious it becomes that we need access to repositories of older software (or ‘abandonware’). For older formats you often find that not only is the creating software obsolete, but any migration tool you can dig up is pretty out-of-date too. Recently I used oldversion.com to source older versions of CompuServe and Eudora to transform an old CompuServe account to mbox format with CS2Eudora. The oldversion site is really valuable and we could use more like it, and more in it. The trouble is, collecting and publishing proprietary ‘abandonware’ seems to be a bit of a grey area.

In 2003, the Internet Archive obtained some exemptions from the Digital Millennium Copyright Act (DCMA) that has allowed them to archive software, but this has to be done privately with the software being made available after copyright expiry. Not much help now, but promising for the long-term. The best thing that could happen (from an archivist’s point of view) is that individuals and companies formally rescinded their interests in older software and put them in the public domain. Ideally they would put an expiry date into the initial licence before the software becomes abandonware.

I’m curious to hear about other good abandonware sites, especially ones that include ‘productivity software’ (our focus is here rather than gaming!). The Macintosh Garden is a good one, and Apple themselves also provide access to some older software, like ClarisWorks. What else is out there that we should know about?

-Susan Thomas

OSS projects for accessing data held in .pst format

Thanks to Neil Jefferies for a link to this article in The Register, which tells us that MS has begun two open source projects that will make it possible for developers to create tools to ‘browse, read and extract emails, calendar, contacts and events information’ which live in MS Outlook’s .pst file format. These tools are the PST Data Structure View Tool and the PST File Format SDK, and both are to be Apache-licensed.

-Susan Thomas

MS format specs, inc. Outlook pst

Last year Microsoft said that they would publish the format specifications for personal store files (pst) – formats used to store data from Outlook products, including email, contacts and diary appointments. You can find them at MSDN here: http://msdn.microsoft.com/en-us/library/ff387869.aspx. There is also format information for other Office products: http://msdn.microsoft.com/en-us/library/cc313118.aspx.

Related sites of interest include Microsoft’s interoperability pages, such as their ‘open specification promise‘ and Microsoft’s interoperability blog.

-Susan Thomas

XML Schema for archiving email accounts

I attended several great sessions at the Society of American Archivists conference last month. There is a wiki for the conference, but very few of the presentations have been posted so far…

One session I particularly enjoyed addressed the archiving of email – ‘Capturing the E-Tiger: New Tools for Email Preservation’. Archiving email is challenging for many reasons, which were very well put by the session speakers.

Both the EMCAP and CERP projects were introduced in the session.

EMCAP is a collaboration between state archives in North Carolina, Kentucky, and Pennsylvania to develop means to archive email. In the past, the archives have typically received email on CDs from a variety of systems, including MS Exchange, Novell Groupwise and Lotus Notes. One of the interesting outcomes of this work is software (an extension of the hmail software – see sourceforge) that enables ongoing capture of email, selected for archiving by users, from user systems. Email identified for archiving is normalised in an XML format and can be transformed to html for access. The software supports open email standards (POP3, SMTP, and IMAP4) as well as MySQL and MS SQL Server. The effort has been underway for five years and the software continues to be tested and refined.

CERP is a collaboration between the Smithsonian Institution Archives and Rockefeller Center Archives. This context has more in common with archiving email in the Bodleian context, where an email account is more likely to be accessioned from its owner in bulk than cumulatively. Ricc Ferrante gave an overview of the issues encountered, which were similar to our experiences on the Paradigm project and in working with creators more generally.

CERP has worked with EMCAP to publish an XML schema for preserving email accounts. Email is first normalised to mbox format and then converted to this XML standard using a prototype parser built in squeak smalltalk, which also has a web interface (seaside/comanche). The result of the transformation is a single XML file that represents an entire email account as per its original arrangement. Attachements can be embedded in the XML file, or externally referenced if on the larger side (over 25kb). If I remember rightly, the largest email account that has been processed so far is c. 1.5GB; we have one at the Library that’s significantly larger and I’d like to see how the parser handles this. It will be interesting to compare the schema/parser with The National Archives of Australia’s Xena. The developers are keen to receive comments on the schema, which is available here.

-Susan Thomas