Tag Archives: migration tools

Migrating documents

We have a collection that consists of several thousand documents in various archaic (well, 1980s/90s) word processor formats including Ami Professional and (its predecessor) Samna Word. Perhaps of interest to folks intent on discussing the implications of migration for authenticity of the items, some of those Ami Pro files contain the (automatically generated) line:

“File … was converted from Samna Word on …”

So which is the original now?

Migrating these file formats has not been straight forward. This is because it was proved remarkably tricky to ascertain a key piece of information – the file format of the original. This is not the fault of file format tools (I’m using FITS, which itself wraps the usual suspects JHOVE & DROID), but the broader problem that the files have multiple formats. Ami Pro files are correctly identified “text/plain”. The command file reports them as “ASCII English text”. Some (not all) have a file extension “.sam” which is usually Ami Word, but the “.sam” files are not all the same format.

Yet this small piece of metadata is essential because without it it is very difficult to identify the correct tool to perform the migration. For example, if I run my usual text to PDF tool – which is primed to leap into action on arrival of a “text/plain” document – the resultant PDF shows the internals of a Ami Pro file, not the neatly laid out document the creator saw. We have a further piece of information available too, and curiously it is the most useful. This is the “Category” from the FTK – which correctly sorts the Ami Pros from the Samna Words.

This leads to a complex migration machine that needs to be capable of collating file format information from disparate sources and making sense of the differences, all within the context of the collection itself. If I know that creator X used Ami Pro a lot, then I can guess that “text/plain” & “.sam” means an Ami Pro document, for example. This approach is not without problems however, not least of which is that it requires a lot of manual input into what should ultimately be an automated and unwatched process. (One day, when it works better, I’ll try to share this code!)

Sometimes you get lucky, and the tool to do the migration offers an “auto” mode for input. For this collection I am using a trial copy of FileMerlin to perform the migration and evaluate it. It actually works better if you let it guess the input format rather than attempt to tell it. Other tools, such as JODConverter, like to know the input format and here you have a similar problem – you need to know what JODConverter is happy to accept rather than the real format – for example, send it a file with a content type of “application/rtf” and it responds with an internal server error. Send the same file with a content type of “application/msword” and the PDF is generated and returned to you.

Then there is a final problem – sometimes you have to make several steps to get the file into shape. For this collection, FileMerlin should be able to migrate Ami Pro and Samna Word into PDFs. In practice, it crashes on a very small sub-set of the documents. To overcome this, I migrate these same documents to “rich text format” (which FileMerlin seems OK with) and then to PDF with JODConverter – sending the aforementioned “application/msword” content type. I had a similar problem with WordPerfect files where using JOD directly changed the formatting of the original files. Using libwpd to create ODTs and then converting them to PDFs generated more accurate PDFs. (This is strange behaviour since OpenOffice itself uses libwpd!) Every time I hit a new (old) file format, the process of identifying it and generating a heuristic for handling it starts over.

I’m starting to think I need a neural network! That really would be putting the AI in OAIS!

-Peter Cliff

Homes for old software

The more hybrid archives we work with, the more obvious it becomes that we need access to repositories of older software (or ‘abandonware’). For older formats you often find that not only is the creating software obsolete, but any migration tool you can dig up is pretty out-of-date too. Recently I used oldversion.com to source older versions of CompuServe and Eudora to transform an old CompuServe account to mbox format with CS2Eudora. The oldversion site is really valuable and we could use more like it, and more in it. The trouble is, collecting and publishing proprietary ‘abandonware’ seems to be a bit of a grey area.

In 2003, the Internet Archive obtained some exemptions from the Digital Millennium Copyright Act (DCMA) that has allowed them to archive software, but this has to be done privately with the software being made available after copyright expiry. Not much help now, but promising for the long-term. The best thing that could happen (from an archivist’s point of view) is that individuals and companies formally rescinded their interests in older software and put them in the public domain. Ideally they would put an expiry date into the initial licence before the software becomes abandonware.

I’m curious to hear about other good abandonware sites, especially ones that include ‘productivity software’ (our focus is here rather than gaming!). The Macintosh Garden is a good one, and Apple themselves also provide access to some older software, like ClarisWorks. What else is out there that we should know about?

-Susan Thomas