Tag Archives: java

open source forever, right?

As much a note to self as anything, but also a cautionary tale…

Open source libraries (I mean software libraries rather than big buildings with books – so apologies for the non-technical readers!) – are very useful. Sometimes they can vanish – projects go under, people stop being interested, and soon a code base is “unsupported” and maybe, one day, might vanish from the Web. Take, for example, the Trilead Java SSH Library, the demise of which I think must be fairly recent.

A quick Google search suggests the following:

http://www.trilead.com/SSH_Library/

Which helpfully says:

“Trilead SSH for Java The freely available open-source library won’t be anymore developed nore supported by Trilead.” (sic.)

Unsupported, in this case, also means unavailable and there are no links to any code from here.

Other sites link to:

http://www.trilead.com/Products/Trilead-SSH-2-Java/

which gives a 404.

None of which is very helpful when your code is telling you:

Exception in thread “main” java.lang.NoClassDefFoundError: com/trilead/ssh2/InteractiveCallback

(Should any non-technical types still be reading, that means “I have no idea what you are talking about when you refer to a thing called com.trilead.ssh2.InteractiveCallback so I’m not going to work, no way, not a chance, so there. Ner.).

Now, had I been more awake, I probably would have noticed a sneaky little file by the name “trilead.jar” in the SVNKit directory. I would have also duely added it to the the classpath. But I wasn’t and I didn’t and then got into a panic searching for it.

But, and here is the moral of the tale, I did find this:

“Also, in the meantime we will put Trilead SSH library source code into our repository (it is also distributed under BSD license) so it will remain available for the community and we still will be able to make minor bugfixes in it when necessary.” [SVNKit User Mailing List, 18th May 2009]

Hooray for Open Source!

The open source project SVNKit, which made use of the open source library, was able – due to the open licensing – absorb the SSH library and make it available along with the SVNKit code. Even though the Trilead SSH Library is officially defunct, it lives on in the hands of its users. Marvellous eh?

All which is to say: 1) check the classpath and include all the jars and 2) open licensing means that something at least has a chance of being preserved by someone other than the creator who got fed up with all the emails asking how it worked… 🙂

-Peter Cliff

Waxwork Accessions

I decided the other day that it would be useful to have a representative accession or two to play with. This way we could test for scalability and robustness (in dealing with different file formats, crazy filenames, and the like) of the various tools that will make up BEAM and also try out some of our ideas regarding packaging, disk images and such.

It isn’t really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!

The way I saw it, I needed three things to create the accession:

  1. A list of files and folders that formed a real accession
  2. A set of data that could be used – real documents, images, sound files, system files, etc.
  3. Some way of tying these together to create an accession modelled on a real one but containing public data

Fortunately Susan already had a list of files that made up the a 2GB hard drive from a donor – created from the forensic machine – which I thought would be a good starting point. Point 1 covered!

Next question was where to get the data. My first thought was to use public (open licensed) content from the Web – obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn’t just dealing with nice, obvious “documents”. The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities – “~$oto “, “~~S”, “!.bk!” are just some examples.

It occured to me that I needed a more diverse source of files – most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!

In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I’ve nothing against XP, it just isn’t so good a software development environment as Linux), but it struck me it’d make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just “document” content.

(I also thought I could try creating a file system with a representative set of files to choose from – dlls from Windows 95 disks, etc. – but that would mean some manual collation of said files. This may be where I go next!).

So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:

C:WINDOWSSYSTEM32ABC.DLL

I wanted to grab any file with a “.DLL” extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.

This suggested I needed a way to ask my file system “What do you have that has ‘.dll’ at the end of the path?”. There were lots of ways to do this – and here is where Linux shines. We have ‘find’, ‘locate’, ‘which’, etc. on the command line to discover files. There is also ‘tracker’ that I could have set indexing the XP filesystem. In the end I opted for Solr.

Solr provides a very quick and easy way to build an index of just about anything – it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you’re unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.

I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java “program” (really a script, and I could’ve just just about any language, but we’re mostly using Java and I’m trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I’ve no idea if that is good or not).

(Renhart suggested an offshoot of this indexing too – namely the creation of a set of OS profiles, so that we can have a service that can be asked things like “What OS(es) does a file with SHA-1 hash XYZ belong to?” – enabling us to profile OSes and remove duplicates from our accessions).

The final step was to use another Java “program” to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done – one accession that mirrors a real live file structure, contains real live files, but none of those files are “private” or a problem if they’re lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we’d expect to get in the future.

Hooray! Now gotta pack it into disk images and start exploring processing!

Should anyone be interested, the source code is available for download.

-Peter Cliff