Forensic archiving and search of web 2.0 sites

Bruce Wilson

As someone who has worked on a number of web application development projects over the years I understand the challenges of web content management and archiving more than most folks. Thus at LegalTech NY earlier this year I was particularly impressed by a vendor in the web archiving space called Hanzo Archives.

Many of us are familiar with a service called the Internet Archive (more commonly known as “The WayBack Machine”) which offers snapshots of previous versions of thousands of web sites, even small ones. It’s fun, and sometimes useful for information gathering, but hardly rises to the level of detail most of us would hope for in a litigation or compliance scenario.

What Hanzo does is take the idea of archiving web sites to a forensic level by comprehensively recording the content of a web site, including Flash and other non-html content, at frequent intervals. Once recorded, site archives are fully searchable and web content can be “replayed” exactly as it was published on a particular date, all in a manner that can be authenticated in court.

This fall I had the privilege of speaking with Mark Middleton, founder and CEO of Hanzo Archives, to satisfy my curiosity about what his product is capable of and who is using it.

Bruce: Mark, thank you for arranging to speak with me. I think I have a general understanding of what your archives do, but let me start off by asking you for some use cases that illustrate who needs your product and what they need it for.

Mark: Actually we have two products now. We are defining a new product, WebHold, which is a streamlined and simplified derivative of our existing ‘Enterprise’ service. We have advanced so far in the past couple of months that we are now able to collect the most insanely complicated web sites, where by comparison archiving something like financial services sites is simple.

To answer your question, our use cases would include litigation support and brand heritage. The common thread here is that, increasingly, companies are communicating and advertising to their audiences using web technologies. Whereas a company historically has been able to capture their communication in print or broadcast relatively easily, they are unable to do this for their web content and so for the first time in decades they have major communication channels they cannot capture for the future. One of the world’s most successful brands in the Food and Beverage industry has selected Hanzo for this purpose.

Here’s your legal use case. We have a prospect whose target audience for communication and advertising is young people. Our prospect communicates with them in sophisticated ways on the web – videos, games, surveys, and animation. They offer very sophisticated messaging about their products, put it on many websites, in order to communicate their sophisticated brand to their customers. How does one capture that?

At the same time, in their words, they anticipate regular litigation about their products. They’ve got every other avenue covered – print, TV ads, materials provided on premises, but they cannot do web – they cannot rely on the “WayBack Machine” – none of the rich media is recorded there. They can do backups – but how do you recreate a web site from a backup, how do you prove it was the one that was live on a particular date? We can capture their content on a regular basis, save into secure containers, can prove the content in the containers is authentic and original, and can be recreated in our archive system to look exactly as is did when it was live. So companies can enjoy the same level of confidence that they have in their other channels.

Bruce: OK, again I think I understand, but to help explain it to others can you give me some concrete examples?

Mark: Here are a couple of examples from prospects and clients.

One is an investment house with a variety of products. Their website contains a mixture of historic performance data and propositions to entice the investor to buy into their investment product. Traditionally these kinds of offers were made in print prospectus documents, regulators would require this to be filed with the regulator. Specific to the web, the companies are now making this offer in a unique way. People can select an investment product, and pull up a calculator showing what returns you might be able to see. They can see graphs of performance, plus recent opinions of analysts, all on the same site. But because it is a dynamically generated web content, as it is presented to the user, there is no capture of what someone sees anywhere. And so what has happened historically is that people take companies to court saying that the company is “not performing as to expectations as per your offer.” So now, because of this possibility, companies need to capture the web site experience so that they can prove it was reasonable and not misleading.

We’ve also received a lot of interest from pharmaceutical companies. Because of claims about their products and performance – drugs often don’t perform for an individual the way they perform statistically – these companies face potential class actions based on perceived underperformance. Advertisements in magazines and TV can all be captured, but web sites are much more difficult.

Bruce: I’ve worked on sites where the content keeps changing and have some idea of how messy it can be to try to figure out how it was at a earlier date. But how is Hanzo’s solution different from the current state of the art?

Mark: We have spoken with a U.S. pharmaceutical company who had to resurrect product information from their webistes, even though it had been brought down years before. Maybe the judge gave them two months to resurrect it. They had to locate and re-hire former staff, building a team of 30 people to handle the project. Once they found the hard drives that were needed that were stored in a cupboard in a basement somewhere, they then had to rebuild from code on up. That is an extreme case of course. But generally, when relying on backups or information stored in a content management system they have to reconstruct physical server infrastructure, and server software, including licensing, before they can even start with the content. But with Hanzo they don’t need any of that, they archive it independently and it can be reviewed immediately, on demand.

Bruce: OK, so by my way of thinking this is the same issue as disaster recovery—to be efficient you want to have a hot backup, not simply the opportunity to recreate your site from bare metal.

Also on our agenda for this conversation, you mentioned you have a new product? This an SaaS product, did I understand this correctly?

Mark: Yes, we’ve been working on a new product – it’s called WebHold. We still only do web archiving. First, a little background. Most institutions that archive websites rely on software called web crawlers. We’ve used several web crawlers from the open source community and have also developed our own. In the last few years we have done a lot of research and had opportunity to archive some very complex sites, and have developed tech that exceeds existing crawlers. But we still have kept to standard archive files to be consistent with standards, even though now putting multi-media, etc. in them. So what we’ve managed to do is this. With our technology we can capture sites very easily. Particularly for customers who have compliance requirements but are not cash rich, we can offer something effective and great value for money. We have come up with a product that will archive websites on a daily basis on a level of quality that will meet compliance from an archival level. This is something we can offer to FINRA [US]or FSA [UK] regulated companies with a high degree of reliability. It’s fully SaaS. Customers submit their sites, which are crawled, then the results are made available to the customer, and we archive their sites every day.

Bruce: What about archiving inside a firewall?

Mark: The regulatory requirements are to archive public facing websites that present advertising and offers that go to the consumer. For Enterprises it is also possible to archive intranets inside the firewall using Hanzo’s crawlers and access systems. We can do it as a SaaS using either a VPN or as an appliance. Offering our products as an appliance was the result of opportunities we had to capture collaborative web platforms on corporate and government intranets.

Bruce: In an earlier blog entry about disaster recovery I learned that one underpublicized form of disaster is when a third party SaaS business goes under, thus cutting a company off from its data. Apparently this happens in niche verticals every so often. Can Hanzo be rigged up to capture a company’s data on an SaaS, say Salesforce.com? Not that I’m predicting that they’re going away any time soon….

Mark: Hanzo can archive some simple web based apps already. It’s a departure from standard architecture of crawlers. But we can do that collaboratively with a client.

Bruce: How about pulling data out of a third party SaaS for the purposes of eDiscovery?

Mark: Hypothetically speaking, if it was for some reason undesirable or not possible for a third party SaaS provider to produce the data themselves Hanzo could be used to get data for eDiscovery from a SaaS system.

2 Replies to “Forensic archiving and search of web 2.0 sites”

  1. Great article Bruce! For those looking for other providers in this field consider PageFreezer (http://pagefreezer.com) which also does archiving of online content. Since this article was written there’s been a continual massive growth in social media and thus a corresponding need for social media archiving – Perhaps fodder for another article?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s