Forensic archiving and search of web 2.0 sites

Bruce Wilson

As someone who has worked on a number of web application development projects over the years I understand the challenges of web content management and archiving more than most folks. Thus at LegalTech NY earlier this year I was particularly impressed by a vendor in the web archiving space called Hanzo Archives.

Many of us are familiar with a service called the Internet Archive (more commonly known as “The WayBack Machine”) which offers snapshots of previous versions of thousands of web sites, even small ones. It’s fun, and sometimes useful for information gathering, but hardly rises to the level of detail most of us would hope for in a litigation or compliance scenario.

What Hanzo does is take the idea of archiving web sites to a forensic level by comprehensively recording the content of a web site, including Flash and other non-html content, at frequent intervals. Once recorded, site archives are fully searchable and web content can be “replayed” exactly as it was published on a particular date, all in a manner that can be authenticated in court.

This fall I had the privilege of speaking with Mark Middleton, founder and CEO of Hanzo Archives, to satisfy my curiosity about what his product is capable of and who is using it.

Bruce: Mark, thank you for arranging to speak with me. I think I have a general understanding of what your archives do, but let me start off by asking you for some use cases that illustrate who needs your product and what they need it for.

Mark: Actually we have two products now. We are defining a new product, WebHold, which is a streamlined and simplified derivative of our existing ‘Enterprise’ service. We have advanced so far in the past couple of months that we are now able to collect the most insanely complicated web sites, where by comparison archiving something like financial services sites is simple.

To answer your question, our use cases would include litigation support and brand heritage. The common thread here is that, increasingly, companies are communicating and advertising to their audiences using web technologies. Whereas a company historically has been able to capture their communication in print or broadcast relatively easily, they are unable to do this for their web content and so for the first time in decades they have major communication channels they cannot capture for the future. One of the world’s most successful brands in the Food and Beverage industry has selected Hanzo for this purpose.

Here’s your legal use case. We have a prospect whose target audience for communication and advertising is young people. Our prospect communicates with them in sophisticated ways on the web – videos, games, surveys, and animation. They offer very sophisticated messaging about their products, put it on many websites, in order to communicate their sophisticated brand to their customers. How does one capture that?

At the same time, in their words, they anticipate regular litigation about their products. They’ve got every other avenue covered – print, TV ads, materials provided on premises, but they cannot do web – they cannot rely on the “WayBack Machine” – none of the rich media is recorded there. They can do backups – but how do you recreate a web site from a backup, how do you prove it was the one that was live on a particular date? We can capture their content on a regular basis, save into secure containers, can prove the content in the containers is authentic and original, and can be recreated in our archive system to look exactly as is did when it was live. So companies can enjoy the same level of confidence that they have in their other channels.

Bruce: OK, again I think I understand, but to help explain it to others can you give me some concrete examples?

Mark: Here are a couple of examples from prospects and clients.

One is an investment house with a variety of products. Their website contains a mixture of historic performance data and propositions to entice the investor to buy into their investment product. Traditionally these kinds of offers were made in print prospectus documents, regulators would require this to be filed with the regulator. Specific to the web, the companies are now making this offer in a unique way. People can select an investment product, and pull up a calculator showing what returns you might be able to see. They can see graphs of performance, plus recent opinions of analysts, all on the same site. But because it is a dynamically generated web content, as it is presented to the user, there is no capture of what someone sees anywhere. And so what has happened historically is that people take companies to court saying that the company is “not performing as to expectations as per your offer.” So now, because of this possibility, companies need to capture the web site experience so that they can prove it was reasonable and not misleading.

We’ve also received a lot of interest from pharmaceutical companies. Because of claims about their products and performance – drugs often don’t perform for an individual the way they perform statistically – these companies face potential class actions based on perceived underperformance. Advertisements in magazines and TV can all be captured, but web sites are much more difficult.

Bruce: I’ve worked on sites where the content keeps changing and have some idea of how messy it can be to try to figure out how it was at a earlier date. But how is Hanzo’s solution different from the current state of the art?

Mark: We have spoken with a U.S. pharmaceutical company who had to resurrect product information from their webistes, even though it had been brought down years before. Maybe the judge gave them two months to resurrect it. They had to locate and re-hire former staff, building a team of 30 people to handle the project. Once they found the hard drives that were needed that were stored in a cupboard in a basement somewhere, they then had to rebuild from code on up. That is an extreme case of course. But generally, when relying on backups or information stored in a content management system they have to reconstruct physical server infrastructure, and server software, including licensing, before they can even start with the content. But with Hanzo they don’t need any of that, they archive it independently and it can be reviewed immediately, on demand.

Bruce: OK, so by my way of thinking this is the same issue as disaster recovery—to be efficient you want to have a hot backup, not simply the opportunity to recreate your site from bare metal.

Also on our agenda for this conversation, you mentioned you have a new product? This an SaaS product, did I understand this correctly?

Mark: Yes, we’ve been working on a new product – it’s called WebHold. We still only do web archiving. First, a little background. Most institutions that archive websites rely on software called web crawlers. We’ve used several web crawlers from the open source community and have also developed our own. In the last few years we have done a lot of research and had opportunity to archive some very complex sites, and have developed tech that exceeds existing crawlers. But we still have kept to standard archive files to be consistent with standards, even though now putting multi-media, etc. in them. So what we’ve managed to do is this. With our technology we can capture sites very easily. Particularly for customers who have compliance requirements but are not cash rich, we can offer something effective and great value for money. We have come up with a product that will archive websites on a daily basis on a level of quality that will meet compliance from an archival level. This is something we can offer to FINRA [US]or FSA [UK] regulated companies with a high degree of reliability. It’s fully SaaS. Customers submit their sites, which are crawled, then the results are made available to the customer, and we archive their sites every day.

Bruce: What about archiving inside a firewall?

Mark: The regulatory requirements are to archive public facing websites that present advertising and offers that go to the consumer. For Enterprises it is also possible to archive intranets inside the firewall using Hanzo’s crawlers and access systems. We can do it as a SaaS using either a VPN or as an appliance. Offering our products as an appliance was the result of opportunities we had to capture collaborative web platforms on corporate and government intranets.

Bruce: In an earlier blog entry about disaster recovery I learned that one underpublicized form of disaster is when a third party SaaS business goes under, thus cutting a company off from its data. Apparently this happens in niche verticals every so often. Can Hanzo be rigged up to capture a company’s data on an SaaS, say Salesforce.com? Not that I’m predicting that they’re going away any time soon….

Mark: Hanzo can archive some simple web based apps already. It’s a departure from standard architecture of crawlers. But we can do that collaboratively with a client.

Bruce: How about pulling data out of a third party SaaS for the purposes of eDiscovery?

Mark: Hypothetically speaking, if it was for some reason undesirable or not possible for a third party SaaS provider to produce the data themselves Hanzo could be used to get data for eDiscovery from a SaaS system.

IT crapshoot: cost-cutting is costly in disaster recovery, archiving for e-discovery

Disaster recovery and archiving are key zones of interaction for IT and Legal Departments. When a lawsuit is filed and an e-discovery production request is received, a company must examine all of its electronically stored information to find documents that are relevant to that lawsuit. Court battles may arise regarding the comprehensiveness of the examination, the need to lock down potentially important documents and metadata, and the cost of identifying, collecting, preserving, and reviewing documents — all of which are related to the way in which data is stored.

Photo credit: Josie Hill

Photo credit: Josie Hill

With this in mind, I recently sought out Jishnu Mitra, President of Stratogent, a specialized application hosting and disaster recovery services provider, to obtain his perspective on disaster recovery best practices and the relationship between disaster recovery and e-discovery. Key points he made include:

  • effective disaster recovery sites are “hot” sites that can be used for secondary purposes rather than remaining idle;
  • “cold” sites are unlikely to get the job done and are not cheap;
  • efforts to keep IT budgets down by delaying or limiting disaster recovery, or by limiting archiving, can backfire;
  • budget-conscious IT departments are more likely to use archiving features built-in to their software of choice;
  • many IT and Legal personnel have a habit of being disrespectful towards one another and doing a poor job of communicating with one another;
  • more crossover Legal-IT people are needed.

Bruce: Can you provide a little background about Stratogent’s domain expertise?

Jishnu: We offer end-to-end application hosting services, including establishing the hosting requirements and architecture, hardware and software implementation, and proactive day-to-day application management including responding to any issues that arise. Most of the time we are tasked with building a full data center, not the building itself, of course, but a complete software and hardware hosting framework. We aren’t providers of any specific business application (like salesforce.com does). We design, deploy and operate all the layers on which modern business applications are hosted including the application’s framework e.g. .NET, Java or SAP Basis.

Our customers include multi-office companies, who require applications shared between offices, and web-based application SaaS (“software as a service”) companies. The scope is typically quite complex – we don’t build or manage general web sites or blogs — that’s a commodity market and too crowded. We build and manage custom application infrastructures for enterprises or for complex applications that require a range of IT skills to manage. Our customers hire us because they don’t want to budget to hire all of the people they would need to do this internally, or when they are deploying a new application that is beyond the current reach of their IT team. For example, if a company wants to start using a new-to-them ERP [“Enterprise Resource Planning”] application like SAP or (say) a Microsoft based enterprise landscape that needs to scale, we can multiplex our internal pool of talent to give their application 24-7 attention far cheaper than the company can hire and retain the specialized employees they need to do it themselves.

Bruce: So you supply the specialized competencies needed to build and operate complex application environments so that your customers can focus on their core competencies? Then their core competencies don’t need to include what you do in order for them to succeed.

Jishnu: Yes. They know what they want, they conceptualize what they want, but not the hardware they need and the infrastructure software. We can go in from the very beginning saying, “Here’s how you set up a highly available, clustered server farm for your social networking app,” and so on and so forth. We know how to customize it and set it up. They also don’t have our expertise in negotiating with hardware vendors, or in capacity planning, etc. Plus there’s the build phase, loading OSes, etc. We essentially give them over the course of our engagement the entire hosting framework on which the app runs and then take care of it for the long run.

Once we get their hosting framework to a steady state, they get to run with it for two, five or longer number of years with no or limited failure. So their role is conceptualizing on day 1 and then we become a partner organization worrying about how to realize that dream, handling inevitable IT break-fix issues and managing changes over the entire life span of that system. Disaster recovery usually becomes part of that framework at some point.

Bruce: Can you give me some broad idea of the scope of disaster recovery work that you do?

Jishnu: Disaster recovery is not a separate arm of our business. It’s very integral to the hosting services we provide. We build disaster recovery sites at different levels of complexity. It can go from a small customer up to a really large customer. And over time Stratogent gets into innovative approaches to deal with disaster recovery. The philosophy of Stratogent is that we’re not trying to sell a boxed solution to all the customers. It’s more of a custom solution, not a mass market product. We say we will architect and host your solution – and as architect we always add very specific elements for our customer, not just one solution for everyone.

The basic approach, even for small customers, is to choose a convenient and correct location for the disaster recovery site and use a replication strategy based on whatever they can afford or have tolerance to accept. As much as possible a disaster recovery site should be up and running and ready to go at a flip of the switch. They can use the excess capacity at their disaster recovery site at quarter end to run financial reports or for other business purposes, plus it can be used for application QA and staging systems. They can be smart about it, and keep it on, so that they can have confidence in it.

Of course a disaster recovery solution like this can’t be built in just a month or two – to do it right requires creativity and diligence. In one recent instance when asked to do it ”right now”, we had to go with a large vendor’s standard disaster recovery solution for our customer. Everyone knows that this does not get us anything beyond the checkmark for DR, so the plan is to go to a Stratogent solution over time, build a hot alternate site on the East coast, and sunset the large vendor’s standard disaster recovery arrangement.

Bruce: Given the importance of disaster recovery for a number of reasons, how seriously are companies taking it?

Jishnu: Everybody needs it, but it suffers from “high priority, low criticality”, and the problem rolls from budget year to budget year. Some unpleasant trigger like an outage, or an impending audit instigates furious activity in this direction, but then it goes on to the back burner again. In the recent instance, although disaster recovery was scheduled for a later phase for technical reasons, for SOX compliance the auditor demanded a disaster recovery solution by year end or our customer would fail their audit. So we went out and obtained a large vendor’s standard disaster recovery solution, which met the auditors’ requirements but isn’t comparable to a “hot” disaster recovery site.

The way disaster recovery solutions from some of the large vendors work is this: they have huge data centers where their customers can use equipment should a disaster happen. Customers pay a monthly fee for this privilege. When a disaster strikes, customers ship their backup tapes out there, fly their people out there, and start building a disaster recovery system from scratch. And by the way, if you have trouble here’s the menu for emergency support services for which they will charge you more. And in 95 out of 100 cases it just doesn’t work, but is a monumental failure when you need it most. These are “cold” sites that have to be built from the ground up. It takes maybe 72 hours to get them up, rather to be asserted as “up”. Then, as someone like yourself with application development experience knows, it takes weeks to debug and get everything working correctly. And when you’re not actually using them, standard disaster recovery services are charging you an incredibly high amount of money for nothing except the option of bringing your people and tapes to their center, and then good luck.

Bruce: You mentioned running quarterly financials, QA, and staging as valuable uses for the excess capacity of “hot” disaster recovery sites. Could this excess capacity also be used for running e-discovery processes when the company is responding to a document production request?

Jishnu: Possibly, but I haven’t seen it done yet in a comprehensive manner. The problem is you still need to have the storage capacity for e-discovery somewhere. The e-discovery stuff is a significant chunk of storage, maybe tier 2 or 3, which demands different storage anyway, so it makes sense to keep the e-discovery data in the primary data center because its easy and faster to copy, etc. That said, it is very useful to employ the capacity available in the secondary site for e-discovery support activities like restoring data to an alternate instance of your application and for running large queries without affecting the live production systems.

Bruce: Do you deploy disaster recovery solutions that protect desktop drive, laptop drive, or shared drive data?

Jishnu: As I have said, our disaster recovery solutions are part of whatever application frameworks we are hosting. We as a company don’t get into the desktop environment, the local LANs that the companies have. We leave that to local teams or whatever partner does classic managed services. We do data centers and hosted frameworks. We don’t have the expertise or organizational structure to have people traveling to local sites, answering desktop-related user queries, etc. But any time it leaves our customer’s office and goes to the internet, from the edge of the office on out it’s ours.

Bruce: But when archiving is part of the customer’s platform hosted by you, it gets incorporated in your disaster recovery solution?

Jishnu: Yes.

Bruce: Is Stratogent involved when your customers must respond to e-discovery and regulatory compliance information retrieval requests?

Jishnu: Yes. For example, we recently went through and did what needed be done when a particular customer asked for all the documents in response to a lawsuit. We brought in a consultant for that specific archiving system as well. Our administrators collaborated with the consultant and 2 people from the customer’s IT department. It took a couple of weeks to provide all the documents they asked for.

Bruce: Was the system designed from the outset with minimizing e-discovery costs in mind?

Jishnu: Unfortunately no. In this case archiving for e-discovery was an afterthought and was grafted on to the application later and a push-button experience wasn’t in the criteria when designing this particular system. But it woke us up. We realized this could get worse.

Bruce: So how do you do it differently now that you’ve had this experience?

Jishnu: Here we recommended to our customer that we upgrade to the newest version of the archiving solution and begin using untapped features that allow for a more push-button approach. Keep in mind that e-discovery products weren’t as popular or sophisticated as you see them now.

Bruce: Aren’t there third-party archiving solutions also?

Jishnu: There are several third-party products and you see the regular enterprise software vendors coming out with add-ons. We’re especially looking forward to the next version of Exchange from Microsoft, where for us the salient feature is archiving and retention. Only because email is the number one retrieval request. On most existing setups getting the information for a lawsuit or another purpose takes us through an antiquated process of restoring mail boxes from tape and loads of manual labor. It’s pretty painful, it takes an inordinate amount of time to find specific emails, its not online, it takes days. For this reason we’re looking forward to Exchange 2010 which has features built INTO the product itself. Yes, some other vendors have add-on products that do this also.

Bruce: And I assume you’re familiar with Mimosa, in the case of Exchange?

Jishnu: – Like Mimosa, yes. But when it’s built-in the customer is more likely to use it. By default customers don’t buy add-ons for budgetary reasons. It’s so much easier if the central product has what we need, and that is in fact happening a lot these days. I won’t be surprised if products in general evolve so that compliance and regulatory features get considered integral parts of the software and not someone else’s problem.

Bruce: Do you have other examples of document retrieval from backups or archives?

Jishnu: Actually there are three scenarios where we do document retrieval. Scenario one, which we discussed, is e-discovery. Scenario 2 is when we have seen retrieval requests in acquisitions, mergers and acquisitions, and we had to pretty much get information from all sorts of systems, a huge pain.

Scenario 3 is SaaS driven. For many of our customers, the bulk of their systems are either on-premises or hosted by Stratogent, but some of our customers use SalesForce.com or one of many, many small or industry specific SaaS vertical solutions. In one recent case, one of these niche vertical SaaS vendors, because of some of the issues in that industry, was about to go out of business. We had to go into emergency mode and create an on-premise mirror, actually more like a graveyard for the data, to keep it for the future, to enable us to fetch the data from that service. We figured out a solution for how to get all the customers’ data and replicate and keep it in our data center and continuously keep it up to date. Fortunately the vendors were cooperative and allowed access through their back door to allow us to achieve this. I call this “the SaaS fallback” scenario. SaaS is a great way to quickly get started on a new application, but BOY, if anything happens, or if you decide you aren’t happy, it becomes a data migration nightmare and worse than an on-premises solution because you have no idea how it’s being kept and have to figure out how to retrieve it through an API or some other means.

Bruce: In e-discovery and other legal-driven document recovery scenarios, how important is collaboration between IT and Legal personnel, or should I say, how significant a problem is the lack of this collaboration?

Jishnu: I’ve seen the divide between IT and legal quite often. Calling it a divide is actually being polite; at worst both parties seem to think the others are clueless or morons. It’s a huge, huge gap. And I have also seen it playing out not just in traditional IT outfits, but also product based companies when I was principal architect at Borland. When attorneys came to talk to engineering about IP issues, open source contracts or even patent issues, there was no realization among the techies that it was important. In fact legal issues were labeled “blockers” and the entire legal department was “the business prevention department”. And there is exactly the opposite feeling in the other camp with how engineering leaders don’t “get it” and how talking to anybody in development or IT was like talking to a wall. The psychological and cultural issues between IT and legal have been there for a while. In some of the companies that have surmounted this issue, the key seems to be having a bridge person or team acting as an interpreter to communicate and keep both sides sane. Some technical folks I know have moved on to play a distinctly legal role in their organizations and they play a pivotal role in closing the gap between legal and IT.

Desktop, laptop, email backups critical for employee lawsuits

I recently spoke with Thao Tiedt, a labor and employment partner at Ryan Swanson & Cleveland, PLLC, a mid-sized full service Seattle law firm. (Full disclosure: I’ve benefited from her incisive advice a number of times when I was wearing the hat of corporate counsel.) Our conversation focused on eDiscovery from the perspective of consequences when individual employees use company computers in ways not approved by their employer.

Bruce: Thao, I first asked you this question some years ago, but I’ll ask again so you can catch me up and share this information with a wider audience. When employees of a company use a company computer, even for personal purposes, who does the information belong to after it winds up on the company’s computer?

From an IT perspective, preparing to defend against employee lawsuits starts long before "there is even a smell of dispute in the air."
From an IT perspective, preparing to defend against employee lawsuits starts long before "there is even a smell of dispute in the air."

Thao: In other words, do employees have an expectation of privacy? Yes and no. In the workplace the employer has the right to take that expectation away through a variety of policies and practices. This includes email and voice mail. With telephone conversations, an employer can’t listen without permission of both the employees and others on the line. States’ laws vary; some states require that at least one person on the conversation has to give you permission to record it. But permission can be obtained through fair warning – you don’t have to get explicit permission, it can be tacit, as when a message is played announcing that a conversation may be recorded – when someone hears that and doesn’t hang up permission is implicit. Employees may be given a policy manual or an explicit waiver to sign that states that privacy is waived. If an employee refuses to sign, they can’t stay employed.

Bruce: What happens when employees try to remove information from a company computer?

Thao: People think they’re smart and they can make information go away. Here’s a good example: one of my clients is a company that received a demand for arbitration over alleged sexual harassment. So I had the company put a hold on all of the computers involved, including both the employee’s and the accused manager’s – in their cases by physically picking the computers up. Upon technical evaluation it appeared that the claimant had been wiping hers. But she failed to realize that the company had backup tapes for disaster recovery purposes. Also, this particular company has multiple branches so it has central email servers. And after interviewing co-workers, a hint of impropriety appeared. I asked a one of claimant’s co-workers “anything else we should know?” The co-worker showed me a cellphone picture sent by the claimant, showing the claimant nude from waist up, with the caption “does this change your mind?” Apparently she had wanted the co-worker to date her and he had refused. When we looked at the company email accounts we found lots of these pictures, which we could tell from the background were taken in the company bathroom. It turns out she had been spending a lot of time on dating sites while at work and sending multiple men the pictures.

Later we learned that someone had asked her: don’t you think you should be careful? She had answered no, someone in IT told me how to double-delete computer files.

After all of this information came out in the open her cause of action went away. Given her behavior it was clear that if her accused manager had in fact asked her to expose herself, as she claimed, she would have gladly done so.

This just goes to show: no one should think they can make digital information go away.

There are huge number of cases where the smoking guns are emails. Somehow people don’t think of emails as documents, they think of them as chit-chat. Far from it. For example, when training attorneys in our firm we teach them that emails are no different from formal letters sent to clients and should be handled with the same care.

Bruce: What about accessing web sites using work computers?

Thao: Of course web use can get traced back to inappropriate sites, like pornography severs for example. I actually had to go home to view a site that had been accessed by an employee on one occassion, because our firm’s own web filters are set so high I couldn’t do it from work. For a while I couldn’t order my own underwear online from work.

Anyway, it turned out this person was running a business on work time– the business of being web master for a porn site.

However, as a general rule an employee can conduct their own business on their lunch hour, as long as that isn’t a conflict with their employer in some fashion.

Bruce: I’ve read about studies that suggest employee productivity actually goes up when they can do a certain amount of personal work – scheduling doctors appointments and what not, from their work computer during work hours – because that flexibility leads to less tardiness and absenteeism and so forth. So how does an employer who believes this is true handle personal use of work computers?

Thao: Here’s what we say in our own [Ryan Swanson & Cleveland] employee manual: employees’ may make limited, incidental, responsible personal use of company computers.

Having said that, an employer can still intercept and log employee use of company computers. In the harassment case I mentioned, for example, we examined how both parties had used their computers. The accused manager was very uncomfortable with having attorneys review his work materials, but we needed to see his responses to her emails to make the company’s case. What we found didn’t support her case, but did lead us to caution him to stop unrelated inappropriate use of his work computer.

Bruce: What about when employees use their personal email account, like Gmail, from a work computer?

Thao: Does accessing email on company computer waive privacy protection? Yes. There is no expectation of privacy for personal email stored on company computer.

Bruce: How about a password for a personal email account, once it has been typed into a company computer?

Thao: Yes, if it’s on the work computer then it’s information that belongs to the employer.

Bruce: But can the employer use that information? What if they use the password to access an employee’s personal email account, like an AOL or Gmail account?

Thao: No. The employer can possess the password if it’s on the company’s computer, but they can’t use it to log into the personal email account.

Bruce: What about Google Gears, which makes local copies of personal email and Google documents on the computer being used, which might be a work computer?

Thao: Then the company has a right to see that information. Anything on the company computer is the company’s – if the company policy reads that way.

California sometimes has different views concerning privacy – they have a state constitutional right to privacy. But as long as companies have been up front with employees by notifying them that if information goes through a work computer, that information can be accessed by the company, then employer access to that information is allowed in California as well.

Bruce: When a lawsuit is threatened you send out a scary letter to employees telling them to avoid destroying evidence?

Thao: We send out a “scary letter” right away [to leave no doubt what is expected of people].

It can be the case that having electronically stored information collected by an outside vendor creates insulation against tampering and a better evidentiary chain of custody, even with intellectual property secrecy issues. Outside vendors can make good selections about what fits an eDiscovery inquiry.

What you don’t want is for opposing counsel to see something secret [and not responsive to a discovery request] that may be useful to their client in some way. If that happens it creates a question for that attorney about what their duty is to their client – to reveal or not to reveal that information – and then there’s the fact that you can’t get it out of your head once you’ve seen it. It will absolutely color your strategy down the road.

Also, concerning attorney-client privilege: privilege is waived whenever a privileged email is copied to anyone outside of “speaking agents of the company.” This happens all the time, even when recipients of privileged emails are warned. Forwarding emails is a hard habit to break.

Bruce: Symantec recently commissioned a study which revealed that a very high percentage of laid-off employees copy company information and take it with them when they go. What, if any, recourse does a company have when employees leave with info?

Thao: Here’s an example. One of my clients is a regional auto dealer association. A common problem they have is that new vehicle salespersons typically view the customers they sell to as “my customers” who they can “keep” after they move to a different dealership. Wrong – they are the dealer’s customers, not the salesperson’s. In addition, customer information is considered private under federal law. If someone captures that information but not because of a business transaction, for some other purpose, it violates Federal privacy law.

Bruce: What remedies are available to an employer in this situation? What can an auto dealer do if a new vehicle salesperson takes a customer list with them?

Thao: The dealer can file for an injunction telling a dealer not to use information that came from other dealers. When dealers do receive such information it won’t be profitable because an injunction is very expensive for them to defend as well as scary and distracting.

And if the company whose information was taken can prove actual damages, then they can receive money damages from the new employer for tortious interference with private information. For example, I had a case where a person thought they were going to be terminated, so they copied specifications for a technical piece of equipment and emailed to themselves. Then they changed information in the company computers regarding that equipment, which was very expensive for that company to correct. A new employer could be held liable for damages by accepting that information from the former employee.

Bruce: What about non-competition agreements – do those work?

Thao: A non-compete protects employer information that’s already in an employee’s head. It’s limited but it works. For example, it can say a vehicle sales employee can’t work in a dealership selling the same type of car in the same county, but usually can’t keep someone from completely working in the car business, or for any company within that county. It works as long as you don’t prevent the employee from working anywhere in the same business.

Bruce: Did you read about the Motorola ex-CFO who quit, apparently under some kind of cloud, then returned his company laptop with files wiped? He then accused the company of retaliation, so the company accused him of spoliation. What can an employer do in this situation? Can the court award sanctions against an ex-employee for destroying evidence?

Thao: Yes, most people don’t understand that computer files must be preserved whenever there is even a smell of dispute in the air. Might the court award money sanctions? Possibly. Or, in some extremely serious situations the judge can order that the offending party can’t defend itself; or that a party can’t pursue it’s lawsuit – case dismissed. It’s a form of inconsistent pleading – a claimant can’t resist providing information and pursue a remedy simultaneously.

Bruce: From what you have said today it sounds like data backups of one sort or another are a critical element for eDiscovery, at least in your practice.

Thao: Disaster recovery backups just make sense as a litigation backup data source when dealing with employees. But you need historical backups that are locked down so that they can’t be erased for a period of time during which they might be needed.

Archiving is another thing you can do. For example, the Puget Sound Automobile Dealers Association maintains an electronic archive of participating dealers’ employee policy manuals over the years which can be used as evidence in an employee dispute.

Bruce: Which brings us to a final thought. There’s a lot of company data — confidential customer data — in the hands of non-attorneys who don’t have the same paranoia about casually exposing it that attorneys like you and I do….

Thao: Yes, you have to have confidence in IT people that they won’t be trolling confidential information, that they will keep it confidential.

Tape Indexing Breathes Life Into Tape Storage

Here’s an observation that can be tagged “mixed blessings”: foot dragging on the part of techno-lagging attorneys has shielded (and in some cases continues to shield) their clients from the full potential weight of eDiscovery requests. For example, even after years of discussion, the legal profession didn’t formally recognize the obligation to produce metadata in response to discovery requests before the Federal Rules of Civil Procedure amendments adopted at the end of 2006. More outrageously, some attorneys are still gaming to avoid eDiscovery all together, as Magistrate Judge John M. Facciola (U.S. District Court, Washington D.C.) pointed out in his keynote presentation at LegalTech earlier this year.

Only a few years ago certain courts had ruled that data stored on tape could be considered “inaccessible” because it was so expensive to review it, and thus data stored on tape did not always need to be reviewed when answering an eDiscovery request (for example, the Zublake decisions). More recently, however, the legal profession is becoming aware of advances which make tapes faster and cheaper to review, like technology for rapid disaster recovery.

What IT person doesn't look forward to working with historic data?
What IT person doesn't look forward to working with historical data?

There are still a number of fine distinctions being made in this area of law, and the specific tape handling practices of different companies can render their tapes more or less “accessible.” (Ironically, companies that archive backup tapes indefinitely, which sounds like a safe practice, may be exposing themselves to a greater burden in eDiscovery, not to mention the extra cost of storing outdated tapes.) But broadly speaking, few companies storing information on tape can categorically rely on “inaccessibility” to rule out the risk of being required to review their tapes during eDiscovery any more. For more about the law concerning inaccessibility, including California’s burden-shifting rules, I recommend this article by Winston & Strawn attorneys David M. Hickey and Veronica Harris.

Fortunately, two prongs of innovation are shrinking the issues surrounding eDiscovery and tape. The first prong, which happens to be the subject of this blog post, comes in the form of new tape indexing and document retrieval technology. The second solution, which involves substituting hard drives in place of tape, will be the subject of a future post.

To learn more about the current state of eDiscovery technology in the realm of tape, I recently spoke with Jim McGann, Vice President of Marketing at Index Engines. Index Engines’ solution comes in the form of an appliance (a hardware box pre-loaded with their software) that scans a broad variety of tapes and catalogs the content. The appliance indexes tape data and de-duplicates documents within the index using the hash values of the documents. At this point users can cull (selectively retrieve) potentially responsive documents from a batch of ingested tapes without first performing an expensive, resource-intensive full restoration of each tape. And because Index Engines can ingest all of the common tape storage formats, users don’t need to run or even possess the original software used to write to the tapes.

From a longer-term strategic perspective, Index Engines’ users can approach their tape stores incrementally, taking a first pass through their tapes in response to a particular discovery request, then add to their global tape index as new discovery requests are fielded. They can embark upon a proactive tape indexing campaign that will give them enhanced early case assessment capabilities. Users may also opt to extract important data that is not immediately needed but resides on old or degraded tape.

For companies with thousands or tens of thousands of tapes, indexing can allow significant numbers of tapes to be discarded since many individual tapes typically contain data which is almost entirely repeated on other tapes or has lasted past the end of its retention period – not to mention the corrupted or blank tapes which are being carefully stored nonetheless.

All of this makes Index Engines an extremely affordable (at least by Enterprise standards) alternative to restoring and reviewing tapes individually.

I asked Jim McGann whether Index Engines resembled dentists who teach patients good dental hygiene and, if successful, will wind up putting themselves out of a job. If Index Engines’ appliances succeed in indexing, de-duplicating, and extracting all of the stored tape in existence, while ever more affordable hard drive storage replaces tape storage, won’t the company be out of a job?

Jim pointed out that, for certain organizations which currently rely on tape storage, substituting hard drives for tape drives is simply not a viable option. Costs associated with re-routing system data and human work flows, as well as the risk of downtime during a transition, mean that many organizations won’t switch even after disk drives become less expensive. And Index Engines takes away much of the cost incentive for switching that would otherwise be driven by eDiscovery and compliance requirements. Finally, Jim says, Index Engines can be used to index almost all of the information customers have, not just tape data, which enables users to find non-tape information that must be reviewed for eDiscovery.

The other approach to the problem of tape storage (to be explored in more detail in a future blog post) involves near-line hard drive solutions. Leading hard drive storage vendors such as Isilon Systems claim that their “near line” solution is priced nearly as low as tape while offering higher performance and reliability. But advocates of tape, including the Boston-area based Clipper Group (in a whitepaper offered on tape drive vendor SpectraLogic’s web site) claim that the total costs of ownership of disk storage, taking into account factors such as floor space requirements and electricity, is still many, many times higher than tape.

So, as tapes look like they will be around for some time to come, companies with tapes will continue to need technologies like Index Engines’. And most will not be able to avoid discovery of tapes for much longer, if they are even still able to do so, thanks in part to the availability of these technologies.

PREVIOUS POST: The Evolution of eDiscovery Analytics Models, Part II: A Conversation with Nicholas Croce

What is Discovery? – explaining eDiscovery to non-lawyers

I met with a group of software developers earlier this week to talk about configuring a visual analytics solution to provide useful insights for eDiscovery. To help them understand the overall process I wrote out a short description of key concepts in Discovery. I omitted legal jargon and described Discovery as a simple, repeatable process that would appeal to engineers. If anyone has enhancements to offer I’d be happy to extend this set further.

Discovery

Discovery is a process of information exchange that takes place during most lawsuits. The goal of discovery is to allow the lawyers to paint a picture that sheds lights on what actually happened. Ideally court proceedings are like an academic argument over competing research papers that have been written as accurately and convincingly as possible. Each side tries to assemble well-documented citations to letters, emails, contracts, and other documents, with information about where they were found, who created them, why they were created, when, and how they were distributed.

Discovery requests

The dead tree version of a law library.
The dead tree version of a law library.

The discovery process is governed by a published body of laws and regulations. Under these rules after a lawsuit has begun each side has the ability to ask the other side to search carefully for all documents, including electronic ones, that might help the court decide the case. Each party can ask the other for documents, using written forms called Discovery Requests. BOTH sides will need documents to support their respective positions in the lawsuit.

Responsive

Documents are “responsive” when they fit the description of documents being sought under discovery requests. Each side has the responsibility of being specific, and not over-inclusive, in describing the documents it requests. Each party has the opportunity to challenge the other’s discovery requests as being over broad, and any dispute that cannot be resolved by negotiation will be resolved by the court. But once the time for raising challenges is over each side has the obligation to take extensive steps to search for, make copies of, and deliver all responsive, non-privileged documents to the other side.

Privilege

Certain document types are protected from disclosure by privilege. The most common are attorney-client privilege and the related work product privilege which in essence cover communications between lawyers and clients and in certain cases non-lawyers working for lawyers or preparing for lawsuits. When documents are responsive, but also protected by privilege, they are described on a list called a privilege log and the log is delivered to the other side instead of the documents themselves.

Authentication

A document ordinarily isn’t considered self-explanatory. Before it can be used in court it must be explained or “authenticated” by a person who has first-hand knowledge of where the document came from, who created it, why it was created, how it was stored, etc. Authentication is necessary to discourage fakery and to limit speculation about the meaning of documents. Documents which simply appear with no explanation of where they came from may be criticized and ultimately rejected if they can’t be properly identified by someone qualified to identify them. Thus document metadata — information about the origins of documents — is of critical importance for discovery. (However, under the elaborate Rules of Evidence that must be followed by lawyers, a wide variety of assumptions may be made, depending on circumstances surrounding the documents, which may allow documents to be used even if their origins are disputed.)

Document custodians

The term “custodian” can be applied to anyone whose work involves storing documents. The spoken or written statement (“testimony”) of a custodian may be required to authenticate and explain information that is in their custody. And when a legal action involves the actions and responsibilities of relatively few people (as most legal actions ultimately do), those people will be considered key custodians whose documents will be examined more thoroughly. Everyone with a hard drive can be considered a document custodian with respect to that drive, although system administrators would ordinarily be considered the custodians of a company-wide document system like a file server. Documents like purchase orders, medical records, repair logs and the like, which are usually and routinely created by an organization (sometimes called “documents kept in the ordinary course of business”) may be authenticated by a person who is knowledgeable about the processes by which such documents were ordinarily created and kept, and who can identify particular documents as having been retrieved from particular places.

Types of documents which are discoverable and may be responsive

Typically any form of information can be requested in discovery, although attorneys are only beginning to explore the boundaries of the possibilities here. In the old days only paper documents and memories were sought through discovery. (Note: of course, physical objects may also be requested, for example, in a lawsuit claiming a defect in an airplane engine, parts of the engine may requested.) As of today requests frequently include databases, spreadsheets, word processing documents, emails, instant messages, voice mail and other recordings, web pages, images, metadata about documents, document backup tapes, erased but still recoverable documents, and everything else attorneys can think of that might help explain the circumstances on which the lawsuit is based.

Discovery workflow

Discovery can be time consuming and expensive. Lawyers work closely with IT, known document custodians, and others with knowledge of the events and people involved in the lawsuit. First they attempt to identify what responsive documents might exist, where they might be kept, and who may have created or may have control over the documents that might exist. Based on what is learned through this collaboration, assumptions are made and iteratively improved about what documents may exist and where they are likely to be found. Efforts must be taken to instruct those who may have potentially responsive documents to avoid erasing them before they are found (this is called “litigation hold”). Then efforts are taken to copy potentially responsive documents, with metadata intact, into a central repository in which batch operations can take place. In recent years online repositories that enable remote access have become very popular for this purpose. Within this repository lawyers and properly qualified personnel can sort documents into groups using various search and de-duplication methodologies, set aside documents which are highly unlikely to contain useful information, then prioritize and assign remaining documents to lawyers for manual review. Reviewing attorneys then sort documents into responsive and non-responsive and privileged and non-privileged groupings. Eventually responsive, non-privileged documents are listed, converted into image files (TIFFs), and delivered to the other side, sometimes alongside copies of the documents in their original formats.

Early Case Assessment (also called Early Data Assessment)

Even before receiving a discovery request, and sometimes even before a lawsuit has been filed, document review can be started in order to plan legal strategy (like settlement), prevent document erasure (“litigation holds”), etc. This preliminary review is called “Early Case Assessment” (or “Early Data Assessment”).

UPDATE: I describe the sources and development of legal procedural rules for e-discovery in a later blog post, Catch-22 for e-discovery standards?

When employees leave, company information leaves with them

A good topic for a future blog post will be a review of the technology that might prevent this from happening: a recent study revealed

“Of about 950 people who said they had lost or left their jobs during the last 12 months, nearly 60 percent admitted to taking confidential company information with them, including customer contact lists and other data that could potentially end up in the hands of a competitor for the employee’s next job stint.

….

“Most of the data takers (53 percent) said they downloaded the information onto a CD or DVD, while 42 percent put it on a USB drive and 38 percent sent it as attachments via e-mail….”

Black CD compact disc and black removable USB driveSymantec, who commissioned this study (and which through a string of acquisitions has become a major vendor in the information management realm), just happens to be one of a number of software vendors who provide DLP (“data loss/leak prevention/protection”) solutions that can inhibit this sort of thing.

Meanwhile, over at RIM, the makers of the BlackBerry, the CEO isn’t shy about admitting that they record ALL company calls on the theory that everything employees say on the job is the company’s intellectual property.

I’m not an advocate for “big brother” work environments because I think there can be a strong relationship between genuine trust and employee productivity and creativity. Nonetheless, I have to admit that employees who are convinced that they will be held accountable for what they do with company information will be more conscientious about how they handle it.

Yet another topic for a future post will be examining how important information is misplaced when employees shift to new projects, positions, or companies.