Tape Indexing Breathes Life Into Tape Storage

Here’s an observation that can be tagged “mixed blessings”: foot dragging on the part of techno-lagging attorneys has shielded (and in some cases continues to shield) their clients from the full potential weight of eDiscovery requests. For example, even after years of discussion, the legal profession didn’t formally recognize the obligation to produce metadata in response to discovery requests before the Federal Rules of Civil Procedure amendments adopted at the end of 2006. More outrageously, some attorneys are still gaming to avoid eDiscovery all together, as Magistrate Judge John M. Facciola (U.S. District Court, Washington D.C.) pointed out in his keynote presentation at LegalTech earlier this year.

Only a few years ago certain courts had ruled that data stored on tape could be considered “inaccessible” because it was so expensive to review it, and thus data stored on tape did not always need to be reviewed when answering an eDiscovery request (for example, the Zublake decisions). More recently, however, the legal profession is becoming aware of advances which make tapes faster and cheaper to review, like technology for rapid disaster recovery.

What IT person doesn't look forward to working with historic data?
What IT person doesn't look forward to working with historical data?

There are still a number of fine distinctions being made in this area of law, and the specific tape handling practices of different companies can render their tapes more or less “accessible.” (Ironically, companies that archive backup tapes indefinitely, which sounds like a safe practice, may be exposing themselves to a greater burden in eDiscovery, not to mention the extra cost of storing outdated tapes.) But broadly speaking, few companies storing information on tape can categorically rely on “inaccessibility” to rule out the risk of being required to review their tapes during eDiscovery any more. For more about the law concerning inaccessibility, including California’s burden-shifting rules, I recommend this article by Winston & Strawn attorneys David M. Hickey and Veronica Harris.

Fortunately, two prongs of innovation are shrinking the issues surrounding eDiscovery and tape. The first prong, which happens to be the subject of this blog post, comes in the form of new tape indexing and document retrieval technology. The second solution, which involves substituting hard drives in place of tape, will be the subject of a future post.

To learn more about the current state of eDiscovery technology in the realm of tape, I recently spoke with Jim McGann, Vice President of Marketing at Index Engines. Index Engines’ solution comes in the form of an appliance (a hardware box pre-loaded with their software) that scans a broad variety of tapes and catalogs the content. The appliance indexes tape data and de-duplicates documents within the index using the hash values of the documents. At this point users can cull (selectively retrieve) potentially responsive documents from a batch of ingested tapes without first performing an expensive, resource-intensive full restoration of each tape. And because Index Engines can ingest all of the common tape storage formats, users don’t need to run or even possess the original software used to write to the tapes.

From a longer-term strategic perspective, Index Engines’ users can approach their tape stores incrementally, taking a first pass through their tapes in response to a particular discovery request, then add to their global tape index as new discovery requests are fielded. They can embark upon a proactive tape indexing campaign that will give them enhanced early case assessment capabilities. Users may also opt to extract important data that is not immediately needed but resides on old or degraded tape.

For companies with thousands or tens of thousands of tapes, indexing can allow significant numbers of tapes to be discarded since many individual tapes typically contain data which is almost entirely repeated on other tapes or has lasted past the end of its retention period – not to mention the corrupted or blank tapes which are being carefully stored nonetheless.

All of this makes Index Engines an extremely affordable (at least by Enterprise standards) alternative to restoring and reviewing tapes individually.

I asked Jim McGann whether Index Engines resembled dentists who teach patients good dental hygiene and, if successful, will wind up putting themselves out of a job. If Index Engines’ appliances succeed in indexing, de-duplicating, and extracting all of the stored tape in existence, while ever more affordable hard drive storage replaces tape storage, won’t the company be out of a job?

Jim pointed out that, for certain organizations which currently rely on tape storage, substituting hard drives for tape drives is simply not a viable option. Costs associated with re-routing system data and human work flows, as well as the risk of downtime during a transition, mean that many organizations won’t switch even after disk drives become less expensive. And Index Engines takes away much of the cost incentive for switching that would otherwise be driven by eDiscovery and compliance requirements. Finally, Jim says, Index Engines can be used to index almost all of the information customers have, not just tape data, which enables users to find non-tape information that must be reviewed for eDiscovery.

The other approach to the problem of tape storage (to be explored in more detail in a future blog post) involves near-line hard drive solutions. Leading hard drive storage vendors such as Isilon Systems claim that their “near line” solution is priced nearly as low as tape while offering higher performance and reliability. But advocates of tape, including the Boston-area based Clipper Group (in a whitepaper offered on tape drive vendor SpectraLogic’s web site) claim that the total costs of ownership of disk storage, taking into account factors such as floor space requirements and electricity, is still many, many times higher than tape.

So, as tapes look like they will be around for some time to come, companies with tapes will continue to need technologies like Index Engines’. And most will not be able to avoid discovery of tapes for much longer, if they are even still able to do so, thanks in part to the availability of these technologies.

PREVIOUS POST: The Evolution of eDiscovery Analytics Models, Part II: A Conversation with Nicholas Croce

The Evolution of eDiscovery Analytics Models, Part II: A Conversation with Nicholas Croce

I recently had the pleasure of speaking with Nicholas Croce, President of Inference Data, a provider of innovative analytics and review software for eDiscovery, following the company’s recent webinar, De-Mystifying Analytics. During our conversation I discovered that Nick is double-qualified as a legal technology visionary. He not only founded Inference, but has been involved with legal technologies for more than 12 years. Particularly focused on the intersection of technology and the law, Nick was directly involved in setting the standards for technology in the courtroom through working personally with the Federal Judicial Center and the Administrative Office of the US Courts.

I asked to speak with Nick because I wanted to pin him down on what I imagined I heard him say (between the words he actually spoke) during the live webinar he presented in mid-March. The hour-long interview and conversation ranged in topic, but was very specific in terms of where Nick sees the eDiscovery market going.

Sure enough, during our conversation Nick confirmed and further explained that he and his team, which includes CEO Lou Andreozzi, the former LexisNexis NA (North American Legal Markets) Chief Executive Officer, have designed Inference with not one, but two models of advanced eDiscovery analytics and legal review in mind.

As data volume explodes, choosing the right way to sift it becomes urgent
As total data volume explodes, choosing the right way to sift out responsive documents becomes urgent

Please read my previous blog post The Evolution of eDiscovery Analytics Models, Part I: Trusting Analytics if you haven’t already and want to understand more about the assertions I make in this blog post.

In a nutshell, Inference is designed not only to deliver the current model of eDiscovery software analytics, which I have dubbed “Software Queued Review,” but the next generation analytics model as well, which I am currently calling “Statistically Validated Automated Review” (Nick calls it “auto-coding”).

Bruce: In a webinar you presented recently you explained statistical validation of eDiscovery analytics and offered predictions concerning the evolution of the EDRM (“Electronic Discovery Reference Model”).

I have a few specific questions to ask, but in general what I’d like to cover is:

1) where does Inference fit within the eDiscovery ecosystem,
2) how you think statistically validated discovery will ultimately be used, and
3) how you think the left side of the EDRM diagram (which is where document identification, collection, and preservation are situated) is going to evolve?

Nick: To first give some perspective on the genesis of Inference, it’s important to understand the environment in which it was developed. Prior to founding Inference I was President of DOAR Litigation Consulting. When I started at DOAR in 1997, the company was really more of a hardware company than anything else. I was privileged to be involved in the conversion of courtroom technology from wooden benches to the efficient digital displays of evidence  we see today,  Within a few years we became the predominant provider of courtroom technology, and it was amazing to see the legal system change and directly benefit from the introduction of technology. As people saw the dramatic benefits, and started saying “how do we use it?” we created a consulting arm around eDiscovery which provided the insight to see that this same type of evolution was needed within the discovery process.

This began around 2004-2005 when we started to see an avalanche of ESI (“Electronically Stored Information”) coming, and George Socha became a much needed voice in the field of eDiscovery. As a businessman I was reading about what was happening, and asking questions, and it seemed black and white to me – it had become impossible to review everything because of the tremendous volume of ESI with existing technology. As a result I started developing new technology for it, to not only manage the discovery of large data collections, but to improve and bring a new level of sophistication to the entire legal discovery process.

Inference was developed to help clients intelligently mine and review data, organize case workflow and strategy, and streamline and accelerate review. It’s the total process. But, today I still have to fight “the short term fix mentality” – lawyers who just care about “how do I get through this stuff faster”, which is the approach of some other providers, and which also relates to the transition I see in the EDRM model – I want to see the whole thing change.

Review is the highest dollar amount, the biggest pain, 70% of a corporation’s legal costs are within eDiscovery. People want to, and need to, speed up review. However, we also need to add intelligence back into the process.

Bruce: What differentiates Inference, where does it fit in?

Nick: I, and Inference, went further than just accelerating linear review and said: it has to be dynamic, not just coding documents as responsive / non-responsive.  I know this is going to sound cheesy I guess, but – you have to put “discovery” back into Discovery. You need to be able to quickly find documents during a deposition when a deponent says something like “I never saw a document from Larry about our financial statements”, and not just search for “responsive: yes/no”, “privileged: yes/no.”

Inference was, and is, designed to be dynamic – providing suggestions to reviewers, opportunities to see relationships between documents and document sets not previously perceived, helping to guide attorneys – intuitively. Inference follows standard, accepted methodologies, including Boolean keyword search, field and parametric search, and incorporates all of the tools required for review – redaction, subjective coding, production, etc.

In addition to that overriding principle, we wanted the ability to get data in from anyone, anywhere and at any time. Regulators are requiring incredibly aggressive production timelines; serial litigants re-use the same data set over and over; CIOs are trying to get control over searching data more effectively, including video and audio. Inference is designed to take ownership of data once it leaves the corporation, whether it is structured, semi-structured or unstructured data.

Inside the firewall, the steps on the left side of the EDRM model are being combined.  Autonomy, EMC, Clearwell, StoredIQ — the crawling technologies – these companies are within inches of extracting metadata during the crawling process, and may be there already. This is where Inference comes in since we can ingest this data directly. I call it the disintermediation of processing because at that point there is no more additional costs for processing.

In the past someone would use EnCase for preservation, then Applied Discovery for processing (using date ranges and Boolean search terms), and at some cost per custodian, and per drive, you’d then pay for processing. It used to be over $2,500 per gig, now it’s more like $600 to $1500 per gig, depending on multi-language use and such.

But once corporations automate the process with crawling and indexing solutions, all of the information goes right into Inference without the intermediary steps, which puts intelligence back in the process. You can ask the system to guide you whenever there’s a particular case, or an issue. If I know the issue is a conversation between Jeff and Michele during a certain date range, I can prime the system with that information, start finding stuff, and start looking at settlement of the dispute. But without automation it can take months to do, at much higher costs.

Inference also offers quality control aspects not previously available: after, say, one month, you can use the software to check review quality, find rogue reviewers, and fix the process. You can also ultimately do auto-coding.

Bruce: I think this is a good opening for segue to the next question: how will analytics ultimately be used in eDiscovery?

Nick: The two most basic components of review are “responsive” and “privileged.” I learned from the public testimony of Verizon’s director of eDiscovery, Patrick Oot, some very strong statistics from a major action they were involved in. The first level document review expense was astounding even before the issues were identified. The total cost of responsive and privileged review was something like $13.6 million.

The truth is that companies only do so many things. Pharma companies aren’t generally talking about real-estate transactions or baseball contracts.

Which brings us to auto-coding… sometimes I try to avoid calling it “auto-coding” versus “computer aided” or “computer recommended” coding. When someone says “the computer did it” attorneys tend to shut down, but if someone says “the computer recommended it” then they pay attention.

Basically auto-coding is applying issue tags to the population based on a sampling of documents. The way we do it is very accurate because it is iterative. It uses statistically sound sampling, recurrent models. It uses the same technology as concept clustering, but you cluster a much smaller percentage.  Let’s say you create 10 clusters, tag those, then have the computer tag other documents consistently with the same concepts. Essentially, the computer makes recommendations which are then confirmed by attorney, and then repeated until the necessary accuracy level has been achieved. This enables you to only look at a small percentage of the total document population.

Bruce: I spoke with one of the statistical sampling gurus at Navigant Consulting last month, who suggests that software validated by statistical sampling can be more accurate than human reviewers, with fewer errors, for analyzing large quantities of documents.

Nick: It makes sense. Document review is very labor intensive and redundant. Think about the type of documents you’re tagging for issues – it doesn’t even need to be conscious: it is an extremely rote activity on many levels which just lends itself to human error.

Bruce: So let’s talk about what needs to happen before auto-coding becomes accepted, and becomes the rule rather than the exception. In your webinar presentation you danced around this a bit, saying, in effect, that we’re waiting for the right alignment of law firms, cases, and a judge’s decision. In my experience as an attorney, including some background in civil rights cases, the way to go about this is by deliberately seeking out best-case-scenario disputes that will become “test cases.” A party who has done its homework stands up and insists on using statistically validated auto-coding in an influential court, here we probably want the DC Circuit, the Second Circuit, or the Ninth Circuit, I suppose. When those disputes result in a ruling on statistical validity, the law will change and everything else will follow. Do you know of any companies in a position to do this, to set up test cases, and have you discussed it with anyone?

Nick: Test cases: who is going to commit to this — the general counsel? Who do they have to convince? Their outside counsel, who, ultimately, has to be comfortable with the potential outcome. But lawyers are trained to mitigate risk, and for now they see auto-coding or statistical sampling as a risk. I am working with a couple of counsel with scientific and/or mathematic backgrounds who “get” Bayesian methods- and the benefits of using them. Once the precedents are set, and determine the use of statistical analysis as reasonable, it will be a risk not to use these technologies. As with legal research, online research tools were initially considered a risk. Now, it can be considered malpractice not to use them.

It can be frustrating for technologists to wait, but that’s how it is. Sometimes when we are following up with new installs of Inference we find that 6 weeks later they’ve gone back to using simple search instead of the advanced analytics tools. But even for those people, after using the advanced features for a few months they finally discover they can no longer live without it.

Bruce: Would you care to offer a prediction as to when these precedents will be set?

Nick: I really believe it will happen, there’s no ambiguity.  I just don’t know if it’s 6 months or a year. But general counsel are taking a more active role, because of the cost of litigation, because of the economy, and looking at expenses more closely. At some point there will be that GC and an outside counsel combination that will make it happen.

Bruce: After hearing my statistician friend from Navigant deliver a presentation on statistical sampling at LegalTech last month I found myself wondering why parties requesting documents wouldn’t want to insist that statistically validated coding be used by parties producing documents for the simple reason that this improves accuracy. What do you think?

Nick: Requesting parties are never going to say “I trust you.”

Bruce: But like they do now, the parties will still have to be able to discuss and will be expected to reach agreement about the search methods being used, right?

Nick: You can agree to the rules, but the producing party can choose a strategy that will be used to manage their own workflow – for example today they can do it linearly, offshore, or using analytics. The requesting party will leave the burden on the producing party.

Bruce: If they are only concerned with jointly defining responsiveness, in order to get a better-culled set of documents — that helps both sides?

Nick: That would be down the road… at that point my vision gets very cloudy, maybe opposing counsel gets access to concept searches – and they can negotiate over the concepts to be produced.

Bruce: There are many approaches to eDiscovery analytics. Will there have to be separate precedents set for each mathematical method used by analytics vendors, or even for each vendor-provided analytical solution?

Nick: I’d love to have Inference be the first case. But I don’t know how important the specific algorithm or methodology is going to be – that is a judicial issue. Right now we’re waiting for the perfect judge and the perfect case – so I’ll hope it’s Inference, rather then “generic” as to which analytics are used. I hope there’s a vendor shakeout – for example, ontology based analytics systems demo nicely, but “raptor” renders “birds” which is non-responsive, while “Raptor” is a critical responsive term in the Enron case.

Bruce: Perhaps vendors and other major stakeholders in the use of analytics in eDiscovery, for example, the National Archives, should be tracking ongoing discovery disputes and be prepared to file amicus briefs when possible to help support the development of good precedents.

Nick: Perhaps they should.

PREVIOUS POST: The Evolution of eDiscovery Analytics Models, Part I: Trusting Analytics
NEXT POST:
Tape Indexing Breathes Life Into Tape Storage