Because e-discovery is complex, and the penalties for screwing it up are significant, the following choice should be considered periodically by attorneys, clients and IT people involved in e-discovery: “Do we do this piece of the project with the people we have already, or do we add people to our payroll who do this, or do we bring in an outside partner to do this?” This is when the IT people reading this post will start muttering the cliché truism “Build or Buy?” which means choosing between “do it ourselves” and finding a pre-packaged solution.
In a generalized “leadership” or “management” frame of mind the basic choice is: “Do, Delegate, or Dump.” I am fond of characterizing this choice as the assignment of the locus of control for decision-making, where an important consideration is who will do the best job of making the decisions once given that responsibility.
“Do” = Must I make a particular set of decisions myself – are those decisions an essential part of my role in the organization, and am I the one with the right information and motivation to make these decisions?
“Delegate” = Can someone else do just as well, or perhaps a better job, with making this set of decisions, especially after making these decisions became an essential part of their role?
“Dump”: should we even be in the business of making these decisions at all, or can we just drop that issue off of our plates somehow?
For example, one can dump having a company picnic to save money. One can’t dump bookkeeping, however, even in a very small company. But even in a very small company a leader can usually delegate or outsource primary responsibility for bookkeeping and expect to get good results while focusing on core competencies of the business such as production, customer relationships, and motivating team members.
Ultimately the choice boils down to this: Do I want to possess and maintain expertise in making certain decisions, at a certain level of granularity, as a core competency? If yes, then I must make it a core competency, which means investing the time, attention, and education it takes to do it right. If no, then I should bring in someone else who has that core competency and who is invested in doing it right.
In e-discovery, answering the question of what can be outsourced — or where to place the locus of control for decision-making — gets even more interesting since courts hold attorneys personally responsible not only for delivering high-quality document production results but for understanding and directing the process by which results are achieved. So the question becomes: Will attorneys generate better document production results when they personally control more of the process (for example, by personally, hands on the keyboard, deriving and executing search methodology)? Or, will they generate better results by collaborating more with outsourced experts, directing and supervising but delegating more of the hands-on decisions?
More than a few attorneys reading this might find that the choice is not as cut and dried as they think. In my next post I’ll explore this choice by applying the core competency / locus of control standard to competing document review automation solutions from Inference Data and H5.
From a business perspective, information should be handled like property. Like assets or supplies, information needs management.
Companies set policies to govern use, storage, and disposal of assets and office supplies. Companies also need to make and enforce rules governing electronically stored information, including how it is organized (who has access and how), stored (where and at what cost), retained (including backups and archives), and destroyed (deletion and non-deletion both have significant legal and cost consequences). These policies must balance the business, legal and technical needs of the company. Without them, a company opens itself up to losses from missed opportunities, employee theft, lawsuits, and numerous other risks.
Some information is analogous to company ASSETS. For example, let’s suppose a certain sales proposal took someone a week to write and required approval and edits from four other people plus 6 hours of graphics production time. An accountant isn’t going to list that proposal on the company books. But it is an asset. It can be edited and resubmitted to other potential customers in a fraction of the time it took to create the original. Like the machinery, furniture, or hand tools used to operate a business, company money was spent obtaining this information and it will retain some value for some time. It should be managed like an asset.
Some information is analogous to OFFICE SUPPLIES. For example, a manager spends a number of hours customizing a laptop with email account settings, browser bookmarks and passwords, ribbon and plugin preferences, nested document folders, security settings, etc. That customization information is crucial for the manager’s productivity in much the same way as having pens in the drawer, paper in the printer, staples in the stapler, and water in the water-cooler can be important for productivity. Productivity will be lost if it is lost. That information needs to be managed just as much as office supplies need to be manged.
From a business perspective, when company information is lost or damaged, or when users are under or over supplied, it is no different from mismanagement of company assets and office supplies.
Setting an information policy means:
identifying information use and control needs;
making choices and tradeoffs about how to meet those needs; and
taking responsibility for results and an ongoing process (setting goals / taking action / measuring progress / adjusting).
Information governance policy is an on-going process for managing valuable company information. All of the stakeholders – in particular, business units, IT, and Legal – must collaborate in order to draw a bullseye on company information management needs. The right people in the organization must be charged with responsibility for getting results or for making changes needed to get results.
Without a doubt, it takes time and money, and requires collaboration, to develop a “policy.” But we’re all accustomed to this type of preparation already. Let’s look at simple, familiar professional standards for just a moment:
Software developers test software on actual users and correct bugs and (hopefully) mistaken assumptions before releasing it. Avoiding these steps will undoubtedly lead to loss and possibly bankruptcy.
Attorneys meet with clients before going to trial to prepare both the client and the attorneys. If they don’t they risk losing their clients millions, or getting them locked up.
Advance preparation is as essential in information management as it is in software development and trial practice. Simply ignoring the issue, or dumping it on one person or a single department (like IT or Legal) can be very costly. Avoiding the planning component of information management is like putting in only 80% of the time and effort needed for the company to succeed. Avoiding 20% of the time and effort doesn’t yield a “savings” when the outcome is failure, as when an employee steals documents, essential information is lost when a building with no computer back-ups burns down, or old documents which would have been deleted under a proper information governance policy turn up in a lawsuit and cost millions.
Information policy does NOT flow from any of the following all-to-common realities:
The first meeting between the Legal Department and the new eDiscovery vendor is also the first meeting between Legal and IT (true story);
Ever since a certain person from the General Counsel’s office was made the head of corporate records management, no one in IT will talk to that person (true story);
Information management technology alone, without a company-specific understanding of the problems that the technology is meant to solve, is not a recipe for success. A recent article by Carol Sliwa, published by SearchStorage.com (April 22, 2009), offers a detailed look at issues surrounding efforts to reduce storage costs by assessing how information is being used and moving it to the least expensive storage tier possible.
The article has some powerful suggestions on developing information policy. First, Karthik Kannan, vice president of marketing and business development at Kazeon Systems Inc.:
“What we discovered over time is that customers need to be able to take some action on the data, not just find it…. Nobody wants to do data classification just for the sake of it. It has to be coupled with a strong business reason.”
“In order to really realize and get the benefit of data and storage classification, you have to start with a business process…. And it has to start from conversations with the business units and understanding the needs and requirements of the business. Only at the end, once you actually have everything in place, should you be looking at technology because then you’ll have a better set of requirements for that technology.”
It takes time, money and cooperation between departments that may have never worked together before to develop a working information governance policy. But that is not a reason to skip — or skimp on — the process. Companies need to protect their assets and productivity, and information governance has become an essential area for doing just that.
I recently had the pleasure of speaking with Nicholas Croce, President of Inference Data, a provider of innovative analytics and review software for eDiscovery, following the company’s recent webinar, De-Mystifying Analytics. During our conversation I discovered that Nick is double-qualified as a legal technology visionary. He not only founded Inference, but has been involved with legal technologies for more than 12 years. Particularly focused on the intersection of technology and the law, Nick was directly involved in setting the standards for technology in the courtroom through working personally with the Federal Judicial Center and the Administrative Office of the US Courts.
I asked to speak with Nick because I wanted to pin him down on what I imagined I heard him say (between the words he actually spoke) during the live webinar he presented in mid-March. The hour-long interview and conversation ranged in topic, but was very specific in terms of where Nick sees the eDiscovery market going.
Sure enough, during our conversation Nick confirmed and further explained that he and his team, which includes CEO Lou Andreozzi, the former LexisNexis NA (North American Legal Markets) Chief Executive Officer, have designed Inference with not one, but two models of advanced eDiscovery analytics and legal review in mind.
In a nutshell, Inference is designed not only to deliver the current model of eDiscovery software analytics, which I have dubbed “Software Queued Review,” but the next generation analytics model as well, which I am currently calling “Statistically Validated Automated Review” (Nick calls it “auto-coding”).
I have a few specific questions to ask, but in general what I’d like to cover is:
1) where does Inference fit within the eDiscovery ecosystem,
2) how you think statistically validated discovery will ultimately be used, and
3) how you think the left side of the EDRM diagram (which is where document identification, collection, and preservation are situated) is going to evolve?
Nick: To first give some perspective on the genesis of Inference, it’s important to understand the environment in which it was developed. Prior to founding Inference I was President of DOAR Litigation Consulting. When I started at DOAR in 1997, the company was really more of a hardware company than anything else. I was privileged to be involved in the conversion of courtroom technology from wooden benches to the efficient digital displays of evidence we see today, Within a few years we became the predominant provider of courtroom technology, and it was amazing to see the legal system change and directly benefit from the introduction of technology. As people saw the dramatic benefits, and started saying “how do we use it?” we created a consulting arm around eDiscovery which provided the insight to see that this same type of evolution was needed within the discovery process.
This began around 2004-2005 when we started to see an avalanche of ESI (“Electronically Stored Information”) coming, and George Socha became a much needed voice in the field of eDiscovery. As a businessman I was reading about what was happening, and asking questions, and it seemed black and white to me – it had become impossible to review everything because of the tremendous volume of ESI with existing technology. As a result I started developing new technology for it, to not only manage the discovery of large data collections, but to improve and bring a new level of sophistication to the entire legal discovery process.
Inference was developed to help clients intelligently mine and review data, organize case workflow and strategy, and streamline and accelerate review. It’s the total process. But, today I still have to fight “the short term fix mentality” – lawyers who just care about “how do I get through this stuff faster”, which is the approach of some other providers, and which also relates to the transition I see in the EDRM model – I want to see the whole thing change.
Review is the highest dollar amount, the biggest pain, 70% of a corporation’s legal costs are within eDiscovery. People want to, and need to, speed up review. However, we also need to add intelligence back into the process.
Bruce: What differentiates Inference, where does it fit in?
Nick: I, and Inference, went further than just accelerating linear review and said: it has to be dynamic, not just coding documents as responsive / non-responsive. I know this is going to sound cheesy I guess, but – you have to put “discovery” back into Discovery. You need to be able to quickly find documents during a deposition when a deponent says something like “I never saw a document from Larry about our financial statements”, and not just search for “responsive: yes/no”, “privileged: yes/no.”
Inference was, and is, designed to be dynamic – providing suggestions to reviewers, opportunities to see relationships between documents and document sets not previously perceived, helping to guide attorneys – intuitively. Inference follows standard, accepted methodologies, including Boolean keyword search, field and parametric search, and incorporates all of the tools required for review – redaction, subjective coding, production, etc.
In addition to that overriding principle, we wanted the ability to get data in from anyone, anywhere and at any time. Regulators are requiring incredibly aggressive production timelines; serial litigants re-use the same data set over and over; CIOs are trying to get control over searching data more effectively, including video and audio. Inference is designed to take ownership of data once it leaves the corporation, whether it is structured, semi-structured or unstructured data.
Inside the firewall, the steps on the left side of the EDRM model are being combined. Autonomy, EMC, Clearwell, StoredIQ — the crawling technologies – these companies are within inches of extracting metadata during the crawling process, and may be there already. This is where Inference comes in since we can ingest this data directly. I call it the disintermediation of processing because at that point there is no more additional costs for processing.
In the past someone would use EnCase for preservation, then Applied Discovery for processing (using date ranges and Boolean search terms), and at some cost per custodian, and per drive, you’d then pay for processing. It used to be over $2,500 per gig, now it’s more like $600 to $1500 per gig, depending on multi-language use and such.
But once corporations automate the process with crawling and indexing solutions, all of the information goes right into Inference without the intermediary steps, which puts intelligence back in the process. You can ask the system to guide you whenever there’s a particular case, or an issue. If I know the issue is a conversation between Jeff and Michele during a certain date range, I can prime the system with that information, start finding stuff, and start looking at settlement of the dispute. But without automation it can take months to do, at much higher costs.
Inference also offers quality control aspects not previously available: after, say, one month, you can use the software to check review quality, find rogue reviewers, and fix the process. You can also ultimately do auto-coding.
Bruce: I think this is a good opening for segue to the next question: how will analytics ultimately be used in eDiscovery?
Nick: The two most basic components of review are “responsive” and “privileged.” I learned from the public testimony of Verizon’s director of eDiscovery, Patrick Oot, some very strong statistics from a major action they were involved in. The first level document review expense was astounding even before the issues were identified. The total cost of responsive and privileged review was something like $13.6 million.
The truth is that companies only do so many things. Pharma companies aren’t generally talking about real-estate transactions or baseball contracts.
Which brings us to auto-coding… sometimes I try to avoid calling it “auto-coding” versus “computer aided” or “computer recommended” coding. When someone says “the computer did it” attorneys tend to shut down, but if someone says “the computer recommended it” then they pay attention.
Basically auto-coding is applying issue tags to the population based on a sampling of documents. The way we do it is very accurate because it is iterative. It uses statistically sound sampling, recurrent models. It uses the same technology as concept clustering, but you cluster a much smaller percentage. Let’s say you create 10 clusters, tag those, then have the computer tag other documents consistently with the same concepts. Essentially, the computer makes recommendations which are then confirmed by attorney, and then repeated until the necessary accuracy level has been achieved. This enables you to only look at a small percentage of the total document population.
Bruce: I spoke with one of the statistical sampling gurus at Navigant Consulting last month, who suggests that software validated by statistical sampling can be more accurate than human reviewers, with fewer errors, for analyzing large quantities of documents.
Nick: It makes sense. Document review is very labor intensive and redundant. Think about the type of documents you’re tagging for issues – it doesn’t even need to be conscious: it is an extremely rote activity on many levels which just lends itself to human error.
Bruce: So let’s talk about what needs to happen before auto-coding becomes accepted, and becomes the rule rather than the exception. In your webinar presentation you danced around this a bit, saying, in effect, that we’re waiting for the right alignment of law firms, cases, and a judge’s decision. In my experience as an attorney, including some background in civil rights cases, the way to go about this is by deliberately seeking out best-case-scenario disputes that will become “test cases.” A party who has done its homework stands up and insists on using statistically validated auto-coding in an influential court, here we probably want the DC Circuit, the Second Circuit, or the Ninth Circuit, I suppose. When those disputes result in a ruling on statistical validity, the law will change and everything else will follow. Do you know of any companies in a position to do this, to set up test cases, and have you discussed it with anyone?
Nick: Test cases: who is going to commit to this — the general counsel? Who do they have to convince? Their outside counsel, who, ultimately, has to be comfortable with the potential outcome. But lawyers are trained to mitigate risk, and for now they see auto-coding or statistical sampling as a risk. I am working with a couple of counsel with scientific and/or mathematic backgrounds who “get” Bayesian methods- and the benefits of using them. Once the precedents are set, and determine the use of statistical analysis as reasonable, it will be a risk not to use these technologies. As with legal research, online research tools were initially considered a risk. Now, it can be considered malpractice not to use them.
It can be frustrating for technologists to wait, but that’s how it is. Sometimes when we are following up with new installs of Inference we find that 6 weeks later they’ve gone back to using simple search instead of the advanced analytics tools. But even for those people, after using the advanced features for a few months they finally discover they can no longer live without it.
Bruce: Would you care to offer a prediction as to when these precedents will be set?
Nick: I really believe it will happen, there’s no ambiguity. I just don’t know if it’s 6 months or a year. But general counsel are taking a more active role, because of the cost of litigation, because of the economy, and looking at expenses more closely. At some point there will be that GC and an outside counsel combination that will make it happen.
Bruce: After hearing my statistician friend from Navigant deliver a presentation on statistical sampling at LegalTech last month I found myself wondering why parties requesting documents wouldn’t want to insist that statistically validated coding be used by parties producing documents for the simple reason that this improves accuracy. What do you think?
Nick: Requesting parties are never going to say “I trust you.”
Bruce: But like they do now, the parties will still have to be able to discuss and will be expected to reach agreement about the search methods being used, right?
Nick: You can agree to the rules, but the producing party can choose a strategy that will be used to manage their own workflow – for example today they can do it linearly, offshore, or using analytics. The requesting party will leave the burden on the producing party.
Bruce: If they are only concerned with jointly defining responsiveness, in order to get a better-culled set of documents — that helps both sides?
Nick: That would be down the road… at that point my vision gets very cloudy, maybe opposing counsel gets access to concept searches – and they can negotiate over the concepts to be produced.
Bruce: There are many approaches to eDiscovery analytics. Will there have to be separate precedents set for each mathematical method used by analytics vendors, or even for each vendor-provided analytical solution?
Nick: I’d love to have Inference be the first case. But I don’t know how important the specific algorithm or methodology is going to be – that is a judicial issue. Right now we’re waiting for the perfect judge and the perfect case – so I’ll hope it’s Inference, rather then “generic” as to which analytics are used. I hope there’s a vendor shakeout – for example, ontology based analytics systems demo nicely, but “raptor” renders “birds” which is non-responsive, while “Raptor” is a critical responsive term in the Enron case.
Bruce: Perhaps vendors and other major stakeholders in the use of analytics in eDiscovery, for example, the National Archives, should be tracking ongoing discovery disputes and be prepared to file amicus briefs when possible to help support the development of good precedents.
I met with a group of software developers earlier this week to talk about configuring a visual analytics solution to provide useful insights for eDiscovery. To help them understand the overall process I wrote out a short description of key concepts in Discovery. I omitted legal jargon and described Discovery as a simple, repeatable process that would appeal to engineers. If anyone has enhancements to offer I’d be happy to extend this set further.
Discovery is a process of information exchange that takes place during most lawsuits. The goal of discovery is to allow the lawyers to paint a picture that sheds lights on what actually happened. Ideally court proceedings are like an academic argument over competing research papers that have been written as accurately and convincingly as possible. Each side tries to assemble well-documented citations to letters, emails, contracts, and other documents, with information about where they were found, who created them, why they were created, when, and how they were distributed.
The discovery process is governed by a published body of laws and regulations. Under these rules after a lawsuit has begun each side has the ability to ask the other side to search carefully for all documents, including electronic ones, that might help the court decide the case. Each party can ask the other for documents, using written forms called Discovery Requests. BOTH sides will need documents to support their respective positions in the lawsuit.
Documents are “responsive” when they fit the description of documents being sought under discovery requests. Each side has the responsibility of being specific, and not over-inclusive, in describing the documents it requests. Each party has the opportunity to challenge the other’s discovery requests as being over broad, and any dispute that cannot be resolved by negotiation will be resolved by the court. But once the time for raising challenges is over each side has the obligation to take extensive steps to search for, make copies of, and deliver all responsive, non-privileged documents to the other side.
Certain document types are protected from disclosure by privilege. The most common are attorney-client privilege and the related work product privilege which in essence cover communications between lawyers and clients and in certain cases non-lawyers working for lawyers or preparing for lawsuits. When documents are responsive, but also protected by privilege, they are described on a list called a privilege log and the log is delivered to the other side instead of the documents themselves.
A document ordinarily isn’t considered self-explanatory. Before it can be used in court it must be explained or “authenticated” by a person who has first-hand knowledge of where the document came from, who created it, why it was created, how it was stored, etc. Authentication is necessary to discourage fakery and to limit speculation about the meaning of documents. Documents which simply appear with no explanation of where they came from may be criticized and ultimately rejected if they can’t be properly identified by someone qualified to identify them. Thus document metadata — information about the origins of documents — is of critical importance for discovery. (However, under the elaborate Rules of Evidence that must be followed by lawyers, a wide variety of assumptions may be made, depending on circumstances surrounding the documents, which may allow documents to be used even if their origins are disputed.)
The term “custodian” can be applied to anyone whose work involves storing documents. The spoken or written statement (“testimony”) of a custodian may be required to authenticate and explain information that is in their custody. And when a legal action involves the actions and responsibilities of relatively few people (as most legal actions ultimately do), those people will be considered key custodians whose documents will be examined more thoroughly. Everyone with a hard drive can be considered a document custodian with respect to that drive, although system administrators would ordinarily be considered the custodians of a company-wide document system like a file server. Documents like purchase orders, medical records, repair logs and the like, which are usually and routinely created by an organization (sometimes called “documents kept in the ordinary course of business”) may be authenticated by a person who is knowledgeable about the processes by which such documents were ordinarily created and kept, and who can identify particular documents as having been retrieved from particular places.
Types of documents which are discoverable and may be responsive
Typically any form of information can be requested in discovery, although attorneys are only beginning to explore the boundaries of the possibilities here. In the old days only paper documents and memories were sought through discovery. (Note: of course, physical objects may also be requested, for example, in a lawsuit claiming a defect in an airplane engine, parts of the engine may requested.) As of today requests frequently include databases, spreadsheets, word processing documents, emails, instant messages, voice mail and other recordings, web pages, images, metadata about documents, document backup tapes, erased but still recoverable documents, and everything else attorneys can think of that might help explain the circumstances on which the lawsuit is based.
Discovery can be time consuming and expensive. Lawyers work closely with IT, known document custodians, and others with knowledge of the events and people involved in the lawsuit. First they attempt to identify what responsive documents might exist, where they might be kept, and who may have created or may have control over the documents that might exist. Based on what is learned through this collaboration, assumptions are made and iteratively improved about what documents may exist and where they are likely to be found. Efforts must be taken to instruct those who may have potentially responsive documents to avoid erasing them before they are found (this is called “litigation hold”). Then efforts are taken to copy potentially responsive documents, with metadata intact, into a central repository in which batch operations can take place. In recent years online repositories that enable remote access have become very popular for this purpose. Within this repository lawyers and properly qualified personnel can sort documents into groups using various search and de-duplication methodologies, set aside documents which are highly unlikely to contain useful information, then prioritize and assign remaining documents to lawyers for manual review. Reviewing attorneys then sort documents into responsive and non-responsive and privileged and non-privileged groupings. Eventually responsive, non-privileged documents are listed, converted into image files (TIFFs), and delivered to the other side, sometimes alongside copies of the documents in their original formats.
Early Case Assessment (also called Early Data Assessment)
Even before receiving a discovery request, and sometimes even before a lawsuit has been filed, document review can be started in order to plan legal strategy (like settlement), prevent document erasure (“litigation holds”), etc. This preliminary review is called “Early Case Assessment” (or “Early Data Assessment”).