What is Discovery? – explaining eDiscovery to non-lawyers

I met with a group of software developers earlier this week to talk about configuring a visual analytics solution to provide useful insights for eDiscovery. To help them understand the overall process I wrote out a short description of key concepts in Discovery. I omitted legal jargon and described Discovery as a simple, repeatable process that would appeal to engineers. If anyone has enhancements to offer I’d be happy to extend this set further.


Discovery is a process of information exchange that takes place during most lawsuits. The goal of discovery is to allow the lawyers to paint a picture that sheds lights on what actually happened. Ideally court proceedings are like an academic argument over competing research papers that have been written as accurately and convincingly as possible. Each side tries to assemble well-documented citations to letters, emails, contracts, and other documents, with information about where they were found, who created them, why they were created, when, and how they were distributed.

Discovery requests

The dead tree version of a law library.
The dead tree version of a law library.

The discovery process is governed by a published body of laws and regulations. Under these rules after a lawsuit has begun each side has the ability to ask the other side to search carefully for all documents, including electronic ones, that might help the court decide the case. Each party can ask the other for documents, using written forms called Discovery Requests. BOTH sides will need documents to support their respective positions in the lawsuit.


Documents are “responsive” when they fit the description of documents being sought under discovery requests. Each side has the responsibility of being specific, and not over-inclusive, in describing the documents it requests. Each party has the opportunity to challenge the other’s discovery requests as being over broad, and any dispute that cannot be resolved by negotiation will be resolved by the court. But once the time for raising challenges is over each side has the obligation to take extensive steps to search for, make copies of, and deliver all responsive, non-privileged documents to the other side.


Certain document types are protected from disclosure by privilege. The most common are attorney-client privilege and the related work product privilege which in essence cover communications between lawyers and clients and in certain cases non-lawyers working for lawyers or preparing for lawsuits. When documents are responsive, but also protected by privilege, they are described on a list called a privilege log and the log is delivered to the other side instead of the documents themselves.


A document ordinarily isn’t considered self-explanatory. Before it can be used in court it must be explained or “authenticated” by a person who has first-hand knowledge of where the document came from, who created it, why it was created, how it was stored, etc. Authentication is necessary to discourage fakery and to limit speculation about the meaning of documents. Documents which simply appear with no explanation of where they came from may be criticized and ultimately rejected if they can’t be properly identified by someone qualified to identify them. Thus document metadata — information about the origins of documents — is of critical importance for discovery. (However, under the elaborate Rules of Evidence that must be followed by lawyers, a wide variety of assumptions may be made, depending on circumstances surrounding the documents, which may allow documents to be used even if their origins are disputed.)

Document custodians

The term “custodian” can be applied to anyone whose work involves storing documents. The spoken or written statement (“testimony”) of a custodian may be required to authenticate and explain information that is in their custody. And when a legal action involves the actions and responsibilities of relatively few people (as most legal actions ultimately do), those people will be considered key custodians whose documents will be examined more thoroughly. Everyone with a hard drive can be considered a document custodian with respect to that drive, although system administrators would ordinarily be considered the custodians of a company-wide document system like a file server. Documents like purchase orders, medical records, repair logs and the like, which are usually and routinely created by an organization (sometimes called “documents kept in the ordinary course of business”) may be authenticated by a person who is knowledgeable about the processes by which such documents were ordinarily created and kept, and who can identify particular documents as having been retrieved from particular places.

Types of documents which are discoverable and may be responsive

Typically any form of information can be requested in discovery, although attorneys are only beginning to explore the boundaries of the possibilities here. In the old days only paper documents and memories were sought through discovery. (Note: of course, physical objects may also be requested, for example, in a lawsuit claiming a defect in an airplane engine, parts of the engine may requested.) As of today requests frequently include databases, spreadsheets, word processing documents, emails, instant messages, voice mail and other recordings, web pages, images, metadata about documents, document backup tapes, erased but still recoverable documents, and everything else attorneys can think of that might help explain the circumstances on which the lawsuit is based.

Discovery workflow

Discovery can be time consuming and expensive. Lawyers work closely with IT, known document custodians, and others with knowledge of the events and people involved in the lawsuit. First they attempt to identify what responsive documents might exist, where they might be kept, and who may have created or may have control over the documents that might exist. Based on what is learned through this collaboration, assumptions are made and iteratively improved about what documents may exist and where they are likely to be found. Efforts must be taken to instruct those who may have potentially responsive documents to avoid erasing them before they are found (this is called “litigation hold”). Then efforts are taken to copy potentially responsive documents, with metadata intact, into a central repository in which batch operations can take place. In recent years online repositories that enable remote access have become very popular for this purpose. Within this repository lawyers and properly qualified personnel can sort documents into groups using various search and de-duplication methodologies, set aside documents which are highly unlikely to contain useful information, then prioritize and assign remaining documents to lawyers for manual review. Reviewing attorneys then sort documents into responsive and non-responsive and privileged and non-privileged groupings. Eventually responsive, non-privileged documents are listed, converted into image files (TIFFs), and delivered to the other side, sometimes alongside copies of the documents in their original formats.

Early Case Assessment (also called Early Data Assessment)

Even before receiving a discovery request, and sometimes even before a lawsuit has been filed, document review can be started in order to plan legal strategy (like settlement), prevent document erasure (“litigation holds”), etc. This preliminary review is called “Early Case Assessment” (or “Early Data Assessment”).

UPDATE: I describe the sources and development of legal procedural rules for e-discovery in a later blog post, Catch-22 for e-discovery standards?

Search Classification and Analysis Tools for Information Governance

One of the most interesting technical issues being discussed at LegalTech last week was the question of how to classify, analyze, and review “unstructured” information like the content of emails,  text documents, and presentations.

A familiar, simple sounding answer leaps immediately to mind. Why not just hook up all of these documents to a search engine “crawler,” index all of the words in all the documents? Then run ordinary key-word searches on the whole set. It’s exactly like conducting Google searches, except instead of spanning a big chunk of the entire internet we only have to cover a few terabytes of corporate information – right?

The bad news is that there are a number of wrinkles in the territory surrounding eDiscovery that render a pure Google-style search model less than perfect. The good news is that a variety of vendors offer well conceived solutions meant to take these wrinkles into account. The remainder of this post will introduce some of the wrinkles, later posts will be concerned with vendor solutions.

What’s different about eDiscovery from a Search perspective?

Stack of DocumentsIn an eDiscovery context the ideal for a classification and search solution is to allow searchers to identify ALL documents which meet their criteria, not just “the ten most relevant documents” or “at least one document that answers my question” as is common with Google searches. Imagine a running a Google search which returned 20,000 responses. You think this number is too big — it’s overinclusive — but you don’t want to risk missing any relevant documents. Then imagine getting a bill for paying a team of attorneys at rates in excess of $100 / hour  to read all of those documents in order to determine whether, in addition to containing the key words you selected, the documents are actually relevant to the particular law suit they are concerned with.

Another common instance of overinclusiveness arises because unstructured information repositories such as email accounts frequently contain multiple versions of the same chunks of content. Many documents will repeat some content from earlier versions and add some new content. For example, when emails are replied to,  forwarded, or sent to multiple recipients, content already in the information pool is duplicated, and new information (email headers or comments) will have been added. Using conventional search all versions of every document will fall within search results and must be manually reviewed at great cost to understand what is important and what is merely redundant.

Another potential problem involves choosing key words correctly. One could easily choose key words that are logically related to the topic at hand, and return a large number of relevant documents, but which miss many, or the most important documents in the document pool (the search results are underinclusive). What if, as in the Enron case, “code” words were adopted by perpetrators of a scam in an effort to cover their tracks? What if some number of documents are written in a language the searchers don’t speak, or use words or terms not familiar to searchers?

Solutions to these problems that various vendors have devised include semantic clustering, multi-variate analysis of word positioning and frequency, key words plus associative groupings, near de-duplication processes, and more. Each comes with both strengths and weaknesses, of course — to be discussed in future posts.

LegalTech NY 2009 in review

I’m back from New York. And as of this morning I’m nearly caught up after wading through a flurry of activity surrounding LegalTech NY 2009. Personally I had a great trip because I made some wonderful new friends, discovered a number of new and relevant companies and their technology, and just plain enjoyed the NYC scene. Results for others may have varied — details below.

The Good, The Bad, and the Not So Aesthetically Pleasing

  • A keynote at LegalTech NY 2009
    A keynote at LegalTech NY 2009

    Most of the presenters were reasonably well prepared and open to discussing any and all issues with members of the audience. (One panel I attended stood out for running out of material half way through their time slot, but that was the exception rather than the rule.) The best presenter by far was DC-based Federal Magistrate Judge John Facciola. He was simply an exceptional speaker — I couldn’t multi-task on my laptop or iPhone (who needs wifi when there’s 3G, baby!) while he was talking for fear of missing the subtleties of his facial expressions and jokes. He made an important point about the future of the legal profession: competency in information technology is now crucial for trial attorneys and those managing discovery at any level. He told a number of vivid, relevant stories, but the one that really stood out for me was about a case which appeared before him involving two technology companies. Despite the obvious need for it,  there was no eDiscovery in this case because counsel on both sides were uncomfortable with it and thus were willing to commit malpractice by overlooking it. Crazy, but indicative of a long standing resistance to technology Judge Facciola regularly encounters with “old-school” lawyers.

  • Both presentations and exhibitors at the conference were primarily oriented around selling eDiscovery software and services. I expected this, and in fact this was a good arrangement for someone in my line of business. However, it bears noting that law firms are only part of the legal technology equation, and for a variety of reasons corporate legal departments are more influential users when it comes to legal tech. It would have been nice also to see a broader array of legal tech offerings, for example the CLM (contract lifecycle management) space was underrepresented.
  • Most of the exhibitors used admirable restraint in resisting the temptation to promote their particular solution while on stage. However I can’t resist the temptation to point out that one panelist, who coincidentally is employed by the very vendor that was sponsoring his panel (vendor name suppressed to avoid unjustly promoting them), must have exhorted the audience 20 times in 45 minutes with the imperative that everyone must have / needs / faces dire consequences unless they adopt “an integrated solution” to eDiscovery document collection and review — which is what his company was selling.
  • The conference infrastructure was adequate but not overwhelmingly impressive by Silicon Valley standards. Seating was particularly crowded in keynotes, free food was nearly non-existent (surprising, given that sponsored buffets present a high visibility promotional opportunity for vendors), coffee was hard to find. But the lack of free wifi was an ongoing source of conversation amongst “wired” attendees. Apparently wifi was available for a fee if one found a certain booth somewhere on the premises. (I used 3G instead; one fellow I met actually picked up a wireless broadband card at a mobile telco store across the street to avoid paying for internet access at the event). I understand it’s mid-town Manhattan, so the ordinary hotel lobby bar (garnering a steady 2 out of 5 rating at Yelp) is selling $10 glasses of beer. And I appreciate a comprehensive revenue model as much as or more than the next fellow. But this branded the conference as not quite having its finger on today’s digital pulse, at least not by comparison to what we are accustomed to out on the left-coast.
  • Kudos to the Metropolitan Transit Authority (MTA) and the Port Authority for New York’s marvelous, amazingly effective transit system. It’s far from a free ride, but incredibly practical. Of special note is the relatively new Air Train connecting JFK airport to the Long Island Railroad which continues on to midtown (Penn Station) only a short distance from the conference site. The transfer point was staffed with bizarrely friendly Port Authority employees helping poor lost tourists trying to figure out how to purchase ride cards and use them to activate turnstiles — very entertaining and even touching, considering how big and bad NYC’s reputation (still) is.
  • Special thanks to my old buddy Kenneth Adams for inviting me to stay with his family in their beautiful home in Garden City, Long Island (also a short walk from an L.I.R.R. station). Ken is not only a master the undisputed master of contract drafting, but also cooks a mean Neopolitan style pizza.