e-Discovery document review: what should counsel outsource?

Earlier this week I blogged about placing the locus of control for e-discovery decisions in the right hands to ensure that the decisions made pass muster in court. To illustrate the potential impact of moving the locus of control for certain decision to an outsource partner let’s compare the document review solutions offered by H5 and Inference Data.

Gold standard counsel or expert linguists - who should take the lead? Photocredit: jeffisageek’s photostream

Gold standard counsel or Linguist - who should decide?

Both H5 and Inference enable users to improve results and potentially save vast amounts of money by teaching sophisticated software how to do document review faster and more accurately than human reviewers can. And the more the review process can be reliably automated, the more money is saved down the road because the amount of manual review is reduced. This all assumes that the software is trained correctly, of course. Which frames a locus of control question: Who’s best at training the software?

Last month I attended a webinar presented by H5. One thing that struck me as distinctive about H5 is their standard deployment of a team of linguists to improve detection of responsive documents from among the thousands or millions of documents in a document review. During the webinar I submitted a question asking what it is their linguists do that attorneys can’t do themselves. One of their people was kind enough to answer, more or less saying “These guys are more expert at this query-building process than attorneys.” Ouch.

I’ve long prided myself on my search ability (ask me about the time I deployed a boolean double-negative in a Westlaw search for Puerto Rico “RICO” cases) and I’m sure many of my fellow attorneys are equally proud. However, I know people (or engineers, anyway) who are probably better at search than I am, and I know one or two otherwise blindingly brilliant attorneys who are seriously techno-lagged. More importantly, attorneys typically have a lot on their plates, and search expertise on a nitty-gritty “get the vocabulary exactly right” level is just one of a thousand equally important things on their minds, so it’s not realistically going to be a “core competency.” So I can see the wisdom in H5’s approach, although I wonder how many attorneys are willing to admit right out loud that they are better off outsourcing this competency.

The other side of the H5 coin is represented by Inference Data, which offers a tightly designed software solution which enables attorneys themselves to become the locus of control for search. For counsel with the proper training and technical aptitude, this strikes me as a killer combination, placing the locus of control — teaching the software to find the right documents — in the hands of the attorney who is the “gold standard” subject matter expert.

I can see where, depending on a number of different factors, either solution might be better. I encourage anyone facing this choice to make an informed decision about which approach leads to the best results rather than relying on their knee-jerk reaction.

e-Discovery outsourcing 101: who makes which decisions?

Because e-discovery is complex, and the penalties for screwing it up are significant, the following choice should be considered periodically by attorneys, clients and IT people involved in e-discovery: “Do we do this piece of the project with the people we have already, or do we add people to our payroll who do this, or do we bring in an outside partner to do this?” This is when the IT people reading this post will start muttering the cliché truism “Build or Buy?” which means choosing between “do it ourselves” and finding a pre-packaged solution.

In a generalized “leadership” or “management” frame of mind the basic choice is: “Do, Delegate, or Dump.” I am fond of characterizing this choice as the assignment of the locus of control for decision-making, where an important consideration is who will do the best job of making the decisions once given that responsibility.

  • “Do” = Must I make a particular set of decisions myself – are those decisions an essential part of my role in the organization, and am I the one with the right information and motivation to make these decisions?
  • “Delegate” = Can someone else do just as well, or perhaps a better job, with making this set of decisions, especially after making these decisions became an essential part of their role?
  • “Dump”: should we even be in the business of making these decisions at all, or can we just drop that issue off of our plates somehow?

For example, one can dump having a company picnic to save money. One can’t dump bookkeeping, however, even in a very small company. But even in a very small company a leader can usually delegate or outsource primary responsibility for bookkeeping and expect to get good results while focusing on core competencies of the business such as production, customer relationships, and motivating team members.

Ultimately the choice boils down to this: Do I want to possess and maintain expertise in making certain decisions, at a certain level of granularity, as a core competency? If yes, then I must make it a core competency, which means investing the time, attention, and education it takes to do it right. If no, then I should bring in someone else who has that core competency and who is invested in doing it right.

In e-discovery, answering the question of what can be outsourced — or where to place the locus of control for decision-making — gets even more interesting since courts hold attorneys personally responsible not only for delivering high-quality document production results but for understanding and directing the process by which results are achieved. So the question becomes: Will attorneys generate better document production results when they personally control more of the process (for example, by personally, hands on the keyboard, deriving and executing search methodology)? Or, will they generate better results by collaborating more with outsourced experts, directing and supervising but delegating more of the hands-on decisions?

More than a few attorneys reading this might find that the choice is not as cut and dried as they think. In my next post I’ll explore this choice by applying the core competency / locus of control standard to competing document review automation solutions from Inference Data and H5.

Catch-22 for e-discovery standards?

Lawyers are waiting for judges… who are waiting for lawyers. What’s a client to do?

E-discovery law is always going to be out of step with information technology because of the way in which the law develops.

photo credit: Drew Vigal / CC BY-NC 2.0

The legal system is very rules oriented, including many purely procedural rules. This is so partly because the legal system is an adversarial process, in which disputes between opposing parties are resolved by a neutral third party judge. The legal system attempts to create a fair process by giving everyone advance notice of the rules that will be observed, then sticking to those rules. Because the process is considered so important, a great deal of time and money may be spent during a lawsuit to resolve disputes over whether one party or the other has broken the procedural rules. Serious breaches of purely procedural rules by one party can lead a judge to award victory to their opponent for no other reason than failing to follow procedure. A judge may also find that party’s attorneys personally liable for their participation in the rule-breaking.

Legal procedural rules are established by legislatures, like the US Congress and state assemblies; by court administrative bodies, like the Federal Rules of Civil Procedure and (initially) the Federal Rules of Evidence; and by judges by way of decisions (also known as rulings or opinions) they write to explain how they resolved a dispute between parties to a lawsuit.

Recommendations concerning legal procedures may be made by expert committees, which for e-discovery typically means The Sedona Conference and TREC Legal Track. But although judges regularly mention legal experts’ recommendations in their opinions – such mentions are typically known as “dicta” – both the recommendations and dicta lack the authority of law.

In an ideal world clearly defined, efficient, and flexible rules would be developed and followed for identifying and disclosing electronic data – like a well developed API (Application Programming Interface) in the software realm, or a useful IEEE  standard, which is akin to what TREC may ultimately develop. Unfortunately, as with many APIs and even internationally recognized standards, the reality in law is a quite a bit messier than the ideal.

Legislators and judges are not experts in the realm of electronically stored information. Nor do they wish to be; nor should their role be expanded to require such expertise. Furthermore, judges tend to be reactive rather than proactive. Not only do they have very full agendas already, but the fundamental character of their role is to wait until they are presented with a dispute that clearly and narrowly addresses a particular issue before they make a ruling that takes a position on that issue. This is known as “judicial restraint”.

Two new influences are incrementally changing the legal rules concerning e-discovery. The first influence is the ripple effect from recent changes to the Federal Rules of Civil Procedure and Evidence which clarify how electronically stored information is to be reviewed and disclosed. Attorneys and judges must attempt to follow these new rules, and judges are now deciding new e-discovery cases interpreting these rules for the first time in various situations. The second influence is the rising prominence of two prestigious but unofficial advisory committees, the aforementioned Sedona Conference and TREC Legal Track, which have researched and published detailed findings and recommendations concerning e-discovery best practices. These recommendations are increasingly being cited by judges in their rulings

Notwithstanding the commendable efforts of those who have been working skillfully and hard to improve e-discovery standards, those standards remain quite broad and subject to a wide range of interpretations. As a result, a wide range of technologies and processes are thriving in the e-discovery realm. And the standards that exist at present do not offer the means by which to compare most of the popular technologies and processes currently in use.

Two change agents are likely to sharpen e-discovery standards and narrow the field of e-discovery technologies and processes.

The first change agent is the quality control movement which is being promoted by vendors such as H5 and Inference and recommended by both TREC and Sedona (which I’ve written a little about here). Their thinking is that e-discovery, like any other mass-scale process – and here we’re talking about reviewing documents on the scale of tens-of-thousands to millions – must adhere to well-defined quality standards, just as manufacturers that produce thousands or more of consumer product components must adhere to well-defined quality standards.

The second and most important change agent will be ground breaking legal rulings by judges. When judges eventually rule that new standards must be observed, then lawyers and their clients will follow (for the most part). Until then lawyers are almost certain to stick to the status quo. In fact, when I discuss this topic with e-discovery solution vendors I always hear that their attorney customers aren’t interested in anything that the courts haven’t already accepted.

But therein lies the Catch-22: Judges are waiting for lawyers to present an issue while the lawyers are waiting for the judges to rule so they don’t have to present the issue. Because of judicial restraint, judges only rule on issues that have been unambiguously presented to them in the course of a lawsuit. But until the courts have ruled that certain technical standards are required, the lawyers won’t advise their clients to rely on those standards. And until a lawyers’ client relies on a standard in a situation which puts that standard at issue in a lawsuit, judges can’t rule that the standard is legally valid. So round and round it goes.

The log jam breakers will be 1) further amendments to the procedural rules, based no doubt on the recommendations of TREC and Sedona and the vendors and lawyers that participate in those efforts, and possibly 2) a “civil rights campaign” approach. The latter is a scenario like we saw in 1950’s Supreme Court school desegregation decisions in which clients stepped forward at some personal risk to offer their personal circumstances as “test cases.” By adopting and sticking to a certain principle, even though it created a controversy, they could bring about court rulings that effectively changed the law.

In a corporate setting the latter approach may only be possible with C-level and board support because of the risk involved. Opportunities would have to be identified and pursued in which it appeared that the company could save more in the long run by relying on e-discovery innovations, such as quality measures and automation, than it risked in the short run by going to court. In addition, the innovations would have to be fundamentally sound, albeit untested in court, while controversial enough that an opposing party was unwilling to accept them without a court challenge. (Ironically, whenever a client adopts an e-discovery innovation that doesn’t lead to controversy, it results in no judge’s ruling and thus no legal precedent for other clients’ attorneys to follow.)

Both of these log jam breakers are bound to move slowly. This means years of small, incremental change before definitive technology standards result. Of course, the complexity of information management continues to increase and the cost of e-discovery is bound to continue to rise along with it. For these reasons, whatever each of us does, in our individual and representative roles, to support standards that lead to increased efficiency and reduced cost, the better off we’ll all be.

TREC and the gold standard for document review

Ron Friedman recently blogged an excellent critique of TREC Legal Track’s effort to objectively assess eDiscovery document review practices. Like Ron, I commend TREC Legal Track while wishing to offer comments that may contribute to their success. Like me, Ron is an attorney with long experience working in the technology sector, although for comparison with his math background I can only claim four years of college courses concerning statistical methods for assessing human behavior.

Benchmarking is valuable almost everywhere.

Benchmarking is valuable almost everywhere.

I strongly recommend reading Ron’s post for the benefit of his insights, whether or not you are already familiar with TREC Legal Track. I’d also like to offer my own observations about TREC Legal Track’s finding of low consistency between document classification decisions made by subject matter experts, who are spoken of as “gold standard” reviewers, and ordinary legal document reviewers. (In TREC Legal Track’s study, ordinary reviewers were 2nd and 3rd year law students. In real life the subject matter expert role is played by in-house or outside counsel, while much of the actual review work is performed by contract or outsource attorneys.)

Generally speaking, quality control processes involve benchmarking against some standard. Mechanical processes can be meaningfully benchmarked by physically sampling output (this is the essence of Six Sigma, in particular). For example, as machine parts come off an assembly line, samples can be selected and measured and the variance between their actual size and target size monitored not only to detect defects but to flag the processes responsible for defects. Human processes can also be benchmarked in a variety of ways. (This is in part the province of ITIL, the “Information Technology Information Library,” and the basis for the idea of “service level agreements”.) For example, those responsible for a customer service center may track the number of issues handled per hour, the type of issues handled, the number of resolutions or escalations per issue, revenue gained or lost per issue, etc.

Unfortunately, “responsiveness” and “privilege” are not only somewhat subjective in document review, standards for responsiveness and privilege will vary from case to case. For this reason standards need to be developed “on the fly” for each case, and these standards will by necessity be arbitrary (aka subjective) to some degree even if consistently applied. The good news is that the latest generation of document clustering software applications incorporate tools for developing consistent document review standards on the fly. Through an iterative feedback loop, the humans educate the machines to look for documents with certain characteristics, while the machines force the humans to refine their conception of responsiveness and privilege to a degree that the machine can reliably model it. After enough iterations have passed and the machine has reached some measurable standard of consistency, the humans can step back and let the machine do the rest of the review work. The machine does it more consistently than human reviewers could themselves, and at a much lower cost.

With document review the very idea of defining a “gold standard” for classification is less useful than it sounds. For instance, even if a panel of leading legal scholars could be formed for each eDiscovery matter, the mere fact that someone legitimately may be called a leading scholar doesn’t mean that their views will be consistent with anyone else’s — just well reasoned. But a “gold standard” is not what’s important here. What’s important is that in each case the attorneys responsible for responding to a document request do everything they can to carefully define and consistently enforce reasonable document review standards. This is what the current crop of document clustering applications are intended to do. That is the current model, anyway. I don’t pretend to be able to name the vendors who can or cannot deliver on this promise, although I think this will be the number one question in eDiscovery technology before long.

UPDATE: I discuss TREC’s role in forumulating new legal procedural rules for e-discovery in a later blog post, Catch-22 for e-discovery standards?

Reusing document clustering categories to spend less on eDiscovery?

After drafting a blog post about mass data sampling and classification in the “cloud,” I became curious about the potential for reusing categories developed in eDiscovery sampling and classification projects as “seeds” for later projects. For further insight I turned to Richard Turner, Vice President of Marketing at Content Analyst Company, LLC, a document clustering and review provider for eDiscovery.

schl¸sselBruce: I wonder to what extent document categories that are created using document clustering software when reviewing documents for eDiscovery can be aggregated across multiple document requests and/or law suits within the same company. Can previously developed categories or tags be reused to seed and thus speed up document review in other cases?

Richard: Regarding the notion of aggregating document categories, etc., it’s something that’s technically very feasible. And it could greatly speed document review if categories could be used to “seed” new reviews, new cases, etc. Here’s the challenge: we have found that most of the “categories” developed by our clients start-out case specific, and are too granular to be valuable when the next case comes along. It also hasn’t seemed to matter whether categorization was being used by a corporate legal department or an outside counsel – they’re equally specific.

The idea itself had merit, so we tossed it around with our Product Solutions Architects, and they came up with several observations. First of all, the categories people develop are driven by their need to solve a specific eDiscovery challenge, i.e. documents that are responsive to the case at hand. Second, when the next issue or case comes along, they naturally start over again, first by identifying responsive documents and then by using those documents to create categories – any “overlap” is purely coincidental. Finally, to develop categories that were really useful across a variety of issues or cases, they would need to be fairly generic and probably not developed with any specific case in mind.

I think that’s very hard to do for a first or even second-level review – it’s not necessarily a natural progression, as people work backwards from the issues at hand. Privilege review, however, could be a different animal. There are some things in any case that invoke privilege because of the particulars of the case, for example, attorney-client conversations which are likely to involve different individuals in different litigation matters. There are other things that could logically be generic – company “trade secrets” for example would almost always be treated as privilege, as are certain normally-redacted items such as PII (personally-identifiable information). Privilege review is also a very expensive aspect for eDiscovery, since it involves physical “reads” using highly-paid attorneys (not something you can comfortably offshore). Could “cloud seeding” have value for this aspect of eDiscovery? It’s an interesting thought.

Cloud-seeding: SaaS data classification via Panda Security’s new anti-virus offering

Panda Security recently released (in beta form) what it claims is the first cloud-based anti-virus / anti-malware solution for Windows PCs. Not only does it sound like a clever tool for data loss prevention, but it demonstrates another way in which information service providers can aggregate individual user data to develop classifications or benchmarks valuable to every user, a mechanism I’ve explored in previous blog posts.

In essence, every computer using Panda’s Cloud Antivirus is networked together through Panda’s server to form a “collective intelligence” for malware detection and prevention. Here’s how it works: users download and install Panda’s software – it’s a small application known as an “agent” because the heavy lifting is done on Panda’s server. These agents send reports back to the Panda server containing information about new files (and, I presume, related computer activity which might indicate the presence of malware). When the server receives reports about previously unknown files which resemble, according to the logic of the classification engine, files already known to be malware, these new files are classified as threats without waiting for manual review by human security experts.

Security Camera
Sampling at the right time and place allows proactive decision making.

For example, imagine a new virus is released onto the net by its creators. People surfing the net, opening emails, and inserting digital media start downloading this new file, which can’t be identified as a virus by traditional anti-virus software because it hasn’t been placed in any virus definitions list yet. Computers on which the Panda agent has been installed begin sending reports about the new file back to the Panda server. After some number of reports about the file are received by Panda’s server, the server is able to determine that the new file should be treated as a virus. At this point all computers in the Panda customer network are preemptively warned about the virus, even though it has only just appeared.

According to Panda’s April 29, 2009 press release:

Utilizing Panda’s proprietary cloud computing technology called Collective Intelligence, Panda Cloud Antivirus harnesses the knowledge of Panda’s global community of millions of users to automatically identify and classify new malware strains in almost real-time. Each new file received by Collective Intelligence is automatically classified in under six minutes. Collective Intelligence servers automatically receive and classify over 50,000 new samples every day. In addition, Panda’s Collective Intelligence system correlates malware information data collected from each PC to continually improve protection for the community of users.

Because Panda’s solution is cloud-based and free to consumers, it will reside on a large number of different computers and networks worldwide. This is how Panda’s cloud solution is able to fill a dual role as both sampling and classification engine for virus activity. On the one hand Panda serves as manager of a communal knowledge pool that benefits all consumers participating in the free service. On the other hand, Panda can sell the malware detection knowledge it gains to corporate customers – wherein lies the revenue model that pays for the free service.

I have friends working in two unrelated startups, one concerning business financial data and the other Enterprise application deployment ROI, that both work along similar lines (although neither are free to consumers). Both startups offer a combination of analytics for each customer’s data plus access to benchmarks established by anonymously aggregating data across customers.

Panda’s cloud analytics, aggregation and classification mechanism is also analogous to the non-boolean document categorization software for eDiscovery discussed in previous posts in this blog, whereby unreviewed documents can be automatically (and thus inexpensively) classified for responsiveness and privilege:

Deeper, even more powerful extensions of this principle are also possible. I anticipate that we will soon see software which will automatically classify all of an organization’s documents as they are created or received, including documents residing on employees laptop and mobile devices. Using Panda-like classification logic, new documents will be classified accurately whether or not they are of an exact match with anything previously known to the classification system. This will substantially improve implementation speed and accuracy for search, access control and collaboration, document deletion and preservation, end point protection, storage tiering, and all other IT, legal and business information management policies.