Earlier this week I blogged about placing the locus of control for e-discovery decisions in the right hands to ensure that the decisions made pass muster in court. To illustrate the potential impact of moving the locus of control for certain decision to an outsource partner let’s compare the document review solutions offered by H5 and Inference Data.
Last month I attended a webinar presented by H5. One thing that struck me as distinctive about H5 is their standard deployment of a team of linguists to improve detection of responsive documents from among the thousands or millions of documents in a document review. During the webinar I submitted a question asking what it is their linguists do that attorneys can’t do themselves. One of their people was kind enough to answer, more or less saying “These guys are more expert at this query-building process than attorneys.” Ouch.
I’ve long prided myself on my search ability (ask me about the time I deployed a boolean double-negative in a Westlaw search for Puerto Rico “RICO” cases) and I’m sure many of my fellow attorneys are equally proud. However, I know people (or engineers, anyway) who are probably better at search than I am, and I know one or two otherwise blindingly brilliant attorneys who are seriously techno-lagged. More importantly, attorneys typically have a lot on their plates, and search expertise on a nitty-gritty “get the vocabulary exactly right” level is just one of a thousand equally important things on their minds, so it’s not realistically going to be a “core competency.” So I can see the wisdom in H5’s approach, although I wonder how many attorneys are willing to admit right out loud that they are better off outsourcing this competency.
I can see where, depending on a number of different factors, either solution might be better. I encourage anyone facing this choice to make an informed decision about which approach leads to the best results rather than relying on their knee-jerk reaction.
Because e-discovery is complex, and the penalties for screwing it up are significant, the following choice should be considered periodically by attorneys, clients and IT people involved in e-discovery: “Do we do this piece of the project with the people we have already, or do we add people to our payroll who do this, or do we bring in an outside partner to do this?” This is when the IT people reading this post will start muttering the cliché truism “Build or Buy?” which means choosing between “do it ourselves” and finding a pre-packaged solution.
In a generalized “leadership” or “management” frame of mind the basic choice is: “Do, Delegate, or Dump.” I am fond of characterizing this choice as the assignment of the locus of control for decision-making, where an important consideration is who will do the best job of making the decisions once given that responsibility.
“Do” = Must I make a particular set of decisions myself – are those decisions an essential part of my role in the organization, and am I the one with the right information and motivation to make these decisions?
“Delegate” = Can someone else do just as well, or perhaps a better job, with making this set of decisions, especially after making these decisions became an essential part of their role?
“Dump”: should we even be in the business of making these decisions at all, or can we just drop that issue off of our plates somehow?
For example, one can dump having a company picnic to save money. One can’t dump bookkeeping, however, even in a very small company. But even in a very small company a leader can usually delegate or outsource primary responsibility for bookkeeping and expect to get good results while focusing on core competencies of the business such as production, customer relationships, and motivating team members.
Ultimately the choice boils down to this: Do I want to possess and maintain expertise in making certain decisions, at a certain level of granularity, as a core competency? If yes, then I must make it a core competency, which means investing the time, attention, and education it takes to do it right. If no, then I should bring in someone else who has that core competency and who is invested in doing it right.
In e-discovery, answering the question of what can be outsourced — or where to place the locus of control for decision-making — gets even more interesting since courts hold attorneys personally responsible not only for delivering high-quality document production results but for understanding and directing the process by which results are achieved. So the question becomes: Will attorneys generate better document production results when they personally control more of the process (for example, by personally, hands on the keyboard, deriving and executing search methodology)? Or, will they generate better results by collaborating more with outsourced experts, directing and supervising but delegating more of the hands-on decisions?
More than a few attorneys reading this might find that the choice is not as cut and dried as they think. In my next post I’ll explore this choice by applying the core competency / locus of control standard to competing document review automation solutions from Inference Data and H5.
Disaster recovery and archiving are key zones of interaction for IT and Legal Departments. When a lawsuit is filed and an e-discovery production request is received, a company must examine all of its electronically stored information to find documents that are relevant to that lawsuit. Court battles may arise regarding the comprehensiveness of the examination, the need to lock down potentially important documents and metadata, and the cost of identifying, collecting, preserving, and reviewing documents — all of which are related to the way in which data is stored.
With this in mind, I recently sought out Jishnu Mitra, President of Stratogent, a specialized application hosting and disaster recovery services provider, to obtain his perspective on disaster recovery best practices and the relationship between disaster recovery and e-discovery. Key points he made include:
effective disaster recovery sites are “hot” sites that can be used for secondary purposes rather than remaining idle;
“cold” sites are unlikely to get the job done and are not cheap;
efforts to keep IT budgets down by delaying or limiting disaster recovery, or by limiting archiving, can backfire;
budget-conscious IT departments are more likely to use archiving features built-in to their software of choice;
many IT and Legal personnel have a habit of being disrespectful towards one another and doing a poor job of communicating with one another;
more crossover Legal-IT people are needed.
Bruce: Can you provide a little background about Stratogent’s domain expertise?
Jishnu: We offer end-to-end application hosting services, including establishing the hosting requirements and architecture, hardware and software implementation, and proactive day-to-day application management including responding to any issues that arise. Most of the time we are tasked with building a full data center, not the building itself, of course, but a complete software and hardware hosting framework. We aren’t providers of any specific business application (like salesforce.com does). We design, deploy and operate all the layers on which modern business applications are hosted including the application’s framework e.g. .NET, Java or SAP Basis.
Our customers include multi-office companies, who require applications shared between offices, and web-based application SaaS (“software as a service”) companies. The scope is typically quite complex – we don’t build or manage general web sites or blogs — that’s a commodity market and too crowded. We build and manage custom application infrastructures for enterprises or for complex applications that require a range of IT skills to manage. Our customers hire us because they don’t want to budget to hire all of the people they would need to do this internally, or when they are deploying a new application that is beyond the current reach of their IT team. For example, if a company wants to start using a new-to-them ERP [“Enterprise Resource Planning”] application like SAP or (say) a Microsoft based enterprise landscape that needs to scale, we can multiplex our internal pool of talent to give their application 24-7 attention far cheaper than the company can hire and retain the specialized employees they need to do it themselves.
Bruce: So you supply the specialized competencies needed to build and operate complex application environments so that your customers can focus on their core competencies? Then their core competencies don’t need to include what you do in order for them to succeed.
Jishnu: Yes. They know what they want, they conceptualize what they want, but not the hardware they need and the infrastructure software. We can go in from the very beginning saying, “Here’s how you set up a highly available, clustered server farm for your social networking app,” and so on and so forth. We know how to customize it and set it up. They also don’t have our expertise in negotiating with hardware vendors, or in capacity planning, etc. Plus there’s the build phase, loading OSes, etc. We essentially give them over the course of our engagement the entire hosting framework on which the app runs and then take care of it for the long run.
Once we get their hosting framework to a steady state, they get to run with it for two, five or longer number of years with no or limited failure. So their role is conceptualizing on day 1 and then we become a partner organization worrying about how to realize that dream, handling inevitable IT break-fix issues and managing changes over the entire life span of that system. Disaster recovery usually becomes part of that framework at some point.
Bruce: Can you give me some broad idea of the scope of disaster recovery work that you do?
Jishnu: Disaster recovery is not a separate arm of our business. It’s very integral to the hosting services we provide. We build disaster recovery sites at different levels of complexity. It can go from a small customer up to a really large customer. And over time Stratogent gets into innovative approaches to deal with disaster recovery. The philosophy of Stratogent is that we’re not trying to sell a boxed solution to all the customers. It’s more of a custom solution, not a mass market product. We say we will architect and host your solution – and as architect we always add very specific elements for our customer, not just one solution for everyone.
The basic approach, even for small customers, is to choose a convenient and correct location for the disaster recovery site and use a replication strategy based on whatever they can afford or have tolerance to accept. As much as possible a disaster recovery site should be up and running and ready to go at a flip of the switch. They can use the excess capacity at their disaster recovery site at quarter end to run financial reports or for other business purposes, plus it can be used for application QA and staging systems. They can be smart about it, and keep it on, so that they can have confidence in it.
Of course a disaster recovery solution like this can’t be built in just a month or two – to do it right requires creativity and diligence. In one recent instance when asked to do it ”right now”, we had to go with a large vendor’s standard disaster recovery solution for our customer. Everyone knows that this does not get us anything beyond the checkmark for DR, so the plan is to go to a Stratogent solution over time, build a hot alternate site on the East coast, and sunset the large vendor’s standard disaster recovery arrangement.
Bruce: Given the importance of disaster recovery for a number of reasons, how seriously are companies taking it?
Jishnu: Everybody needs it, but it suffers from “high priority, low criticality”, and the problem rolls from budget year to budget year. Some unpleasant trigger like an outage, or an impending audit instigates furious activity in this direction, but then it goes on to the back burner again. In the recent instance, although disaster recovery was scheduled for a later phase for technical reasons, for SOX compliance the auditor demanded a disaster recovery solution by year end or our customer would fail their audit. So we went out and obtained a large vendor’s standard disaster recovery solution, which met the auditors’ requirements but isn’t comparable to a “hot” disaster recovery site.
The way disaster recovery solutions from some of the large vendors work is this: they have huge data centers where their customers can use equipment should a disaster happen. Customers pay a monthly fee for this privilege. When a disaster strikes, customers ship their backup tapes out there, fly their people out there, and start building a disaster recovery system from scratch. And by the way, if you have trouble here’s the menu for emergency support services for which they will charge you more. And in 95 out of 100 cases it just doesn’t work, but is a monumental failure when you need it most. These are “cold” sites that have to be built from the ground up. It takes maybe 72 hours to get them up, rather to be asserted as “up”. Then, as someone like yourself with application development experience knows, it takes weeks to debug and get everything working correctly. And when you’re not actually using them, standard disaster recovery services are charging you an incredibly high amount of money for nothing except the option of bringing your people and tapes to their center, and then good luck.
Bruce: You mentioned running quarterly financials, QA, and staging as valuable uses for the excess capacity of “hot” disaster recovery sites. Could this excess capacity also be used for running e-discovery processes when the company is responding to a document production request?
Jishnu: Possibly, but I haven’t seen it done yet in a comprehensive manner. The problem is you still need to have the storage capacity for e-discovery somewhere. The e-discovery stuff is a significant chunk of storage, maybe tier 2 or 3, which demands different storage anyway, so it makes sense to keep the e-discovery data in the primary data center because its easy and faster to copy, etc. That said, it is very useful to employ the capacity available in the secondary site for e-discovery support activities like restoring data to an alternate instance of your application and for running large queries without affecting the live production systems.
Bruce: Do you deploy disaster recovery solutions that protect desktop drive, laptop drive, or shared drive data?
Jishnu: As I have said, our disaster recovery solutions are part of whatever application frameworks we are hosting. We as a company don’t get into the desktop environment, the local LANs that the companies have. We leave that to local teams or whatever partner does classic managed services. We do data centers and hosted frameworks. We don’t have the expertise or organizational structure to have people traveling to local sites, answering desktop-related user queries, etc. But any time it leaves our customer’s office and goes to the internet, from the edge of the office on out it’s ours.
Bruce: But when archiving is part of the customer’s platform hosted by you, it gets incorporated in your disaster recovery solution?
Bruce: Is Stratogent involved when your customers must respond to e-discovery and regulatory compliance information retrieval requests?
Jishnu: Yes. For example, we recently went through and did what needed be done when a particular customer asked for all the documents in response to a lawsuit. We brought in a consultant for that specific archiving system as well. Our administrators collaborated with the consultant and 2 people from the customer’s IT department. It took a couple of weeks to provide all the documents they asked for.
Bruce: Was the system designed from the outset with minimizing e-discovery costs in mind?
Jishnu: Unfortunately no. In this case archiving for e-discovery was an afterthought and was grafted on to the application later and a push-button experience wasn’t in the criteria when designing this particular system. But it woke us up. We realized this could get worse.
Bruce: So how do you do it differently now that you’ve had this experience?
Jishnu: Here we recommended to our customer that we upgrade to the newest version of the archiving solution and begin using untapped features that allow for a more push-button approach. Keep in mind that e-discovery products weren’t as popular or sophisticated as you see them now.
Bruce: Aren’t there third-party archiving solutions also?
Jishnu: There are several third-party products and you see the regular enterprise software vendors coming out with add-ons. We’re especially looking forward to the next version of Exchange from Microsoft, where for us the salient feature is archiving and retention. Only because email is the number one retrieval request. On most existing setups getting the information for a lawsuit or another purpose takes us through an antiquated process of restoring mail boxes from tape and loads of manual labor. It’s pretty painful, it takes an inordinate amount of time to find specific emails, its not online, it takes days. For this reason we’re looking forward to Exchange 2010 which has features built INTO the product itself. Yes, some other vendors have add-on products that do this also.
Bruce: And I assume you’re familiar with Mimosa, in the case of Exchange?
Jishnu: – Like Mimosa, yes. But when it’s built-in the customer is more likely to use it. By default customers don’t buy add-ons for budgetary reasons. It’s so much easier if the central product has what we need, and that is in fact happening a lot these days. I won’t be surprised if products in general evolve so that compliance and regulatory features get considered integral parts of the software and not someone else’s problem.
Bruce: Do you have other examples of document retrieval from backups or archives?
Jishnu: Actually there are three scenarios where we do document retrieval. Scenario one, which we discussed, is e-discovery. Scenario 2 is when we have seen retrieval requests in acquisitions, mergers and acquisitions, and we had to pretty much get information from all sorts of systems, a huge pain.
Scenario 3 is SaaS driven. For many of our customers, the bulk of their systems are either on-premises or hosted by Stratogent, but some of our customers use SalesForce.com or one of many, many small or industry specific SaaS vertical solutions. In one recent case, one of these niche vertical SaaS vendors, because of some of the issues in that industry, was about to go out of business. We had to go into emergency mode and create an on-premise mirror, actually more like a graveyard for the data, to keep it for the future, to enable us to fetch the data from that service. We figured out a solution for how to get all the customers’ data and replicate and keep it in our data center and continuously keep it up to date. Fortunately the vendors were cooperative and allowed access through their back door to allow us to achieve this. I call this “the SaaS fallback” scenario. SaaS is a great way to quickly get started on a new application, but BOY, if anything happens, or if you decide you aren’t happy, it becomes a data migration nightmare and worse than an on-premises solution because you have no idea how it’s being kept and have to figure out how to retrieve it through an API or some other means.
Bruce: In e-discovery and other legal-driven document recovery scenarios, how important is collaboration between IT and Legal personnel, or should I say, how significant a problem is the lack of this collaboration?
Jishnu: I’ve seen the divide between IT and legal quite often. Calling it a divide is actually being polite; at worst both parties seem to think the others are clueless or morons. It’s a huge, huge gap. And I have also seen it playing out not just in traditional IT outfits, but also product based companies when I was principal architect at Borland. When attorneys came to talk to engineering about IP issues, open source contracts or even patent issues, there was no realization among the techies that it was important. In fact legal issues were labeled “blockers” and the entire legal department was “the business prevention department”. And there is exactly the opposite feeling in the other camp with how engineering leaders don’t “get it” and how talking to anybody in development or IT was like talking to a wall. The psychological and cultural issues between IT and legal have been there for a while. In some of the companies that have surmounted this issue, the key seems to be having a bridge person or team acting as an interpreter to communicate and keep both sides sane. Some technical folks I know have moved on to play a distinctly legal role in their organizations and they play a pivotal role in closing the gap between legal and IT.
Lawyers are waiting for judges… who are waiting for lawyers. What’s a client to do?
E-discovery law is always going to be out of step with information technology because of the way in which the law develops.
The legal system is very rules oriented, including many purely procedural rules. This is so partly because the legal system is an adversarial process, in which disputes between opposing parties are resolved by a neutral third party judge. The legal system attempts to create a fair process by giving everyone advance notice of the rules that will be observed, then sticking to those rules. Because the process is considered so important, a great deal of time and money may be spent during a lawsuit to resolve disputes over whether one party or the other has broken the procedural rules. Serious breaches of purely procedural rules by one party can lead a judge to award victory to their opponent for no other reason than failing to follow procedure. A judge may also find that party’s attorneys personally liable for their participation in the rule-breaking.
Legal procedural rules are established by legislatures, like the US Congress and state assemblies; by court administrative bodies, like the Federal Rules of Civil Procedure and (initially) the Federal Rules of Evidence; and by judges by way of decisions (also known as rulings or opinions) they write to explain how they resolved a dispute between parties to a lawsuit.
Recommendations concerning legal procedures may be made by expert committees, which for e-discovery typically means The Sedona Conference and TREC Legal Track. But although judges regularly mention legal experts’ recommendations in their opinions – such mentions are typically known as “dicta” – both the recommendations and dicta lack the authority of law.
In an ideal world clearly defined, efficient, and flexible rules would be developed and followed for identifying and disclosing electronic data – like a well developed API (Application Programming Interface) in the software realm, or a useful IEEE standard, which is akin to what TREC may ultimately develop. Unfortunately, as with many APIs and even internationally recognized standards, the reality in law is a quite a bit messier than the ideal.
Legislators and judges are not experts in the realm of electronically stored information. Nor do they wish to be; nor should their role be expanded to require such expertise. Furthermore, judges tend to be reactive rather than proactive. Not only do they have very full agendas already, but the fundamental character of their role is to wait until they are presented with a dispute that clearly and narrowly addresses a particular issue before they make a ruling that takes a position on that issue. This is known as “judicial restraint”.
Two new influences are incrementally changing the legal rules concerning e-discovery. The first influence is the ripple effect from recent changes to the Federal Rules of Civil Procedure and Evidence which clarify how electronically stored information is to be reviewed and disclosed. Attorneys and judges must attempt to follow these new rules, and judges are now deciding new e-discovery cases interpreting these rules for the first time in various situations. The second influence is the rising prominence of two prestigious but unofficial advisory committees, the aforementioned Sedona Conference and TREC Legal Track, which have researched and published detailed findings and recommendations concerning e-discovery best practices. These recommendations are increasingly being cited by judges in their rulings
Notwithstanding the commendable efforts of those who have been working skillfully and hard to improve e-discovery standards, those standards remain quite broad and subject to a wide range of interpretations. As a result, a wide range of technologies and processes are thriving in the e-discovery realm. And the standards that exist at present do not offer the means by which to compare most of the popular technologies and processes currently in use.
Two change agents are likely to sharpen e-discovery standards and narrow the field of e-discovery technologies and processes.
The first change agent is the quality control movement which is being promoted by vendors such as H5 and Inference and recommended by both TREC and Sedona (which I’ve written a little about here). Their thinking is that e-discovery, like any other mass-scale process – and here we’re talking about reviewing documents on the scale of tens-of-thousands to millions – must adhere to well-defined quality standards, just as manufacturers that produce thousands or more of consumer product components must adhere to well-defined quality standards.
The second and most important change agent will be ground breaking legal rulings by judges. When judges eventually rule that new standards must be observed, then lawyers and their clients will follow (for the most part). Until then lawyers are almost certain to stick to the status quo. In fact, when I discuss this topic with e-discovery solution vendors I always hear that their attorney customers aren’t interested in anything that the courts haven’t already accepted.
But therein lies the Catch-22: Judges are waiting for lawyers to present an issue while the lawyers are waiting for the judges to rule so they don’t have to present the issue. Because of judicial restraint, judges only rule on issues that have been unambiguously presented to them in the course of a lawsuit. But until the courts have ruled that certain technical standards are required, the lawyers won’t advise their clients to rely on those standards. And until a lawyers’ client relies on a standard in a situation which puts that standard at issue in a lawsuit, judges can’t rule that the standard is legally valid. So round and round it goes.
The log jam breakers will be 1) further amendments to the procedural rules, based no doubt on the recommendations of TREC and Sedona and the vendors and lawyers that participate in those efforts, and possibly 2) a “civil rights campaign” approach. The latter is a scenario like we saw in 1950’s Supreme Court school desegregation decisions in which clients stepped forward at some personal risk to offer their personal circumstances as “test cases.” By adopting and sticking to a certain principle, even though it created a controversy, they could bring about court rulings that effectively changed the law.
In a corporate setting the latter approach may only be possible with C-level and board support because of the risk involved. Opportunities would have to be identified and pursued in which it appeared that the company could save more in the long run by relying on e-discovery innovations, such as quality measures and automation, than it risked in the short run by going to court. In addition, the innovations would have to be fundamentally sound, albeit untested in court, while controversial enough that an opposing party was unwilling to accept them without a court challenge. (Ironically, whenever a client adopts an e-discovery innovation that doesn’t lead to controversy, it results in no judge’s ruling and thus no legal precedent for other clients’ attorneys to follow.)
Both of these log jam breakers are bound to move slowly. This means years of small, incremental change before definitive technology standards result. Of course, the complexity of information management continues to increase and the cost of e-discovery is bound to continue to rise along with it. For these reasons, whatever each of us does, in our individual and representative roles, to support standards that lead to increased efficiency and reduced cost, the better off we’ll all be.
As an enthusiastic user of SaaS (“Software as a Service”) applications, I’ve increased my own productivity via the cloud. But while wearing my Information Governance hat I see companies becoming sensitized to information control and risk management issues arising from SaaS use. In particular:
Company intellectual property (“IP”) frequently leaks out through employees’ SaaS use, often when subject matter experts within a company naively collaborate with “colleagues” outside the company; and
Company information may be preserved indefinitely rather than being deleted at the end of its useful life, thus remaining available for eDiscovery when it shouldn’t be.
Pi Corp’s Smart Desktop project is described by EMC’s CTO Jeff Nick in this video taken at EMC World last year. In a nutshell, Smart Desktop is meant to:
provide a central portal for all of an individual’s information collected from all of the information sources they use;
index and classify that information so it can be used more productively, for example, when a user begins performing a particular task the user will be prompted with a “view” (dashboard) of all of the information the system expects they will want, based on the user’s past performance and the system’s predictive intelligence algorithms;
“untether” information so that it is available to the user from any of the user’s devices, including mobile devices, and interchangeable across different sources; and
enable secure sharing such that people can share just the information they wish to share with those they want to share it with.
Once I’ve had a chance to evaluate Smart Desktop I’ll take a harder look at its Information Governance implications. Problems could arise for employers — albeit through no fault of Decho — if Smart Desktop (or Mozy, or another file sharing service, for that matter) is used by employees to share their employers’ IP with people outside of the company, or people within their company who have not been properly trained and cautioned about maintaining IP security. Similarly, if Smart Desktop (or Mozy, or another SaaS) enables employees to preserve company documents beyond their deletion dates, or to access company documents after they are no longer employees, this could prove difficult in eDiscovery or IP secrecy scenarios, where such information could become a costly surprise late in the game.
But for now I’ll presume that because Decho’s parent EMC has a strong Information Governance focus, Decho will ultimately provide not only the access controls that they currently envision, which will enable secure sharing across devices and users, but also group administration features that make it possible for companies to retain control over IP and information lifecycle management. In particular, I predict Decho will provide dynamic global indexing of information which enters any user account within a company’s user group, thereby making company information easy to find, place holds on, and collect for eDiscovery. I also predict Decho will offer document lifecycle management functionality, including automatically enforced retention and deletion policies.
And while I’m making a Decho wish list, two more items:
I strongly recommend reading Ron’s post for the benefit of his insights, whether or not you are already familiar with TREC Legal Track. I’d also like to offer my own observations about TREC Legal Track’s finding of low consistency between document classification decisions made by subject matter experts, who are spoken of as “gold standard” reviewers, and ordinary legal document reviewers. (In TREC Legal Track’s study, ordinary reviewers were 2nd and 3rd year law students. In real life the subject matter expert role is played by in-house or outside counsel, while much of the actual review work is performed by contract or outsource attorneys.)
Generally speaking, quality control processes involve benchmarking against some standard. Mechanical processes can be meaningfully benchmarked by physically sampling output (this is the essence of Six Sigma, in particular). For example, as machine parts come off an assembly line, samples can be selected and measured and the variance between their actual size and target size monitored not only to detect defects but to flag the processes responsible for defects. Human processes can also be benchmarked in a variety of ways. (This is in part the province of ITIL, the “Information Technology Information Library,” and the basis for the idea of “service level agreements”.) For example, those responsible for a customer service center may track the number of issues handled per hour, the type of issues handled, the number of resolutions or escalations per issue, revenue gained or lost per issue, etc.
Unfortunately, “responsiveness” and “privilege” are not only somewhat subjective in document review, standards for responsiveness and privilege will vary from case to case. For this reason standards need to be developed “on the fly” for each case, and these standards will by necessity be arbitrary (aka subjective) to some degree even if consistently applied. The good news is that the latest generation of document clustering software applications incorporate tools for developing consistent document review standards on the fly. Through an iterative feedback loop, the humans educate the machines to look for documents with certain characteristics, while the machines force the humans to refine their conception of responsiveness and privilege to a degree that the machine can reliably model it. After enough iterations have passed and the machine has reached some measurable standard of consistency, the humans can step back and let the machine do the rest of the review work. The machine does it more consistently than human reviewers could themselves, and at a much lower cost.
With document review the very idea of defining a “gold standard” for classification is less useful than it sounds. For instance, even if a panel of leading legal scholars could be formed for each eDiscovery matter, the mere fact that someone legitimately may be called a leading scholar doesn’t mean that their views will be consistent with anyone else’s — just well reasoned. But a “gold standard” is not what’s important here. What’s important is that in each case the attorneys responsible for responding to a document request do everything they can to carefully define and consistently enforce reasonable document review standards. This is what the current crop of document clustering applications are intended to do. That is the current model, anyway. I don’t pretend to be able to name the vendors who can or cannot deliver on this promise, although I think this will be the number one question in eDiscovery technology before long.
I recently spoke with Thao Tiedt, a labor and employment partner at Ryan Swanson & Cleveland, PLLC, a mid-sized full service Seattle law firm. (Full disclosure: I’ve benefited from her incisive advice a number of times when I was wearing the hat of corporate counsel.) Our conversation focused on eDiscovery from the perspective of consequences when individual employees use company computers in ways not approved by their employer.
Bruce: Thao, I first asked you this question some years ago, but I’ll ask again so you can catch me up and share this information with a wider audience. When employees of a company use a company computer, even for personal purposes, who does the information belong to after it winds up on the company’s computer?
Thao: In other words, do employees have an expectation of privacy? Yes and no. In the workplace the employer has the right to take that expectation away through a variety of policies and practices. This includes email and voice mail. With telephone conversations, an employer can’t listen without permission of both the employees and others on the line. States’ laws vary; some states require that at least one person on the conversation has to give you permission to record it. But permission can be obtained through fair warning – you don’t have to get explicit permission, it can be tacit, as when a message is played announcing that a conversation may be recorded – when someone hears that and doesn’t hang up permission is implicit. Employees may be given a policy manual or an explicit waiver to sign that states that privacy is waived. If an employee refuses to sign, they can’t stay employed.
Bruce: What happens when employees try to remove information from a company computer?
Thao: People think they’re smart and they can make information go away. Here’s a good example: one of my clients is a company that received a demand for arbitration over alleged sexual harassment. So I had the company put a hold on all of the computers involved, including both the employee’s and the accused manager’s – in their cases by physically picking the computers up. Upon technical evaluation it appeared that the claimant had been wiping hers. But she failed to realize that the company had backup tapes for disaster recovery purposes. Also, this particular company has multiple branches so it has central email servers. And after interviewing co-workers, a hint of impropriety appeared. I asked a one of claimant’s co-workers “anything else we should know?” The co-worker showed me a cellphone picture sent by the claimant, showing the claimant nude from waist up, with the caption “does this change your mind?” Apparently she had wanted the co-worker to date her and he had refused. When we looked at the company email accounts we found lots of these pictures, which we could tell from the background were taken in the company bathroom. It turns out she had been spending a lot of time on dating sites while at work and sending multiple men the pictures.
Later we learned that someone had asked her: don’t you think you should be careful? She had answered no, someone in IT told me how to double-delete computer files.
After all of this information came out in the open her cause of action went away. Given her behavior it was clear that if her accused manager had in fact asked her to expose herself, as she claimed, she would have gladly done so.
This just goes to show: no one should think they can make digital information go away.
There are huge number of cases where the smoking guns are emails. Somehow people don’t think of emails as documents, they think of them as chit-chat. Far from it. For example, when training attorneys in our firm we teach them that emails are no different from formal letters sent to clients and should be handled with the same care.
Bruce: What about accessing web sites using work computers?
Thao: Of course web use can get traced back to inappropriate sites, like pornography severs for example. I actually had to go home to view a site that had been accessed by an employee on one occassion, because our firm’s own web filters are set so high I couldn’t do it from work. For a while I couldn’t order my own underwear online from work.
Anyway, it turned out this person was running a business on work time– the business of being web master for a porn site.
However, as a general rule an employee can conduct their own business on their lunch hour, as long as that isn’t a conflict with their employer in some fashion.
Bruce: I’ve read about studies that suggest employee productivity actually goes up when they can do a certain amount of personal work – scheduling doctors appointments and what not, from their work computer during work hours – because that flexibility leads to less tardiness and absenteeism and so forth. So how does an employer who believes this is true handle personal use of work computers?
Thao: Here’s what we say in our own [Ryan Swanson & Cleveland] employee manual: employees’ may make limited, incidental, responsible personal use of company computers.
Having said that, an employer can still intercept and log employee use of company computers. In the harassment case I mentioned, for example, we examined how both parties had used their computers. The accused manager was very uncomfortable with having attorneys review his work materials, but we needed to see his responses to her emails to make the company’s case. What we found didn’t support her case, but did lead us to caution him to stop unrelated inappropriate use of his work computer.
Bruce: What about when employees use their personal email account, like Gmail, from a work computer?
Thao: Does accessing email on company computer waive privacy protection? Yes. There is no expectation of privacy for personal email stored on company computer.
Bruce: How about a password for a personal email account, once it has been typed into a company computer?
Thao: Yes, if it’s on the work computer then it’s information that belongs to the employer.
Bruce: But can the employer use that information? What if they use the password to access an employee’s personal email account, like an AOL or Gmail account?
Thao: No. The employer can possess the password if it’s on the company’s computer, but they can’t use it to log into the personal email account.
Bruce: What about Google Gears, which makes local copies of personal email and Google documents on the computer being used, which might be a work computer?
Thao: Then the company has a right to see that information. Anything on the company computer is the company’s – if the company policy reads that way.
California sometimes has different views concerning privacy – they have a state constitutional right to privacy. But as long as companies have been up front with employees by notifying them that if information goes through a work computer, that information can be accessed by the company, then employer access to that information is allowed in California as well.
Bruce: When a lawsuit is threatened you send out a scary letter to employees telling them to avoid destroying evidence?
Thao: We send out a “scary letter” right away [to leave no doubt what is expected of people].
It can be the case that having electronically stored information collected by an outside vendor creates insulation against tampering and a better evidentiary chain of custody, even with intellectual property secrecy issues. Outside vendors can make good selections about what fits an eDiscovery inquiry.
What you don’t want is for opposing counsel to see something secret [and not responsive to a discovery request] that may be useful to their client in some way. If that happens it creates a question for that attorney about what their duty is to their client – to reveal or not to reveal that information – and then there’s the fact that you can’t get it out of your head once you’ve seen it. It will absolutely color your strategy down the road.
Also, concerning attorney-client privilege: privilege is waived whenever a privileged email is copied to anyone outside of “speaking agents of the company.” This happens all the time, even when recipients of privileged emails are warned. Forwarding emails is a hard habit to break.
Thao: Here’s an example. One of my clients is a regional auto dealer association. A common problem they have is that new vehicle salespersons typically view the customers they sell to as “my customers” who they can “keep” after they move to a different dealership. Wrong – they are the dealer’s customers, not the salesperson’s. In addition, customer information is considered private under federal law. If someone captures that information but not because of a business transaction, for some other purpose, it violates Federal privacy law.
Bruce: What remedies are available to an employer in this situation? What can an auto dealer do if a new vehicle salesperson takes a customer list with them?
Thao: The dealer can file for an injunction telling a dealer not to use information that came from other dealers. When dealers do receive such information it won’t be profitable because an injunction is very expensive for them to defend as well as scary and distracting.
And if the company whose information was taken can prove actual damages, then they can receive money damages from the new employer for tortious interference with private information. For example, I had a case where a person thought they were going to be terminated, so they copied specifications for a technical piece of equipment and emailed to themselves. Then they changed information in the company computers regarding that equipment, which was very expensive for that company to correct. A new employer could be held liable for damages by accepting that information from the former employee.
Bruce: What about non-competition agreements – do those work?
Thao: A non-compete protects employer information that’s already in an employee’s head. It’s limited but it works. For example, it can say a vehicle sales employee can’t work in a dealership selling the same type of car in the same county, but usually can’t keep someone from completely working in the car business, or for any company within that county. It works as long as you don’t prevent the employee from working anywhere in the same business.
Thao: Yes, most people don’t understand that computer files must be preserved whenever there is even a smell of dispute in the air. Might the court award money sanctions? Possibly. Or, in some extremely serious situations the judge can order that the offending party can’t defend itself; or that a party can’t pursue it’s lawsuit – case dismissed. It’s a form of inconsistent pleading – a claimant can’t resist providing information and pursue a remedy simultaneously.
Bruce: From what you have said today it sounds like data backups of one sort or another are a critical element for eDiscovery, at least in your practice.
Thao: Disaster recovery backups just make sense as a litigation backup data source when dealing with employees. But you need historical backups that are locked down so that they can’t be erased for a period of time during which they might be needed.
Archiving is another thing you can do. For example, the Puget Sound Automobile Dealers Association maintains an electronic archive of participating dealers’ employee policy manuals over the years which can be used as evidence in an employee dispute.
Bruce: Which brings us to a final thought. There’s a lot of company data — confidential customer data — in the hands of non-attorneys who don’t have the same paranoia about casually exposing it that attorneys like you and I do….
Thao: Yes, you have to have confidence in IT people that they won’t be trolling confidential information, that they will keep it confidential.
From a business perspective, information should be handled like property. Like assets or supplies, information needs management.
Companies set policies to govern use, storage, and disposal of assets and office supplies. Companies also need to make and enforce rules governing electronically stored information, including how it is organized (who has access and how), stored (where and at what cost), retained (including backups and archives), and destroyed (deletion and non-deletion both have significant legal and cost consequences). These policies must balance the business, legal and technical needs of the company. Without them, a company opens itself up to losses from missed opportunities, employee theft, lawsuits, and numerous other risks.
Some information is analogous to company ASSETS. For example, let’s suppose a certain sales proposal took someone a week to write and required approval and edits from four other people plus 6 hours of graphics production time. An accountant isn’t going to list that proposal on the company books. But it is an asset. It can be edited and resubmitted to other potential customers in a fraction of the time it took to create the original. Like the machinery, furniture, or hand tools used to operate a business, company money was spent obtaining this information and it will retain some value for some time. It should be managed like an asset.
Some information is analogous to OFFICE SUPPLIES. For example, a manager spends a number of hours customizing a laptop with email account settings, browser bookmarks and passwords, ribbon and plugin preferences, nested document folders, security settings, etc. That customization information is crucial for the manager’s productivity in much the same way as having pens in the drawer, paper in the printer, staples in the stapler, and water in the water-cooler can be important for productivity. Productivity will be lost if it is lost. That information needs to be managed just as much as office supplies need to be manged.
From a business perspective, when company information is lost or damaged, or when users are under or over supplied, it is no different from mismanagement of company assets and office supplies.
Setting an information policy means:
identifying information use and control needs;
making choices and tradeoffs about how to meet those needs; and
taking responsibility for results and an ongoing process (setting goals / taking action / measuring progress / adjusting).
Information governance policy is an on-going process for managing valuable company information. All of the stakeholders – in particular, business units, IT, and Legal – must collaborate in order to draw a bullseye on company information management needs. The right people in the organization must be charged with responsibility for getting results or for making changes needed to get results.
Without a doubt, it takes time and money, and requires collaboration, to develop a “policy.” But we’re all accustomed to this type of preparation already. Let’s look at simple, familiar professional standards for just a moment:
Software developers test software on actual users and correct bugs and (hopefully) mistaken assumptions before releasing it. Avoiding these steps will undoubtedly lead to loss and possibly bankruptcy.
Attorneys meet with clients before going to trial to prepare both the client and the attorneys. If they don’t they risk losing their clients millions, or getting them locked up.
Advance preparation is as essential in information management as it is in software development and trial practice. Simply ignoring the issue, or dumping it on one person or a single department (like IT or Legal) can be very costly. Avoiding the planning component of information management is like putting in only 80% of the time and effort needed for the company to succeed. Avoiding 20% of the time and effort doesn’t yield a “savings” when the outcome is failure, as when an employee steals documents, essential information is lost when a building with no computer back-ups burns down, or old documents which would have been deleted under a proper information governance policy turn up in a lawsuit and cost millions.
Information policy does NOT flow from any of the following all-to-common realities:
The first meeting between the Legal Department and the new eDiscovery vendor is also the first meeting between Legal and IT (true story);
Ever since a certain person from the General Counsel’s office was made the head of corporate records management, no one in IT will talk to that person (true story);
Information management technology alone, without a company-specific understanding of the problems that the technology is meant to solve, is not a recipe for success. A recent article by Carol Sliwa, published by SearchStorage.com (April 22, 2009), offers a detailed look at issues surrounding efforts to reduce storage costs by assessing how information is being used and moving it to the least expensive storage tier possible.
The article has some powerful suggestions on developing information policy. First, Karthik Kannan, vice president of marketing and business development at Kazeon Systems Inc.:
“What we discovered over time is that customers need to be able to take some action on the data, not just find it…. Nobody wants to do data classification just for the sake of it. It has to be coupled with a strong business reason.”
“In order to really realize and get the benefit of data and storage classification, you have to start with a business process…. And it has to start from conversations with the business units and understanding the needs and requirements of the business. Only at the end, once you actually have everything in place, should you be looking at technology because then you’ll have a better set of requirements for that technology.”
It takes time, money and cooperation between departments that may have never worked together before to develop a working information governance policy. But that is not a reason to skip — or skimp on — the process. Companies need to protect their assets and productivity, and information governance has become an essential area for doing just that.
I recently had the pleasure of speaking with Nicholas Croce, President of Inference Data, a provider of innovative analytics and review software for eDiscovery, following the company’s recent webinar, De-Mystifying Analytics. During our conversation I discovered that Nick is double-qualified as a legal technology visionary. He not only founded Inference, but has been involved with legal technologies for more than 12 years. Particularly focused on the intersection of technology and the law, Nick was directly involved in setting the standards for technology in the courtroom through working personally with the Federal Judicial Center and the Administrative Office of the US Courts.
I asked to speak with Nick because I wanted to pin him down on what I imagined I heard him say (between the words he actually spoke) during the live webinar he presented in mid-March. The hour-long interview and conversation ranged in topic, but was very specific in terms of where Nick sees the eDiscovery market going.
Sure enough, during our conversation Nick confirmed and further explained that he and his team, which includes CEO Lou Andreozzi, the former LexisNexis NA (North American Legal Markets) Chief Executive Officer, have designed Inference with not one, but two models of advanced eDiscovery analytics and legal review in mind.
In a nutshell, Inference is designed not only to deliver the current model of eDiscovery software analytics, which I have dubbed “Software Queued Review,” but the next generation analytics model as well, which I am currently calling “Statistically Validated Automated Review” (Nick calls it “auto-coding”).
I have a few specific questions to ask, but in general what I’d like to cover is:
1) where does Inference fit within the eDiscovery ecosystem,
2) how you think statistically validated discovery will ultimately be used, and
3) how you think the left side of the EDRM diagram (which is where document identification, collection, and preservation are situated) is going to evolve?
Nick: To first give some perspective on the genesis of Inference, it’s important to understand the environment in which it was developed. Prior to founding Inference I was President of DOAR Litigation Consulting. When I started at DOAR in 1997, the company was really more of a hardware company than anything else. I was privileged to be involved in the conversion of courtroom technology from wooden benches to the efficient digital displays of evidence we see today, Within a few years we became the predominant provider of courtroom technology, and it was amazing to see the legal system change and directly benefit from the introduction of technology. As people saw the dramatic benefits, and started saying “how do we use it?” we created a consulting arm around eDiscovery which provided the insight to see that this same type of evolution was needed within the discovery process.
This began around 2004-2005 when we started to see an avalanche of ESI (“Electronically Stored Information”) coming, and George Socha became a much needed voice in the field of eDiscovery. As a businessman I was reading about what was happening, and asking questions, and it seemed black and white to me – it had become impossible to review everything because of the tremendous volume of ESI with existing technology. As a result I started developing new technology for it, to not only manage the discovery of large data collections, but to improve and bring a new level of sophistication to the entire legal discovery process.
Inference was developed to help clients intelligently mine and review data, organize case workflow and strategy, and streamline and accelerate review. It’s the total process. But, today I still have to fight “the short term fix mentality” – lawyers who just care about “how do I get through this stuff faster”, which is the approach of some other providers, and which also relates to the transition I see in the EDRM model – I want to see the whole thing change.
Review is the highest dollar amount, the biggest pain, 70% of a corporation’s legal costs are within eDiscovery. People want to, and need to, speed up review. However, we also need to add intelligence back into the process.
Bruce: What differentiates Inference, where does it fit in?
Nick: I, and Inference, went further than just accelerating linear review and said: it has to be dynamic, not just coding documents as responsive / non-responsive. I know this is going to sound cheesy I guess, but – you have to put “discovery” back into Discovery. You need to be able to quickly find documents during a deposition when a deponent says something like “I never saw a document from Larry about our financial statements”, and not just search for “responsive: yes/no”, “privileged: yes/no.”
Inference was, and is, designed to be dynamic – providing suggestions to reviewers, opportunities to see relationships between documents and document sets not previously perceived, helping to guide attorneys – intuitively. Inference follows standard, accepted methodologies, including Boolean keyword search, field and parametric search, and incorporates all of the tools required for review – redaction, subjective coding, production, etc.
In addition to that overriding principle, we wanted the ability to get data in from anyone, anywhere and at any time. Regulators are requiring incredibly aggressive production timelines; serial litigants re-use the same data set over and over; CIOs are trying to get control over searching data more effectively, including video and audio. Inference is designed to take ownership of data once it leaves the corporation, whether it is structured, semi-structured or unstructured data.
Inside the firewall, the steps on the left side of the EDRM model are being combined. Autonomy, EMC, Clearwell, StoredIQ — the crawling technologies – these companies are within inches of extracting metadata during the crawling process, and may be there already. This is where Inference comes in since we can ingest this data directly. I call it the disintermediation of processing because at that point there is no more additional costs for processing.
In the past someone would use EnCase for preservation, then Applied Discovery for processing (using date ranges and Boolean search terms), and at some cost per custodian, and per drive, you’d then pay for processing. It used to be over $2,500 per gig, now it’s more like $600 to $1500 per gig, depending on multi-language use and such.
But once corporations automate the process with crawling and indexing solutions, all of the information goes right into Inference without the intermediary steps, which puts intelligence back in the process. You can ask the system to guide you whenever there’s a particular case, or an issue. If I know the issue is a conversation between Jeff and Michele during a certain date range, I can prime the system with that information, start finding stuff, and start looking at settlement of the dispute. But without automation it can take months to do, at much higher costs.
Inference also offers quality control aspects not previously available: after, say, one month, you can use the software to check review quality, find rogue reviewers, and fix the process. You can also ultimately do auto-coding.
Bruce: I think this is a good opening for segue to the next question: how will analytics ultimately be used in eDiscovery?
Nick: The two most basic components of review are “responsive” and “privileged.” I learned from the public testimony of Verizon’s director of eDiscovery, Patrick Oot, some very strong statistics from a major action they were involved in. The first level document review expense was astounding even before the issues were identified. The total cost of responsive and privileged review was something like $13.6 million.
The truth is that companies only do so many things. Pharma companies aren’t generally talking about real-estate transactions or baseball contracts.
Which brings us to auto-coding… sometimes I try to avoid calling it “auto-coding” versus “computer aided” or “computer recommended” coding. When someone says “the computer did it” attorneys tend to shut down, but if someone says “the computer recommended it” then they pay attention.
Basically auto-coding is applying issue tags to the population based on a sampling of documents. The way we do it is very accurate because it is iterative. It uses statistically sound sampling, recurrent models. It uses the same technology as concept clustering, but you cluster a much smaller percentage. Let’s say you create 10 clusters, tag those, then have the computer tag other documents consistently with the same concepts. Essentially, the computer makes recommendations which are then confirmed by attorney, and then repeated until the necessary accuracy level has been achieved. This enables you to only look at a small percentage of the total document population.
Bruce: I spoke with one of the statistical sampling gurus at Navigant Consulting last month, who suggests that software validated by statistical sampling can be more accurate than human reviewers, with fewer errors, for analyzing large quantities of documents.
Nick: It makes sense. Document review is very labor intensive and redundant. Think about the type of documents you’re tagging for issues – it doesn’t even need to be conscious: it is an extremely rote activity on many levels which just lends itself to human error.
Bruce: So let’s talk about what needs to happen before auto-coding becomes accepted, and becomes the rule rather than the exception. In your webinar presentation you danced around this a bit, saying, in effect, that we’re waiting for the right alignment of law firms, cases, and a judge’s decision. In my experience as an attorney, including some background in civil rights cases, the way to go about this is by deliberately seeking out best-case-scenario disputes that will become “test cases.” A party who has done its homework stands up and insists on using statistically validated auto-coding in an influential court, here we probably want the DC Circuit, the Second Circuit, or the Ninth Circuit, I suppose. When those disputes result in a ruling on statistical validity, the law will change and everything else will follow. Do you know of any companies in a position to do this, to set up test cases, and have you discussed it with anyone?
Nick: Test cases: who is going to commit to this — the general counsel? Who do they have to convince? Their outside counsel, who, ultimately, has to be comfortable with the potential outcome. But lawyers are trained to mitigate risk, and for now they see auto-coding or statistical sampling as a risk. I am working with a couple of counsel with scientific and/or mathematic backgrounds who “get” Bayesian methods- and the benefits of using them. Once the precedents are set, and determine the use of statistical analysis as reasonable, it will be a risk not to use these technologies. As with legal research, online research tools were initially considered a risk. Now, it can be considered malpractice not to use them.
It can be frustrating for technologists to wait, but that’s how it is. Sometimes when we are following up with new installs of Inference we find that 6 weeks later they’ve gone back to using simple search instead of the advanced analytics tools. But even for those people, after using the advanced features for a few months they finally discover they can no longer live without it.
Bruce: Would you care to offer a prediction as to when these precedents will be set?
Nick: I really believe it will happen, there’s no ambiguity. I just don’t know if it’s 6 months or a year. But general counsel are taking a more active role, because of the cost of litigation, because of the economy, and looking at expenses more closely. At some point there will be that GC and an outside counsel combination that will make it happen.
Bruce: After hearing my statistician friend from Navigant deliver a presentation on statistical sampling at LegalTech last month I found myself wondering why parties requesting documents wouldn’t want to insist that statistically validated coding be used by parties producing documents for the simple reason that this improves accuracy. What do you think?
Nick: Requesting parties are never going to say “I trust you.”
Bruce: But like they do now, the parties will still have to be able to discuss and will be expected to reach agreement about the search methods being used, right?
Nick: You can agree to the rules, but the producing party can choose a strategy that will be used to manage their own workflow – for example today they can do it linearly, offshore, or using analytics. The requesting party will leave the burden on the producing party.
Bruce: If they are only concerned with jointly defining responsiveness, in order to get a better-culled set of documents — that helps both sides?
Nick: That would be down the road… at that point my vision gets very cloudy, maybe opposing counsel gets access to concept searches – and they can negotiate over the concepts to be produced.
Bruce: There are many approaches to eDiscovery analytics. Will there have to be separate precedents set for each mathematical method used by analytics vendors, or even for each vendor-provided analytical solution?
Nick: I’d love to have Inference be the first case. But I don’t know how important the specific algorithm or methodology is going to be – that is a judicial issue. Right now we’re waiting for the perfect judge and the perfect case – so I’ll hope it’s Inference, rather then “generic” as to which analytics are used. I hope there’s a vendor shakeout – for example, ontology based analytics systems demo nicely, but “raptor” renders “birds” which is non-responsive, while “Raptor” is a critical responsive term in the Enron case.
Bruce: Perhaps vendors and other major stakeholders in the use of analytics in eDiscovery, for example, the National Archives, should be tracking ongoing discovery disputes and be prepared to file amicus briefs when possible to help support the development of good precedents.
This blog post is the first of two on the topic of advanced eDiscovery analytics models. My goal is to make the point that lawyers don’t trust or use analytics to the degree that they should, according to scientifically sound conventions commonly employed by other professions, and to speculate about how this is going to change.
In this first post I’ll explain why we arrived where we are today by describing the progression of analytics across three generations of Discovery technology.
The first generation, which I call “The Photocopier Era,” relies on labor intensive, pre-analytics processes. Some lawyers are still stuck in this era, which is extremely labor intensive.
The second generation is the current reigning model of analytics and review. I call it “software queued review.” Software queued review intelligently sorts and displays documents to enable attorneys to perform document review more efficiently. At the same time, software queued review allows – or should I say, requires? – attorneys to do more manual labor than is required to ensure review quality or to ensure that attorneys take personal responsibility for the discovery process.
The third, upcoming generation of analytics is only beginning to provoke widespread discussion in the legal community. I’ll call it “statistically validated automated review.” In it software is used to perform the majority of document review work, leaving attorneys to do the minimum amount of review work. In fact, certain advanced analytics and workflow software solutions can already be calibrated, by attorney reviewers, to be more accurate than human reviewers typically are capable of when reviewing vast quantities of documents.
Because it will radically reduce the amount of hands-on review, the third generation model is currently perceived by many lawyers as a risky break from legal tradition. But when this model is deployed outside of the legal profession it is not considered a giant step, technologically or conceptually. It is merely an application of scientifically grounded business processes.
In subsequent blog posts, including the second post in this series, I will look at what is being done to overcome the legal profession’s reluctance to adopt this more accurate, less expensive eDiscovery model.
The Pre-analytics Generation: Back to the Photocopier Era
Please return with me now to olden times of not-so-long-ago, the days before eDiscovery software. (Although even today, for smaller cases and cases that somehow don’t involve electronically stored information, the Photocopier Era is alive and well.)
In the beginning there were paper documents, usually stored within folders, file boxes, and file cabinets. Besides paper, staples, clips, folders, and boxes, photocopiers were the key document handling technology, with ever improving speed, sheet feeding, and collation options.
Gathering documents: When a lawsuit reached the discovery stage, clients following the instructions of their attorneys physically gathered their papers together. Photocopies were made. Some degree of effort was (usually) made to preserve “metadata” which in this era meant identifying where the pieces of paper had been stored, and how they had been labeled while stored.
Assessing documents: In this era every “document” was a physical sheet of paper, or multiple sheets clipped together in some manner. Each page was individually read by legal personnel (attorneys or paralegals supervised by attorneys) and sorted for responsiveness and privilege. Responsive, non-privileged documents were compiled into a complete set and then, individual page after individual page, each was numbered (more like impaled) with a hand-held, mechanical, auto-incrementing ink stamp (I can hear the “ka-chunk” of the Bates Stamp now… ah, those were the days).
Privileged documents were set to one side, and summarized in a typed list called a privilege log. Some documents containing privileged information were “redacted” using black markers (there was an art to doing this in a way so that the words couldn’t be read anyway – an art which even the FBI on one occasion in my experience failed to master).
Finally, the completed document set was photocopied, boxed, and delivered to opposing counsel, who in turn reviewed each sheet of paper, page by page.
The Present Generation of Analytics: Software Queued Review
Fast forward to today, the era of eDiscovery and software-queued review. In the present generation software is used to streamline, and thus reduce, the cost of reviewing documents for responsiveness and privilege.
Gathering documents: Nowadays, still relying on instructions from their attorneys, clients designate likely sources of responsive documents from a variety of electronic sources, including email, databases, document repositories, etc. Other media such as printed documents and audio recordings may also be designated when indicated.
After appropriate conversions are made (for example, laptop hard drives may need to be transferred, printed documents may need to be OCR scanned, audio recordings may need to be transcribed, adapters for certain types of data sources may need to be bought or built) all designated sources are ingested into a system which indexes the data, including all metadata, for review.
Some organizations already possess aggressive records management / email management solutions which provide the equivalent of real time ingestion and indexing of significant portions of their documents. Such systems are particularly valuable in a legal context because they enable more meaningful early case assessment (sometimes called “early data assessment”).
Assessing documents: In the current era attorneys can use tools such as Inference which use a variety of analytical methods and workflow schemas to streamline and thus speed up review. (Another such tool is Clustify, which I described in some detail in a previous blog entry.) Such advanced tools typically combine document analytics and summarization with document clustering, tagging, and support for human reviewer workflows. In other words, tools like Inference start with a jumble of all of the documents gathered from a client, documents which most likely contain a broad spectrum of pertinent and random, off-topic information, and sort them into neat, easy to handle, virtual piles of documents arranged by topic. The beauty of such systems is that all of the virtual piles can be displayed — and the documents within them browsed and marked — from one screen, and any number of people in any number of geographic locations can share the same documents organized the same way. Software can also help the people managing the discovery process to assign groups of documents to particular review attorneys, and help them track reviewer progress and accuracy in marking documents as responsive or not, and privileged or not.
The key benefit of this generation of analytics is speed and cost savings. Similar documents, including documents that contain similar ideas as well as exact duplicates and partial duplicates of documents, can be quickly identified and grouped together. When a group of documents contains similar documents, and all of the documents in that group are assigned to the same person or persons, they can work more quickly because they know more of what to expect as they see each new document. Studies have shown that review can be performed perhaps 70-80% faster, and thus at a fraction of the cost, using these mechanisms.
Once review is complete, documents can be automatically prepared for transfer to opposing counsel, and privilege logs can be automatically generated. Opposing counsel can be sent electronic copies of responsive, non-privileged documents, which they in turn can review using analytical tools. (Inference is among the tools that are sometimes used by attorneys receiving such document sets, Nick tells me.)
The Coming Generation of Analytics: Statistically Validated Automated Review
The next software analytics model will be a giant leap forward when it is adopted. In this model software analytics intelligence is calibrated by human intelligence to automatically and definitively categorize the majority of documents collected as responsive or not, and as privileged or not, without document-by-document review by humans. In actuality, some of the analytical engines already in existence – such as Inference — can be “trained” through a relatively brief iterative process to be more accurate in making content-based distinctions than human reviewers can.
To adopt this mechanism as standard, and preferred, in eDiscovery would be merely to apply the same best practices statistical sampling standards currently relied upon to safeguard quality in life-or-death situations such as product manufacturing (think cars and airplanes) and medicine (think pharmaceuticals). The higher level of efficiency and accuracy that this represents is well within the scope of existing software. But while statistically validated automated review has been widely alluded to in legal technology circles, so far as I know it has not been used as a default by anyone when responding to document requests. Not yet. Reasons for this will be discussed in subsequent posts, including the next one.
Gathering documents: The Statistically Validated Automated Review model relies on document designation, ingestion, and indexing in much the same manner as described above with respect to Software Queued Review.
Assessing documents: In this model, a statistically representative sample of documents is first extracted from the collected set. Human reviewers study the documents in this sample then agree upon how to code them as responsive / non-responsive, privileged / non-privileged. This coded sample becomes the “seed” for the analytics engine. Using pattern matching algorithms the analytics engine makes a first attempt to code more documents from the collected set in the same way the human coders did, to match the coding from the seed sample. But because the analytics engine won’t have learned enough from a single sample to become highly accurate, another sample is taken. The human coders correct miscoding by the analytics engine, and their corrections are re-seeded to the engine. The process repeats until the level of error generated by the analytics engine is extremely low from the standpoint of scientific and industrial standards, and more accurate than human reviewers are typically capable of sustaining when coding large volumes of documents.
By way of comparison this assessment process resembles the functioning of the current generation of email spam filters, which employ Bayesian mathematics and corrections by human readers (“spam” / “not spam”) that teach the filters to make better and better choices.
After the Next Generation: Real Time Automated Review
It’s not another generation of analytics, but another significant shift is gradually occurring that will have a significant impact on eDiscovery. The day is approaching when virtually all information that people touch while working will be available and indexed in real time. From the perspective of analytics engines it is “pre-ingested” information. This will largely negate the gathering phase still common in previous generations. Vendors such as Kazeon, Autonomy, CA, Symantec, and others are already on the verge – and in some cases, perhaps, past the verge – of making this a possibility for their customers.
(Full disclosure of possible personal bias: I’m working with a startup with a replication engine that can in real time securely duplicate documents’ full content, plus metadata information about documents, as they are created on out-of-network devices, like laptops, to document management engines….)
The era of Real Time Automated Review will be both exciting and alarming. It will be exciting because instant access to all relevant documents should mean that more lawsuits settle on the facts, in perhaps weeks, after a conflict erupts (see early case assessment, above), rather than waiting for the conclusion of a long, and sometimes murky, discovery process. It’s alarming because of the Orwellian “Big Brother” implications of systems that enable others to know every detail of the information you touch the moment you touch it, and at any time thereafter.
In my next post you’ll hear about my conversation with Nick Croce, including how Inference has prepared for the coming generation of automated discovery and where Nick thinks things are going next.