TREC and the gold standard for document review

Ron Friedman recently blogged an excellent critique of TREC Legal Track’s effort to objectively assess eDiscovery document review practices. Like Ron, I commend TREC Legal Track while wishing to offer comments that may contribute to their success. Like me, Ron is an attorney with long experience working in the technology sector, although for comparison with his math background I can only claim four years of college courses concerning statistical methods for assessing human behavior.

Benchmarking is valuable almost everywhere.

Benchmarking is valuable almost everywhere.

I strongly recommend reading Ron’s post for the benefit of his insights, whether or not you are already familiar with TREC Legal Track. I’d also like to offer my own observations about TREC Legal Track’s finding of low consistency between document classification decisions made by subject matter experts, who are spoken of as “gold standard” reviewers, and ordinary legal document reviewers. (In TREC Legal Track’s study, ordinary reviewers were 2nd and 3rd year law students. In real life the subject matter expert role is played by in-house or outside counsel, while much of the actual review work is performed by contract or outsource attorneys.)

Generally speaking, quality control processes involve benchmarking against some standard. Mechanical processes can be meaningfully benchmarked by physically sampling output (this is the essence of Six Sigma, in particular). For example, as machine parts come off an assembly line, samples can be selected and measured and the variance between their actual size and target size monitored not only to detect defects but to flag the processes responsible for defects. Human processes can also be benchmarked in a variety of ways. (This is in part the province of ITIL, the “Information Technology Information Library,” and the basis for the idea of “service level agreements”.) For example, those responsible for a customer service center may track the number of issues handled per hour, the type of issues handled, the number of resolutions or escalations per issue, revenue gained or lost per issue, etc.

Unfortunately, “responsiveness” and “privilege” are not only somewhat subjective in document review, standards for responsiveness and privilege will vary from case to case. For this reason standards need to be developed “on the fly” for each case, and these standards will by necessity be arbitrary (aka subjective) to some degree even if consistently applied. The good news is that the latest generation of document clustering software applications incorporate tools for developing consistent document review standards on the fly. Through an iterative feedback loop, the humans educate the machines to look for documents with certain characteristics, while the machines force the humans to refine their conception of responsiveness and privilege to a degree that the machine can reliably model it. After enough iterations have passed and the machine has reached some measurable standard of consistency, the humans can step back and let the machine do the rest of the review work. The machine does it more consistently than human reviewers could themselves, and at a much lower cost.

With document review the very idea of defining a “gold standard” for classification is less useful than it sounds. For instance, even if a panel of leading legal scholars could be formed for each eDiscovery matter, the mere fact that someone legitimately may be called a leading scholar doesn’t mean that their views will be consistent with anyone else’s — just well reasoned. But a “gold standard” is not what’s important here. What’s important is that in each case the attorneys responsible for responding to a document request do everything they can to carefully define and consistently enforce reasonable document review standards. This is what the current crop of document clustering applications are intended to do. That is the current model, anyway. I don’t pretend to be able to name the vendors who can or cannot deliver on this promise, although I think this will be the number one question in eDiscovery technology before long.

UPDATE: I discuss TREC’s role in forumulating new legal procedural rules for e-discovery in a later blog post, Catch-22 for e-discovery standards?

Desktop, laptop, email backups critical for employee lawsuits

I recently spoke with Thao Tiedt, a labor and employment partner at Ryan Swanson & Cleveland, PLLC, a mid-sized full service Seattle law firm. (Full disclosure: I’ve benefited from her incisive advice a number of times when I was wearing the hat of corporate counsel.) Our conversation focused on eDiscovery from the perspective of consequences when individual employees use company computers in ways not approved by their employer.

Bruce: Thao, I first asked you this question some years ago, but I’ll ask again so you can catch me up and share this information with a wider audience. When employees of a company use a company computer, even for personal purposes, who does the information belong to after it winds up on the company’s computer?

From an IT perspective, preparing to defend against employee lawsuits starts long before "there is even a smell of dispute in the air."
From an IT perspective, preparing to defend against employee lawsuits starts long before "there is even a smell of dispute in the air."

Thao: In other words, do employees have an expectation of privacy? Yes and no. In the workplace the employer has the right to take that expectation away through a variety of policies and practices. This includes email and voice mail. With telephone conversations, an employer can’t listen without permission of both the employees and others on the line. States’ laws vary; some states require that at least one person on the conversation has to give you permission to record it. But permission can be obtained through fair warning – you don’t have to get explicit permission, it can be tacit, as when a message is played announcing that a conversation may be recorded – when someone hears that and doesn’t hang up permission is implicit. Employees may be given a policy manual or an explicit waiver to sign that states that privacy is waived. If an employee refuses to sign, they can’t stay employed.

Bruce: What happens when employees try to remove information from a company computer?

Thao: People think they’re smart and they can make information go away. Here’s a good example: one of my clients is a company that received a demand for arbitration over alleged sexual harassment. So I had the company put a hold on all of the computers involved, including both the employee’s and the accused manager’s – in their cases by physically picking the computers up. Upon technical evaluation it appeared that the claimant had been wiping hers. But she failed to realize that the company had backup tapes for disaster recovery purposes. Also, this particular company has multiple branches so it has central email servers. And after interviewing co-workers, a hint of impropriety appeared. I asked a one of claimant’s co-workers “anything else we should know?” The co-worker showed me a cellphone picture sent by the claimant, showing the claimant nude from waist up, with the caption “does this change your mind?” Apparently she had wanted the co-worker to date her and he had refused. When we looked at the company email accounts we found lots of these pictures, which we could tell from the background were taken in the company bathroom. It turns out she had been spending a lot of time on dating sites while at work and sending multiple men the pictures.

Later we learned that someone had asked her: don’t you think you should be careful? She had answered no, someone in IT told me how to double-delete computer files.

After all of this information came out in the open her cause of action went away. Given her behavior it was clear that if her accused manager had in fact asked her to expose herself, as she claimed, she would have gladly done so.

This just goes to show: no one should think they can make digital information go away.

There are huge number of cases where the smoking guns are emails. Somehow people don’t think of emails as documents, they think of them as chit-chat. Far from it. For example, when training attorneys in our firm we teach them that emails are no different from formal letters sent to clients and should be handled with the same care.

Bruce: What about accessing web sites using work computers?

Thao: Of course web use can get traced back to inappropriate sites, like pornography severs for example. I actually had to go home to view a site that had been accessed by an employee on one occassion, because our firm’s own web filters are set so high I couldn’t do it from work. For a while I couldn’t order my own underwear online from work.

Anyway, it turned out this person was running a business on work time– the business of being web master for a porn site.

However, as a general rule an employee can conduct their own business on their lunch hour, as long as that isn’t a conflict with their employer in some fashion.

Bruce: I’ve read about studies that suggest employee productivity actually goes up when they can do a certain amount of personal work – scheduling doctors appointments and what not, from their work computer during work hours – because that flexibility leads to less tardiness and absenteeism and so forth. So how does an employer who believes this is true handle personal use of work computers?

Thao: Here’s what we say in our own [Ryan Swanson & Cleveland] employee manual: employees’ may make limited, incidental, responsible personal use of company computers.

Having said that, an employer can still intercept and log employee use of company computers. In the harassment case I mentioned, for example, we examined how both parties had used their computers. The accused manager was very uncomfortable with having attorneys review his work materials, but we needed to see his responses to her emails to make the company’s case. What we found didn’t support her case, but did lead us to caution him to stop unrelated inappropriate use of his work computer.

Bruce: What about when employees use their personal email account, like Gmail, from a work computer?

Thao: Does accessing email on company computer waive privacy protection? Yes. There is no expectation of privacy for personal email stored on company computer.

Bruce: How about a password for a personal email account, once it has been typed into a company computer?

Thao: Yes, if it’s on the work computer then it’s information that belongs to the employer.

Bruce: But can the employer use that information? What if they use the password to access an employee’s personal email account, like an AOL or Gmail account?

Thao: No. The employer can possess the password if it’s on the company’s computer, but they can’t use it to log into the personal email account.

Bruce: What about Google Gears, which makes local copies of personal email and Google documents on the computer being used, which might be a work computer?

Thao: Then the company has a right to see that information. Anything on the company computer is the company’s – if the company policy reads that way.

California sometimes has different views concerning privacy – they have a state constitutional right to privacy. But as long as companies have been up front with employees by notifying them that if information goes through a work computer, that information can be accessed by the company, then employer access to that information is allowed in California as well.

Bruce: When a lawsuit is threatened you send out a scary letter to employees telling them to avoid destroying evidence?

Thao: We send out a “scary letter” right away [to leave no doubt what is expected of people].

It can be the case that having electronically stored information collected by an outside vendor creates insulation against tampering and a better evidentiary chain of custody, even with intellectual property secrecy issues. Outside vendors can make good selections about what fits an eDiscovery inquiry.

What you don’t want is for opposing counsel to see something secret [and not responsive to a discovery request] that may be useful to their client in some way. If that happens it creates a question for that attorney about what their duty is to their client – to reveal or not to reveal that information – and then there’s the fact that you can’t get it out of your head once you’ve seen it. It will absolutely color your strategy down the road.

Also, concerning attorney-client privilege: privilege is waived whenever a privileged email is copied to anyone outside of “speaking agents of the company.” This happens all the time, even when recipients of privileged emails are warned. Forwarding emails is a hard habit to break.

Bruce: Symantec recently commissioned a study which revealed that a very high percentage of laid-off employees copy company information and take it with them when they go. What, if any, recourse does a company have when employees leave with info?

Thao: Here’s an example. One of my clients is a regional auto dealer association. A common problem they have is that new vehicle salespersons typically view the customers they sell to as “my customers” who they can “keep” after they move to a different dealership. Wrong – they are the dealer’s customers, not the salesperson’s. In addition, customer information is considered private under federal law. If someone captures that information but not because of a business transaction, for some other purpose, it violates Federal privacy law.

Bruce: What remedies are available to an employer in this situation? What can an auto dealer do if a new vehicle salesperson takes a customer list with them?

Thao: The dealer can file for an injunction telling a dealer not to use information that came from other dealers. When dealers do receive such information it won’t be profitable because an injunction is very expensive for them to defend as well as scary and distracting.

And if the company whose information was taken can prove actual damages, then they can receive money damages from the new employer for tortious interference with private information. For example, I had a case where a person thought they were going to be terminated, so they copied specifications for a technical piece of equipment and emailed to themselves. Then they changed information in the company computers regarding that equipment, which was very expensive for that company to correct. A new employer could be held liable for damages by accepting that information from the former employee.

Bruce: What about non-competition agreements – do those work?

Thao: A non-compete protects employer information that’s already in an employee’s head. It’s limited but it works. For example, it can say a vehicle sales employee can’t work in a dealership selling the same type of car in the same county, but usually can’t keep someone from completely working in the car business, or for any company within that county. It works as long as you don’t prevent the employee from working anywhere in the same business.

Bruce: Did you read about the Motorola ex-CFO who quit, apparently under some kind of cloud, then returned his company laptop with files wiped? He then accused the company of retaliation, so the company accused him of spoliation. What can an employer do in this situation? Can the court award sanctions against an ex-employee for destroying evidence?

Thao: Yes, most people don’t understand that computer files must be preserved whenever there is even a smell of dispute in the air. Might the court award money sanctions? Possibly. Or, in some extremely serious situations the judge can order that the offending party can’t defend itself; or that a party can’t pursue it’s lawsuit – case dismissed. It’s a form of inconsistent pleading – a claimant can’t resist providing information and pursue a remedy simultaneously.

Bruce: From what you have said today it sounds like data backups of one sort or another are a critical element for eDiscovery, at least in your practice.

Thao: Disaster recovery backups just make sense as a litigation backup data source when dealing with employees. But you need historical backups that are locked down so that they can’t be erased for a period of time during which they might be needed.

Archiving is another thing you can do. For example, the Puget Sound Automobile Dealers Association maintains an electronic archive of participating dealers’ employee policy manuals over the years which can be used as evidence in an employee dispute.

Bruce: Which brings us to a final thought. There’s a lot of company data — confidential customer data — in the hands of non-attorneys who don’t have the same paranoia about casually exposing it that attorneys like you and I do….

Thao: Yes, you have to have confidence in IT people that they won’t be trolling confidential information, that they will keep it confidential.

Reusing document clustering categories to spend less on eDiscovery?

After drafting a blog post about mass data sampling and classification in the “cloud,” I became curious about the potential for reusing categories developed in eDiscovery sampling and classification projects as “seeds” for later projects. For further insight I turned to Richard Turner, Vice President of Marketing at Content Analyst Company, LLC, a document clustering and review provider for eDiscovery.

schl¸sselBruce: I wonder to what extent document categories that are created using document clustering software when reviewing documents for eDiscovery can be aggregated across multiple document requests and/or law suits within the same company. Can previously developed categories or tags be reused to seed and thus speed up document review in other cases?

Richard: Regarding the notion of aggregating document categories, etc., it’s something that’s technically very feasible. And it could greatly speed document review if categories could be used to “seed” new reviews, new cases, etc. Here’s the challenge: we have found that most of the “categories” developed by our clients start-out case specific, and are too granular to be valuable when the next case comes along. It also hasn’t seemed to matter whether categorization was being used by a corporate legal department or an outside counsel – they’re equally specific.

The idea itself had merit, so we tossed it around with our Product Solutions Architects, and they came up with several observations. First of all, the categories people develop are driven by their need to solve a specific eDiscovery challenge, i.e. documents that are responsive to the case at hand. Second, when the next issue or case comes along, they naturally start over again, first by identifying responsive documents and then by using those documents to create categories – any “overlap” is purely coincidental. Finally, to develop categories that were really useful across a variety of issues or cases, they would need to be fairly generic and probably not developed with any specific case in mind.

I think that’s very hard to do for a first or even second-level review – it’s not necessarily a natural progression, as people work backwards from the issues at hand. Privilege review, however, could be a different animal. There are some things in any case that invoke privilege because of the particulars of the case, for example, attorney-client conversations which are likely to involve different individuals in different litigation matters. There are other things that could logically be generic – company “trade secrets” for example would almost always be treated as privilege, as are certain normally-redacted items such as PII (personally-identifiable information). Privilege review is also a very expensive aspect for eDiscovery, since it involves physical “reads” using highly-paid attorneys (not something you can comfortably offshore). Could “cloud seeding” have value for this aspect of eDiscovery? It’s an interesting thought.

Cloud-seeding: SaaS data classification via Panda Security’s new anti-virus offering

Panda Security recently released (in beta form) what it claims is the first cloud-based anti-virus / anti-malware solution for Windows PCs. Not only does it sound like a clever tool for data loss prevention, but it demonstrates another way in which information service providers can aggregate individual user data to develop classifications or benchmarks valuable to every user, a mechanism I’ve explored in previous blog posts.

In essence, every computer using Panda’s Cloud Antivirus is networked together through Panda’s server to form a “collective intelligence” for malware detection and prevention. Here’s how it works: users download and install Panda’s software – it’s a small application known as an “agent” because the heavy lifting is done on Panda’s server. These agents send reports back to the Panda server containing information about new files (and, I presume, related computer activity which might indicate the presence of malware). When the server receives reports about previously unknown files which resemble, according to the logic of the classification engine, files already known to be malware, these new files are classified as threats without waiting for manual review by human security experts.

Security Camera
Sampling at the right time and place allows proactive decision making.

For example, imagine a new virus is released onto the net by its creators. People surfing the net, opening emails, and inserting digital media start downloading this new file, which can’t be identified as a virus by traditional anti-virus software because it hasn’t been placed in any virus definitions list yet. Computers on which the Panda agent has been installed begin sending reports about the new file back to the Panda server. After some number of reports about the file are received by Panda’s server, the server is able to determine that the new file should be treated as a virus. At this point all computers in the Panda customer network are preemptively warned about the virus, even though it has only just appeared.

According to Panda’s April 29, 2009 press release:

Utilizing Panda’s proprietary cloud computing technology called Collective Intelligence, Panda Cloud Antivirus harnesses the knowledge of Panda’s global community of millions of users to automatically identify and classify new malware strains in almost real-time. Each new file received by Collective Intelligence is automatically classified in under six minutes. Collective Intelligence servers automatically receive and classify over 50,000 new samples every day. In addition, Panda’s Collective Intelligence system correlates malware information data collected from each PC to continually improve protection for the community of users.

Because Panda’s solution is cloud-based and free to consumers, it will reside on a large number of different computers and networks worldwide. This is how Panda’s cloud solution is able to fill a dual role as both sampling and classification engine for virus activity. On the one hand Panda serves as manager of a communal knowledge pool that benefits all consumers participating in the free service. On the other hand, Panda can sell the malware detection knowledge it gains to corporate customers – wherein lies the revenue model that pays for the free service.

I have friends working in two unrelated startups, one concerning business financial data and the other Enterprise application deployment ROI, that both work along similar lines (although neither are free to consumers). Both startups offer a combination of analytics for each customer’s data plus access to benchmarks established by anonymously aggregating data across customers.

Panda’s cloud analytics, aggregation and classification mechanism is also analogous to the non-boolean document categorization software for eDiscovery discussed in previous posts in this blog, whereby unreviewed documents can be automatically (and thus inexpensively) classified for responsiveness and privilege:

Deeper, even more powerful extensions of this principle are also possible. I anticipate that we will soon see software which will automatically classify all of an organization’s documents as they are created or received, including documents residing on employees laptop and mobile devices. Using Panda-like classification logic, new documents will be classified accurately whether or not they are of an exact match with anything previously known to the classification system. This will substantially improve implementation speed and accuracy for search, access control and collaboration, document deletion and preservation, end point protection, storage tiering, and all other IT, legal and business information management policies.

Tape Indexing Breathes Life Into Tape Storage

Here’s an observation that can be tagged “mixed blessings”: foot dragging on the part of techno-lagging attorneys has shielded (and in some cases continues to shield) their clients from the full potential weight of eDiscovery requests. For example, even after years of discussion, the legal profession didn’t formally recognize the obligation to produce metadata in response to discovery requests before the Federal Rules of Civil Procedure amendments adopted at the end of 2006. More outrageously, some attorneys are still gaming to avoid eDiscovery all together, as Magistrate Judge John M. Facciola (U.S. District Court, Washington D.C.) pointed out in his keynote presentation at LegalTech earlier this year.

Only a few years ago certain courts had ruled that data stored on tape could be considered “inaccessible” because it was so expensive to review it, and thus data stored on tape did not always need to be reviewed when answering an eDiscovery request (for example, the Zublake decisions). More recently, however, the legal profession is becoming aware of advances which make tapes faster and cheaper to review, like technology for rapid disaster recovery.

What IT person doesn't look forward to working with historic data?
What IT person doesn't look forward to working with historical data?

There are still a number of fine distinctions being made in this area of law, and the specific tape handling practices of different companies can render their tapes more or less “accessible.” (Ironically, companies that archive backup tapes indefinitely, which sounds like a safe practice, may be exposing themselves to a greater burden in eDiscovery, not to mention the extra cost of storing outdated tapes.) But broadly speaking, few companies storing information on tape can categorically rely on “inaccessibility” to rule out the risk of being required to review their tapes during eDiscovery any more. For more about the law concerning inaccessibility, including California’s burden-shifting rules, I recommend this article by Winston & Strawn attorneys David M. Hickey and Veronica Harris.

Fortunately, two prongs of innovation are shrinking the issues surrounding eDiscovery and tape. The first prong, which happens to be the subject of this blog post, comes in the form of new tape indexing and document retrieval technology. The second solution, which involves substituting hard drives in place of tape, will be the subject of a future post.

To learn more about the current state of eDiscovery technology in the realm of tape, I recently spoke with Jim McGann, Vice President of Marketing at Index Engines. Index Engines’ solution comes in the form of an appliance (a hardware box pre-loaded with their software) that scans a broad variety of tapes and catalogs the content. The appliance indexes tape data and de-duplicates documents within the index using the hash values of the documents. At this point users can cull (selectively retrieve) potentially responsive documents from a batch of ingested tapes without first performing an expensive, resource-intensive full restoration of each tape. And because Index Engines can ingest all of the common tape storage formats, users don’t need to run or even possess the original software used to write to the tapes.

From a longer-term strategic perspective, Index Engines’ users can approach their tape stores incrementally, taking a first pass through their tapes in response to a particular discovery request, then add to their global tape index as new discovery requests are fielded. They can embark upon a proactive tape indexing campaign that will give them enhanced early case assessment capabilities. Users may also opt to extract important data that is not immediately needed but resides on old or degraded tape.

For companies with thousands or tens of thousands of tapes, indexing can allow significant numbers of tapes to be discarded since many individual tapes typically contain data which is almost entirely repeated on other tapes or has lasted past the end of its retention period – not to mention the corrupted or blank tapes which are being carefully stored nonetheless.

All of this makes Index Engines an extremely affordable (at least by Enterprise standards) alternative to restoring and reviewing tapes individually.

I asked Jim McGann whether Index Engines resembled dentists who teach patients good dental hygiene and, if successful, will wind up putting themselves out of a job. If Index Engines’ appliances succeed in indexing, de-duplicating, and extracting all of the stored tape in existence, while ever more affordable hard drive storage replaces tape storage, won’t the company be out of a job?

Jim pointed out that, for certain organizations which currently rely on tape storage, substituting hard drives for tape drives is simply not a viable option. Costs associated with re-routing system data and human work flows, as well as the risk of downtime during a transition, mean that many organizations won’t switch even after disk drives become less expensive. And Index Engines takes away much of the cost incentive for switching that would otherwise be driven by eDiscovery and compliance requirements. Finally, Jim says, Index Engines can be used to index almost all of the information customers have, not just tape data, which enables users to find non-tape information that must be reviewed for eDiscovery.

The other approach to the problem of tape storage (to be explored in more detail in a future blog post) involves near-line hard drive solutions. Leading hard drive storage vendors such as Isilon Systems claim that their “near line” solution is priced nearly as low as tape while offering higher performance and reliability. But advocates of tape, including the Boston-area based Clipper Group (in a whitepaper offered on tape drive vendor SpectraLogic’s web site) claim that the total costs of ownership of disk storage, taking into account factors such as floor space requirements and electricity, is still many, many times higher than tape.

So, as tapes look like they will be around for some time to come, companies with tapes will continue to need technologies like Index Engines’. And most will not be able to avoid discovery of tapes for much longer, if they are even still able to do so, thanks in part to the availability of these technologies.

PREVIOUS POST: The Evolution of eDiscovery Analytics Models, Part II: A Conversation with Nicholas Croce

The Evolution of eDiscovery Analytics Models, Part II: A Conversation with Nicholas Croce

I recently had the pleasure of speaking with Nicholas Croce, President of Inference Data, a provider of innovative analytics and review software for eDiscovery, following the company’s recent webinar, De-Mystifying Analytics. During our conversation I discovered that Nick is double-qualified as a legal technology visionary. He not only founded Inference, but has been involved with legal technologies for more than 12 years. Particularly focused on the intersection of technology and the law, Nick was directly involved in setting the standards for technology in the courtroom through working personally with the Federal Judicial Center and the Administrative Office of the US Courts.

I asked to speak with Nick because I wanted to pin him down on what I imagined I heard him say (between the words he actually spoke) during the live webinar he presented in mid-March. The hour-long interview and conversation ranged in topic, but was very specific in terms of where Nick sees the eDiscovery market going.

Sure enough, during our conversation Nick confirmed and further explained that he and his team, which includes CEO Lou Andreozzi, the former LexisNexis NA (North American Legal Markets) Chief Executive Officer, have designed Inference with not one, but two models of advanced eDiscovery analytics and legal review in mind.

As data volume explodes, choosing the right way to sift it becomes urgent
As total data volume explodes, choosing the right way to sift out responsive documents becomes urgent

Please read my previous blog post The Evolution of eDiscovery Analytics Models, Part I: Trusting Analytics if you haven’t already and want to understand more about the assertions I make in this blog post.

In a nutshell, Inference is designed not only to deliver the current model of eDiscovery software analytics, which I have dubbed “Software Queued Review,” but the next generation analytics model as well, which I am currently calling “Statistically Validated Automated Review” (Nick calls it “auto-coding”).

Bruce: In a webinar you presented recently you explained statistical validation of eDiscovery analytics and offered predictions concerning the evolution of the EDRM (“Electronic Discovery Reference Model”).

I have a few specific questions to ask, but in general what I’d like to cover is:

1) where does Inference fit within the eDiscovery ecosystem,
2) how you think statistically validated discovery will ultimately be used, and
3) how you think the left side of the EDRM diagram (which is where document identification, collection, and preservation are situated) is going to evolve?

Nick: To first give some perspective on the genesis of Inference, it’s important to understand the environment in which it was developed. Prior to founding Inference I was President of DOAR Litigation Consulting. When I started at DOAR in 1997, the company was really more of a hardware company than anything else. I was privileged to be involved in the conversion of courtroom technology from wooden benches to the efficient digital displays of evidence  we see today,  Within a few years we became the predominant provider of courtroom technology, and it was amazing to see the legal system change and directly benefit from the introduction of technology. As people saw the dramatic benefits, and started saying “how do we use it?” we created a consulting arm around eDiscovery which provided the insight to see that this same type of evolution was needed within the discovery process.

This began around 2004-2005 when we started to see an avalanche of ESI (“Electronically Stored Information”) coming, and George Socha became a much needed voice in the field of eDiscovery. As a businessman I was reading about what was happening, and asking questions, and it seemed black and white to me – it had become impossible to review everything because of the tremendous volume of ESI with existing technology. As a result I started developing new technology for it, to not only manage the discovery of large data collections, but to improve and bring a new level of sophistication to the entire legal discovery process.

Inference was developed to help clients intelligently mine and review data, organize case workflow and strategy, and streamline and accelerate review. It’s the total process. But, today I still have to fight “the short term fix mentality” – lawyers who just care about “how do I get through this stuff faster”, which is the approach of some other providers, and which also relates to the transition I see in the EDRM model – I want to see the whole thing change.

Review is the highest dollar amount, the biggest pain, 70% of a corporation’s legal costs are within eDiscovery. People want to, and need to, speed up review. However, we also need to add intelligence back into the process.

Bruce: What differentiates Inference, where does it fit in?

Nick: I, and Inference, went further than just accelerating linear review and said: it has to be dynamic, not just coding documents as responsive / non-responsive.  I know this is going to sound cheesy I guess, but – you have to put “discovery” back into Discovery. You need to be able to quickly find documents during a deposition when a deponent says something like “I never saw a document from Larry about our financial statements”, and not just search for “responsive: yes/no”, “privileged: yes/no.”

Inference was, and is, designed to be dynamic – providing suggestions to reviewers, opportunities to see relationships between documents and document sets not previously perceived, helping to guide attorneys – intuitively. Inference follows standard, accepted methodologies, including Boolean keyword search, field and parametric search, and incorporates all of the tools required for review – redaction, subjective coding, production, etc.

In addition to that overriding principle, we wanted the ability to get data in from anyone, anywhere and at any time. Regulators are requiring incredibly aggressive production timelines; serial litigants re-use the same data set over and over; CIOs are trying to get control over searching data more effectively, including video and audio. Inference is designed to take ownership of data once it leaves the corporation, whether it is structured, semi-structured or unstructured data.

Inside the firewall, the steps on the left side of the EDRM model are being combined.  Autonomy, EMC, Clearwell, StoredIQ — the crawling technologies – these companies are within inches of extracting metadata during the crawling process, and may be there already. This is where Inference comes in since we can ingest this data directly. I call it the disintermediation of processing because at that point there is no more additional costs for processing.

In the past someone would use EnCase for preservation, then Applied Discovery for processing (using date ranges and Boolean search terms), and at some cost per custodian, and per drive, you’d then pay for processing. It used to be over $2,500 per gig, now it’s more like $600 to $1500 per gig, depending on multi-language use and such.

But once corporations automate the process with crawling and indexing solutions, all of the information goes right into Inference without the intermediary steps, which puts intelligence back in the process. You can ask the system to guide you whenever there’s a particular case, or an issue. If I know the issue is a conversation between Jeff and Michele during a certain date range, I can prime the system with that information, start finding stuff, and start looking at settlement of the dispute. But without automation it can take months to do, at much higher costs.

Inference also offers quality control aspects not previously available: after, say, one month, you can use the software to check review quality, find rogue reviewers, and fix the process. You can also ultimately do auto-coding.

Bruce: I think this is a good opening for segue to the next question: how will analytics ultimately be used in eDiscovery?

Nick: The two most basic components of review are “responsive” and “privileged.” I learned from the public testimony of Verizon’s director of eDiscovery, Patrick Oot, some very strong statistics from a major action they were involved in. The first level document review expense was astounding even before the issues were identified. The total cost of responsive and privileged review was something like $13.6 million.

The truth is that companies only do so many things. Pharma companies aren’t generally talking about real-estate transactions or baseball contracts.

Which brings us to auto-coding… sometimes I try to avoid calling it “auto-coding” versus “computer aided” or “computer recommended” coding. When someone says “the computer did it” attorneys tend to shut down, but if someone says “the computer recommended it” then they pay attention.

Basically auto-coding is applying issue tags to the population based on a sampling of documents. The way we do it is very accurate because it is iterative. It uses statistically sound sampling, recurrent models. It uses the same technology as concept clustering, but you cluster a much smaller percentage.  Let’s say you create 10 clusters, tag those, then have the computer tag other documents consistently with the same concepts. Essentially, the computer makes recommendations which are then confirmed by attorney, and then repeated until the necessary accuracy level has been achieved. This enables you to only look at a small percentage of the total document population.

Bruce: I spoke with one of the statistical sampling gurus at Navigant Consulting last month, who suggests that software validated by statistical sampling can be more accurate than human reviewers, with fewer errors, for analyzing large quantities of documents.

Nick: It makes sense. Document review is very labor intensive and redundant. Think about the type of documents you’re tagging for issues – it doesn’t even need to be conscious: it is an extremely rote activity on many levels which just lends itself to human error.

Bruce: So let’s talk about what needs to happen before auto-coding becomes accepted, and becomes the rule rather than the exception. In your webinar presentation you danced around this a bit, saying, in effect, that we’re waiting for the right alignment of law firms, cases, and a judge’s decision. In my experience as an attorney, including some background in civil rights cases, the way to go about this is by deliberately seeking out best-case-scenario disputes that will become “test cases.” A party who has done its homework stands up and insists on using statistically validated auto-coding in an influential court, here we probably want the DC Circuit, the Second Circuit, or the Ninth Circuit, I suppose. When those disputes result in a ruling on statistical validity, the law will change and everything else will follow. Do you know of any companies in a position to do this, to set up test cases, and have you discussed it with anyone?

Nick: Test cases: who is going to commit to this — the general counsel? Who do they have to convince? Their outside counsel, who, ultimately, has to be comfortable with the potential outcome. But lawyers are trained to mitigate risk, and for now they see auto-coding or statistical sampling as a risk. I am working with a couple of counsel with scientific and/or mathematic backgrounds who “get” Bayesian methods- and the benefits of using them. Once the precedents are set, and determine the use of statistical analysis as reasonable, it will be a risk not to use these technologies. As with legal research, online research tools were initially considered a risk. Now, it can be considered malpractice not to use them.

It can be frustrating for technologists to wait, but that’s how it is. Sometimes when we are following up with new installs of Inference we find that 6 weeks later they’ve gone back to using simple search instead of the advanced analytics tools. But even for those people, after using the advanced features for a few months they finally discover they can no longer live without it.

Bruce: Would you care to offer a prediction as to when these precedents will be set?

Nick: I really believe it will happen, there’s no ambiguity.  I just don’t know if it’s 6 months or a year. But general counsel are taking a more active role, because of the cost of litigation, because of the economy, and looking at expenses more closely. At some point there will be that GC and an outside counsel combination that will make it happen.

Bruce: After hearing my statistician friend from Navigant deliver a presentation on statistical sampling at LegalTech last month I found myself wondering why parties requesting documents wouldn’t want to insist that statistically validated coding be used by parties producing documents for the simple reason that this improves accuracy. What do you think?

Nick: Requesting parties are never going to say “I trust you.”

Bruce: But like they do now, the parties will still have to be able to discuss and will be expected to reach agreement about the search methods being used, right?

Nick: You can agree to the rules, but the producing party can choose a strategy that will be used to manage their own workflow – for example today they can do it linearly, offshore, or using analytics. The requesting party will leave the burden on the producing party.

Bruce: If they are only concerned with jointly defining responsiveness, in order to get a better-culled set of documents — that helps both sides?

Nick: That would be down the road… at that point my vision gets very cloudy, maybe opposing counsel gets access to concept searches – and they can negotiate over the concepts to be produced.

Bruce: There are many approaches to eDiscovery analytics. Will there have to be separate precedents set for each mathematical method used by analytics vendors, or even for each vendor-provided analytical solution?

Nick: I’d love to have Inference be the first case. But I don’t know how important the specific algorithm or methodology is going to be – that is a judicial issue. Right now we’re waiting for the perfect judge and the perfect case – so I’ll hope it’s Inference, rather then “generic” as to which analytics are used. I hope there’s a vendor shakeout – for example, ontology based analytics systems demo nicely, but “raptor” renders “birds” which is non-responsive, while “Raptor” is a critical responsive term in the Enron case.

Bruce: Perhaps vendors and other major stakeholders in the use of analytics in eDiscovery, for example, the National Archives, should be tracking ongoing discovery disputes and be prepared to file amicus briefs when possible to help support the development of good precedents.

Nick: Perhaps they should.

PREVIOUS POST: The Evolution of eDiscovery Analytics Models, Part I: Trusting Analytics
NEXT POST:
Tape Indexing Breathes Life Into Tape Storage

What is Discovery? – explaining eDiscovery to non-lawyers

I met with a group of software developers earlier this week to talk about configuring a visual analytics solution to provide useful insights for eDiscovery. To help them understand the overall process I wrote out a short description of key concepts in Discovery. I omitted legal jargon and described Discovery as a simple, repeatable process that would appeal to engineers. If anyone has enhancements to offer I’d be happy to extend this set further.

Discovery

Discovery is a process of information exchange that takes place during most lawsuits. The goal of discovery is to allow the lawyers to paint a picture that sheds lights on what actually happened. Ideally court proceedings are like an academic argument over competing research papers that have been written as accurately and convincingly as possible. Each side tries to assemble well-documented citations to letters, emails, contracts, and other documents, with information about where they were found, who created them, why they were created, when, and how they were distributed.

Discovery requests

The dead tree version of a law library.
The dead tree version of a law library.

The discovery process is governed by a published body of laws and regulations. Under these rules after a lawsuit has begun each side has the ability to ask the other side to search carefully for all documents, including electronic ones, that might help the court decide the case. Each party can ask the other for documents, using written forms called Discovery Requests. BOTH sides will need documents to support their respective positions in the lawsuit.

Responsive

Documents are “responsive” when they fit the description of documents being sought under discovery requests. Each side has the responsibility of being specific, and not over-inclusive, in describing the documents it requests. Each party has the opportunity to challenge the other’s discovery requests as being over broad, and any dispute that cannot be resolved by negotiation will be resolved by the court. But once the time for raising challenges is over each side has the obligation to take extensive steps to search for, make copies of, and deliver all responsive, non-privileged documents to the other side.

Privilege

Certain document types are protected from disclosure by privilege. The most common are attorney-client privilege and the related work product privilege which in essence cover communications between lawyers and clients and in certain cases non-lawyers working for lawyers or preparing for lawsuits. When documents are responsive, but also protected by privilege, they are described on a list called a privilege log and the log is delivered to the other side instead of the documents themselves.

Authentication

A document ordinarily isn’t considered self-explanatory. Before it can be used in court it must be explained or “authenticated” by a person who has first-hand knowledge of where the document came from, who created it, why it was created, how it was stored, etc. Authentication is necessary to discourage fakery and to limit speculation about the meaning of documents. Documents which simply appear with no explanation of where they came from may be criticized and ultimately rejected if they can’t be properly identified by someone qualified to identify them. Thus document metadata — information about the origins of documents — is of critical importance for discovery. (However, under the elaborate Rules of Evidence that must be followed by lawyers, a wide variety of assumptions may be made, depending on circumstances surrounding the documents, which may allow documents to be used even if their origins are disputed.)

Document custodians

The term “custodian” can be applied to anyone whose work involves storing documents. The spoken or written statement (“testimony”) of a custodian may be required to authenticate and explain information that is in their custody. And when a legal action involves the actions and responsibilities of relatively few people (as most legal actions ultimately do), those people will be considered key custodians whose documents will be examined more thoroughly. Everyone with a hard drive can be considered a document custodian with respect to that drive, although system administrators would ordinarily be considered the custodians of a company-wide document system like a file server. Documents like purchase orders, medical records, repair logs and the like, which are usually and routinely created by an organization (sometimes called “documents kept in the ordinary course of business”) may be authenticated by a person who is knowledgeable about the processes by which such documents were ordinarily created and kept, and who can identify particular documents as having been retrieved from particular places.

Types of documents which are discoverable and may be responsive

Typically any form of information can be requested in discovery, although attorneys are only beginning to explore the boundaries of the possibilities here. In the old days only paper documents and memories were sought through discovery. (Note: of course, physical objects may also be requested, for example, in a lawsuit claiming a defect in an airplane engine, parts of the engine may requested.) As of today requests frequently include databases, spreadsheets, word processing documents, emails, instant messages, voice mail and other recordings, web pages, images, metadata about documents, document backup tapes, erased but still recoverable documents, and everything else attorneys can think of that might help explain the circumstances on which the lawsuit is based.

Discovery workflow

Discovery can be time consuming and expensive. Lawyers work closely with IT, known document custodians, and others with knowledge of the events and people involved in the lawsuit. First they attempt to identify what responsive documents might exist, where they might be kept, and who may have created or may have control over the documents that might exist. Based on what is learned through this collaboration, assumptions are made and iteratively improved about what documents may exist and where they are likely to be found. Efforts must be taken to instruct those who may have potentially responsive documents to avoid erasing them before they are found (this is called “litigation hold”). Then efforts are taken to copy potentially responsive documents, with metadata intact, into a central repository in which batch operations can take place. In recent years online repositories that enable remote access have become very popular for this purpose. Within this repository lawyers and properly qualified personnel can sort documents into groups using various search and de-duplication methodologies, set aside documents which are highly unlikely to contain useful information, then prioritize and assign remaining documents to lawyers for manual review. Reviewing attorneys then sort documents into responsive and non-responsive and privileged and non-privileged groupings. Eventually responsive, non-privileged documents are listed, converted into image files (TIFFs), and delivered to the other side, sometimes alongside copies of the documents in their original formats.

Early Case Assessment (also called Early Data Assessment)

Even before receiving a discovery request, and sometimes even before a lawsuit has been filed, document review can be started in order to plan legal strategy (like settlement), prevent document erasure (“litigation holds”), etc. This preliminary review is called “Early Case Assessment” (or “Early Data Assessment”).

UPDATE: I describe the sources and development of legal procedural rules for e-discovery in a later blog post, Catch-22 for e-discovery standards?