Earlier this week I blogged about placing the locus of control for e-discovery decisions in the right hands to ensure that the decisions made pass muster in court. To illustrate the potential impact of moving the locus of control for certain decision to an outsource partner let’s compare the document review solutions offered by H5 and Inference Data.
Last month I attended a webinar presented by H5. One thing that struck me as distinctive about H5 is their standard deployment of a team of linguists to improve detection of responsive documents from among the thousands or millions of documents in a document review. During the webinar I submitted a question asking what it is their linguists do that attorneys can’t do themselves. One of their people was kind enough to answer, more or less saying “These guys are more expert at this query-building process than attorneys.” Ouch.
I’ve long prided myself on my search ability (ask me about the time I deployed a boolean double-negative in a Westlaw search for Puerto Rico “RICO” cases) and I’m sure many of my fellow attorneys are equally proud. However, I know people (or engineers, anyway) who are probably better at search than I am, and I know one or two otherwise blindingly brilliant attorneys who are seriously techno-lagged. More importantly, attorneys typically have a lot on their plates, and search expertise on a nitty-gritty “get the vocabulary exactly right” level is just one of a thousand equally important things on their minds, so it’s not realistically going to be a “core competency.” So I can see the wisdom in H5’s approach, although I wonder how many attorneys are willing to admit right out loud that they are better off outsourcing this competency.
I can see where, depending on a number of different factors, either solution might be better. I encourage anyone facing this choice to make an informed decision about which approach leads to the best results rather than relying on their knee-jerk reaction.
As an enthusiastic user of SaaS (“Software as a Service”) applications, I’ve increased my own productivity via the cloud. But while wearing my Information Governance hat I see companies becoming sensitized to information control and risk management issues arising from SaaS use. In particular:
Company intellectual property (“IP”) frequently leaks out through employees’ SaaS use, often when subject matter experts within a company naively collaborate with “colleagues” outside the company; and
Company information may be preserved indefinitely rather than being deleted at the end of its useful life, thus remaining available for eDiscovery when it shouldn’t be.
Pi Corp’s Smart Desktop project is described by EMC’s CTO Jeff Nick in this video taken at EMC World last year. In a nutshell, Smart Desktop is meant to:
provide a central portal for all of an individual’s information collected from all of the information sources they use;
index and classify that information so it can be used more productively, for example, when a user begins performing a particular task the user will be prompted with a “view” (dashboard) of all of the information the system expects they will want, based on the user’s past performance and the system’s predictive intelligence algorithms;
“untether” information so that it is available to the user from any of the user’s devices, including mobile devices, and interchangeable across different sources; and
enable secure sharing such that people can share just the information they wish to share with those they want to share it with.
Once I’ve had a chance to evaluate Smart Desktop I’ll take a harder look at its Information Governance implications. Problems could arise for employers — albeit through no fault of Decho — if Smart Desktop (or Mozy, or another file sharing service, for that matter) is used by employees to share their employers’ IP with people outside of the company, or people within their company who have not been properly trained and cautioned about maintaining IP security. Similarly, if Smart Desktop (or Mozy, or another SaaS) enables employees to preserve company documents beyond their deletion dates, or to access company documents after they are no longer employees, this could prove difficult in eDiscovery or IP secrecy scenarios, where such information could become a costly surprise late in the game.
But for now I’ll presume that because Decho’s parent EMC has a strong Information Governance focus, Decho will ultimately provide not only the access controls that they currently envision, which will enable secure sharing across devices and users, but also group administration features that make it possible for companies to retain control over IP and information lifecycle management. In particular, I predict Decho will provide dynamic global indexing of information which enters any user account within a company’s user group, thereby making company information easy to find, place holds on, and collect for eDiscovery. I also predict Decho will offer document lifecycle management functionality, including automatically enforced retention and deletion policies.
And while I’m making a Decho wish list, two more items:
I strongly recommend reading Ron’s post for the benefit of his insights, whether or not you are already familiar with TREC Legal Track. I’d also like to offer my own observations about TREC Legal Track’s finding of low consistency between document classification decisions made by subject matter experts, who are spoken of as “gold standard” reviewers, and ordinary legal document reviewers. (In TREC Legal Track’s study, ordinary reviewers were 2nd and 3rd year law students. In real life the subject matter expert role is played by in-house or outside counsel, while much of the actual review work is performed by contract or outsource attorneys.)
Generally speaking, quality control processes involve benchmarking against some standard. Mechanical processes can be meaningfully benchmarked by physically sampling output (this is the essence of Six Sigma, in particular). For example, as machine parts come off an assembly line, samples can be selected and measured and the variance between their actual size and target size monitored not only to detect defects but to flag the processes responsible for defects. Human processes can also be benchmarked in a variety of ways. (This is in part the province of ITIL, the “Information Technology Information Library,” and the basis for the idea of “service level agreements”.) For example, those responsible for a customer service center may track the number of issues handled per hour, the type of issues handled, the number of resolutions or escalations per issue, revenue gained or lost per issue, etc.
Unfortunately, “responsiveness” and “privilege” are not only somewhat subjective in document review, standards for responsiveness and privilege will vary from case to case. For this reason standards need to be developed “on the fly” for each case, and these standards will by necessity be arbitrary (aka subjective) to some degree even if consistently applied. The good news is that the latest generation of document clustering software applications incorporate tools for developing consistent document review standards on the fly. Through an iterative feedback loop, the humans educate the machines to look for documents with certain characteristics, while the machines force the humans to refine their conception of responsiveness and privilege to a degree that the machine can reliably model it. After enough iterations have passed and the machine has reached some measurable standard of consistency, the humans can step back and let the machine do the rest of the review work. The machine does it more consistently than human reviewers could themselves, and at a much lower cost.
With document review the very idea of defining a “gold standard” for classification is less useful than it sounds. For instance, even if a panel of leading legal scholars could be formed for each eDiscovery matter, the mere fact that someone legitimately may be called a leading scholar doesn’t mean that their views will be consistent with anyone else’s — just well reasoned. But a “gold standard” is not what’s important here. What’s important is that in each case the attorneys responsible for responding to a document request do everything they can to carefully define and consistently enforce reasonable document review standards. This is what the current crop of document clustering applications are intended to do. That is the current model, anyway. I don’t pretend to be able to name the vendors who can or cannot deliver on this promise, although I think this will be the number one question in eDiscovery technology before long.
Bruce: I wonder to what extent document categories that are created using document clustering software when reviewing documents for eDiscovery can be aggregated across multiple document requests and/or law suits within the same company. Can previously developed categories or tags be reused to seed and thus speed up document review in other cases?
Richard: Regarding the notion of aggregating document categories, etc., it’s something that’s technically very feasible. And it could greatly speed document review if categories could be used to “seed” new reviews, new cases, etc. Here’s the challenge: we have found that most of the “categories” developed by our clients start-out case specific, and are too granular to be valuable when the next case comes along. It also hasn’t seemed to matter whether categorization was being used by a corporate legal department or an outside counsel – they’re equally specific.
The idea itself had merit, so we tossed it around with our Product Solutions Architects, and they came up with several observations. First of all, the categories people develop are driven by their need to solve a specific eDiscovery challenge, i.e. documents that are responsive to the case at hand. Second, when the next issue or case comes along, they naturally start over again, first by identifying responsive documents and then by using those documents to create categories – any “overlap” is purely coincidental. Finally, to develop categories that were really useful across a variety of issues or cases, they would need to be fairly generic and probably not developed with any specific case in mind.
I think that’s very hard to do for a first or even second-level review – it’s not necessarily a natural progression, as people work backwards from the issues at hand. Privilege review, however, could be a different animal. There are some things in any case that invoke privilege because of the particulars of the case, for example, attorney-client conversations which are likely to involve different individuals in different litigation matters. There are other things that could logically be generic – company “trade secrets” for example would almost always be treated as privilege, as are certain normally-redacted items such as PII (personally-identifiable information). Privilege review is also a very expensive aspect for eDiscovery, since it involves physical “reads” using highly-paid attorneys (not something you can comfortably offshore). Could “cloud seeding” have value for this aspect of eDiscovery? It’s an interesting thought.
Panda Security recently released (in beta form) what it claims is the first cloud-based anti-virus / anti-malware solution for Windows PCs. Not only does it sound like a clever tool for data loss prevention, but it demonstrates another way in which information service providers can aggregate individual user data to develop classifications or benchmarks valuable to every user, a mechanism I’ve explored in previous blog posts.
In essence, every computer using Panda’s Cloud Antivirus is networked together through Panda’s server to form a “collective intelligence” for malware detection and prevention. Here’s how it works: users download and install Panda’s software – it’s a small application known as an “agent” because the heavy lifting is done on Panda’s server. These agents send reports back to the Panda server containing information about new files (and, I presume, related computer activity which might indicate the presence of malware). When the server receives reports about previously unknown files which resemble, according to the logic of the classification engine, files already known to be malware, these new files are classified as threats without waiting for manual review by human security experts.
For example, imagine a new virus is released onto the net by its creators. People surfing the net, opening emails, and inserting digital media start downloading this new file, which can’t be identified as a virus by traditional anti-virus software because it hasn’t been placed in any virus definitions list yet. Computers on which the Panda agent has been installed begin sending reports about the new file back to the Panda server. After some number of reports about the file are received by Panda’s server, the server is able to determine that the new file should be treated as a virus. At this point all computers in the Panda customer network are preemptively warned about the virus, even though it has only just appeared.
Utilizing Panda’s proprietary cloud computing technology called Collective Intelligence, Panda Cloud Antivirus harnesses the knowledge of Panda’s global community of millions of users to automatically identify and classify new malware strains in almost real-time. Each new file received by Collective Intelligence is automatically classified in under six minutes. Collective Intelligence servers automatically receive and classify over 50,000 new samples every day. In addition, Panda’s Collective Intelligence system correlates malware information data collected from each PC to continually improve protection for the community of users.
Because Panda’s solution is cloud-based and free to consumers, it will reside on a large number of different computers and networks worldwide. This is how Panda’s cloud solution is able to fill a dual role as both sampling and classification engine for virus activity. On the one hand Panda serves as manager of a communal knowledge pool that benefits all consumers participating in the free service. On the other hand, Panda can sell the malware detection knowledge it gains to corporate customers – wherein lies the revenue model that pays for the free service.
I have friends working in two unrelated startups, one concerning business financial data and the other Enterprise application deployment ROI, that both work along similar lines (although neither are free to consumers). Both startups offer a combination of analytics for each customer’s data plus access to benchmarks established by anonymously aggregating data across customers.
Panda’s cloud analytics, aggregation and classification mechanism is also analogous to the non-boolean document categorization software for eDiscovery discussed in previous posts in this blog, whereby unreviewed documents can be automatically (and thus inexpensively) classified for responsiveness and privilege:
Deeper, even more powerful extensions of this principle are also possible. I anticipate that we will soon see software which will automatically classify all of an organization’s documents as they are created or received, including documents residing on employees laptop and mobile devices. Using Panda-like classification logic, new documents will be classified accurately whether or not they are of an exact match with anything previously known to the classification system. This will substantially improve implementation speed and accuracy for search, access control and collaboration, document deletion and preservation, end point protection, storage tiering, and all other IT, legal and business information management policies.
From a business perspective, information should be handled like property. Like assets or supplies, information needs management.
Companies set policies to govern use, storage, and disposal of assets and office supplies. Companies also need to make and enforce rules governing electronically stored information, including how it is organized (who has access and how), stored (where and at what cost), retained (including backups and archives), and destroyed (deletion and non-deletion both have significant legal and cost consequences). These policies must balance the business, legal and technical needs of the company. Without them, a company opens itself up to losses from missed opportunities, employee theft, lawsuits, and numerous other risks.
Some information is analogous to company ASSETS. For example, let’s suppose a certain sales proposal took someone a week to write and required approval and edits from four other people plus 6 hours of graphics production time. An accountant isn’t going to list that proposal on the company books. But it is an asset. It can be edited and resubmitted to other potential customers in a fraction of the time it took to create the original. Like the machinery, furniture, or hand tools used to operate a business, company money was spent obtaining this information and it will retain some value for some time. It should be managed like an asset.
Some information is analogous to OFFICE SUPPLIES. For example, a manager spends a number of hours customizing a laptop with email account settings, browser bookmarks and passwords, ribbon and plugin preferences, nested document folders, security settings, etc. That customization information is crucial for the manager’s productivity in much the same way as having pens in the drawer, paper in the printer, staples in the stapler, and water in the water-cooler can be important for productivity. Productivity will be lost if it is lost. That information needs to be managed just as much as office supplies need to be manged.
From a business perspective, when company information is lost or damaged, or when users are under or over supplied, it is no different from mismanagement of company assets and office supplies.
Setting an information policy means:
identifying information use and control needs;
making choices and tradeoffs about how to meet those needs; and
taking responsibility for results and an ongoing process (setting goals / taking action / measuring progress / adjusting).
Information governance policy is an on-going process for managing valuable company information. All of the stakeholders – in particular, business units, IT, and Legal – must collaborate in order to draw a bullseye on company information management needs. The right people in the organization must be charged with responsibility for getting results or for making changes needed to get results.
Without a doubt, it takes time and money, and requires collaboration, to develop a “policy.” But we’re all accustomed to this type of preparation already. Let’s look at simple, familiar professional standards for just a moment:
Software developers test software on actual users and correct bugs and (hopefully) mistaken assumptions before releasing it. Avoiding these steps will undoubtedly lead to loss and possibly bankruptcy.
Attorneys meet with clients before going to trial to prepare both the client and the attorneys. If they don’t they risk losing their clients millions, or getting them locked up.
Advance preparation is as essential in information management as it is in software development and trial practice. Simply ignoring the issue, or dumping it on one person or a single department (like IT or Legal) can be very costly. Avoiding the planning component of information management is like putting in only 80% of the time and effort needed for the company to succeed. Avoiding 20% of the time and effort doesn’t yield a “savings” when the outcome is failure, as when an employee steals documents, essential information is lost when a building with no computer back-ups burns down, or old documents which would have been deleted under a proper information governance policy turn up in a lawsuit and cost millions.
Information policy does NOT flow from any of the following all-to-common realities:
The first meeting between the Legal Department and the new eDiscovery vendor is also the first meeting between Legal and IT (true story);
Ever since a certain person from the General Counsel’s office was made the head of corporate records management, no one in IT will talk to that person (true story);
Information management technology alone, without a company-specific understanding of the problems that the technology is meant to solve, is not a recipe for success. A recent article by Carol Sliwa, published by SearchStorage.com (April 22, 2009), offers a detailed look at issues surrounding efforts to reduce storage costs by assessing how information is being used and moving it to the least expensive storage tier possible.
The article has some powerful suggestions on developing information policy. First, Karthik Kannan, vice president of marketing and business development at Kazeon Systems Inc.:
“What we discovered over time is that customers need to be able to take some action on the data, not just find it…. Nobody wants to do data classification just for the sake of it. It has to be coupled with a strong business reason.”
“In order to really realize and get the benefit of data and storage classification, you have to start with a business process…. And it has to start from conversations with the business units and understanding the needs and requirements of the business. Only at the end, once you actually have everything in place, should you be looking at technology because then you’ll have a better set of requirements for that technology.”
It takes time, money and cooperation between departments that may have never worked together before to develop a working information governance policy. But that is not a reason to skip — or skimp on — the process. Companies need to protect their assets and productivity, and information governance has become an essential area for doing just that.
I recently had the pleasure of speaking with Nicholas Croce, President of Inference Data, a provider of innovative analytics and review software for eDiscovery, following the company’s recent webinar, De-Mystifying Analytics. During our conversation I discovered that Nick is double-qualified as a legal technology visionary. He not only founded Inference, but has been involved with legal technologies for more than 12 years. Particularly focused on the intersection of technology and the law, Nick was directly involved in setting the standards for technology in the courtroom through working personally with the Federal Judicial Center and the Administrative Office of the US Courts.
I asked to speak with Nick because I wanted to pin him down on what I imagined I heard him say (between the words he actually spoke) during the live webinar he presented in mid-March. The hour-long interview and conversation ranged in topic, but was very specific in terms of where Nick sees the eDiscovery market going.
Sure enough, during our conversation Nick confirmed and further explained that he and his team, which includes CEO Lou Andreozzi, the former LexisNexis NA (North American Legal Markets) Chief Executive Officer, have designed Inference with not one, but two models of advanced eDiscovery analytics and legal review in mind.
In a nutshell, Inference is designed not only to deliver the current model of eDiscovery software analytics, which I have dubbed “Software Queued Review,” but the next generation analytics model as well, which I am currently calling “Statistically Validated Automated Review” (Nick calls it “auto-coding”).
I have a few specific questions to ask, but in general what I’d like to cover is:
1) where does Inference fit within the eDiscovery ecosystem,
2) how you think statistically validated discovery will ultimately be used, and
3) how you think the left side of the EDRM diagram (which is where document identification, collection, and preservation are situated) is going to evolve?
Nick: To first give some perspective on the genesis of Inference, it’s important to understand the environment in which it was developed. Prior to founding Inference I was President of DOAR Litigation Consulting. When I started at DOAR in 1997, the company was really more of a hardware company than anything else. I was privileged to be involved in the conversion of courtroom technology from wooden benches to the efficient digital displays of evidence we see today, Within a few years we became the predominant provider of courtroom technology, and it was amazing to see the legal system change and directly benefit from the introduction of technology. As people saw the dramatic benefits, and started saying “how do we use it?” we created a consulting arm around eDiscovery which provided the insight to see that this same type of evolution was needed within the discovery process.
This began around 2004-2005 when we started to see an avalanche of ESI (“Electronically Stored Information”) coming, and George Socha became a much needed voice in the field of eDiscovery. As a businessman I was reading about what was happening, and asking questions, and it seemed black and white to me – it had become impossible to review everything because of the tremendous volume of ESI with existing technology. As a result I started developing new technology for it, to not only manage the discovery of large data collections, but to improve and bring a new level of sophistication to the entire legal discovery process.
Inference was developed to help clients intelligently mine and review data, organize case workflow and strategy, and streamline and accelerate review. It’s the total process. But, today I still have to fight “the short term fix mentality” – lawyers who just care about “how do I get through this stuff faster”, which is the approach of some other providers, and which also relates to the transition I see in the EDRM model – I want to see the whole thing change.
Review is the highest dollar amount, the biggest pain, 70% of a corporation’s legal costs are within eDiscovery. People want to, and need to, speed up review. However, we also need to add intelligence back into the process.
Bruce: What differentiates Inference, where does it fit in?
Nick: I, and Inference, went further than just accelerating linear review and said: it has to be dynamic, not just coding documents as responsive / non-responsive. I know this is going to sound cheesy I guess, but – you have to put “discovery” back into Discovery. You need to be able to quickly find documents during a deposition when a deponent says something like “I never saw a document from Larry about our financial statements”, and not just search for “responsive: yes/no”, “privileged: yes/no.”
Inference was, and is, designed to be dynamic – providing suggestions to reviewers, opportunities to see relationships between documents and document sets not previously perceived, helping to guide attorneys – intuitively. Inference follows standard, accepted methodologies, including Boolean keyword search, field and parametric search, and incorporates all of the tools required for review – redaction, subjective coding, production, etc.
In addition to that overriding principle, we wanted the ability to get data in from anyone, anywhere and at any time. Regulators are requiring incredibly aggressive production timelines; serial litigants re-use the same data set over and over; CIOs are trying to get control over searching data more effectively, including video and audio. Inference is designed to take ownership of data once it leaves the corporation, whether it is structured, semi-structured or unstructured data.
Inside the firewall, the steps on the left side of the EDRM model are being combined. Autonomy, EMC, Clearwell, StoredIQ — the crawling technologies – these companies are within inches of extracting metadata during the crawling process, and may be there already. This is where Inference comes in since we can ingest this data directly. I call it the disintermediation of processing because at that point there is no more additional costs for processing.
In the past someone would use EnCase for preservation, then Applied Discovery for processing (using date ranges and Boolean search terms), and at some cost per custodian, and per drive, you’d then pay for processing. It used to be over $2,500 per gig, now it’s more like $600 to $1500 per gig, depending on multi-language use and such.
But once corporations automate the process with crawling and indexing solutions, all of the information goes right into Inference without the intermediary steps, which puts intelligence back in the process. You can ask the system to guide you whenever there’s a particular case, or an issue. If I know the issue is a conversation between Jeff and Michele during a certain date range, I can prime the system with that information, start finding stuff, and start looking at settlement of the dispute. But without automation it can take months to do, at much higher costs.
Inference also offers quality control aspects not previously available: after, say, one month, you can use the software to check review quality, find rogue reviewers, and fix the process. You can also ultimately do auto-coding.
Bruce: I think this is a good opening for segue to the next question: how will analytics ultimately be used in eDiscovery?
Nick: The two most basic components of review are “responsive” and “privileged.” I learned from the public testimony of Verizon’s director of eDiscovery, Patrick Oot, some very strong statistics from a major action they were involved in. The first level document review expense was astounding even before the issues were identified. The total cost of responsive and privileged review was something like $13.6 million.
The truth is that companies only do so many things. Pharma companies aren’t generally talking about real-estate transactions or baseball contracts.
Which brings us to auto-coding… sometimes I try to avoid calling it “auto-coding” versus “computer aided” or “computer recommended” coding. When someone says “the computer did it” attorneys tend to shut down, but if someone says “the computer recommended it” then they pay attention.
Basically auto-coding is applying issue tags to the population based on a sampling of documents. The way we do it is very accurate because it is iterative. It uses statistically sound sampling, recurrent models. It uses the same technology as concept clustering, but you cluster a much smaller percentage. Let’s say you create 10 clusters, tag those, then have the computer tag other documents consistently with the same concepts. Essentially, the computer makes recommendations which are then confirmed by attorney, and then repeated until the necessary accuracy level has been achieved. This enables you to only look at a small percentage of the total document population.
Bruce: I spoke with one of the statistical sampling gurus at Navigant Consulting last month, who suggests that software validated by statistical sampling can be more accurate than human reviewers, with fewer errors, for analyzing large quantities of documents.
Nick: It makes sense. Document review is very labor intensive and redundant. Think about the type of documents you’re tagging for issues – it doesn’t even need to be conscious: it is an extremely rote activity on many levels which just lends itself to human error.
Bruce: So let’s talk about what needs to happen before auto-coding becomes accepted, and becomes the rule rather than the exception. In your webinar presentation you danced around this a bit, saying, in effect, that we’re waiting for the right alignment of law firms, cases, and a judge’s decision. In my experience as an attorney, including some background in civil rights cases, the way to go about this is by deliberately seeking out best-case-scenario disputes that will become “test cases.” A party who has done its homework stands up and insists on using statistically validated auto-coding in an influential court, here we probably want the DC Circuit, the Second Circuit, or the Ninth Circuit, I suppose. When those disputes result in a ruling on statistical validity, the law will change and everything else will follow. Do you know of any companies in a position to do this, to set up test cases, and have you discussed it with anyone?
Nick: Test cases: who is going to commit to this — the general counsel? Who do they have to convince? Their outside counsel, who, ultimately, has to be comfortable with the potential outcome. But lawyers are trained to mitigate risk, and for now they see auto-coding or statistical sampling as a risk. I am working with a couple of counsel with scientific and/or mathematic backgrounds who “get” Bayesian methods- and the benefits of using them. Once the precedents are set, and determine the use of statistical analysis as reasonable, it will be a risk not to use these technologies. As with legal research, online research tools were initially considered a risk. Now, it can be considered malpractice not to use them.
It can be frustrating for technologists to wait, but that’s how it is. Sometimes when we are following up with new installs of Inference we find that 6 weeks later they’ve gone back to using simple search instead of the advanced analytics tools. But even for those people, after using the advanced features for a few months they finally discover they can no longer live without it.
Bruce: Would you care to offer a prediction as to when these precedents will be set?
Nick: I really believe it will happen, there’s no ambiguity. I just don’t know if it’s 6 months or a year. But general counsel are taking a more active role, because of the cost of litigation, because of the economy, and looking at expenses more closely. At some point there will be that GC and an outside counsel combination that will make it happen.
Bruce: After hearing my statistician friend from Navigant deliver a presentation on statistical sampling at LegalTech last month I found myself wondering why parties requesting documents wouldn’t want to insist that statistically validated coding be used by parties producing documents for the simple reason that this improves accuracy. What do you think?
Nick: Requesting parties are never going to say “I trust you.”
Bruce: But like they do now, the parties will still have to be able to discuss and will be expected to reach agreement about the search methods being used, right?
Nick: You can agree to the rules, but the producing party can choose a strategy that will be used to manage their own workflow – for example today they can do it linearly, offshore, or using analytics. The requesting party will leave the burden on the producing party.
Bruce: If they are only concerned with jointly defining responsiveness, in order to get a better-culled set of documents — that helps both sides?
Nick: That would be down the road… at that point my vision gets very cloudy, maybe opposing counsel gets access to concept searches – and they can negotiate over the concepts to be produced.
Bruce: There are many approaches to eDiscovery analytics. Will there have to be separate precedents set for each mathematical method used by analytics vendors, or even for each vendor-provided analytical solution?
Nick: I’d love to have Inference be the first case. But I don’t know how important the specific algorithm or methodology is going to be – that is a judicial issue. Right now we’re waiting for the perfect judge and the perfect case – so I’ll hope it’s Inference, rather then “generic” as to which analytics are used. I hope there’s a vendor shakeout – for example, ontology based analytics systems demo nicely, but “raptor” renders “birds” which is non-responsive, while “Raptor” is a critical responsive term in the Enron case.
Bruce: Perhaps vendors and other major stakeholders in the use of analytics in eDiscovery, for example, the National Archives, should be tracking ongoing discovery disputes and be prepared to file amicus briefs when possible to help support the development of good precedents.
On a unrelated note, except that it’s also rather geeky, is a trend in how databases are being structured that a cloud computing buddy of mine recently tuned me in to. This article does an excellent job of simply describing why some of the most prominent database projects in the realm of cloud computing are moving away from traditional relational database management systems (RDBMS) — think SQL queries, normalized tables, and all that jazz — and towards less standardized, more XML-ish “key/value” data structures. The brief wikipedia article entitled Document-oriented Database also offers some simply-stated information on this topic.
UPDATE: To improve scalability the social media and networking web application FriendFeed has started using a sort of a hybrid database that involves storing JSON (“a lightweight computer data interchange format”) objects within a simplified MySQL data structure. They’re pretty happy with their results, which speak louder than words. But as my (same) cloud computing buddy points out, this still falls short of achieving the goals of non-RDBMS databases such as CouchDB: “if you’re taking a relational DB and shoving JSON objects into it, you have to start asking yourself if there’s a more efficient way to store that data.”
One of the most interesting technical issues being discussed at LegalTech last week was the question of how to classify, analyze, and review “unstructured” information like the content of emails, text documents, and presentations.
A familiar, simple sounding answer leaps immediately to mind. Why not just hook up all of these documents to a search engine “crawler,” index all of the words in all the documents? Then run ordinary key-word searches on the whole set. It’s exactly like conducting Google searches, except instead of spanning a big chunk of the entire internet we only have to cover a few terabytes of corporate information – right?
The bad news is that there are a number of wrinkles in the territory surrounding eDiscovery that render a pure Google-style search model less than perfect. The good news is that a variety of vendors offer well conceived solutions meant to take these wrinkles into account. The remainder of this post will introduce some of the wrinkles, later posts will be concerned with vendor solutions.
What’s different about eDiscovery from a Search perspective?
In an eDiscovery context the ideal for a classification and search solution is to allow searchers to identify ALL documents which meet their criteria, not just “the ten most relevant documents” or “at least one document that answers my question” as is common with Google searches. Imagine a running a Google search which returned 20,000 responses. You think this number is too big — it’s overinclusive — but you don’t want to risk missing any relevant documents. Then imagine getting a bill for paying a team of attorneys at rates in excess of $100 / hour to read all of those documents in order to determine whether, in addition to containing the key words you selected, the documents are actually relevant to the particular law suit they are concerned with.
Another common instance of overinclusiveness arises because unstructured information repositories such as email accounts frequently contain multiple versions of the same chunks of content. Many documents will repeat some content from earlier versions and add some new content. For example, when emails are replied to, forwarded, or sent to multiple recipients, content already in the information pool is duplicated, and new information (email headers or comments) will have been added. Using conventional search all versions of every document will fall within search results and must be manually reviewed at great cost to understand what is important and what is merely redundant.
Another potential problem involves choosing key words correctly. One could easily choose key words that are logically related to the topic at hand, and return a large number of relevant documents, but which miss many, or the most important documents in the document pool (the search results are underinclusive). What if, as in the Enron case, “code” words were adopted by perpetrators of a scam in an effort to cover their tracks? What if some number of documents are written in a language the searchers don’t speak, or use words or terms not familiar to searchers?
Solutions to these problems that various vendors have devised include semantic clustering, multi-variate analysis of word positioning and frequency, key words plus associative groupings, near de-duplication processes, and more. Each comes with both strengths and weaknesses, of course — to be discussed in future posts.