Earlier this week I blogged about placing the locus of control for e-discovery decisions in the right hands to ensure that the decisions made pass muster in court. To illustrate the potential impact of moving the locus of control for certain decision to an outsource partner let’s compare the document review solutions offered by H5 and Inference Data.
Last month I attended a webinar presented by H5. One thing that struck me as distinctive about H5 is their standard deployment of a team of linguists to improve detection of responsive documents from among the thousands or millions of documents in a document review. During the webinar I submitted a question asking what it is their linguists do that attorneys can’t do themselves. One of their people was kind enough to answer, more or less saying “These guys are more expert at this query-building process than attorneys.” Ouch.
I’ve long prided myself on my search ability (ask me about the time I deployed a boolean double-negative in a Westlaw search for Puerto Rico “RICO” cases) and I’m sure many of my fellow attorneys are equally proud. However, I know people (or engineers, anyway) who are probably better at search than I am, and I know one or two otherwise blindingly brilliant attorneys who are seriously techno-lagged. More importantly, attorneys typically have a lot on their plates, and search expertise on a nitty-gritty “get the vocabulary exactly right” level is just one of a thousand equally important things on their minds, so it’s not realistically going to be a “core competency.” So I can see the wisdom in H5’s approach, although I wonder how many attorneys are willing to admit right out loud that they are better off outsourcing this competency.
I can see where, depending on a number of different factors, either solution might be better. I encourage anyone facing this choice to make an informed decision about which approach leads to the best results rather than relying on their knee-jerk reaction.
As an enthusiastic user of SaaS (“Software as a Service”) applications, I’ve increased my own productivity via the cloud. But while wearing my Information Governance hat I see companies becoming sensitized to information control and risk management issues arising from SaaS use. In particular:
Company intellectual property (“IP”) frequently leaks out through employees’ SaaS use, often when subject matter experts within a company naively collaborate with “colleagues” outside the company; and
Company information may be preserved indefinitely rather than being deleted at the end of its useful life, thus remaining available for eDiscovery when it shouldn’t be.
Pi Corp’s Smart Desktop project is described by EMC’s CTO Jeff Nick in this video taken at EMC World last year. In a nutshell, Smart Desktop is meant to:
provide a central portal for all of an individual’s information collected from all of the information sources they use;
index and classify that information so it can be used more productively, for example, when a user begins performing a particular task the user will be prompted with a “view” (dashboard) of all of the information the system expects they will want, based on the user’s past performance and the system’s predictive intelligence algorithms;
“untether” information so that it is available to the user from any of the user’s devices, including mobile devices, and interchangeable across different sources; and
enable secure sharing such that people can share just the information they wish to share with those they want to share it with.
Once I’ve had a chance to evaluate Smart Desktop I’ll take a harder look at its Information Governance implications. Problems could arise for employers — albeit through no fault of Decho — if Smart Desktop (or Mozy, or another file sharing service, for that matter) is used by employees to share their employers’ IP with people outside of the company, or people within their company who have not been properly trained and cautioned about maintaining IP security. Similarly, if Smart Desktop (or Mozy, or another SaaS) enables employees to preserve company documents beyond their deletion dates, or to access company documents after they are no longer employees, this could prove difficult in eDiscovery or IP secrecy scenarios, where such information could become a costly surprise late in the game.
But for now I’ll presume that because Decho’s parent EMC has a strong Information Governance focus, Decho will ultimately provide not only the access controls that they currently envision, which will enable secure sharing across devices and users, but also group administration features that make it possible for companies to retain control over IP and information lifecycle management. In particular, I predict Decho will provide dynamic global indexing of information which enters any user account within a company’s user group, thereby making company information easy to find, place holds on, and collect for eDiscovery. I also predict Decho will offer document lifecycle management functionality, including automatically enforced retention and deletion policies.
And while I’m making a Decho wish list, two more items:
Bruce: I wonder to what extent document categories that are created using document clustering software when reviewing documents for eDiscovery can be aggregated across multiple document requests and/or law suits within the same company. Can previously developed categories or tags be reused to seed and thus speed up document review in other cases?
Richard: Regarding the notion of aggregating document categories, etc., it’s something that’s technically very feasible. And it could greatly speed document review if categories could be used to “seed” new reviews, new cases, etc. Here’s the challenge: we have found that most of the “categories” developed by our clients start-out case specific, and are too granular to be valuable when the next case comes along. It also hasn’t seemed to matter whether categorization was being used by a corporate legal department or an outside counsel – they’re equally specific.
The idea itself had merit, so we tossed it around with our Product Solutions Architects, and they came up with several observations. First of all, the categories people develop are driven by their need to solve a specific eDiscovery challenge, i.e. documents that are responsive to the case at hand. Second, when the next issue or case comes along, they naturally start over again, first by identifying responsive documents and then by using those documents to create categories – any “overlap” is purely coincidental. Finally, to develop categories that were really useful across a variety of issues or cases, they would need to be fairly generic and probably not developed with any specific case in mind.
I think that’s very hard to do for a first or even second-level review – it’s not necessarily a natural progression, as people work backwards from the issues at hand. Privilege review, however, could be a different animal. There are some things in any case that invoke privilege because of the particulars of the case, for example, attorney-client conversations which are likely to involve different individuals in different litigation matters. There are other things that could logically be generic – company “trade secrets” for example would almost always be treated as privilege, as are certain normally-redacted items such as PII (personally-identifiable information). Privilege review is also a very expensive aspect for eDiscovery, since it involves physical “reads” using highly-paid attorneys (not something you can comfortably offshore). Could “cloud seeding” have value for this aspect of eDiscovery? It’s an interesting thought.
Panda Security recently released (in beta form) what it claims is the first cloud-based anti-virus / anti-malware solution for Windows PCs. Not only does it sound like a clever tool for data loss prevention, but it demonstrates another way in which information service providers can aggregate individual user data to develop classifications or benchmarks valuable to every user, a mechanism I’ve explored in previous blog posts.
In essence, every computer using Panda’s Cloud Antivirus is networked together through Panda’s server to form a “collective intelligence” for malware detection and prevention. Here’s how it works: users download and install Panda’s software – it’s a small application known as an “agent” because the heavy lifting is done on Panda’s server. These agents send reports back to the Panda server containing information about new files (and, I presume, related computer activity which might indicate the presence of malware). When the server receives reports about previously unknown files which resemble, according to the logic of the classification engine, files already known to be malware, these new files are classified as threats without waiting for manual review by human security experts.
For example, imagine a new virus is released onto the net by its creators. People surfing the net, opening emails, and inserting digital media start downloading this new file, which can’t be identified as a virus by traditional anti-virus software because it hasn’t been placed in any virus definitions list yet. Computers on which the Panda agent has been installed begin sending reports about the new file back to the Panda server. After some number of reports about the file are received by Panda’s server, the server is able to determine that the new file should be treated as a virus. At this point all computers in the Panda customer network are preemptively warned about the virus, even though it has only just appeared.
Utilizing Panda’s proprietary cloud computing technology called Collective Intelligence, Panda Cloud Antivirus harnesses the knowledge of Panda’s global community of millions of users to automatically identify and classify new malware strains in almost real-time. Each new file received by Collective Intelligence is automatically classified in under six minutes. Collective Intelligence servers automatically receive and classify over 50,000 new samples every day. In addition, Panda’s Collective Intelligence system correlates malware information data collected from each PC to continually improve protection for the community of users.
Because Panda’s solution is cloud-based and free to consumers, it will reside on a large number of different computers and networks worldwide. This is how Panda’s cloud solution is able to fill a dual role as both sampling and classification engine for virus activity. On the one hand Panda serves as manager of a communal knowledge pool that benefits all consumers participating in the free service. On the other hand, Panda can sell the malware detection knowledge it gains to corporate customers – wherein lies the revenue model that pays for the free service.
I have friends working in two unrelated startups, one concerning business financial data and the other Enterprise application deployment ROI, that both work along similar lines (although neither are free to consumers). Both startups offer a combination of analytics for each customer’s data plus access to benchmarks established by anonymously aggregating data across customers.
Panda’s cloud analytics, aggregation and classification mechanism is also analogous to the non-boolean document categorization software for eDiscovery discussed in previous posts in this blog, whereby unreviewed documents can be automatically (and thus inexpensively) classified for responsiveness and privilege:
Deeper, even more powerful extensions of this principle are also possible. I anticipate that we will soon see software which will automatically classify all of an organization’s documents as they are created or received, including documents residing on employees laptop and mobile devices. Using Panda-like classification logic, new documents will be classified accurately whether or not they are of an exact match with anything previously known to the classification system. This will substantially improve implementation speed and accuracy for search, access control and collaboration, document deletion and preservation, end point protection, storage tiering, and all other IT, legal and business information management policies.
From a business perspective, information should be handled like property. Like assets or supplies, information needs management.
Companies set policies to govern use, storage, and disposal of assets and office supplies. Companies also need to make and enforce rules governing electronically stored information, including how it is organized (who has access and how), stored (where and at what cost), retained (including backups and archives), and destroyed (deletion and non-deletion both have significant legal and cost consequences). These policies must balance the business, legal and technical needs of the company. Without them, a company opens itself up to losses from missed opportunities, employee theft, lawsuits, and numerous other risks.
Some information is analogous to company ASSETS. For example, let’s suppose a certain sales proposal took someone a week to write and required approval and edits from four other people plus 6 hours of graphics production time. An accountant isn’t going to list that proposal on the company books. But it is an asset. It can be edited and resubmitted to other potential customers in a fraction of the time it took to create the original. Like the machinery, furniture, or hand tools used to operate a business, company money was spent obtaining this information and it will retain some value for some time. It should be managed like an asset.
Some information is analogous to OFFICE SUPPLIES. For example, a manager spends a number of hours customizing a laptop with email account settings, browser bookmarks and passwords, ribbon and plugin preferences, nested document folders, security settings, etc. That customization information is crucial for the manager’s productivity in much the same way as having pens in the drawer, paper in the printer, staples in the stapler, and water in the water-cooler can be important for productivity. Productivity will be lost if it is lost. That information needs to be managed just as much as office supplies need to be manged.
From a business perspective, when company information is lost or damaged, or when users are under or over supplied, it is no different from mismanagement of company assets and office supplies.
Setting an information policy means:
identifying information use and control needs;
making choices and tradeoffs about how to meet those needs; and
taking responsibility for results and an ongoing process (setting goals / taking action / measuring progress / adjusting).
Information governance policy is an on-going process for managing valuable company information. All of the stakeholders – in particular, business units, IT, and Legal – must collaborate in order to draw a bullseye on company information management needs. The right people in the organization must be charged with responsibility for getting results or for making changes needed to get results.
Without a doubt, it takes time and money, and requires collaboration, to develop a “policy.” But we’re all accustomed to this type of preparation already. Let’s look at simple, familiar professional standards for just a moment:
Software developers test software on actual users and correct bugs and (hopefully) mistaken assumptions before releasing it. Avoiding these steps will undoubtedly lead to loss and possibly bankruptcy.
Attorneys meet with clients before going to trial to prepare both the client and the attorneys. If they don’t they risk losing their clients millions, or getting them locked up.
Advance preparation is as essential in information management as it is in software development and trial practice. Simply ignoring the issue, or dumping it on one person or a single department (like IT or Legal) can be very costly. Avoiding the planning component of information management is like putting in only 80% of the time and effort needed for the company to succeed. Avoiding 20% of the time and effort doesn’t yield a “savings” when the outcome is failure, as when an employee steals documents, essential information is lost when a building with no computer back-ups burns down, or old documents which would have been deleted under a proper information governance policy turn up in a lawsuit and cost millions.
Information policy does NOT flow from any of the following all-to-common realities:
The first meeting between the Legal Department and the new eDiscovery vendor is also the first meeting between Legal and IT (true story);
Ever since a certain person from the General Counsel’s office was made the head of corporate records management, no one in IT will talk to that person (true story);
Information management technology alone, without a company-specific understanding of the problems that the technology is meant to solve, is not a recipe for success. A recent article by Carol Sliwa, published by SearchStorage.com (April 22, 2009), offers a detailed look at issues surrounding efforts to reduce storage costs by assessing how information is being used and moving it to the least expensive storage tier possible.
The article has some powerful suggestions on developing information policy. First, Karthik Kannan, vice president of marketing and business development at Kazeon Systems Inc.:
“What we discovered over time is that customers need to be able to take some action on the data, not just find it…. Nobody wants to do data classification just for the sake of it. It has to be coupled with a strong business reason.”
“In order to really realize and get the benefit of data and storage classification, you have to start with a business process…. And it has to start from conversations with the business units and understanding the needs and requirements of the business. Only at the end, once you actually have everything in place, should you be looking at technology because then you’ll have a better set of requirements for that technology.”
It takes time, money and cooperation between departments that may have never worked together before to develop a working information governance policy. But that is not a reason to skip — or skimp on — the process. Companies need to protect their assets and productivity, and information governance has become an essential area for doing just that.
This blog post is the first of two on the topic of advanced eDiscovery analytics models. My goal is to make the point that lawyers don’t trust or use analytics to the degree that they should, according to scientifically sound conventions commonly employed by other professions, and to speculate about how this is going to change.
In this first post I’ll explain why we arrived where we are today by describing the progression of analytics across three generations of Discovery technology.
The first generation, which I call “The Photocopier Era,” relies on labor intensive, pre-analytics processes. Some lawyers are still stuck in this era, which is extremely labor intensive.
The second generation is the current reigning model of analytics and review. I call it “software queued review.” Software queued review intelligently sorts and displays documents to enable attorneys to perform document review more efficiently. At the same time, software queued review allows – or should I say, requires? – attorneys to do more manual labor than is required to ensure review quality or to ensure that attorneys take personal responsibility for the discovery process.
The third, upcoming generation of analytics is only beginning to provoke widespread discussion in the legal community. I’ll call it “statistically validated automated review.” In it software is used to perform the majority of document review work, leaving attorneys to do the minimum amount of review work. In fact, certain advanced analytics and workflow software solutions can already be calibrated, by attorney reviewers, to be more accurate than human reviewers typically are capable of when reviewing vast quantities of documents.
Because it will radically reduce the amount of hands-on review, the third generation model is currently perceived by many lawyers as a risky break from legal tradition. But when this model is deployed outside of the legal profession it is not considered a giant step, technologically or conceptually. It is merely an application of scientifically grounded business processes.
In subsequent blog posts, including the second post in this series, I will look at what is being done to overcome the legal profession’s reluctance to adopt this more accurate, less expensive eDiscovery model.
The Pre-analytics Generation: Back to the Photocopier Era
Please return with me now to olden times of not-so-long-ago, the days before eDiscovery software. (Although even today, for smaller cases and cases that somehow don’t involve electronically stored information, the Photocopier Era is alive and well.)
In the beginning there were paper documents, usually stored within folders, file boxes, and file cabinets. Besides paper, staples, clips, folders, and boxes, photocopiers were the key document handling technology, with ever improving speed, sheet feeding, and collation options.
Gathering documents: When a lawsuit reached the discovery stage, clients following the instructions of their attorneys physically gathered their papers together. Photocopies were made. Some degree of effort was (usually) made to preserve “metadata” which in this era meant identifying where the pieces of paper had been stored, and how they had been labeled while stored.
Assessing documents: In this era every “document” was a physical sheet of paper, or multiple sheets clipped together in some manner. Each page was individually read by legal personnel (attorneys or paralegals supervised by attorneys) and sorted for responsiveness and privilege. Responsive, non-privileged documents were compiled into a complete set and then, individual page after individual page, each was numbered (more like impaled) with a hand-held, mechanical, auto-incrementing ink stamp (I can hear the “ka-chunk” of the Bates Stamp now… ah, those were the days).
Privileged documents were set to one side, and summarized in a typed list called a privilege log. Some documents containing privileged information were “redacted” using black markers (there was an art to doing this in a way so that the words couldn’t be read anyway – an art which even the FBI on one occasion in my experience failed to master).
Finally, the completed document set was photocopied, boxed, and delivered to opposing counsel, who in turn reviewed each sheet of paper, page by page.
The Present Generation of Analytics: Software Queued Review
Fast forward to today, the era of eDiscovery and software-queued review. In the present generation software is used to streamline, and thus reduce, the cost of reviewing documents for responsiveness and privilege.
Gathering documents: Nowadays, still relying on instructions from their attorneys, clients designate likely sources of responsive documents from a variety of electronic sources, including email, databases, document repositories, etc. Other media such as printed documents and audio recordings may also be designated when indicated.
After appropriate conversions are made (for example, laptop hard drives may need to be transferred, printed documents may need to be OCR scanned, audio recordings may need to be transcribed, adapters for certain types of data sources may need to be bought or built) all designated sources are ingested into a system which indexes the data, including all metadata, for review.
Some organizations already possess aggressive records management / email management solutions which provide the equivalent of real time ingestion and indexing of significant portions of their documents. Such systems are particularly valuable in a legal context because they enable more meaningful early case assessment (sometimes called “early data assessment”).
Assessing documents: In the current era attorneys can use tools such as Inference which use a variety of analytical methods and workflow schemas to streamline and thus speed up review. (Another such tool is Clustify, which I described in some detail in a previous blog entry.) Such advanced tools typically combine document analytics and summarization with document clustering, tagging, and support for human reviewer workflows. In other words, tools like Inference start with a jumble of all of the documents gathered from a client, documents which most likely contain a broad spectrum of pertinent and random, off-topic information, and sort them into neat, easy to handle, virtual piles of documents arranged by topic. The beauty of such systems is that all of the virtual piles can be displayed — and the documents within them browsed and marked — from one screen, and any number of people in any number of geographic locations can share the same documents organized the same way. Software can also help the people managing the discovery process to assign groups of documents to particular review attorneys, and help them track reviewer progress and accuracy in marking documents as responsive or not, and privileged or not.
The key benefit of this generation of analytics is speed and cost savings. Similar documents, including documents that contain similar ideas as well as exact duplicates and partial duplicates of documents, can be quickly identified and grouped together. When a group of documents contains similar documents, and all of the documents in that group are assigned to the same person or persons, they can work more quickly because they know more of what to expect as they see each new document. Studies have shown that review can be performed perhaps 70-80% faster, and thus at a fraction of the cost, using these mechanisms.
Once review is complete, documents can be automatically prepared for transfer to opposing counsel, and privilege logs can be automatically generated. Opposing counsel can be sent electronic copies of responsive, non-privileged documents, which they in turn can review using analytical tools. (Inference is among the tools that are sometimes used by attorneys receiving such document sets, Nick tells me.)
The Coming Generation of Analytics: Statistically Validated Automated Review
The next software analytics model will be a giant leap forward when it is adopted. In this model software analytics intelligence is calibrated by human intelligence to automatically and definitively categorize the majority of documents collected as responsive or not, and as privileged or not, without document-by-document review by humans. In actuality, some of the analytical engines already in existence – such as Inference — can be “trained” through a relatively brief iterative process to be more accurate in making content-based distinctions than human reviewers can.
To adopt this mechanism as standard, and preferred, in eDiscovery would be merely to apply the same best practices statistical sampling standards currently relied upon to safeguard quality in life-or-death situations such as product manufacturing (think cars and airplanes) and medicine (think pharmaceuticals). The higher level of efficiency and accuracy that this represents is well within the scope of existing software. But while statistically validated automated review has been widely alluded to in legal technology circles, so far as I know it has not been used as a default by anyone when responding to document requests. Not yet. Reasons for this will be discussed in subsequent posts, including the next one.
Gathering documents: The Statistically Validated Automated Review model relies on document designation, ingestion, and indexing in much the same manner as described above with respect to Software Queued Review.
Assessing documents: In this model, a statistically representative sample of documents is first extracted from the collected set. Human reviewers study the documents in this sample then agree upon how to code them as responsive / non-responsive, privileged / non-privileged. This coded sample becomes the “seed” for the analytics engine. Using pattern matching algorithms the analytics engine makes a first attempt to code more documents from the collected set in the same way the human coders did, to match the coding from the seed sample. But because the analytics engine won’t have learned enough from a single sample to become highly accurate, another sample is taken. The human coders correct miscoding by the analytics engine, and their corrections are re-seeded to the engine. The process repeats until the level of error generated by the analytics engine is extremely low from the standpoint of scientific and industrial standards, and more accurate than human reviewers are typically capable of sustaining when coding large volumes of documents.
By way of comparison this assessment process resembles the functioning of the current generation of email spam filters, which employ Bayesian mathematics and corrections by human readers (“spam” / “not spam”) that teach the filters to make better and better choices.
After the Next Generation: Real Time Automated Review
It’s not another generation of analytics, but another significant shift is gradually occurring that will have a significant impact on eDiscovery. The day is approaching when virtually all information that people touch while working will be available and indexed in real time. From the perspective of analytics engines it is “pre-ingested” information. This will largely negate the gathering phase still common in previous generations. Vendors such as Kazeon, Autonomy, CA, Symantec, and others are already on the verge – and in some cases, perhaps, past the verge – of making this a possibility for their customers.
(Full disclosure of possible personal bias: I’m working with a startup with a replication engine that can in real time securely duplicate documents’ full content, plus metadata information about documents, as they are created on out-of-network devices, like laptops, to document management engines….)
The era of Real Time Automated Review will be both exciting and alarming. It will be exciting because instant access to all relevant documents should mean that more lawsuits settle on the facts, in perhaps weeks, after a conflict erupts (see early case assessment, above), rather than waiting for the conclusion of a long, and sometimes murky, discovery process. It’s alarming because of the Orwellian “Big Brother” implications of systems that enable others to know every detail of the information you touch the moment you touch it, and at any time thereafter.
In my next post you’ll hear about my conversation with Nick Croce, including how Inference has prepared for the coming generation of automated discovery and where Nick thinks things are going next.
As I mentioned in a previous post about document content and metadata, an organization must know what its documents are about before making decisions concerning how to manage them. This blog post provides a brief explanation of how automated, software-driven tools can help us make document handling decisions based on the content of our documents.
From an information governance perspective in order to develop useful document handling policies an organization should first study the documents it already has. After gathering information about document properties and taking into account needs such as security, productivity, and compliance, practical rules can be developed concerning (1) how to categorize an organization’s documents and (2) how to handle documents within each category. For example, let’s suppose a company has a number of different employees who enter into a variety of contracts with different customers, suppliers, employees, and partners. After grasping what types of contracts have been made and are likely to be made, the company can develop (or formalize) rules about grouping contracts into categories and how the categories should be handled. Perhaps a rule will specify that consulting contracts over $100,000 must be routed to legal, stored in a contract management archive, and not deleted for a period of five years after the end of the contract term.
Then when new documents are created, they can be examined, categorized, and handled according to the policy for their category.
Software can help us choose useful ways to categorize our documents. Software can also automatically sort new documents into established categories, then initiate actions (workflows) on those documents based on category-specific document handling rules.
There are many business scenarios in which a company benefits from using software to understand what its documents are about. A particularly high stakes scenario is eDiscovery, where attorneys must review some number of company documents to determine whether the documents will be needed in a lawsuit. The greater the number of documents that must be individually reviewed by attorneys, the more attorney time used, and thus the more money spent. On the other hand, the more errors made – for example, if documents that should be reviewed aren’t – the higher the likelihood of a multi-million dollar penalty in court. As a result, quite a few software vendors now offer products for categorizing and sorting documents in order to reduce both the time it takes attorneys to review documents and the number of reviewing errors.
It’s a poorly kept secret that years ago I was a litigation attorney grown accustomed to managing sizable quantities of paper documents on top of a large desk. To keep everything organized and accessible I created many piles to accumulate the documents relating to different cases, and the projects within those cases, and thus my desk was covered with piles and sub-piles corresponding to cases, briefs, discovery, and trials. And as new pieces of paper came in, after a glance to understand what a piece of paper was about, I could place it on my desk in such a way that it would be easy to find, grouped with other pieces of paper that I would eventually need for the same project. For the most part it was a surprisingly effective system, although not particularly accessible to anyone besides myself. On any given day most of the square footage of my desktop was obscured by a grid formed by individual documents, short stacks of paper, stacks of paper clipped and/or cross-hatched to create groupings, file folders containing paper, stacks of file folders, expanding file jackets containing file folders, stacks of loaded file jackets, boxes of file jackets, and even mixed piles containing documents, folders and file jackets (perhaps on top of a box).
Fast forward to 2009. I recently spoke with Bill Dimm, founder of software vendor Hot Neuron to find out more about what their document clustering solution does and how it is used. Appropriately enough, Bill’s solution goes by the name Clustify. (I am featuring Clustify in this post because its narrow focus on document clustering, combined with Bill Dimm’s patience over three weeks of intermittent emails and conversations, made it a particularly suitable example for this topic. The pros, cons, and comparables of various clustering solutions, including Clustify, will have to wait for a future post.)
Software like Clustify solves a sorting problem analogous to the one I solved with my large desk and intricate piles, but much faster and in a way which allows the results to be integrated with other software applications and the workflows of many other people.
Clustify uses a proprietary mathematical algorithm to rapidly cluster (aka sort or group) documents according to the degree of commonality between the words used within the documents. Once clustering is completed, it provides information about the clusters and presents further options concerning both clustering and actions that can be initiated on the documents by cluster. Clustify offers three grouping options. Documents can be clustered by literal similarity to one another, which is to say, near-duplicates of long documents are grouped together. They can be clustered by literal sub-sets, which is to say, short documents that are near-duplicates of long documents are grouped together, since they incorporate chunks of the same source documents. And documents can be clustered by concept, which is to say, documents are placed in the same group when they use many of the same words but aren’t necessarily duplicating each other.
With respect to speed, Clustify in particular can process a disk containing 1.3 million documents in as little as 20 minutes on Linux, 50 minutes on Windows, without any sort of pre-processing. And as new documents are added to a document collection, they can be sorted into existing clusters automatically.
An important benefit of fast clustering speed combined with a streamlined user interface is that an operator can quickly grasp, and iteratively improve, the quality of the clusters being generated. As Bill Dimm puts it, there are many ways to group documents, and software only provides a tool for helping people settle on clusters that are useful for the particular task at hand. There is no abstract, mathematically absolute way to group documents. Cluster quality depends on the context. After documents have been clustered by software, the clusters should be examined by someone who understands the purpose(s) for which the clusters are intended. The software may produce clusters that are useful on its first pass, but clusters may also be over- or under-inclusive. With Clustify, settings can easily be adjusted and clustering re-run to quickly come up with the most useful balance between cluster size and quantity, which also affects topicality. For instance, if a particular cluster is meant to be about football rather than sports generally, the only references to “Broncos” in that cluster should be about the football team, not the rodeo event. This cluster adjustment process can be repeated, allowing the software operator to zero-in on a desirable level of inclusiveness and topicality.
Besides speed, another advantage of automated document clustering is consistency. Once appropriate clusters have been established, the same clustering parameters can be applied to new documents based on mathematical similarities between the new documents and already clustered documents. As more than one vendor will tell you, studies show that properly calibrated software can perform mass sorting more consistently than human reviewers can.
A third advantage of automated document clustering is that it can funnel documents into different workflows. Particular clusters of documents can be assigned priorities and assigned to specific people for review. Clustify, for example, provides two-way integration with Lexis / Nexis’s Concordance document processing solution. Clustify’s clustering engine is also accessible to third party applications via an API, and its data is stored in simple formatted text files, allowing clustering tasks to be automatically batched and incorporated into more extensive project workflows.
In a future post I will discuss how metadata visualization can be used to develop useful content handling policies.