As I mentioned in a previous post about document content and metadata, an organization must know what its documents are about before making decisions concerning how to manage them. This blog post provides a brief explanation of how automated, software-driven tools can help us make document handling decisions based on the content of our documents.
From an information governance perspective in order to develop useful document handling policies an organization should first study the documents it already has. After gathering information about document properties and taking into account needs such as security, productivity, and compliance, practical rules can be developed concerning (1) how to categorize an organization’s documents and (2) how to handle documents within each category. For example, let’s suppose a company has a number of different employees who enter into a variety of contracts with different customers, suppliers, employees, and partners. After grasping what types of contracts have been made and are likely to be made, the company can develop (or formalize) rules about grouping contracts into categories and how the categories should be handled. Perhaps a rule will specify that consulting contracts over $100,000 must be routed to legal, stored in a contract management archive, and not deleted for a period of five years after the end of the contract term.
Then when new documents are created, they can be examined, categorized, and handled according to the policy for their category.
Software can help us choose useful ways to categorize our documents. Software can also automatically sort new documents into established categories, then initiate actions (workflows) on those documents based on category-specific document handling rules.

There are many business scenarios in which a company benefits from using software to understand what its documents are about. A particularly high stakes scenario is eDiscovery, where attorneys must review some number of company documents to determine whether the documents will be needed in a lawsuit. The greater the number of documents that must be individually reviewed by attorneys, the more attorney time used, and thus the more money spent. On the other hand, the more errors made – for example, if documents that should be reviewed aren’t – the higher the likelihood of a multi-million dollar penalty in court. As a result, quite a few software vendors now offer products for categorizing and sorting documents in order to reduce both the time it takes attorneys to review documents and the number of reviewing errors.
It’s a poorly kept secret that years ago I was a litigation attorney grown accustomed to managing sizable quantities of paper documents on top of a large desk. To keep everything organized and accessible I created many piles to accumulate the documents relating to different cases, and the projects within those cases, and thus my desk was covered with piles and sub-piles corresponding to cases, briefs, discovery, and trials. And as new pieces of paper came in, after a glance to understand what a piece of paper was about, I could place it on my desk in such a way that it would be easy to find, grouped with other pieces of paper that I would eventually need for the same project. For the most part it was a surprisingly effective system, although not particularly accessible to anyone besides myself. On any given day most of the square footage of my desktop was obscured by a grid formed by individual documents, short stacks of paper, stacks of paper clipped and/or cross-hatched to create groupings, file folders containing paper, stacks of file folders, expanding file jackets containing file folders, stacks of loaded file jackets, boxes of file jackets, and even mixed piles containing documents, folders and file jackets (perhaps on top of a box).
Fast forward to 2009. I recently spoke with Bill Dimm, founder of software vendor Hot Neuron to find out more about what their document clustering solution does and how it is used. Appropriately enough, Bill’s solution goes by the name Clustify. (I am featuring Clustify in this post because its narrow focus on document clustering, combined with Bill Dimm’s patience over three weeks of intermittent emails and conversations, made it a particularly suitable example for this topic. The pros, cons, and comparables of various clustering solutions, including Clustify, will have to wait for a future post.)
Software like Clustify solves a sorting problem analogous to the one I solved with my large desk and intricate piles, but much faster and in a way which allows the results to be integrated with other software applications and the workflows of many other people.
Clustify uses a proprietary mathematical algorithm to rapidly cluster (aka sort or group) documents according to the degree of commonality between the words used within the documents. Once clustering is completed, it provides information about the clusters and presents further options concerning both clustering and actions that can be initiated on the documents by cluster. Clustify offers three grouping options. Documents can be clustered by literal similarity to one another, which is to say, near-duplicates of long documents are grouped together. They can be clustered by literal sub-sets, which is to say, short documents that are near-duplicates of long documents are grouped together, since they incorporate chunks of the same source documents. And documents can be clustered by concept, which is to say, documents are placed in the same group when they use many of the same words but aren’t necessarily duplicating each other.
With respect to speed, Clustify in particular can process a disk containing 1.3 million documents in as little as 20 minutes on Linux, 50 minutes on Windows, without any sort of pre-processing. And as new documents are added to a document collection, they can be sorted into existing clusters automatically.
An important benefit of fast clustering speed combined with a streamlined user interface is that an operator can quickly grasp, and iteratively improve, the quality of the clusters being generated. As Bill Dimm puts it, there are many ways to group documents, and software only provides a tool for helping people settle on clusters that are useful for the particular task at hand. There is no abstract, mathematically absolute way to group documents. Cluster quality depends on the context. After documents have been clustered by software, the clusters should be examined by someone who understands the purpose(s) for which the clusters are intended. The software may produce clusters that are useful on its first pass, but clusters may also be over- or under-inclusive. With Clustify, settings can easily be adjusted and clustering re-run to quickly come up with the most useful balance between cluster size and quantity, which also affects topicality. For instance, if a particular cluster is meant to be about football rather than sports generally, the only references to “Broncos” in that cluster should be about the football team, not the rodeo event. This cluster adjustment process can be repeated, allowing the software operator to zero-in on a desirable level of inclusiveness and topicality.
Besides speed, another advantage of automated document clustering is consistency. Once appropriate clusters have been established, the same clustering parameters can be applied to new documents based on mathematical similarities between the new documents and already clustered documents. As more than one vendor will tell you, studies show that properly calibrated software can perform mass sorting more consistently than human reviewers can.
A third advantage of automated document clustering is that it can funnel documents into different workflows. Particular clusters of documents can be assigned priorities and assigned to specific people for review. Clustify, for example, provides two-way integration with Lexis / Nexis’s Concordance document processing solution. Clustify’s clustering engine is also accessible to third party applications via an API, and its data is stored in simple formatted text files, allowing clustering tasks to be automatically batched and incorporated into more extensive project workflows.
In a future post I will discuss how metadata visualization can be used to develop useful content handling policies.
3 Replies to “Understanding the “about” of documents using content-based clustering”