Geeky Friday: UNIX goes to 1234567890, Web 2.0 goes non-RDBMS

tuxFor the truly geeky, a cause to celebrate! Our calendars say today is Friday the 13th. But to my Mac (OSX) and other UNIX-derived computers, on which time is measured internally as the number of seconds since January 1, 1970, today is designated at something over 1.2 billion [seconds]. Later today it will be 1234567890 — here’s the running countdown if you are anxiously awaiting the moment.

On a unrelated note, except that it’s also rather geeky, is a trend in how databases are being structured that a cloud computing buddy of mine recently tuned me in to. This article does an excellent job of simply describing why some of the most prominent database projects in the realm of cloud computing are moving away from traditional relational database management systems (RDBMS) — think SQL queries, normalized tables, and all that jazz — and towards less standardized, more XML-ish “key/value” data structures. The brief wikipedia article entitled Document-oriented Database also offers some simply-stated information on this topic.

UPDATE: To improve scalability the social media and networking web application FriendFeed has started using a sort of a hybrid database that involves storing JSON (“a lightweight computer data interchange format”) objects within a simplified MySQL data structure. They’re pretty happy with their results, which speak louder than words. But as my (same) cloud computing buddy points out, this still falls short of achieving the goals of non-RDBMS databases such as CouchDB: “if you’re taking a relational DB and shoving JSON objects into it, you have to start asking yourself if there’s a more efficient way to store that data.”

Search Classification and Analysis Tools for Information Governance

One of the most interesting technical issues being discussed at LegalTech last week was the question of how to classify, analyze, and review “unstructured” information like the content of emails,  text documents, and presentations.

A familiar, simple sounding answer leaps immediately to mind. Why not just hook up all of these documents to a search engine “crawler,” index all of the words in all the documents? Then run ordinary key-word searches on the whole set. It’s exactly like conducting Google searches, except instead of spanning a big chunk of the entire internet we only have to cover a few terabytes of corporate information – right?

The bad news is that there are a number of wrinkles in the territory surrounding eDiscovery that render a pure Google-style search model less than perfect. The good news is that a variety of vendors offer well conceived solutions meant to take these wrinkles into account. The remainder of this post will introduce some of the wrinkles, later posts will be concerned with vendor solutions.

What’s different about eDiscovery from a Search perspective?

Stack of DocumentsIn an eDiscovery context the ideal for a classification and search solution is to allow searchers to identify ALL documents which meet their criteria, not just “the ten most relevant documents” or “at least one document that answers my question” as is common with Google searches. Imagine a running a Google search which returned 20,000 responses. You think this number is too big — it’s overinclusive — but you don’t want to risk missing any relevant documents. Then imagine getting a bill for paying a team of attorneys at rates in excess of $100 / hour  to read all of those documents in order to determine whether, in addition to containing the key words you selected, the documents are actually relevant to the particular law suit they are concerned with.

Another common instance of overinclusiveness arises because unstructured information repositories such as email accounts frequently contain multiple versions of the same chunks of content. Many documents will repeat some content from earlier versions and add some new content. For example, when emails are replied to,  forwarded, or sent to multiple recipients, content already in the information pool is duplicated, and new information (email headers or comments) will have been added. Using conventional search all versions of every document will fall within search results and must be manually reviewed at great cost to understand what is important and what is merely redundant.

Another potential problem involves choosing key words correctly. One could easily choose key words that are logically related to the topic at hand, and return a large number of relevant documents, but which miss many, or the most important documents in the document pool (the search results are underinclusive). What if, as in the Enron case, “code” words were adopted by perpetrators of a scam in an effort to cover their tracks? What if some number of documents are written in a language the searchers don’t speak, or use words or terms not familiar to searchers?

Solutions to these problems that various vendors have devised include semantic clustering, multi-variate analysis of word positioning and frequency, key words plus associative groupings, near de-duplication processes, and more. Each comes with both strengths and weaknesses, of course — to be discussed in future posts.

Citrix and Intel take the next step in Virtualizing Information Management

The fact that a company’s information is scattered across vast numbers of desktop and laptop hard drives creates multiple headaches, not the least of which arises when a lawsuit or regulatory proceeding requires the company to inventory and/or recover the information stored on those drives.

zipper isolated on whiteOne solution, provided by software vendors such as Citrix, turns laptops and desktops into mere terminals connecting to company data which resides on company servers. This approach took another step forward last month, at least in terms of user experience, when Intel and Citrix announced that they will embed a Citrix hypervisor (a virtual machine for running the Citrix terminal application) within laptops and desktops, thereby substantially improving performance by allowing more work (including access to USB peripherals) to happen on the remote machine. There is an interesting dynamic between corporate IT departments, who generally want to “lock down” company PCs to guard against user misuse and abuse, and corporate IT users accustomed to privacy and the trust of their employers. It will be interesting to see how the embedded hypervisor strategy will affect that dynamic, as it may in essence allow company PCs to operate in both “company” and “personal” modes.

LegalTech NY 2009 in review

I’m back from New York. And as of this morning I’m nearly caught up after wading through a flurry of activity surrounding LegalTech NY 2009. Personally I had a great trip because I made some wonderful new friends, discovered a number of new and relevant companies and their technology, and just plain enjoyed the NYC scene. Results for others may have varied — details below.

The Good, The Bad, and the Not So Aesthetically Pleasing

  • A keynote at LegalTech NY 2009
    A keynote at LegalTech NY 2009

    Most of the presenters were reasonably well prepared and open to discussing any and all issues with members of the audience. (One panel I attended stood out for running out of material half way through their time slot, but that was the exception rather than the rule.) The best presenter by far was DC-based Federal Magistrate Judge John Facciola. He was simply an exceptional speaker — I couldn’t multi-task on my laptop or iPhone (who needs wifi when there’s 3G, baby!) while he was talking for fear of missing the subtleties of his facial expressions and jokes. He made an important point about the future of the legal profession: competency in information technology is now crucial for trial attorneys and those managing discovery at any level. He told a number of vivid, relevant stories, but the one that really stood out for me was about a case which appeared before him involving two technology companies. Despite the obvious need for it,  there was no eDiscovery in this case because counsel on both sides were uncomfortable with it and thus were willing to commit malpractice by overlooking it. Crazy, but indicative of a long standing resistance to technology Judge Facciola regularly encounters with “old-school” lawyers.

  • Both presentations and exhibitors at the conference were primarily oriented around selling eDiscovery software and services. I expected this, and in fact this was a good arrangement for someone in my line of business. However, it bears noting that law firms are only part of the legal technology equation, and for a variety of reasons corporate legal departments are more influential users when it comes to legal tech. It would have been nice also to see a broader array of legal tech offerings, for example the CLM (contract lifecycle management) space was underrepresented.
  • Most of the exhibitors used admirable restraint in resisting the temptation to promote their particular solution while on stage. However I can’t resist the temptation to point out that one panelist, who coincidentally is employed by the very vendor that was sponsoring his panel (vendor name suppressed to avoid unjustly promoting them), must have exhorted the audience 20 times in 45 minutes with the imperative that everyone must have / needs / faces dire consequences unless they adopt “an integrated solution” to eDiscovery document collection and review — which is what his company was selling.
  • The conference infrastructure was adequate but not overwhelmingly impressive by Silicon Valley standards. Seating was particularly crowded in keynotes, free food was nearly non-existent (surprising, given that sponsored buffets present a high visibility promotional opportunity for vendors), coffee was hard to find. But the lack of free wifi was an ongoing source of conversation amongst “wired” attendees. Apparently wifi was available for a fee if one found a certain booth somewhere on the premises. (I used 3G instead; one fellow I met actually picked up a wireless broadband card at a mobile telco store across the street to avoid paying for internet access at the event). I understand it’s mid-town Manhattan, so the ordinary hotel lobby bar (garnering a steady 2 out of 5 rating at Yelp) is selling $10 glasses of beer. And I appreciate a comprehensive revenue model as much as or more than the next fellow. But this branded the conference as not quite having its finger on today’s digital pulse, at least not by comparison to what we are accustomed to out on the left-coast.
  • Kudos to the Metropolitan Transit Authority (MTA) and the Port Authority for New York’s marvelous, amazingly effective transit system. It’s far from a free ride, but incredibly practical. Of special note is the relatively new Air Train connecting JFK airport to the Long Island Railroad which continues on to midtown (Penn Station) only a short distance from the conference site. The transfer point was staffed with bizarrely friendly Port Authority employees helping poor lost tourists trying to figure out how to purchase ride cards and use them to activate turnstiles — very entertaining and even touching, considering how big and bad NYC’s reputation (still) is.
  • Special thanks to my old buddy Kenneth Adams for inviting me to stay with his family in their beautiful home in Garden City, Long Island (also a short walk from an L.I.R.R. station). Ken is not only a master the undisputed master of contract drafting, but also cooks a mean Neopolitan style pizza.