De-mystifying Machine Learning

I was a little surprised to see a post in a respected tech publication just the other day about how unfathomable machine learning is, and how unknown its impact is going to be. Agreed, machine learning is still unfamiliar to many people, and its potential is enormous. But maybe I can help demystify it a little by sharing some of my own experience applying machine learning in a real life situation.

I really dug into machine learning a few years back working on a marketing campaign concerning the use of analytics during the discovery phase of lawsuits. I got hands-on by downloading the somewhat-famous Enron emails, which I popped into a MySQL database server, and did a little poking around in them using Tableau. But what really helped me understand the power of machine learning was studying emerging e-discovery technology, culminating in a conversation with data scientist and entrepreneur Nicolas Croce (see the interview here).


Before I share what I learned, first some background for those who aren’t already familiar with what the legal profession calls “discovery”. Discovery is the process by which lawyers are permitted to obtain evidence, including documents and electronic records, from their opponents. This is permitted under civil and criminal law so that the lawyers for both sides can assemble evidence that courts need to make good decisions. In a major legal action discovery can involve literally millions of documents and equivalent types of records (images, emails, database entries, etc.). Both sides must review these documents to identify which are important and why.

The rise of digital information has greatly increased the scale of discovery. Thus the term e-discovery: the “e”stands for “electronic”. Even way back in 2002, when the internet was just a fraction of what it is today, just 158 Enron employees had generated 600,000 different email messages that had to be reviewed during discovery.

Historically, every document produced in discovery had to be carefully scrutinized by human eyes and hands. Imagine dozens or even hundreds of well-trained humans spending hundreds or thousands of hours reading page after page. Besides the expense of this, a major problem can arise when humans reviewers become burned out by looking at so many documents, hour after hour, day after day after day. Even highly skilled and motivated humans get tired, and their accuracy suffers. They start to overlook both documents and details. So you see where this leads.

Visualize, if you will, working as a document reviewer. The lead attorneys instruct you to find every document that mentions specific topics, spelling-out in writing what they want you to find and possibly giving you concrete examples. While reading through the giant stacks of documents assigned to you, you flag anything that exactly matches what you were told to find, plus everything you believe to be related based on your understanding of the words and phrases being used. The attorneys eventually review your work and provide feedback: yes, this was important; no, you can ignore that next time.

This is where machine learning comes in. Like human document reviewers, computers can read text, in a manner of speaking. When you provide a machine learning algorithm with a “seed” consisting of key words, or perhaps a sample document, the algorithm can compare that seed with the contents of other documents. It looks for matches between the sample you provide and words, and groups of words, that resemble that sample. Like you, it can also find closely related but different words and groups of words. For example, if the seed is the phrase “affordable care act”, the results should include documents that include “Obamacare” and “ACA” even if they don’t include the phrase “affordable care act” because those words are frequently used together elsewhere.

Interestingly, unlike you, a machine learning algorithm doesn’t even have to know the language it’s “reading” to match word patterns. To the machine, words are just groups of characters, sequences of character codes (like ASCII) that are bunched together. Machine learning takes note of the fact that certain characters are frequently grouped together. These may be words, names, acronyms, or whatever, but the computer doesn’t have to know that to recognize them and note their positions. It also takes note of the fact that certain groups of characters tend to appear more or less frequently in combination with other groups of characters. Essentially, a machine learning algorithm creates a map that records when certain characters tend to be clustered together (it counts the frequency of character combinations) and uses that map to discover relationships between different words and phrases without “knowing” what those words mean.

In addition, machine learning algorithms can be “trained” to ignore matches they find that aren’t important. They start off returning all of the matches they find based on the initial seed; wait for feedback; are told by human operators which matches are useful and which are not; and do it all over again, applying the additional information. Training can require multiple feedback sessions to fine-tune the ability of the algorithm to match the patterns that users want to find.

Just like human reviewers, who can recognize related concepts that use different wording, and learn how to fine-tune their reviews, machine learning algorithms that are given enough material to work with and enough feedback can find relevant documents that use words that are different from, but contain the same concepts as, what they are originally given to look for. And they can cover a whole lot of text in almost no time at all.

Machine learning adds tremendous value for e-discovery because it’s fast and accurate, saving money and improving quality. Which is great. But here’s the whole point of taking this long detour into e-discovery: the way this works is remarkably parallel to the way you would do it yourself. All that machine learning does here is to match patterns, patterns defined by a human user, that already exist in the material being reviewed.

With this example you can easily imagine how, like you, machines can quickly pick out specific concepts – patterns of words, phrases, and related words and phrases – from an ocean of text.

So here’s where the magic begins. Now, instead of an ocean of text, like those millions of e-discovery documents that you could read through yourself given enough time (!), imagine a million pages of sensor data. Or an endless stream of network data packets. Unlike you, who (no offense) would probably tire long before recognizing any meaningful patterns when passing your eyes across page after page of numerical data, machine learning can apply the same pattern matching technique it uses with text to find patterns in literally any digital data set.

And here’s the really cool bit: humans can invoke machine learning to find just about any pattern that they can conceive of. Algorithms and  patterns that we’ve already discovered can often be applied to other data sets. We can even work collectively to create shared libraries of machine learning templates. For example, information associated with certain types of security intrusions into computer networks can be shared, giving anyone with network monitoring capabilities a starting point for detecting hacking on their own systems.

What can you do with machine learning? Imagine what you’d like to discover, if only you could observe a certain thing and learn its “language”. If you can capture the data somehow, even if there are millions of “pages” of data, and can identify patterns you want to explore, then you’re in luck. A machine learning assistant can learn how to read the language, and isolate the patterns you’re looking for.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: