Lessons in Agile Machine Learning from Walmart

Takeaways from Sam Charrington’s May, 2017 interview with Jennifer Prendki, senior data science manager and principal data scientist for Walmart.com

I am very grateful to Sam Charrington for his TWiML&AI podcast series. So far I have consumed about 70 episodes (~50 hours). Every podcast is reliably fascinating: so many amazing people accomplishing incredible things. It’s energizing! The September 5, 2017 podcast, recorded in May, 2017 at Sam’s Future of Data Summit event, featured his interview with Jennifer Prendki, who at the time was senior data science manager and principal data scientist for Walmart’s online business (she’s since become head of data science at Atlassian). Jennifer provides an instructive window into agile methodology in machine learning, a topic that will become more and more important as machine learning becomes mainstream and production-centric (or “industrialized”, as Sam dubs it). I’ve taken the liberty of capturing key takeaways from her interview in this blog post. (To be clear, I had no part in creating the podcast itself.) If this topic matters to you, please listen to the original podcast – available via iTunes, Google Play, Soundcloud, Stitcher, and YouTube – it’s worth a listen.


Jennifer Prendki was a member of an internal Walmart data science team supporting two other internal teams, the Perceive team and the Guide team, delivering essential components of Walmart.com’s search experience. The Perceive team is responsible for providing autocomplete and spell check to help improve customers’ search queries. The Guide team is responsible for ranking the search results, helping customers find what they are looking for as easily as possible.

Photo by Edwin Andrade on Unsplash

In pursuit of their mission, Jennifer and her teams:

• developed a unified set of data and tools that enable the teams they support to efficiently build and operate machine learning models;

• adopted standards for building and operating machine learning models, including metrics for model performance and processes for improving models over time; and

• took steps to reliably deliver the best available combined data from Walmart stores and Walmart.com (in particular, stores want access to information about what customers did online such as click data and add-to-cart cart activity).

Challenges Addressed By Jennifer’s Team

Challenge > Jennifer’s Approach
Acquiring the right skills. Developing and putting machine learning solutions into production at scale requires a number of specialized skills. > Walmart’s machine learning teams include statistical analysts, data scientists, and machine learning engineers.
Statistical analysts have a clear understanding of whether data is sufficient to build a model (but don’t expect statistical analysts to be experts in machine learning, or put them in charge of creating the models).
Data scientists understand which is the best type of model to solve a problem, and deliver model prototypes.
Machine learning engineers know how to optimize models to be work efficiently at scale, and push them to production.
Retaining institutional knowledge. Data scientists and developers are in short supply and may jump to new jobs. > Document everything. Detailed documentation was required for all new machine learning models, enabling future hires to quickly replicate work done by their predecessors.
Inefficient models. Machine learning models were being pushed to production before they were optimized. > Culture change and standardized processes. Jennifer and her teams re-focused on “making things right” in the first place using checklists of requirements that must be satisfied before machine learning models can be put into production, including standards for accuracy, CPU consumption, retraining, etc.
Disconnect between model creators and model implementers. Data scientists are typically focused on building accurate models. They may throw models over the fence to engineers, who are typically focused on scalability, and may implement on a different platform, generating errors and delays. > Paired responsibility. Walmart pairs data scientists with machine learning engineers from the inception of new models throughout a model’s life cycle. Data scientists take the lead in early stages, machine learning engineers take the lead in production stages.
Focus on the wrong model performance metrics. Machine learning model performance was being assessed using metrics selected by the data scientist who originally developed each model, without reference to overall business goals or tradeoffs between different goals. > Choose performance metrics for each model individually, and evaluate performance side by side with other models. Walmart began selecting up to 10 different measures for each of its machine learning model, including both technical and business measures (such as data quality, CPU consumption, add-to-carts, clicks, customer satisfaction, ROI). Performance is then evaluated in context with the performance of other models. Example: Reduced frequency of use of spell check suggested that the model wasn’t performing well. But in reality, autocomplete was becoming more and more effective, making spell check less and less necessary.
Multiple similar but inconsistent data sets. Many versions of the same data sets were circulating within the company, some of lower quality than others, with no way to identify which was best. > One version of the truth. Jennifer’s team started tracking the origins of existing data sets, defining which were most current/most accurate, and making those sets available throughout the company.
Machine learning models were seasonally overfitted. Machine learning models were being retrained on a regular schedule. But sometimes they really needed retraining more frequently to accommodate seasonal changes in demand for certain products (holiday shopping patterns are different from summer shopping patterns). > Understanding the models. A protocol was implemented to determine which model parameters are extremely stable over time, and essentially don’t change even when you retrain the model, and which parameters are volatile and have a big error over time. Once this is understood, models can be retrained when changes to the volatile parameters are anticipated.
Excessive model training and retraining costs. Data scientists tend to use 4 times as much data as they need to train machine learning models, which is resource intensive and costly. > Measure the incremental benefit to model accuracy from training, and establish requirements for training cost vs accuracy. 99% of the accuracy may be enough for a model, if it can be achieved at 25% of the cost.

Lessons Learned / Guiding Principles

• Don’t wait for a customer to complain because they had a bad experience (for example, getting the wrong recommendations in search results) to tell you that something is wrong with a machine learning model. You want to make changes before it impacts the customer.

• The notion of technical debt (when a thing is put into production more quickly in exchange for sub-optimal performance) applies to machine learning also.

• A great deal of effort can go into building the data lake your teams need.

• Find the right metric to assess customer satisfaction, but recognize that there isn’t one “true” metric for every machine learning model.

• You will have have some crosstalk between models. One model’s underperformance could be due to another model’s overperformance (see the example above about autocomplete and spell check). You can’t keep things segmented and just track one at a time, you have to consider a holistic view.

• Machine learning models can become distorted when you don’t have a complete picture of what’s driving customer behavior. In particular, connecting online and in-store behavior (such as Google Online-To-Offline Attribution) is extremely valuable. Example: a customer’s failure to add an item they viewed online to their shopping cart (no “add-to-cart”) could be attributed to an error by the machine learning model: this would happen if the model showed the customer the wrong item. But in reality, the customer may have decided to purchase the exact item the model showed them, then purchased it from the brick-and-mortar store instead (a.k.a. “webrooming”). In which case the model worked perfectly.


> Listen to the original podcast

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: