Predictive Model Data Prep: An Art and Science

As promised in the Moving Beyond Data Visualization to Predictive Analytics article, I have collected the following predictive modeling data preparation tips for you. My intent is to enlighten and help you successfully get started with predictive analytics or smart data discovery tools. Unlike data visualization tools that contain basic automated, quick insights of outliers, trends and clusters from data in a spreadsheet, dimensional data warehouse, OLAP cube or OLTP database, predictive analytics and smart data discovery tools contain algorithms that provide much deeper analytical (what will happen) and prescriptive (how to optimize) capabilities with data formatted for that purpose.

Don’t Underestimate Data Prep

My first tip is a collection of fun facts. In researching this topic, I found excellent must-read analysis called Quantifying the Case for Enhanced Data Preparation. Blue Hill Research surveyed 186 data analysts to learn about the amount of time they spend on data work. We always knew data prep was the most time-consuming task estimated to devour 60% to 80% of the work effort on an analysis project. Data visualization is the quick and easy part!

Blue Hill’s findings in February 2016 revealed that the median amount of time a data analyst spends working with data is between 4 and 6 hours each day. Remarkably, 66% of data analysts reported spending at least 4 hours each day and 15% of analysts reported spending at least 8 hours a day on it.

Taking it one step further, Blue Hill estimated the cost of data prep.

Data prep costs approximately $22,000 per data analyst per year.

Other related studies by Ventana Research in 2015 further show data preparation as the top time-consuming activity. Needless to say, data prep is an area where any improvements or automation of tedious and often unpleasant data cleaning work is a quick win.

Ventana DataPrepTime
Source: Ventana Research

The top methods used for data prep today are manual with Excel spreadsheets, scripting/coding (R, Python, VBA), embedded features in visualization tools and niche self-service data prep tools like Datawatch Monarch, Alteryx, Paxata, Trifacta, and Informatica Rev. However, I am seeing more data prep automation in predictive analytics and smart data discovery tools. The most impressive automated data prep and ETL solution that I have seen so far is an industry focused analytics offering, Datorama.

Predictive Model Data Prep Differences

Predictive model data preparation requires a different thought process. Analysts need to organize data in a machine learning-friendly format that accurately reflects the business process, outcome and measures that they want to evaluate. Depending on the project, data preparation might be a one-time activity or a periodic one. As new predictive insights get revealed, it is common to further experiment.

Data preparation is both an art and a science. Since each data set, business objective and analysis are unique with varied data source challenges, there is no exact recipe for success. The following guidelines can help get you started.

Where to Start

You begin predictive model data preparation by choosing the business question you want to answer, selecting an outcome metric and potential influencer variables. From there you will clean and shape the data, create analytical features and take samples.

Predictive algorithms (excluding unstructured text, cognitive and a few other types) prefer data input as a table, view, or comma separated (.csv) flat file of rows and columns. If you have data stored in a dimensional data warehouse or transactional database format, you will want to use record identifiers or keys to join fields from multiple tables to create a single unified, flattened “view”. Your view will contain an outcome metric along with input predictor variables collected at a level of analytical granularity that you can make actionable decisions upon.

Flattened View

You will shape your data into analytical features with derived variables that you feel might describe or influence the outcome metric. If you ever saw the movie or read the book Moneyball: The Art of Winning an Unfair Game by Michael Lewis that discusses the revolution of baseball analysis with new performance metrics On-Base Percentage (OBP) and Slugging Percentage (SLG), essentially you will be using a similar approach. Shaping data involves subject matter expert thought to creatively select and transform variables for maximum influence.

Keep in mind that analytical feature variables need to be created at the correct level of granularity for decision-making. You don’t want to overly aggregate and you may not need lower levels of detail.

Choose a metric level of granularity that is actionable, understandable, and useful.

Predictive views of features are single, wide rows when compared to narrow reporting or application database rows built for other purposes. Predictive views summarize a lifetime of values in just one single row with many columns that can describe different points in time versus multiple rows for each date.

Twelve Predictive Model Data Prep Tips

  1. Choose a metric level of granularity for actionable decision-making with the predictive output.
  2. Predictive algorithms assume that each record is independent and unrelated. If relationships do exist between records, create a new derived variable called a feature to capture data relationships.
  3. When selecting predictor variables, keep in mind that you want to gather a maximum amount of information from a minimum number of variables to avoid the curse of dimensionality without overfitting or underfitting.
  4. Dates will likely get rolled up to a duration like Days between Demo to Trial. Common date rollups include the earliest date and the most recent date in either absolute or relative form.
  5. Decide how to deal with outliers. Some algorithms such as regression are sensitive to them for standard deviation in statistical significance calculations. Confirm if the data is relevant and real. If you expect it to happen again, do not those points. Alternatively you can reduce outlier influence by using transformations or converting the numeric variable to a categorical value with binning.
  6. For missing values, decide if you want to delete it or impute a likely or expected value. If you use a mean, you may be reducing your standard deviation thus a distribution approach is more reliable. As you deal with missing values, do not lose the initial context. A common approach is to add a column to the row to flag data was missing. Also don’t get too ambitious with removing missing values. Sometimes the pattern is in the missing data.
  7. Treat incorrect values as missing if there are only a few. If there are a lot of inaccurate values, try to determine what happened to repair them. Predictive algorithms assume input information is correct.
  8. Reduce variable skew. When a skewed distribution needs correction, the variable is typically transformed by a function that has a disproportionate effect on the tails of the distribution. Log transform (log(x), logn(x), log10(x)), the multiplicative inverse (1/x), square root transform sqrt(x) or power (xn) are the most often used corrective functions.
  9. Avoid using high-cardinality fields are categorical attributes that contain a very large number of distinct values.
  10. Do not use duplicate, redundant or other highly correlated variables that carry the same information or live in the same hierarchy to avoid collinearity issues.
  11. Ordinal variables are problematic for predictive models – Low, Medium and High. If you have ordinal values, transform them into a continuous or categorical one.
  12. Creating features from several combined variables and ratios usually provides more improvement and model accuracy than any single-variable transformation because of the information gain associated with these interactions. Learn to love ratios.

creative

Human Mind + Machine = Better Together

The beauty of the expert human mind in combination with predictive algorithms, machine learning and automation empowers amazing insights that might never be found using manual visual analysis techniques. Since the quality of predictive model output does rely on the quality of input, predictive model data preparation is a critical success factor. Professional data analysts and data scientists understand statistics, how algorithms work and prepare data properly to achieve optimal results.

Today there is an array of niche data preparation tools and features within modern analytics offerings. With all the time and money at stake, I imagine we will continue to see more data preparation automation. However even if the most common routines get automated, a business subject matter expert will likely still create the killer predictive model feature or creative ratio just as they did in the analytics classic, Moneyball: The Art of Winning an Unfair Game.