Demystifying Data Science | Machine Learning

What is it and why is it so hard?

My goal in this blog is to back away from the hype and peel back the curtain surrounding Machine Learning (ML) and define what it is and what it does for our world. I hope to lower the barrier to entry and the intimidation factor for the “data analysis newbie” and or the “data scientist wannabe” to start exploring the exciting world of ML.

ML is an evolutionary discipline that grew out of the data mining space. ML is intended to give us real insight into the reasons behind our successes or failures by the analysis of our data. It helps us understand our customers, products, mission and plan in ways that simply weren’t possible or practical before.  ML is not a panacea for making sense of all your big data. It is not a magic wand to wave at your datasets to pop out intuitive insights that lead to the proverbial pot of gold profit margins for your business. Nor will it ever be that.  ML is not perfect or exact.

Traditional data analysis is concerned with finding the answer to the “known questions” the business routinely asks. An example is total profit made in Q3 of 2017 in the western region of North America. This kind of question can usually be answered using traditional business intelligence technologies.

By contrast ML is meant to give us several highly likely or possible answers about our data to the questions we’ve not thought to ask. ML uses algorithmic based approaches for extracting insights and knowledge from data as it exists at a point in time. More specifically, it is leveraging algorithmic analysis methodologies to explore large amounts of data in search of meaningful patterns and implied rules. As such it is not permanent nor enduring. It is more akin to weather forecasting than exact sciences such as chemistry or physics.

The patterns discovered in the data are only meaningful to us if they give us actionable insights.

ML cartoon

What ML truly represents, in my opinion, is a paradigm shift. With the commoditization of big data and the cloud infrastructures that made that possible, acquiring and analyzing large data-sets is no longer the herculean effort it was just three or four years ago. Huge amounts of data are readily available for collection, storage, and analysis. The problem has shifted from how do we get the data to how do we use the data we’ve got?

Roles of a Machine Learning Expert

To quote Tracy Teal; co-founder and the Executive Director of Data Carpentry, “before now the conversation has been around bringing compute to data or data to compute, but with ML we are bringing people and understanding to the data and that becomes knowledge.” (1) As great and empowering as this sounds the people part of this quote is both the best part of ML and its Achilles heel. This is the reason there is so much hype as well as so much confusion around ML and why it seems so daunting. Occupying a space within the field of data science and as a subset of artificial intelligence – ML lies at the intersection of mathematics, statistics and computer science. It requires a combination of skill sets that have traditionally been found in separate job descriptions and therefore typically requires separate professionals.

Subject Matter Expertise: It’s all about the data. You need to first understand your data. You need to understand the implications of poor data quality on your ML models and understand the data well enough to know how changes in the various attributes or features of the data effect your organization’s mission or business model. Essentially, you must be a Subject Matter Expert (SME) of your data. This role was traditionally held by the data expert in the business’ marketing, inventory, supply chain or financial departments, but that person is usually not a data analyst by vocation.

Data Analysis: Data Analyst skills are needed as well as comfortable working with the software tools and techniques necessary to acquire, load, cleanse and finally analyze the applicable data sets.

Mathematical/Statistical Analysis: Mathematics, specifically statistical analysis, skills are also needed to choose the most appropriate mathematical algorithm for the given dataset and the type of analysis for the problem you’re trying to solve.

IT Infrastructure: You need to understand your IT environment and infrastructure well enough to quickly configure and deploy the necessary server resources and software to quickly standup a new ML “laboratory” for testing and improving your ML model with live or real data.

Computer Science/Programming: You need to be comfortable in statistical analysis programming languages such as R and leveraging its analysis libraries in languages such a C, C# and Python.

ML Cartoon

I just described to you the role of no less than five separate professionals and perhaps as many as ten, yet a data scientist or an ML expert is expected to not only possess all these skills, he or she is expected to be a master in all of these skills. This combination of skill sets at a mastery level rarely exist in one person nor is it realistic for a “data scientist wannabe” to acquire a mastery level of these skills while maintaining their ‘9 to 5’. Now you can start to see why the barrier to entry for data science and ML is perceived to be so high. The good news is that both employers and software vendors are starting to realize that finding these existing skills sets in one person is nearly impossible, so a better strategy is to both leverage existing strengths of these separate roles and offer tools and training to bridge the gaps in the requisite skills of existing professionals. Grow a data scientist versus hiring one. The software vendor’s approach to bridging the skill-set gaps via training and software tools will be discussed more closely in my next blog entry.

Sources:
(1) https://www.youtube.com/watch?v=xMmpMXlSzW0

Images:
http://analyticscube.com/2017/07/
http://bigdata-madesimple.com/dilberts-20-funniest-cartoons-on-big-data/