What’s The Best Path To Becoming A Data Scientist?
There’s a lot of interest in becoming a data scientist, and for good reasons: high impact, high job satisfaction, high salaries, high demand. A quick search yields a plethora of possible resources that could help — MOOCs, blogs, Quora answers to this exact question, books, Master’s programs, bootcamps, self-directed curricula, articles, forums and podcasts. Their quality is highly variable; some are excellent resources and programs, some are click-bait laundry lists. Since this is a relatively new role and there’s no universal agreement on what a data scientist does, it’s difficult for a beginner to know where to start, and it’s easy to get overwhelmed.
Many of these resources follow a common pattern: 1) Here are the skills you need and 2) Here is where you learn each of these. Learn Python from this link, R from this one; take a machine learning class and “brush up” on your linear algebra. Download the iris data set and train a classifier (“learn by doing!”) Install Spark and Hadoop. Don’t forget about deep learning—work your way through the TensorFlow tutorial (the one for ML beginners, so you can feel even worse about not understanding it). Buy that old orange Pattern Classification book to display on your desk after you gave up two chapters in.
This makes sense; our educational institutions trained us to think that’s how you learn things. It might eventually work, too, but it’s an unnecessarily inefficient process. Some programs have capstone projects (often using curated, clean data sets with a clear purpose, which sounds good but it’s not). Many recognize there’s no substitute for ‘learning on the job’—but how do you get that data science job in the first place?
What does a data scientist do?
There is much discussion, and often confusion, around the term “data scientist.” In short, the definition of data science is the process of asking questions and getting answers from data. By defining the different roles of data scientists and breaking them into four distinct categories, it may better clarify the different uses of the term data scientist, each with its own focus.
The first category of data scientist; which in this article will be referred to as Data Scientist 1 (DS1), is going to have the responsibility to create the data strategy and overall technological requirements surrounding how the data will be collected, stored, formatted, and accessed throughout the life cycle of rapid insight gathering. Additionally, this type of data scientist will be leaned upon to develop AI and other coding mechanisms that enable the other groups to gain the ability to ask, and have answered, their questions from the data. Another key element is making sure that the users have ‘good data’. Good data is data that is clean of errors and difficult formatting.
The DS1 has a critical, and often difficult, role engaging SMEs that have vital knowledge about how processes function and are measured. Being able to understand the needs of SMEs may be daunting when being asked to assist with creating meaningful algorithms. Getting the right data in the right format to the right people is the basis for creating a top-notch organization. Additionally, the DS1 plays a significant role in the technical aspects of making the data rapidly available since the volumes and velocity can make analysis of the data an overwhelming task for the data scientists to be described later.
The second category; i.e., Data Scientist 2 (DS2) delves into the types of data with SMEs and their needs to perform advanced analytics. They are supported by statisticians and a new breed of individuals who graduate with a master’s degree in analytics. The latter is focused on analyzing data and less on the underlying mathematical theory. Both disciplines provide significant support to SMEs to perform advanced analytics. The DS2 must also recognize that a major consideration for integrating analytics relates to how it plays out with the 4th generation of the Industrial Revolution. Terms like Industry 4.0, Manufacturing 4.0, Smart Manufacturing, and Quality 4.0 are being used, and sometimes misused, more often. One must consider how data creation, collection and formatting changed in this environment.
Defining the most relevant and useful data to be created and collected is valuable. Taking time to brainstorm what key questions need to be answered with data can also add value to the discussion around data sources and what data to collect.
The next exercise would be to measure the volume of data that is desired and the velocity at which it is being created. Taking time to work through measuring the impact of the volume and velocity of data available will oftentimes educate both the DS1 and DS2 in ways that optimize the tasks of each.
A conversation about the analysis methods that will turn data into meaningful information for decision-making will uncover the technical nature of advanced analytics as well as the formatting of data to accomplish the task. At this point DS1 and DS2 can bring in the SMEs for meaningful discussion. From questions that surfaced during previous conversations on what data is needed, the DS2 can start to build an analytics strategy. That strategy will start by separating simple data rummaging with Business Intelligence (BI) software from more advanced methods like machine learning and other more sophisticated analysis methods.
At this point the DS1 and DS2 may collaborate on the purchase of analytical software. There are several things to consider when acquiring analytical software. Some of the general themes that should be considered when purchasing from a software vendor are: 1) avoid software that is ‘overkill’. Software companies may ‘bundle’ many features into a package that includes analytical procedures and features that may never be used. 2) ensure that the software is not overcomplicated. Software should install and configure without significant time and resources. If the software is difficult to install and configure, it may also be difficult to maintain. 3) watch the price. This sounds obvious, but there are still things to consider when the vendor starts charging for ‘consulting’ to install and configure their system.