Data teams are very hard to hire. Engineers are low in supply and high in demand, resumés are difficult to vet without time-consuming phone/coding screens, and often without FAANG’s resources it can be hard to find people as passionate as you are about your product. As the hiring manager, you may be lucky enough to afford the services of a professional recruiter. However, often the recruiters send you completely unsuitable candidates.
A very friendly recruiter I know in NYC and I had coffee the other day to discuss these industry issues. We agreed that it would be useful if we defined what makes up a great data team and what the roles actually mean.
Who should I hire first? And how do they work together?
- Data Analyst
- Your goal in hiring a data team is probably business insights and getting to a point at which your actions are largely data (rather than intuition) driven. For this, a competent Data Analyst is your starting point. Thankfully, Data Analysts are in plentiful supply. New grads with a passion for digging for insights and a quantitative or semi-quantitative (e.g. Econ) background shouldn’t be too tough to find. On the high end you can find those with experience in SQL and also relatively new BI tools such as Looker. If you have the budget, 75% of the engineering tasks that they will encounter can be solved with plug-and-play pipeline tools such as Segment. Most standard business insights – “what happened yesterday/last month?”, “what’s an underserved demographic in my customer base?” can be answered without an expensive Data Scientist.
- Data Engineer
- At some point, your databases and queries are going to become too slow. You’ll also want to scrape a data source or use an API that no-one has built a connector for in Segment. Perhaps you’ll want some custom lead scoring for sales, there’ll be a bunch of security issues or there’s some giant set of queries that you want to have refreshed every morning and notify you in Slack. Not having an programming background, your Data Analyst will feel out of their depth. This is when you need to hire a Data Engineer. Their expertise really lies around getting your datasets to the point at which they can be analyzed. If your tables are getting 50 million events per month then queries will start to take hours overnight and tables may even lock up. The Data Engineer can alleviate this – they will be able to optimize the indexes of your database’s tables for fast lookups, create materialized views of helpful aggregates that refresh every morning, connect custom APIs to your data warehouse and generally go above and beyond that which is covered by plug-and-play tools.
- Data Scientist
- Now that you’re getting standard business insights, and getting them at scale from interesting sources you’re going to almost certainly be curious about the siren song of AI. What if you could predict which customer will churn/convert? What if you could cluster your customer base by behavior? This sort of next-level insight is all possible with a competent Data Scientist. They will use Machine Learning libraries in Python to try to automatically classify your customers and actions in your system. They can predict the future (with certain degrees of accuracy) and tell you stories about your data. This will involve using the large datasets that the Data Engineer has provided. Once these insights have been surfaced, they will have a trained model which can be saved to disk and refreshed. But what if this needs to be refreshed on a daily basis, and those insights added to a table that the Data Analyst or even Salesperson can read? Then you need to go back to the Data Engineer and have them set up a backend system to do this.
- Data Science Engineer, Machine Learning Engineer
- Sometimes, your (possibly more old-school) Data Engineer will be 100% pipeline and database admin focused and not particularly experienced at implementing the Python-based models your Data Scientist came up with. For this issue, you need a Data Science Engineer or Machine Learning Engineer. Their expertise is in deploying, scaling and refreshing the model that the Data Scientist came up with. The Data Scientist should be focussing entirely on probabilities, tweaking model parameters and confidence scores; your Data Analyst should be focussing on the higher level narrative, and your DS/ML Engineer can now take the model and make sure it delivers your insights quickly, and with clean, fresh, correct data.
- Visualization Engineer
- What if Looker doesn’t have the charts you want? What if you need to visualize a network? What if your data has 16 dimensions and you’ve exhausted all the color, size and shape options in your scatter plot? Then you need the very specialized role of Visualization Engineer to build you custom visualizations and think of better ways to surface the insights that the Data Analyst struggled with.
- Business Intelligence Engineer
- This is an ambiguous title – they could be a Data Science Engineer, or Data Engineer.
|Data Analyst||Basic business insights||Looker/Tableau, SQL, Excel, Redash|
|Data Scientist||More complicated business predictions||Jupyter Notebooks, Python, Scikit-Learn, Numpy, Scipy, Statsmodels|
|Data Engineer||Getting “Big Data” to the point at which it can be analyzed as well as connecting custom data sources||SQL, Python, Scrapy, Spacy, Selenium, Kafka, Airflow, Luigi, Spark, AWS, Redshift|
|Data Science Engineer or Machine Learning Engineer||Implementing the Data Scientist’s models at scale||SQL, Python, Scikit-Learn, Numpy, Scipy, AWS, TensorFlow|
But why does everyone just focus on Data Science?
Data Science is a buzzword that some use to represent anyone who fits in the intersection of Data Engineering and Analytics. One of the biggest mistakes companies make is in hiring too many Data Scientists and not enough Data Engineers. Data Scientists are getting easier to hire recently as many former Mathematicians, Chemists, Statisticians and other quantitative grads have found they can either easily rebrand themselves or attend a 3-month Bootcamp and get hired in a junior role. However, their models and their insights can be limited as they don’t know how to scale/deploy them; and they will spend much of their time cleaning the datasets rather than deeply concentrating on the message and predictions hidden within. Of course, as someone who mostly now focuses on Data Engineering I am biased, but I would say that each Data Scientist should be paired with at least 1 Data Engineer. The reason I actually moved into Data Engineering from Data Science was really out of necessity for my own work.
I hope this guide was successful in explaining today’s data teams. Please comment or email me if you have any ideas about how to improve it.