How much add a PhD to a Data Scientist’s salary — explanation from the Stack overflow 2017–2020…
As a follow-up to the previous story about choosing Data professional roles, a question is to investigate how important are certain…
SHAP values for years of coding, educational level, and more
As a follow-up to the previous story about choosing Data professional roles, a question is to investigate how important are certain characteristics (such as previous education, company size, or coding experience) to a Data Scientist’s yearly compensation.
To answer this question, let us make a closer look at the public Salary and more-Data Scientist, Analyst, Engineer | Kaggle dataset that contains the 2017–2020 year data processed from the Stack overflow Annual Developers Survey.
Full details of the analysis can be found in this public Kaggle notebook.
Step 1 — data preprocessing
Here, data preprocessing consists of the following steps:
selection of a single country (United States) data;
rescaling the label column to kUSD/year;
removing 5% (5%) responders with the largest (smallest) compensations;
selecting only the high cardinality data in categorical features;
replacing null values.
Step 2 — setting a Machine Learning model to predict the yearly compensation
The data prepared with the previous step are randomly split between training and test samples, and modelled with the CatBoostRegressor model that explicitly takes into account categorical features. The root mean squared error (RMSE) of the resulting model is about 32 kUSD/year, an improvement compared to the baseline model RMSE of about 37 kUSD/year (assuming the same yearly compensation of about 108 kUSD/year for every responder).
Step 3 — explanation of the obtained Machine Learning model.
Here, we are using SHapley Additive exPlanations (SHAP) method, one of the most common to explore the explainability of Machine Learning models. The units of SHAP value are hence in kUSD/year.
First, we look into the span of SHAP values for every feature of our interest:

As we see, the largest span of SHAP values is due to years of professional coding (YearsCodePro variable). Not surprisingly, responders with larger professional coding experience are getting paid more, with a difference of about 50 kUSD/year between the most and the least experienced responders:

Interestingly, if the responder works not only as a Data Scientist, but also as a Data Analyst, Business Analyst, or Database Administrator, the expected yearly compensation is significantly smaller (down to 10 kUSD/year). This can be explained as the average salaries for the mentioned professions are generally smaller compared with Data Scientist’s salaries, so such “mixed” positions can be interpreted as a switch toward Data Scientist positions. As a result, this downwards compensation trend is not true for Data Engineering positions whose salaries are very close to those for Data Scientists.
Another interesting effect is the impact of educational level:

Not surprisingly, the largest positive impact on the educational level is the doctoral (PhD) degree. However, while looking at individual SHAP values coloured with survey year (2017 to 2020), we can see that most of the highest SHAP values for a PhD degree have been in 2017 while the 2020 year data shows much smaller SHAP values. Indeed, the average SHAP value of a PhD degree during the 2017–2020 years is 8.1 kUSD/year, for the 2017 year it is 10.6 kUSD/year, and for 2020 it is only 5.3 kUSD/year. In other words, the value of a doctoral degree for a Data Scientist position is decreasing with time.
Remarkably, there is no clear trend with organisation size, with an average difference between very large (> 5,000 employees) and very small (1–20 employees) organisations no more than 1 kUSD/year:

Finally, looking at the yearly SHAP values, we see a steady compensation increase of predicted compensations for about 4 kUSD/year, or ~4% yearly growth rate throughout each of the 2017–2020 years:

I hope these results can be useful for you. In case of questions/comments, do not hesitate to write in the comments below or reach me directly through LinkedIn or Twitter.
You can also subscribe to my new articles, or become a referred Medium member.