Show Me the Money

maura cerow
5 min readApr 14, 2020

Using Decision Trees & Ensemble Methods to predict baseball player salaries.

We should be in baseball season right now. Hockey should be in playoffs. Basketball should exist. I’ll even admit to Golf being around. But sadly they are all on hiatus while the world implodes and I for one miss sports. So to include them in my every day, quarantine life, I figured let’s try to predict player salaries (and then be in awe of how much athletes get paid in a single season) using my newly learned Decision Tree & Ensemble Method skills.

In an effort to flex some web scraping skills, I pulled my data from USA Today. It’s a super simple dataset — I pulled out just player name, team, position and the annual salary. I set out to see if team & player position are good indications of how much a player will make in a given year. After all, 6 of the top 10 highest paid players are starting pitchers. Max Scherzer signed with the Nationals back in 2015 with a $210 million dollar contract. His salary for 2019 was a whopping $42m, +$12m ahead of the next highest paid player. Clearly this data has some outliers.

Now, to me there’s good reason to believe the team also dictates the salary of a player. That’s the entire premise of Moneyball where smaller markets buy up undervalued players based on statistical analysis and sells off overvalued to other markets.

Digging into my data, I first want to check the distribution of my target variable — in this case, salary. I removed my outliers by using the IQR. Below is a graph of the distribution.

Before I ran my first model, I had to turn my two categorical features into dummies, dropping the first to handle multicollinearity. My small DataFrame went to 42 columns.

To set the stage, my data ranges from $555k to $14m with the average salary for 2019 being $2.8m. That’s a pretty wide window to be predicting, but I’m curious to see how Decision Trees & later Random Forests will be at doing such a job. My first model is a simple Decision Tree. I didn’t tune any hyperparameters; I just let it run with all my dummies on my test data. Decision Trees get a bad rap for overfitting and it’s not entirely unjust. They want to create a space for each individual observation in a training set which isn’t super helpful when then using it to predict on a new dataset. This model returned a mean squared error of 16.4 million. So maybe I should try something else.

Random Forest has to be one of my favorite predictor methods I’ve learned so far. Random Forest, along with other ensemble methods, use the idea of the ‘wisdom of the crowd’ to predict outcomes. This is the idea that as we average out multiple predictions from different models, we get close to the actual value (think Central Limit Theorem). Random Forest takes a collection of decision trees, and for Regression Forests, takes the average prediction as the output. Each tree is weighted the same in Random Forest. While each tree has high variance, the goal is to create diverse opinions. When I ran my first Random Forest I again didn’t tune any hyperparameters. My MSE improved, but only slightly, to 15.2 million.

Next up, I ran a GridSearch to find the best parameters to use in my Random Forest model. I’m finally touching the hyperparameters! GridSearch uses cross validation to find which parameters a user inputs will provide for the best results. I adjusted for the number of estimators, the depth of each tree, max features used at each node, minimum samples required to split and whether or not to use bootstrapped data. After updating my hyperparameters in my Random Forest Regressor, I saw some improvements! My MSE came in at 11 million.

The final model I wanted to try is AdaBoost. AdaBoost is another ensemble method, but where Random Forest produces independent models, AdaBoost iterates over each decision tree attempting to correct what the previous tree got wrong. AdaBoost also weighs each tree differently depending on their performance. The lower the error rate for a given learner, the more weight it is given in the pool. Where Random Forest uses strong learners, AdaBoost works with weak learners to make one strong learner. AdaBoost has some problems when it comes to outliers. It’ll assign unjust weight to an outlier to try to figure out where it belongs since the outlier does not conform to what the data is trying to tell us. When I ran this model, my MSE actually increased. It didn’t increase tremendously, but it did go up to 11.9 million.

So maybe not the most practical model. My MSE for my best model is over 11.0 million with my random forest using grid search. Some things I could do to improve this model is to look for years active, batting average, etc to really turn this into something worthwhile.

Let’s hope baseball comes back soon for more unbelievable feats and the occasional mascot bashing.

github

--

--