April 2021

Sports Analytics- Baseball Data Science

Summary:

In this project, our goal is to predict the pitcher performances by using previous years' game data and in-game pitch by pitch data for the Major League Baseball (MLB) regular seasons. We will be constructing various types of features that can capture the player's skills and how much of that skill he can consistently show and carry forward over the course of several years. Using these constructed features, we predict the pitcher performance as measured by these features in the final season of our dataset.

Skill-based vs. Luck-based features:

Skill-based features truly capture the player's ability to perform and (positively) influence the play. These are the player's attributes that generally depend on himself and not on several other events which are beyond the control of that individual player. In simple words, a good score wrt to these features reflect that the player is really good and he deserves to be called the 'best player' in terms of these features.

Luck-based features, on the other hand, are features that depend on several external events which are beyond the control of the individual player. These are features that are affected by factors like how the player's teammates perform, or how (badly) the opponents perform, etc. In simple words, a good score wrt to these features may mean that the player is somewhat overrated or he's just 'lucky' to have a good match.

These features need to be construed because they are not explicitly seen during the game. What we observed during a game is a complicated mix of skill-based features and luck-based features along with several random events which are beyond prediction.

Thus, some players who are actually very good may be undervalued and some players who are not so good may be overvalued.

Skill vs. Luck-based features in Baseball:

As our goal is to predict pitcher performances (will be precisely defined later), we need to construct features that measure how good the pitcher is and also learn about features that fall in the luck-based category. Using the data as shown below, let's construct the following features:

bbrate (Walk rate): bb/bf
krate (Strikeout rate): k/ (bf-bb)
hrrate (Home Run rate): hr/ (bf-bb-k)
hrate (Hit rate): (h - hr)/ (bf-bb-k-hr)

As it may be apparent from the above formulas, we are singling out the particular type of events we are interested in by conditionalizing on the sample space. For more details, please refer to the McCracken article.

After calculating these features, we fit a linear regression model between successive year's feature values for each pitcher. e.g. fit y = krate vs. x= krate_{previous_year} as shown below:

These are the slope coefficients (m) in y=mx +c we observed for year-over-year fits:

m=0.747 for krate (Strikeout rate)
m=0.547 for bbrate (Walk rate)
m=0.295 for hrrate (Home run rate)
m=0.172 for hrate (Hit rate)

Explanation:

These are very interesting observations. Roughly speaking, this means that if pitcher A performed really well in the current season and made a lot of strikeouts, it's highly probably that pitcher A will carry forward the same performance in terms of strikeouts the next season too. He will be roughly 75 % as good as he is now in the next season too! This is because Strikeout rate (krate) is a skill-based feature, and thus, is a really important variable in predicting pitcher performances.

A similar argument works for the walk rate (bbrate) too. The pitcher's ability to avoid giving walks carries forward to the next season too, at least with a weight of 55 %.

For Home Run rate (hrrate), the pitcher is not really in control of preventing home runs consistently over the seasons as it depends on a lot of factors and may have more to do with the batter's abilities than pitchers.

For Hit rate (hrate), we clearly see that this is a luck-based feature. It depends on a lot of factors like fielding done by the pitcher's teammates, the ability of the batter and runners to run fast, etc. Hence, the pitcher's ability to avoid hits does not necessarily carry forward to the next season.

This can be further illustrated as follows:

Here, we compare the top 10 pitchers with the highest krate for the year 2016 and 2017.

Strikeout rate of top 10 pitchers in 2016 and 2017

As we can see in the above figure, 5 players: Strasburg Stephen, Kershaw Clayton, Archer Chris, Scherzer Max, and Ray Robbie appear in the top 10 list of krate for both the years.

Now, let's do a similar comparison for the Hit rate hrate:

Not a single common player appears in the top 10 list of hrate for 2016 and 2017! In fact, Ray Robbie and Syndergaard Noah, who were in the in the top 10 list of krate also appear in the top 10 list of hrate in 2016 (they made a lot of strikeouts but also allowed many hits).

This clearly explains that krate is a skill-based feature and pitchers carry forward the performance over the years as measured by krate. Hit rate (hrate) is clearly a luck-based feature and it won't be helpful in predicting the pitcher performances over the years.

Goal:

Now that we have established and understood which features are skill-based vs. luck-based, henceforth, we will carry forward our work to improve our predictions of the Strikeout rate (krate) and the Walk rate (bbrate).

Our goal is to reduce the out-of-sample Mean Square Error (MSE) in the linear regression fit of

y = krate vs. x= krate_{previous_year}

AND

y = bbrate vs. x= bbrate_{previous_year}

The Called Strike Model:

We will be making a 'Called Strike Model' (CS model) which predicts the probability of a pitch being called a strike or not strike(a ball) by the umpire. To do this, we are going to use in-game pitch-by-pitch data for the regular MLB season matches. We include only those pitches into our data set where the batter has not swung the bat and the only possible options are that the pitch is being called a strike or a ball. We are making this probability prediction model so that we can use it to build more features which will greatly help us to predict krate and bbrate as mentioned in our Goal. Our pitch-by-pitch data set looks as follows:

Let's call the pitch-by-pitch data set shown above as 'pitch data' for simplicity. The pitch data set has many columns (90+), but we will be describing the in-game pitch data features that we will be using for our called strike model. Detailed information about the in-game pitch data features can be found here.

Building the Called Strike Model:

We will try to capture all the factors that influence the outcome to be called a strike or not by the umpire. As we have many columns in our pitch data set, we will be taking an iterative approach in selecting the features which are relevant to our model. We will be using a Logit model to predict the probability of a pitch being called a strike (label=1) or not strike i.e. ball (label=0).

Strike Zone related features:

o High/ Low Miss: Measures how high/low the ball was from the top/bottom of the Strike zone rectangle.

o Left/ Right Miss: Measures how left/right wide the ball was from the left/right of the Strike zone rectangle.

o dist_mid: Euclidean distance of the ball from the middle of the strike zone.

We get the following result when we fit our logit model using these features mentioned above on the pitch data.

Strikezone features logit fit coefficients

We fit the logit model on pitch data for the years 2012-2017.

As we can see above, all the strike zone features that we constructed are statistically significant (check the z score values or p-values) in predicting the probability of strike.

We obtain a Brier score loss of 0.059178 on the out-of-sample 2018 pitch data.

This will be our methodology, we will keep on adding new features to our CS model and keep them if they have a statistically significant coefficient and reduce the brier score loss on the out-of-sample pitch data of 2018.

Note: Brier score loss is improved by improving our prediction on the probability of calling a strike (probCS). E.g. if probCS=0.9 and the pitch is actually called a strike (label=1), then there is obviously very little error in our CS model for that pitch.

Let's observed this effect with our own eyes: We will observe all pitches thrown to Mike Trout in the special cases when of 30 and 02 as explained above. This is what we see:

pitches faced by Mike Trout when 3 balls 0 strikes and 0 balls 2 strikes

Categorical features:

o Ball-Strike score before the pitch: We make categorical features related to various combinations of ball-strike scores possible.

e.g. 2 balls 0 strikes (2-0), 0 balls 1 strike (0-1), 1 ball 2 strikes (2-0).

There would be a total of 11 possible combinations but let's focus our attention on two special cases: 3 balls 0 strikes (3-0) and 0 balls 2 strikes (2-0). Remember in baseball, 3 strikes result in a strikeout, and 4 balls result in a walk, hence these two cases are particularly important! (Also, remember that our final goal is to predict krate and bbrate in the linear regression that we did at the beginning.)

What happens in the 3 balls 0 strikes (3-0) case? Maybe the umpire is psychologically more inclined/pressurized to call a strike as giving another ball call will result in a walk. Thus, we expect mean Called strike probability (probCS) of a typical pitch being called a strike to be larger than usual.

What happens in the 0 balls 2 strikes (0-2) case? Maybe the umpire is psychologically more inclined/pressurized to NOT call a strike as another strike and the batter is striked out! Thus, we expect mean Called strike probability (probCS) of a typical pitch being called a strike to be lesser than usual.

For Mike Trout, we get the mean probCS in the 3-0 case to be 0.603 and mean probCS in the 0-2 case to be 0.079

Thus, clear evidence of our hypothesis that 3-0 and 0-2 are special cases! (Note: We will build on this later)

o Indicator for Left/Right-handed batter: Let's add another categorical feature telling us whether the batter is left or right-handed, along with the plate_x value for the pitch.

Kinematics related features:

o Velocities: We add the velocity of the ball along the x, y, and z-direction into our model.

o Acceleration: We add the acceleration of the ball along the x and z-direction. Note: ay (acceleration along the y-direction) i.e. the direction from the batter to the pitcher did not turn out to be statistically significant! This probably means how well the ball "swings" up-down or left-right is more important than how it accelerates while going towards the batter.

Finally, we have constructed the following features using the pitch date set:

Strike zone-related features.
Category-related features.
Kinematics-related features.

We use all these features in our CS model (i.e. logit) to predict probCS and reduce the Brier score loss on the out-of-sample 2018 pitch data. These are the coefficients we obtain for the final CS model:

final logit model coefficients by statsmodels python

Insights: Notice that in the ball-strikes score combination category, we obtain the largest positive coefficient in the 3-0 case and the largest negative coefficient in the 0-2 case.

We obtain a Brier score loss of 0.05691 on the out-of-sample 2018 pitch data.

Now that we have finally settled with our CS model, we will make a new column probCS (i.e probability of calling a strike) by doing fit and predict over successive years. Meaning: We will fit the CS model on pitch data of 2012 and predict probCS for 2013, then we will fit the CS model on pitch data of 2013 and predict probCS for 2014 and so on. We do this to avoid data leakage, as our CS model should have no information about the seasons which are yet to happen! Hence, we will have the pitch data set along with all the features which we created in the CS model (strike zone features, category features, and kinematics features) along with the predicted probability probCS column.

In the figure above, the left plot shows all the pitches faced by Mike Trout and are color-coded based on the pitches actually being called a strike or not (label=1 or 0). The green box represents the average strike zone for Mike Trout.

The right plot shows the same data where the color gradient is determined by the probCS value for that point. Clearly, the yellow region which represents high probability values comes out approximating the actual strike zone for Mike.

Now, let's make a 3D plot where the pitch location (plate_x, plate_z) represents a plane and the height is a measure of the probCS for that point.

The figure above shows the probability distribution of pitches faced by Mike Trout. We clearly see that the yellow points (which were actually called strike) have high probCS values and show the shape of the strike zone (which is actually somewhat circular).

The concept of Extra Strike:

Till now, by constructing the CS model, we have the probCS value of each and every pitch thrown by the pitchers in the pitch data set. After going through all that effort in making the CS model and adding the extra probCS column in the pitch data set, now we will finally make use of probCS to construct two new features which will help us predict krate and bbrate (Remember that was our Goal in the beginning!). Let's construct a new feature called the "Extra Strike".

Q: What is Extra Strike?

A: We define Extra Strike to be simply: extraStrike= strike - probCS, where strike is the dependent variable which is label=1 when called strike and label=0 when not called strike.

Q: What does extraStrike signify?

A: Let us explain with the help of an example. Let's say we have probCS=0.7 for a particular pitch by pitcher A, i.e. our model predicts that this particular pitch by pitcher A has a 70 % chance of being called a strike.

Now, if strike=1, this means that pitcher A got a benefit of 1-0.7=0.3 credited to his name because the pitch was only 70 % "good" for being a strike. He got the extra 30 % bonus.

If strike=0, the extraStrike=0-0.7=-0.7, which means the pitcher didn't get the 0.7 credit that he deserved. He is awarded -0.7 of extraStrike which basically means he is negatively awarded 0.7 parts of a strike which he deserved.

Two new features called xstr1S and xstr1B:

Now, from the pitch data set, we will select only those pitches where we have the special ball-strike score combination that we explained earlier (i.e. 0-2 case and 3-0 case). Why are we doing so? Remember, we are interested in predicting the pitcher's ability to make a strikeout (not strikes) and the pitcher's ability to prevent walks (not balls).

xstr1S:

0-2 is the case when there is 1 strike left to go for a strikeout (this will be a highly predictive feature for krate!). Hence, xstr1S is calculated by grouping the pitcher data set on pitcher and game year and aggregating (sum) on the extraStrike feature which we explained above.

xstr1B:

3-0 is the case when there is 1 ball left to go for a walk (this will be a highly predictive feature for bbrate!). Hence, xstr1B is calculated by grouping the pitcher data set on pitcher and game year and aggregating (sum) on the extraStrike feature which we explained above.

Finally, we have two new columns: xstr1S and xstr1B feature values for each pitcher for each game year (2013-2018).

Remember, we also have krate and bbrate feature values for each pitcher for each year. We'll join the two datasets and finally obtain the following:

As we had done early, we fit a OLS model between successive year's feature values for each pitcher.

i.e. fit y = krate vs. x=( krate_{previous_year}, xstr1S_{previous_year}/bf) as shown below:

We fit a OLS model between successive year's feature values for each pitcher.

i.e. fit y = bbrate vs. x=( bbrate_{previous_year}, xstr1B_{previous_year}/bf) as shown below:

The results we obtain for the two OLS fits of krate and bbrate on 2013-2018 data is as follows:

Insights:

In the first OLS fit for krate, note that xstr1S/bf represents the extra strike credit per batter gained by the pitcher as a lucky bonus when on 0-2. High values of xstr1S/bf means the pitcher got strikeouts to his name that he didn't completely deserve. Hence, the more the pitcher gets credited with underserved credit, the luckier he gets and thus this negatively contributes to his true skill (which is represented by krate). Hence, the coefficient of xstr1S/bf is negative and almost of the same magnitude as the krate_{previous_year}'s coefficient.

In the second OLS fit for bbrate, note that xstr1B/bf represents the extra strike credit per batter gained by the pitcher as a lucky bonus when on 3-0. High values of xstr1B/bf means the pitcher got a strike to his name that he didn't completely deserve. Hence, the more the pitcher gets credited with underserved credit of a strike (instead of a ball), the luckier he gets. Hence, a high value of xstr1B/bf means that the pitcher has a lot of luck and the pitch which should have been called a ball (which eventually results in a walk as we are on 3-0) is getting called as a strike instead. Hence, the coefficient of xstr1B/bf is positive and it should contribute positively towards predicting the true walk rate of the pitcher.

To summarize, we first identified the skill-based features i.e. krate and bbrate which are a good measure to predict pitcher performances over the years. We calculated krate and bbrate using the game data set.

We then went through a lot of effort to develop the CS model which calculates the probability of calling a strike (probCS) for every pitch. From that, we calculated xstr1S and xstr1B which are used as features in our final OLS fit to predict krate and bbrate.

Results:

We achieve an MLE of 0.0019187 on the out-of-sample 2019 data set for predicting the Strikeout rate (krate)

and an MLE of 0.0004601 on the out-of-sample 2019 data set for predicting the Walk rate (bbrate).

This analysis can be useful to predict pitcher performances in the next season using data available from the previous seasons. This can also help us get a quantitative measure of which players are overvalued/ undervalued by observing their xstr1S and xstr1B values. This can also help in determining the salary of the players or drafting a team. This analysis may also be useful in fantasy leagues by knowing which players have good luck as well as skills factors on their side.

Did you like this project? Check out for soccer analytics!

Optimization of Search Variables in Leptoquark production at the CMS,CERN: Project