top of page
Ronaldo-pen-1_edited.jpg

April 2021

Sports Analytics- Soccer Data Science

Summary:

In this project, our aim is to predict the probability of the home team winning the game. We will be using top 5 European Leagues data (EPL, La Liga, Serie A, Bundesliga, and Ligue 1) and construct various types of features using match data and in-game shot-by-shot data for the games played during the seasons 2014-2018. 

Skill-based vs. Luck-based features:

Skill-based features truly capture the player or a team's ability to perform and (positively)  influence the play. These are the attributes that generally depend on the team's skill and not on several other events which are beyond anyone's control. In simple words, a good score wrt to these features reflect that the team is really good and deserves to be called the 'best team' in terms of these features.

Luck-based features, on the other hand, are features that depend on several external events which are beyond the control of the team's playing. These are features that are affected by factors like how the player's teammates perform, or how (badly) the opponents perform, etc. In simple words, a good score wrt to these features may mean that the player/team is somewhat overrated or just 'lucky' to have a good match.

These features need to be construed because they are not explicitly seen during the game. What we observed during a game is a complicated mix of skill-based features and luck-based features along with several random events which are beyond prediction.

Skill vs. Luck-based features in Soccer:

Our goal is to predict the probability of the home team winning. In soccer, there are several features that can be constructed using game data. We are going to use goals, shots, and shots on target as our initial features in our prediction model. Then we will add more features to this model using in-game shot data.

Game Data:

Our goal is to predict the probability of the home team winning. In soccer, there are several features that can be constructed using game data. We are going to use goals, shots, and shots on target as our initial features in our prediction model. Then we will add more features to this model using in-game shot data.

This is how the game data looks like (shown below). For both the home and away team, we have the goals scored at full-time, the number of shots taken, and the number of shots on target.

Game data

Construct average of features:

We are going to construct features that are based on goals, shots, and shots on target taken by the home and away teams. Let's take goals for instance:

  • GD_Home= G_Home- G_Away: Goal differential from the home team's perspective is simply goals score by the home team minus goals scored by the away team at full-time.

  • GD_avg_Home: We take the average of GD_Home of the home team for all games that are played before the current game. E.g. If the home team is Manchester United, we take the average of all the goal differentials (GDs of Man Utd) in all games played by Man Utd before the current game (both Home and Away games!).​​

Similarly for the Away team:

  • GD_Away= G_Away- G_Home: Goal differential from the away team's perspective is simply goals score by the away team minus goals scored by the home team at full-time.

  • GD_avg_Away: We take the average of GD_Away of the away team for all games that are played before the current game.

GD_avg (call it 'average goal differential') means how much the team outscores the opponent team on an average. If GD_avg > 0 that means the team scores more goals than it concedes on an average. If GD_avg < 0 means the team usually concedes more than scoring themselves. A clear example would be a match between Team A with say GD_avg=2.1 vs.  Team B with GD_avg=-0.9. This means that on average, Team A outscores the opponent by 2 goals, and Team B is outscored by the opponent by 1 goal. Clearly, in case of such opposite signs of GD_avg, Team A would likely win the match vs. Team B.

In case both the GD_avgs are of the same sign, the magnitudes of the GD_avg would matter (More on this later).

Similarly, we calculate the 'average shots' S_avg and the 'average shots on target' ST_avg

Model

We will use generalized linear model (logit model) to predict the probability of the home team winning. Our dependent variable 'Win_Home' will be 1 when the home team wins and 0 when there is a draw or home team loses. We will use the features constructed above, namely GD_avg, S_avg and ST_avg for both home and away teams.

Our training data constitutes of season 14-17 and testing data set will be season 18. We will try to minimize the Brier score on the test set.

These are the coefficients we get:

Goal difference shots

As we can see above, Shots on target are not statistically significant (look at the p-values).

Insights: Let us try to understand why shots are significant  and why shots on target are not. To take a shot, the team needs to make some amount of a good play (passing, running, dribbling, keeping possession, etc) in order to get closer to the goal and attempt a shot. Thus, there is definitely a lot of skill involved in taking shots and a good team will likely be able to create more opportunities to get closer to the goal and attempt a shot. Hence, we understand that S_avg (shots average) are significant in the logit model.

Now what happens post-contact after the shot is taken is more dependent on external factors than the player who has taken the shot. It's possible that even after making a great play and getting closer to the goal, the shot taken is simply blocked by the defender or it just misses the goal post narrowly. Hence, shots which happen to be on target constitute of a lot of luck component and has a lot of dependency on the post-contact dynamics of the ball, which the shot-taking player is not completely in control of.

Now let's remove ST_avg and check the coefficients of the logit model again:

Goal difference shots

As we can see above, all the features are statistically significant (p-value= 0 for both GD_avg and S_avg).

Brier score on test set: 0.2166244

Shots Data:

Now we will construct more features to add to our model by using the shots data. The shots data contains detailed

information for every shot taken during the game. This is what it looks like:

Shots data features

Features Explained:

Let us go through a brief description  all features that we will use from the shots data:

  • shot_id: Unique id for every shot taken in the data set.

  • game_id: Current game id.

  • team_id: Id of the team to which the shot-taking player belongs to.

  • result: Result of what happened after taking the shot. The categories are:

    • ​MissedShots

    • BlockedShot

    • SavedShot

    • ShotOnPost

    • Goal

  • x,y: Position co-ordinates of the player on the pitch while taking the shot.

  • situation: The game play situation in which the shot was taken. These fall into the categories:

    • ​OpenPlay: Shot taken during open play

    • Penalty: The shot was a pentaly kick

    • SetPiece: The shot was taken on a set piece.

    • DirectFreekick: The shot was a freekick

    • FromCorner: The shot was taken when the ball came from a corner.​​

  • shot_type: The body part used by the player to take the shot.

    • ​LeftFoot

    • RightFoot

    • Head

    • OtherBodyPart

  • isHome: Indicated if the team is playing at Home field or not.

  • Behind: This is a constructed feature which indicates if the team was behind when the goal was scored by the player. E.g. Real Madrid are 2-0 down at half time. Say Ronaldo scores and the score is 2-1. In this case, behind=1 because the player scored when the team was behind.

Shots Data Visualization:

Lets do some exploratory data analysis and have a look at various statistics, top players, etc.

Top 20 goals and assists
Top 20 shots and shots on post
Top 20 header goals and comeback goals

Most of the shots are taken near the penalty box and the closer the player gets to the goal, the higher are the chances of converting the shot to a goal. The points on the left side of the field are own goals. This can be seen below:

All shots on football / soccer field distribution

You can also view the shot distribution for individual players using the interactive plot shown below. Select the player you want to view by entering his name in the 'player' box. Hover over the soccer field to get detailed information about the shot taken at that position (x,y). Then you can highlight the shots by selecting an option from the 'Result' section, e.g. highlight all the shots which resulted in a goal. You can also select the situation in which the shot was taken, e.g. select the shots which were taken during a SetPiece.

The plot also shows the form of the player in terms of goals scored.

Moreover, you can select more than one player at a time for comparison. E.g. Select Ronaldo and Messi and then compare their form, shot results, and goal situation. Use the 'Highlight Player' option to view the shots distribution of a particular player whenever multiple players are selected.

Note: Selecting many players may take time to load. It's recommended to select two players or at most three when comparing.

Construct average differentials features from Shots data:

We are going to construct features that we'll call average differentials. The process is similar to what we already did when constructing average goal differentials from the game data. Let's take 'shots on post' feature and explain the process:

  • SP: Count all the shots on post which occurred during the game for each game of every team.

  • SPD: Take the difference of SPs from the home and away teams' perspective during the game. Thus, we will have SPD_Home and SPD_Away

  • SPD_avg_Home: We take the average of SPD_Home of the home team for all games that are played before the current game, just like we did previously in the game data. Similarly, we have SPD_avg_Away

  • SPDD= SPD_avg_Home-SPD_avg_Away. Meaning: We understand that SPD_avg_Home means how much the home team gets more shots on post (on average) than the away team.  SPDD simply represents the difference between this average. E.g. Take the El Clasico match as shown below. It tells us that Real Madrid on average outscored their opponents by +0.2 in terms of shots on post while Barcelona outscored their opponents by +0.57. Hence, when the two teams face off, the difference between their abilities to outscore the opponents (on average) is 0.2-0.57=-0.37 i.e SPDD.

Shots on post Real Madrid

Similarly, following the exact same process as described above, we will make featureDD for own goals, header goals, penalty goals, and come back goals. Come back goals are the goals that are scored when the team is behind and eventually didn’t lose the match i.e. the team fought back and the match resulted in a draw or a win for the team.

Thus, we have the following features which we will use:

  • GDD: Difference between GD_avg's of the two teams (Already have it from game data)

  • SDD: Shots (Already have it from game data)

  • OGDD: Own goals

  • SPDD: Shots on posts

  • HGDD: Header goals

  • CGDD: Come back goals

  • PGDD: Penalty goals

Now how do we decide amongst header goals, own goals, penalty goals, come back goals, etc that which of these are significant in predicting the outcome of the game? To do this, we will have to understand how much skill component these features have. We already know that goals and shots taken are statistically significant from our analysis using the game data. Hence, we will now try to figure out how significant are OGDD, SPDD, HGDD, CGDD, and PGDD in predicting the outcome of the game.

To do this, we will use market probabilities for the games played. Let pH denote the probability of the home team winning according to the market (which can be calculated using betting odds).

Logit function

logit(p) brings the probability to the same scale as goal differentials and other features. We then fit a linear regression between logit(p) vs GDD and other features that we constructed to figure out which features are significant.

Linear Regression Fit of y= logit(pH) vs x= (GDD,SDD,SPDD,OGDD,HGDD,CGDD,PGDD).

We obtained the following coefficients:

Linear regression coefficients

Explanation:

Explaining which features are important using t-values (and p-values) of their coefficients from the fit:

GDD and SDD:

Already explained during analysis of game data.

SPDD:

Shots on post represents really good quality shots, which are essentially almost goals. As we know GDD is significant, hence SPDD too is significant 

OGDD:

Own goals represents the skill of the opposition team to create an opportunity that lead to an own goal, but it also constitutes of a lot of bad luck of the team which scored an own goal. Own goals are usually due to misplaced passes, defending errors, deflection, rebounds, erroneously clearing the ball, etc. All these events have a lot of randomness in them and thus are mostly luck based. Hence, in a way, the team which scored an own goal didn't actually deserve to concede one. Because they scored own goals, this affected their goals averages in a bad way, hence, the coefficient of OGDD is positive to compensate for the bad luck which the team had while mistakenly scoring own goals.

HGDD:

Header goals are usually taken at distance close to the goal near the inner box area inside the penalty box. Hence, it does constitute of some level of skill by the team to reach that close to the goal. However, header goals depend on a lot of external factors just like shots on target. First of all, it's difficult to score a goal with your head as compared to a normal goal. Second is that the players usually do not have much control on the ball post-contact, as its already difficult enough to make a proper connection and head the ball in the penalty box (because it may be crowded, defenders trying to push/ clear the ball, etc). Also, the goals scored via head are already counted in GDD. Hence, header goals are do not have a statistically significant coefficient as it contains more luck than skill factor in them.

CGDD:

Come back goals somewhat represents the team's ability to 'fight back' in case they are down. These are the goals scored by the team when they are behind and eventually didn’t lose the match, hence the 'comeback'. This does represent a lot of skill factor involved in order for a team to make a comeback. However, the coefficient is slightly negative. That's because although the team made a comeback in that particular game, they were behind nevertheless. And it's possible that more often than not, the comeback won't be possible because a comeback depends on a lot of (luck-based) factors too. Hence, although we counted only the games where the comeback was successful, it also tells us that there may be games that comeback wasn't successful and the team is conceding goals in those games and falling behind, thus (slightly) negatively contributing towards their chances of winning.

PGDD:

Penalty goals do represent some level of skill of the team to able to get that close to the goal in a penalty box and being able to win a penalty kick. However, it's really a lot of bonus for the team that is awarded a penalty kick. That's because in our dataset, almost 75.76 % of the penalties are converted to goals, but had it been a typical shot from the same spot where the foul was committed, the probability of that shot being a goal will definitely won't be something like 75 %. Hence, the team gets some extra lucky bonus from the penalty kick and hence, PGDD is only slightly significant.

Hence, finally we use all the features described above (except HGDD) in our final logit model to predict the probability of home team winning.

Logit Fit of y= Win_Home vs x= (GDD,SDD,SPDD,OGDD,CGDD,PGDD).

Final logit model features

Results:

We achieve a Brier score of 0.2159003 on the out-of-sample Y=18 test data set for predicting the probability of home team winning. Note that this is an improvement from the Brier score of 0.2166244 using just goals and shots based features from the game data. Thus, using in-game shots data helps in predicting the outcome of the game by constructing custom made skill based features which improves the performance of the logit model.

Optimization of Search Variables in Leptoquark production at the CMS,CERN: Project
bottom of page