#15: Getting Started with Free StatsBomb Event Data – xG Shot Map Tutorial

Introduction

After attending StatsBomb’s Introduction to Football Analytics last week, I was inspired to take another look at the free events data that they offer. One of the main obstacles to breaking into the football analytical industry is getting data to play around with and show what you can do, which is why Statsbomb’s commitment to offering such samples of data for free is so amazing and should be taken advantage of! There are endless possibilities of insight and visualisations to create using the event data, limited only by your creativity.

Support and free tutorials are also freely available for using data in R, including their own StatsBombR package and FCrStats’s twitter and GitHub who provides functions for creating custom pitches for visualisations. Did I mention they were both free?

https://github.com/statsbomb/StatsBombR & @StatsBomb
https://github.com/FCrSTATS & @FC_rstats

It can be intimidating to start to work with complex data like this, so I will go through step by step and create a version of a popular match visualisation: an Expected Goals Shot Map.

Since the Fifa Women’s World Cup is currently taking place and the StatsBombR package is continually being updated with new games as they are played, I thought I’d use the recent England v Argentina game as an example.

Install RStudio

FOr those completely new to R, you can download the latest RStudio version here:

https://cran.r-project.org/

And install packages using:

install.packages(“…package name here…”)

Load Libraries

To start we will load the relevant libraries.

library("StatsBombR") # Event data
library("SBpitch")    # Custom functions for creating pitches
library("ggplot2")    # Building visualisations 
library("tidyverse")  # Data manipulation

Create Blank Pitch

Using FCrStat’s SBpitch package you can create a pitch to use with custom visualisations using the create_Pitch() function. You can specify the colours and which lines you want to see. For the xG Shot Map, we will use the whole pitch.

# Create a blank pitch using create_Pitch()
blank_pitch <- create_Pitch(
  goaltype = "box",
  grass_colour = "#202020", 
  line_colour =  "#797876", 
  background_colour = "#202020", 
  goal_colour = "#131313"
)

blank_pitch

unnamed-chunk-2-1

Get StatsBomb Data

Using the StatsBombR package, getting access to the free events data is as simple as running the StatsBombFreeEvents() function as below and storing it in your environment.

statsbomb_events <- StatsBombFreeEvents()

## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please register your details on https://www.statsbomb.com/resource-centre and read our User Agreement carefully."

Get Match Info

There are over 100 variables for each event of each match, so we want to narrow the data set down to a single match. We are interested in the Fifa Women’s World Cup match with England v Argentina. We are also only interested in shots, so will only include those types of events.

I have also included the colours for the respective teams to use later on.

The x,y location of each event is stored in a single variable as an array.
Using the separate() function in the Tidyverse we can extract these and create new variables called “location_x” and “location_y”.
Use as.numeric() to make the new location variables numeric so we can plot them later.

event_type <- "Shot"
team1_colour <- "red4"
team2_colour <- "lightblue"

# Narrow down to a specific match: Australia Women's v Brazil Women's
match <- statsbomb_events %>%
  filter(# Fifa Women's World Cup Competition ID
           competition_id == 72 & 
           # Eng Womens v BArg Women's Match ID
           match_id == 22962 & 
           # Only keep events that are shots
           type.name == event_type ) %>%
  # X,Y locations are stored in a single array column, separate() into two columns
  separate(col = location, into = c(NA, "location_x","location_y")) %>%
  mutate(location_x = as.numeric(location_x),
         location_y = 80 - as.numeric(location_y))

Create Goal and xG Indicators

Since we are interested in the actual goals and expected goals of each shot, we can create a goal indicator variable and respective expected goal variables for the shots of each team.

match <- match %>%
  mutate(# Create a goal indicator
         Goal = ifelse(shot.outcome.name == "Goal","1","0"),
         # Create England goal indicator and xG
         team1_Goal = ifelse((shot.outcome.name == "Goal" & team.name == unique(match$team.name)[1]),"1","0"),
         team1_xG = ifelse(team.name == unique(match$team.name)[1],shot.statsbomb_xg,NA),
         # Create Argentina goal indicator and xG
         team2_Goal = ifelse((shot.outcome.name == "Goal" & team.name == unique(match$team.name)[2]),"1","0"),
         team2_xG = ifelse(team.name == unique(match$team.name)[2],shot.statsbomb_xg,NA)
)

Plot Shot Locations

Okay, lots of preparation done so far. Let’s plot some shots!

ggplot2 builds plots from the ground upwards. Remember the blank_pitch we made earlier? We use that as a base and add the shot locations on top using geom_point to add points/dots

# Plotting raw shot locations
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y), colour = "white")

unnamed-chunk-5-1

Oops, looks like all the shots happened at the same end, regardless of team. We need to reverse the shot locations of one team, since we know the pitch dimensions from create_Pitch() as 120 x 80, we can use those.

# Looks like all shots are at the same end, need to reverse the locations of one team
match <- match %>%
  mutate(location_x = ifelse(team.name == unique(match$team.name)[1],
                             120 - location_x,
                             location_x),
         location_y = ifelse(team.name == unique(match$team.name)[1],
                             80 - location_y,
                             location_y)
         )

Plot Respective Coloured Locations

Let’s see if it worked!

# Try again, with different colours for each team
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))

unnamed-chunk-7-1

Oof, looks like England had lots of shots and denied Argentina anything significant.

Highlight Goals

We have shot locations, but it would be nice to see which shots are goals using the goal indicator we created earlier and we can use a different shape (triangle) to differentiate.

# Now highlight the goals
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))
unnamed-chunk-8-1

Plot Size of xG

Looks like England scored from their shot closest to the goal in the Argentina 6-yard box. Let’s see how likely they were to score by using the size of the points to reflect the expected goals.

# Now use size to reflect shot xG
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal, size = shot.statsbomb_xg)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))
unnamed-chunk-9-1

Looks like England scored with their best chance and could potentially have scored a few more considering their volume of relatively good shots. This is a skeleton of an Expected Goals Shot Map, we can add in annotations to make the final plot look more presentable and quantify each team’s expected goals versus actual goals.

Add Titles and Annotations

blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal, size = shot.statsbomb_xg)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team2_colour, team1_colour)) + 
  # Australia's xG
  geom_text(aes(x = 2, y=78,label = unique(match$team.name)[1]), hjust=0, vjust=0.5, size = 5, colour = team1_colour) +
  geom_text(aes(x = 2, y=75,label = paste0("Expected Goals (xG): ",round(sum(match$team1_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) + 
  geom_text(aes(x = 2, y=73,label = paste0("Actual Goals: ",round(sum(as.numeric(match$team1_Goal), na.rm = TRUE),0))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) + 
  geom_text(aes(x = 2, y=71,label = paste0("xG Difference: ",round(sum(as.numeric(match$team1_Goal), na.rm = TRUE),0)-round(sum(match$team1_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) +
  # Brazil's xG
  geom_text(aes(x = 80, y=78,label = unique(match$team.name)[2]), hjust=0, vjust=0.5, size = 5, colour = team2_colour) +
  geom_text(aes(x = 80, y=75,label = paste0("Expected Goals (xG): ",round(sum(match$team2_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team2_colour) + 
  geom_text(aes(x = 80, y=73,label = paste0("Actual Goals: ",round(sum(as.numeric(match$team2_Goal), na.rm = TRUE),0))), hjust=0, vjust=0.5, size = 3, colour = team2_colour) + 
  geom_text(aes(x = 80, y=71,label = paste0("xG Difference: ",round(sum(as.numeric(match$team2_Goal), na.rm = TRUE),0)-round(sum(match$team2_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team2_colour)
unnamed-chunk-10-1

That looks a little better, at least we now know the score and how each team did compared to their expected goals. After creating a blank pitch, we only need to add layers to get a visualisation of the information we want which is incredibly powerful. To get the visualisation for another match, simply change the match_id (and team colours) above!

The only packages used to create this are those loaded above, with the free events data provided by StatsBomb and extra functions/tutorials by FCrStats

Again, you can find those two here:
@StatsBomb
@FCrStats

Hopefully this will help get some people get started and overcome any initial intimidation. I will look to provide more of these types of step by step guides going forwards the more I get to play around with the data.

@TLMAnalytics

#11 Normalizing xG Chain – Are all actions created equal?

Featured

In this post I will be taking a look at the concepts of xG Chain (xGC) and xG Buildup (xGB), why they are useful and how we can develop these concepts to get even more use from them. Both of these concepts further the expected goals (xG) and expected assists (xA) metrics, allowing the contribution of players not directly involved in a goal to be accounted for.

xG is a likelihood attached to each shot that attributed the chance of that shot being a goal. This metric is only really useful for players who take lots of shots, such as forwards.

xA is attached to a pass that immediately precedes a shot, the xA measures the likelihood that a pass will become an assist from the following shot.. This metric aims to widen the influence of the xG metric and attribution of play to the creative players who create the shots that the xG provides information for.

Both of these are intuitive and simple concepts that provide an estimate for specific actions on the pitch. Since goals and assists are key events in a match, it makes sense to focus analysis on them since they are incredibly predictive. xG and xA are very limited however, they only care about a shot and the preceding pass so don’t tell us anything about any of the play that happens leading up to there. It turns out that the majority of football isn’t just taking turns taking shots, so it would be nice to be able to do something like xG/xA for other actions on the pitch.

Just as xA is to xG; attributing the result to the preceding pass, xG Chain is to xA where it aims to do the same thing for the whole preceding possession chain. In this way you can widen the influence of xG to all players that are involved in the preceding possession. Where xG mainly highlights forwards and xA mainly highlights creative players, xG Chain aims to highlight players that make contributions to the possessions that end up with a shot. These could include your ‘assisting the assister’ players, your deep lying playmakers like Jorginho who get criticised for lack of assists or your progressive passing defenders that wouldn’t usually get the credit they potentially deserve for starting effective possessions.

Calculating xG Chain: https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/

  • Find all possessions each player is involved in
  • Find all shots within those possessions
  • Sum the xG of those shots (usually take the highest xG per possession)
  • Assign that sum to each player, however involved they are

You can normalise xGC per 90mins to see contributions per match, however this still highlights forwards and creative players since if they are the players getting the shots then they will get all the credit for their own shots plus any other possession chains they are involved in.

Since the aim is to highlight players that xG and xA don’t directly pick up, you can calculate xGC without including the shots and assists to get xG Buildup. This leaves all of the preceding actions to the assist and the shot, or all of the build up play as it were. By removing assists and shots, the dominance of forwards is removed and the remaining players are heavily involved in all the play up to just before the defining assist and shot. You can also normalize xGB per 90 mins to see contributions per match. Again, each player involved gets equal contribution as long as they are involved in the possession chain in some way.

xG Chain and especially xG Buildup are great metrics that highlight the contributions of players leading up to assists and shots. They allow players that don’t contribute directly to goals to make a case for their own importance. Normalising per 90 mins is an effective way to allow for reduced player minutes due to injury or substitutions, and evaluate all players on the same basis.

As great as the concepts of xGC and xGB are, there is a clear and influential flaw in the calculation when assigning the xG of the possession chain to the players involved. Each player gets equal contribution no matter how involved they were. So player A makes a simple 5 yard pass in their own half gets the same assigned contribution as player B who made the decisive through ball to a player who squared it for an open goal. Neither player would get credit in xG/xA but both would get the same xGC/xGB contribution despite the fact that player A’s contribution was potentially arbitrary and player B’s turned the possession chain from probing to penetrating and a shot on goal.

Another way to consider the contributions of each player is if you were to remove the action of that player, how likely was the possession chain to have still occurred. If you remove player A’s simple pass, it doesn’t take much for the possession chain to maintain its low threat whereas if you remove player B’s decisive through ball then it’s unlikely that the possession chain continues in the same way. In this way, player B’s contribution could be argued to be more important than player A’s.

This leads to considering other ways of normalising xGC and xGB, each method of assigning contribution and normalising will highlight different aspects of the build up.

Since you have all the information of each possession chain, you may have access to the number of passes or touches that each player contributed to the chain. If you proportion the xGC out by the frequency of passes or touches you can get a good idea of the proportion of involvement that each player has in each possession chain. For example, if a possession chain involves two players, C and D, where player C made 3 passes and player D made 4 passes with a resulting shot that has an xG of 0.7. Then player C contributed 3/7 passes so gets an xGC of 3/7 * 0.7 = 0.3 and player D contributed 4/7 passes so gets an xGC of 4/7 * 0.7 = 0.4. Since player D was involved slightly more than player C then player D gets a higher xGC. A similar calculation can be made using touches which will consider players who dribble more than just counting passes.

You aren’t limited to just counting passes or touches of the ball, you can get more creative with the allocations if you want to credit specific types of actions. You could only count progressive passes that move the ball forward by at least 10 yards, try to quantify the most important or necessary actions of a possession chain (decisive through ball/taking on a player in the box) or count the number of opposition players taken out of the game by each player involved, where ‘taking a player out the game’ may be defined as moving the ball closer to the defending team’s goal than the player.

xG Chain and xG Buildup are both intuitive and simple metrics that assign contributions to players that don’t get directly involved in taking shots or assists but are frequently involved in preceding actions to these events. On their own they can already highlight players that seem to contribute well under the ‘eye-test’ when you watch them, but they can be misleading and provide many false positives since all actions are considered equal under xG Chain.

@TLMAnalytics

Credit to Statsbomb and Thom Lawrence for introducing concepts and providing clear explanations and examples. They even include free data sets for FAWSL and the 2108 FIFA World Cup if anyone wants to try themselves. Check them out here:

https://statsbomb.com/