#23 – FBRef and Progressive Passes

Featured

There’s so much more data available for football matches now than there ever has been. There are some fantastic initiatives making tracking and event data freely available by StatsBomb and Metrica Sports. But there’s also lots of other information more widely available out on the web at the likes of https://www.transfermarkt.co.uk/ and recently fbref.com. These won’t be as detailed match to match, but offer a wider overview of what’s happening on and off the pitch.

Accessing information from the web can be time consuming unless automated, which is made very easy using some powerful Python packages and tutorials widely available, like FCPython. There are limits to what you can/should access automatically since we don’t want to overload websites with requests. More information on scraping can be found here.

There are many public functions for web scraping different places, but advice I would give would be to try and make your own to ensure you actually understand what you’re getting. I’ll still add my own interpretation of a web scraping function which works for player/squad tables on fbref, so here’s a link to the GitHub page which has the functions I used with some examples: https://github.com/ciaran-grant/fbref_data

Progressive Passes

There are lots of types of passes available on fbref, it’s really appreciative having all this data available to explore. Here I am taking a look at progressive passes, with a view to see who’s leading the way this year and if there are any styles we can infer.

Among the progressive passes available the ones that I’ve focused on are below:

  • Total Progressive Distance
  • Number of Progressive Passes
  • Passes into the Final Third
  • Passes into the Penalty Area
  • Key Passes
  • Expected Assists

These have been chosen to try to get a cross between both quantity and quality of ball progression.

As each player has played a different number of minutes, I have used per 90 minutes to compare players. It may be more suitable to use number of minutes in possession for offensive passing such as this, and also number of minutes out of possession for defensive measures. But per 90 minutes goes most of the way there.

Each metric has a significantly different range of values, for example Total Progressive Distance per 90 minutes will be in the 100s/1000s whilst xA per 90 minutes will be between 0 and 0.5 usually. An extra 1 progressive distance is way less impressive than an extra 1 xA. To compare between statistics I’ve normalised each relative to the best performer respectively, this forces all comparisons relative to their peers at their productivity per 90 minutes.

This makes it hard to compare between different groups sometimes, but as long as you’re aware of the context what everything is relative to then that should be minimised.

On to the fun stuff!

Out of all players in the 2019/20 season it’s no surprise that two of the most complete progressive passes were: Lionel Messi and Kevin De Bruyne. We’ve passed the Messi sense-check at least.

Both are among the top across almost all statistics in both ball progression and actually creating quality chances.

There seems to be two styles that most players fit into, the above two are aliens so they don’t count. There are chance creators:

Angel Di Maria leads everyone in passes into the penalty area and xA, with lots of key passes too. These types of players seem to be great at using the ball in and around the box, converting possession into chances.

There are also deep progressors:

This is among all players, position agnostic. And David Alaba appears as the best passer into the final 3rd, whilst also high volume and distance in progressive passes. These are deeper players who can move the ball from the halfway line into the final third for your chance creators to thrive.

I have identified these personally just using some intuition, I think next steps will be to test my theory and apply some clustering or PCA to these players to try to identify more styles.

As everyone is secretly wondering, here are a selection of some of the best u23 performers I have found. In no particular order we have Christopher Nkunku, Martin Odegaard, Jadon Sancho and Trent Alexander-Arnold. These are normalised relative to other u23 players:

Whilst for perspective at just how good these guys are, here they are relative to everyone. They are some of the best players in the world already, pretty scary.

#19: How to quantify the prevention of a potential goal scoring opportunity

Featured

Chances Created and Chances Missed

Chances Created is a metric which tries to quantify the number of goal scoring opportunities that a player is directly involved in.
Opta definition is Assists + Key Passes, where Assists are passes (final touch) which result in a goal from the subsequent play and KP are passes which result in a shot that doesn’t become a goal.
So chances created can be reduced to the final passes (touches) before another player has an attempt at goal (and scores or doesn’t).
As I’ve discussed on podcast The Monthly Football Podcast, Assists alone are pretty random since you rely on the shot actually going in the goal, so chances created is a bit less noisy and should more reliably predict future assists than assists actually do due to the sheer volume of chances created and opportunities for goals to be scored rather than relying on goals actually being scored which is hard.

Chances created relies on a player to actually have a shot at the end, otherwise there is no record of the opportunity. Opta also have ‘Chance Missed’, which is defined as a big chance opportunity where the player doesn’t get a shot away. Chance missed will be attributed to the player who has the big chance and decides not to shoot, which doesn’t help the creator who provided the chance. If we assume that the miss is largely due to the player not executing an attempt, then mapping these chances missed back to the creator in addition to chances created would give credit to creating the opportunity and not punishing them for something out of their control, such as the forward deciding to delay a shot and missing the chance to.

Chances Denied Metric

As usual, chances created and most quantified statistics deal with the offensive side of the game since it’s more tangible. Shots are there and they happen, counting them is pretty straightforward. A bit less straightforward is to count the passes prior to shots, with chances created. Both of these can be tracked over many events and quantify expected outcomes based off similar situations in the past, this results in expected goals and expected assists. What is not straightforward is how to quantify the benefit of defensive actions.

We can count tackles, blocks, interceptions and recoveries, however, much like steals, blocks and rebounds in the NBA, they don’t quite tell the whole story about how a defence works. Weaker teams are asked to defend more since they have less possession, this means they have more chances to rack up interceptions, tackles and recoveries. Possession adjusting these measures helps somewhat to normalise these differences, which means that we can compare the frequency of each action assuming they all have equal chances to do so. However it’s still hard to differentiate the quality of the actions, or how important they were to each team.

Chances denied are an attempt to quantify how much of an opportunity was denied by an interception or ball recovery. In a purely defensive, denying your opponent a goal scoring opportunity, sense, recovering the ball in the middle of the pitch is not as important as recovering the ball on the edge of your own box. Expected threat, created by Karun Singh (@karun1710), is a metric which quantifies how likely a team is to score from each location on the pitch within the next 5 actions. If we assign the xT to a recovery or interception or tackle considering the location on the pitch it occurs then we may get a proxy for how important each action was. Since defensive teams will get more opportunities, it may be worth possession adjusting this also to compare like for like.

The general concept trying to be captured here is to quantify the quality of chance or potential quality of chance that is denied due to the action taken by the defender. This quantity can be given solely to the defender making the action or collectively assigned to the players involved to appreciate the team aspect of defensive play. There is a question whether to include tactical fouls in here as well as legal ball recoveries, but will save that for another time.

@TLMAnalytics

https://www.optasports.com/news/opta-s-event-definitions/
https://karun.in/blog/expected-threat.html

#16: StatsBomb – Messi Ball Receipt Locations

Featured

** Re-uploaded with correct y-axis (duh, Messi played on the right..)

Introduction

Over the last few moths, Statsbomb have released all of the event data for matches including Lionel Messi’s La Liga matches. Data this detailed and clean is incredibly hard (expensive) to come by, so to give free access to everyone is amazing and much appreciated!

I’ve only just had a chance to take a look at the data, and seen many great pieces put out already. Considering who the data is of, I don’t think it will ever go out of fashion so it’s never too late to start playing around.

You can get access to the data and there’s a very helpful getting started guide here:
https://statsbomb.com/resource-centre/
https://statsbomb.com/2019/07/messi-data-release-part-1-working-with-statsbomb-data-in-r/

** You do need to have the latest version of R and the StatsbombR package installed

There are almost too many things to look at in this data set, so I’ve decided to try to focus on a specific part of Messi’s game and see what I find.

Messi gets the ball, a lot. And he obviously does great things with it once he’s got it, but taking a look at where/how he manages to get the ball would be interesting. Surely the one thing an opposing team would try to do against him would be to try to stop him getting the ball or at least limit him receiving the ball in dangerous areas. That’s the inspiration for taking a look at where he receives the ball on the pitch.

Data Prep

Load all the necessary libraries. The usual suspects for R data manipulation like plyr/tidyverse/magrittr, plotting graphs and pitches with FC_RStats’ SBpitch/ggplot2/cowplot and access to the data in StatsBombR.

# Libraries ---------------
library(plyr)
library(StatsBombR)
library("SBpitch")
library("ggplot2")
library(tidyverse)
library(magrittr)
library(cowplot)

We are only looking at La Liga matches, so let’s only load matches from that competition. There is even a cleaning function ‘allclean’ which adds in some extra columns which will be of use such as x/y locations. We have joined on the season names also as they’re much more intuitive than the season id that has been assigned.

There are events from matches from Messi’s debut season 2004/05 through to 2015/16, consisting of events such as shots, passes and even nutmegs. We’re interested in passes received by the man himself. Note there is also an indicator “ball_receipt.outcome.name” that identifies when a pass is missed, we want to exclude these and only look at passes to Messi that he received (NA values).

Plotting a pitch

To get some perspective relative to an actual football pitch and use StatsBomb’s event location data, FC_RStats has created a function “create_Pitch” which does exactly that. Using ggplot2 and a set of pitch type parameters, it’s easy to plot a pitch with the same proportions as the event data collected by StatsBomb.

This pitch can be used as the base to visualise all events by plotting the x/y locations.

goaltype = "box"
grass_colour = "#202020"
line_colour =  "#797876"
background_colour = "#202020" 
goal_colour = "#131313"

ymin <- 0 # minimum width
ymax <- 80 # maximum width
xmin <- 0 # minimum length
xmax <- 120 # maximum length

blank_pitch <- create_Pitch(
  goaltype = goaltype,
  grass_colour = grass_colour, 
  line_colour =  line_colour, 
  background_colour = background_colour, 
  goal_colour = goal_colour,
  padding = 0
)

plot(blank_pitch)

Quick Look

Initial data and visual processing is done, we can now start to take a look at the interesting stuff!

At a high level, we can have a look at all of the times Messi received the ball and plot them on a pitch. This will probably get overcrowded but can start to provide some understanding.

# All ball receipts ------------
Messi_Plot <- 
  blank_pitch +
    geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Plot)

Since eachtime Messi receives the ball is in a specific location, we’ve used points to represent this on the pitch. This looks okay initially, but it’s pretty hard to work out exactly what’s going on and doesn’t really tell us anything we didn’t already know. Messi gets the ball a lot in the opposition’s half.

There are lots of overlapping points, let’s try to get a view of the density distribution to see where specifically he has received the ball the most.

# Density Receipts ----------------
Messi_Density_Plot <- 
  blank_pitch +
    geom_density_2d(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts - Density") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Density_Plot)

This mostly suggests the same thing, Messi likes to receive the ball in the opposition half. Though we can now also see that there are two “peaks”, one far out wide near the top and one closer to the centre. More central areas are more dangerous, whereas you might get more space out wide to be able to receive the ball easier.

Luckily (definitely not luckily) StatsBomb just so happen to have a flag which identifies events that occurred under pressure. I believe under pressure is taken as having an opposition player within X metres of you actively affecting your decision making.

Let’s take a look at Messi’s ball receives whilst under pressure and under no pressure. I would expect that you would be under pressure more often the closer you get to the opposition’s goal.

# Pressure --------------
Messi_Ball_Receipts <- Messi_Ball_Receipts %>%
  mutate(pressure = ifelse(is.na(under_pressure), "No Pressure", "Pressure"))

Messi_Pressure_Plot <-
  blank_pitch +
  geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y, colour = pressure)) +
  ggtitle("Messi Ball Receipts by Pressure") +
  theme(plot.background = element_rect(fill = grass_colour),
        plot.title = element_text(hjust = 0.5, colour = line_colour),
        legend.position = "bottom",
        legend.title = element_blank(),
        legend.background = element_rect(fill = grass_colour),
        legend.text = element_text(color = line_colour))

Messi_Pressure_Plot

I’m not really sure what I expected. It’s pretty hard to distinguish between the two as there are so many points. To further filter the data we can take a look at this in each season.

# Pressure Season Loop ----------------
for (i in rev(La_Liga$season_name)) {
    print(
      blank_pitch +
        geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), 
                   aes(x=location.x, y=80-location.y, colour = pressure))  +
        ggtitle(paste0("Messi Ball Receipts by Pressure - ", i)) +
        theme(plot.background = element_rect(fill = grass_colour),
              plot.title = element_text(hjust = 0.5, colour = line_colour),
              legend.position = "bottom",
              legend.title = element_blank(),
              legend.background = element_rect(fill = grass_colour),
              legend.text = element_text(color = line_colour))
      ) 
}

Now there’s a lot less going on. Remember those two peaks of ball receipts from above? We can see here that this is due to Messi receiving the ball in different areas of the pitch in different seasons. Again, this is something we already probably knew. Messi started his career as a wide forward so will receive the ball out wide most of the time. From 2009/10 onwards he starts to receive the ball much more centrally, coinciding with his time playing as a “False 9” up front. Coincidentally, his already ridiculous production output skyrocketted. Messi getting the ball in central areas = goal machine.

This still hasn’t really answered the question of where Messi recieves the ball under pressure as it’s hard to tell if there’s a pattern to the blue/red or if it’s all just random.

Something that can help here are marginal density plots. These can be plotted along each axis separately and can hopefully display the distribution of ball receipts more intuitively.

Taking a look at all seasons initially.

xdens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=location.x, fill = pressure), alpha = 0.5) +
  xlim(xmin, xmax)

combined_pressure_plot <- insert_xaxis_grob(Messi_Pressure_Plot, xdens_pressure, position = "top") 

ydens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=80-location.y, fill = pressure), alpha = 0.5) +
  xlim(ymin, ymax) +
  coord_flip()

combined_pressure_plot %<>%
  insert_yaxis_grob(., ydens_pressure, position = "right")
ggdraw(combined_pressure_plot)

Again there’s a bit too much going on on the pitch here, but looking at the marginal distributions across each axis is interesting.

Across the top it looks like there’s not too much difference in distribution between “Pressure”“ and “No Pressure”. There is a higher peak for “No Pressure”“ about halfway inside the opposition’s half which could be due to Barcelona practically camping themselves outside the opposition box and all defenders are on the edge of their own box for the majority of most games.

Along the right is as expected, there are many more pass receives under no pressure out wide.

And for each season separately.

for (i in rev(La_Liga$season_name)) {
    p <- blank_pitch +
      geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, y=80-location.y, colour = pressure)) +
      ggtitle(paste0("Messi Ball Receipts - ", i)) +
      theme(plot.background = element_rect(fill = grass_colour),
            plot.title = element_text(hjust = 0.5, colour = line_colour),
            legend.position = "bottom",
            legend.title = element_blank(),
            legend.background = element_rect(fill = grass_colour),
            legend.text = element_text(color = line_colour))


    xdens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, fill = pressure), alpha = 0.5) +
      xlim(xmin, xmax)
    xplot <- insert_xaxis_grob(p, xdens, position = "top") 

    ydens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=80-location.y, fill = pressure), alpha = 0.5) +
      xlim(ymin, ymax) +
      coord_flip()

    comb_plot <- insert_yaxis_grob(xplot, ydens, position = "right")
    print(ggdraw(comb_plot))
}

Now this is what we all came here to see.

For the first 6 seasons of his career (2004/05 – 2009/10), Messi actually received the ball closer to goal under no pressure than he did under pressure (distribution across the top), which is pretty incredible and opposed to both what you would expect and what we saw overall. These are the wide forward Messi seasons, which shows just how good he was and how good he was getting at being a wide forward. Where he most often received the ball under no pressure (peak of the top distribution) actually moves closer to the opposition goal until 2008/09.

Then in 2009/10 something magical happens. Somehow he manages to receive the ball under no pressure both closer to the goal (across the top) AND dead in the centre of the field (along the right). Which of course is a recipe for success.

From then on, looks like teams at least tried to put pressure on him when he received the ball close to the goal. Not really sure that worked so much though.

There are a lot more amazing things from Messi’s career hidden away in this amazing data set. Thanks again to StatsBomb for the free access to explore and show off some things that are possible with the data.

@TLMAnalytics

#14 What Defines a Successful Season?

Featured


No matter what happens, Liverpool’s season is a success.

With the best Premier League title race going down to the last day of the season, it’s a large contrast to last season where Manchester City had the league wrapped up and were aiming for 100 points. They became the Premier League team with the most points in a season, beating Chelsea’s 04/05 95 points by 5 points, arguably becoming the most successful Premier League team. They were so good that it’s not such a surprise that this year Manchester City are on 95 points with a game to play, potentially getting 98 points and becoming the team with the second most points scored the year after smashing the record last year. The surprise of this year is that despite Manchester City being so good, the title is still going down to the final day. Liverpool have got 94 points with a game to play and a win will bring them up to 97 points, becoming at least the team with the third most points, depending on how Manchester City’s game turns out. We are likely seeing two of the top three premier league sides ever in the same season, with the best team being one of these teams last year! It’s truly an incredible season and hopefully we appreciate how good these teams are.

This brings us to the imminent question and judgement on the whole season that comes from these last games. One team will be champions and one team will not. One team’s season will be a success and one team’s won’t. That may seem unfair since as discussed, these could be two of the three best teams to be seen in the Premier League.

However, there are of course more trophies to be won and means to success than just the Premier League. Manchester City got 100 points last year, on course for 98 points this year, have already won the Carabao Cup, Community Shield and are in the FA Cup final. They are on course for the domestic treble and the only team to get more points than them in a season were themselves last year, but they got knocked out of the Champions League quarter-finals to Tottenham. Which people are keen to focus on, despite domestic success once again Manchester City failed in Europe. The criticism is fair, Manchester City were favourites to beat Tottenham over two legs but they didn’t, largely due to prioritising the league over their first leg match. That’s where the problem with success lies, Manchester City were going for the Quadruple and looks likely they will have to settle for a domestic treble. This just shows how high their standards are and what perceived success is for a team of their quality. With two games to go, one in the League and one in the FA Cup final, from here they expect to go on to win both. However, if they don’t they already have the joint second highest points total and have won the Carabao Cup, this is probably not successful considering how close they got to all of their goals but is one hell of a season with all the chance to do the same again next year.

In comparison and with the incredible Champions League semi-finals just behind us, Liverpool have made it to the Champions League final for the second time in two years and are favourites to win this time against Tottenham. Liverpool are in contention to win the Premier League and the Champions League this year, that is an incredible achievement in itself. They lost to the Real Madrid three-peat side with Cristiano Ronaldo and without Mohamed Salah last year, as expected. Most teams don’t get to a single European final, let alone get to back to back finals. They have managed to beat Paris St Germain, Bayern Munich and Barcelona on their way to the final, even with Lionel Messi largely pulling the semi-final tie away from them in the first leg, they were the better team across both legs and you can’t argue they don’t deserve to be there.

As a worst-case scenario for the finish to this season, if Liverpool lose in their final Premier League match and lose the Champions League final to Tottenham, they will still have the fourth highest points total in a season and have got to back to back Champions League finals. Even at worst case scenario, you could argue that’s a successful season. They are expected to win the Champions League and beat Wolves on the final day, ultimately getting 97 points and coming second to the second-best team in the Premier League. Their expected finish to the season is definitely a success. If Manchester City were to drop points and Liverpool won the League title, doing the Premier League and Champions League double whilst getting the second highest points total in a season would cement this team among the Premier League’s best. It’s not possible to on one hand potentially be considered the best ever, but also potentially be considered to have an unsuccessful season based on 2 games of football. No matter what happens, Liverpool’s season is a success.

@TLMAnalytics

#12 Statsbomb Event Data – Fernandinho Replacements

Featured

Manchester City find themselves once again top of the Premier League, with the chance to retain the title for the first time in 10 years since Manchester United in 2008/09. However they also find themselves without Fernandinho, the only seemingly irreplaceable player in their squad that overflows with talent. Fernandinho has missed four Premier League games so far this season, the two at the end of December in which they lost and left the league title in Liverpool’s hands and the two most recent games which were both dominating 1-0 wins. Even if their performances were no worse off and just lacked some luck, no doubt there is nobody else in their squad who can do exactly what Fernandinho does.

Even Guardiola has commented that there is no doubt they will be looking to bring in a replacement:

“I think with the way we play we need a guy who has of course physicality, is quick in the head and reading where our spaces to attack are”

Guardiola

In this post I will try to scout a replacement for Fernandinho using Statsbomb’s 2018 FIFA World Cup Event data. This is a small sample size, so will only include players and their performances in the World Cup. I will define some metrics that could be used to describe the type of player that would fit the role that Fernandinho plays and identify those players that performed best during the World Cup.

Guardiola talks about physicality, quickness of thought and reading where the spaces will be to attack. It is hard to quantify those qualities, however using adapting some simpler metrics could give a good shortlist.

We know that Manchester City will have the ball a lot and want to get the ball forwards to their more attacking players in attacking areas, relying on Fernandinho to progress the ball. Using Statsbomb’s passing events, with the start and end location in x, y coordinates, I have defined a ‘Progressive Pass’ to be one that moves up the pitch more than 10m. Players who have the ability to progress the ball forwards are desired. It could be argued that we also want to only include players who progress the ball from deeper positions so as to more accurately emulate Fernandinho’s role, however we have a small sample as it is and the ability to play progressive passes is what we are looking for.

Whilst lots of players are great at passing, what makes Manchester City so special and Fernandinho so hard to replace, is their ability or willingness to win the ball higher up the pitch. Check out a previous post in the link below where I show how many more times they win the ball back in the opposition’s half. In the same vein, using Statsbomb’s ball recovery event with the x, y location I create a count of times that a player has recovered the ball in the opponent’s half. This tries to emulate the ability to win the ball back quickly after losing it and pinning the opposition back.

https://thelastmananalytics.home.blog/2018/11/06/3-are-man-city-better-without-the-ball-defensive-analysis/

The combination of progressive passes and high ball recovery is used as a proxy for the type of skills that Fernandinho portrays and can be used to get a shortlist of players that perform similarly. Looking at only the players who played positions considered as central midfield or defensive midfield, the top 10 is below.

Figure 1: Midfield Progressive Passes and Opponent Half Recoveries Top 10 from 2018 FIFA World Cup

One thing to note is that these are pure counts and not per game or per 90min. It would be worth taking a look at that to account for the differences in games and minutes played. For example, Croatia making the Final and Germany getting knocked out in Group Stage is a difference of four games, so Toni Kroos making it to 2nd on the absolute list is incredible.

Initially it looks like the list makes sense, players like Kroos, Modric, Rakitic are all players who you could see being able to play in a deeper midfield role. Mascherano is also in the same mould, even more so considering he has played at Centre Back most of the time for Barcelona and Fernandinho has begun to slot in there to bring the ball out.

Those players are all 30+ years old so no better than Fernandinho in terms of potential replacements. Granit Xhaka and Marcelo Brozovic are two that are just entering their prime midfield years at the age of 26. This is where it’s important to note that when scouting, context is important and large sample sizes are encouraged. Xhaka may have the progressive passing ability and love of yellow cards, but probably wouldn’t have the discipline.

This post has looked at outlining a way to narrow down a shortlist of potential replacements for Fernandinho, the methods can be used to find similar players for any player as long as you can identify what you are looking for. Ideally you would get a much larger sample size of games and could look at a player’s contribution per game or per 90mins to get a more stable shortlist. In the future I would like to look at some unsupervised methods which don’t require you to specify or create the similar fields as I have done here.

I have included the total passing heatmaps and the recovery maps of selected players; if you want to see any players specifically from the World Cup from any position then give me a shout!

Once again, massive shout out to Statsbomb for providing the free source of event level data, it’s hard to come by and even harder to collect so it’s much appreciated!

@TLMAnalytics

#11 Normalizing xG Chain – Are all actions created equal?

Featured

In this post I will be taking a look at the concepts of xG Chain (xGC) and xG Buildup (xGB), why they are useful and how we can develop these concepts to get even more use from them. Both of these concepts further the expected goals (xG) and expected assists (xA) metrics, allowing the contribution of players not directly involved in a goal to be accounted for.

xG is a likelihood attached to each shot that attributed the chance of that shot being a goal. This metric is only really useful for players who take lots of shots, such as forwards.

xA is attached to a pass that immediately precedes a shot, the xA measures the likelihood that a pass will become an assist from the following shot.. This metric aims to widen the influence of the xG metric and attribution of play to the creative players who create the shots that the xG provides information for.

Both of these are intuitive and simple concepts that provide an estimate for specific actions on the pitch. Since goals and assists are key events in a match, it makes sense to focus analysis on them since they are incredibly predictive. xG and xA are very limited however, they only care about a shot and the preceding pass so don’t tell us anything about any of the play that happens leading up to there. It turns out that the majority of football isn’t just taking turns taking shots, so it would be nice to be able to do something like xG/xA for other actions on the pitch.

Just as xA is to xG; attributing the result to the preceding pass, xG Chain is to xA where it aims to do the same thing for the whole preceding possession chain. In this way you can widen the influence of xG to all players that are involved in the preceding possession. Where xG mainly highlights forwards and xA mainly highlights creative players, xG Chain aims to highlight players that make contributions to the possessions that end up with a shot. These could include your ‘assisting the assister’ players, your deep lying playmakers like Jorginho who get criticised for lack of assists or your progressive passing defenders that wouldn’t usually get the credit they potentially deserve for starting effective possessions.

Calculating xG Chain: https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/

  • Find all possessions each player is involved in
  • Find all shots within those possessions
  • Sum the xG of those shots (usually take the highest xG per possession)
  • Assign that sum to each player, however involved they are

You can normalise xGC per 90mins to see contributions per match, however this still highlights forwards and creative players since if they are the players getting the shots then they will get all the credit for their own shots plus any other possession chains they are involved in.

Since the aim is to highlight players that xG and xA don’t directly pick up, you can calculate xGC without including the shots and assists to get xG Buildup. This leaves all of the preceding actions to the assist and the shot, or all of the build up play as it were. By removing assists and shots, the dominance of forwards is removed and the remaining players are heavily involved in all the play up to just before the defining assist and shot. You can also normalize xGB per 90 mins to see contributions per match. Again, each player involved gets equal contribution as long as they are involved in the possession chain in some way.

xG Chain and especially xG Buildup are great metrics that highlight the contributions of players leading up to assists and shots. They allow players that don’t contribute directly to goals to make a case for their own importance. Normalising per 90 mins is an effective way to allow for reduced player minutes due to injury or substitutions, and evaluate all players on the same basis.

As great as the concepts of xGC and xGB are, there is a clear and influential flaw in the calculation when assigning the xG of the possession chain to the players involved. Each player gets equal contribution no matter how involved they were. So player A makes a simple 5 yard pass in their own half gets the same assigned contribution as player B who made the decisive through ball to a player who squared it for an open goal. Neither player would get credit in xG/xA but both would get the same xGC/xGB contribution despite the fact that player A’s contribution was potentially arbitrary and player B’s turned the possession chain from probing to penetrating and a shot on goal.

Another way to consider the contributions of each player is if you were to remove the action of that player, how likely was the possession chain to have still occurred. If you remove player A’s simple pass, it doesn’t take much for the possession chain to maintain its low threat whereas if you remove player B’s decisive through ball then it’s unlikely that the possession chain continues in the same way. In this way, player B’s contribution could be argued to be more important than player A’s.

This leads to considering other ways of normalising xGC and xGB, each method of assigning contribution and normalising will highlight different aspects of the build up.

Since you have all the information of each possession chain, you may have access to the number of passes or touches that each player contributed to the chain. If you proportion the xGC out by the frequency of passes or touches you can get a good idea of the proportion of involvement that each player has in each possession chain. For example, if a possession chain involves two players, C and D, where player C made 3 passes and player D made 4 passes with a resulting shot that has an xG of 0.7. Then player C contributed 3/7 passes so gets an xGC of 3/7 * 0.7 = 0.3 and player D contributed 4/7 passes so gets an xGC of 4/7 * 0.7 = 0.4. Since player D was involved slightly more than player C then player D gets a higher xGC. A similar calculation can be made using touches which will consider players who dribble more than just counting passes.

You aren’t limited to just counting passes or touches of the ball, you can get more creative with the allocations if you want to credit specific types of actions. You could only count progressive passes that move the ball forward by at least 10 yards, try to quantify the most important or necessary actions of a possession chain (decisive through ball/taking on a player in the box) or count the number of opposition players taken out of the game by each player involved, where ‘taking a player out the game’ may be defined as moving the ball closer to the defending team’s goal than the player.

xG Chain and xG Buildup are both intuitive and simple metrics that assign contributions to players that don’t get directly involved in taking shots or assists but are frequently involved in preceding actions to these events. On their own they can already highlight players that seem to contribute well under the ‘eye-test’ when you watch them, but they can be misleading and provide many false positives since all actions are considered equal under xG Chain.

@TLMAnalytics

Credit to Statsbomb and Thom Lawrence for introducing concepts and providing clear explanations and examples. They even include free data sets for FAWSL and the 2108 FIFA World Cup if anyone wants to try themselves. Check them out here:

https://statsbomb.com/

#10 Match Report: Man City 2 – 1 Liverpool

Featured

Liverpool head into their first game of 2019 still unbeaten and 7 points clear of arguably the best ever Premier League side, reigning champions Manchester City. Manchester City were on course for another incredible year, and still are by anyone else’s standards, however losing at home to Crystal Palace and then Away to Leicester in 2 of their last 3 games was not in the script for their next documentary.

Up to Christmas, City had been unbeaten too, sitting top of the league and had already played all of the other ‘Top 6’ sides away from home. it was looking like the question was whether City could go unbeaten, with Liverpool doing amazing to just keep up. A severe dip in form, a key injury and some incredible shooting against them saw City relinquish the lead in the title with Liverpool not looking like slowing down at all.

A Liverpool win at the Etihad and the gap becomes 10 points, arguably the title race is over without a Liverpool collapse (not impossible). A draw would maintain the 7-point gap, but would also give Liverpool hope that they can continue in their excellent season since the champions couldn’t beat them at their own ground. Whilst a win for City would reduce the gap down to 4 points, which means City are still relying on Liverpool messing up, but it also means that Liverpool are no longer untouchable and City will have put doubt in Liverpool’s minds.

Considering City finished champions 25 points ahead of Liverpool and won 5-0 in this fixture last season, if I were to say to you that this was the most even game of the season so far would be surprising to say the least. It shows how far Liverpool have come in such a short space of time that that is indeed the case, this game was incredibly even and almost any result could’ve happened if repeated.

City did end up winning 2-1, however the Expected Goals (xG) from Understat suggest it wasn’t an easy win. The xG score was City 1.18 – 1.38 Liverpool, suggesting arguably Liverpool would win this game more often than City if repeated and a draw is most likely. For a game with two of the highest scoring teams in the league, there were not that many shots or chances created with only 9 – 7 for City – Liverpool respectively. This low shot volume adds to the variance in xG numbers and emphasises that it would be more down to individual skill at finishing or luck to determine this game rather than an overwhelming inevitability that someone would score.

Figure 1: Size of bubble = Expected Goals (xG), Location = Location of shot, Stars = Goals

In terms of finishing and scoring goals, Liverpool were not very clinical however they did create the best chance of the game with a lovely cross field pass followed by a first time cross across the box for a tap in to an empty goal. They also had a ball cleared off the line by centimetres following a scramble after a rebound off the post. Other than those two, Liverpool were limited to shots through crowds of bodies. City managed to manufacture some chances through counter attacks, and also capitalise on the fact that Sergio Aguero is an incredible finisher from tight angles. Whilst Liverpool scored with their highest xG chance (0.62), City missed both of their highest xG chances (0.49, 0.32) and scored from two lower xG chances (0.06, 0.05) which suggests that it was City’s finishing when needed was the difference in goal scoring.

Since not very chances were being made, most of the game and interesting plays were between the two boxes. There are three players I’d like to highlight, all playing central midfield: Fernandinho, Bernado Silva and James Milner. It’s hard to quantify the effect that these players had on the game, but all three were excellent in denying the opposition any space or progression up the pitch.

No player had more ball recoveries than Silva with 10, Fernandinho had 9 and Milner whilst only being on the pitch for about an hour had 7. In of itself ball recoveries doesn’t mean much, however especially for City players it’s the area of the pitch that they win the ball back that’s so great.

Figure 2: 25/65 Man City recoveries in Liverpool’s half, 10/56 Liverpool recoveries in Man City’s half

https://thelastmananalytics.home.blog/2018/11/06/3-are-man-city-better-without-the-ball-defensive-analysis/

5 out of the 10 ball recoveries for Silva and 4 out of 9 for Fernandinho were in Liverpool’s half, which suggests that City were winning the ball back high up the pitch and not allowing Liverpool to progress much further. Compared to other players with high recoveries, this is significant. Not only recoveries, but Silva also completed 3 tackles on the halfway line out of 8 (!) attempts and made 4 interceptions in Liverpool’s half. As you can imagine Silva got around the pitch a lot this game and managed to cover 13.7km which is the most in a game this season. I don’t usually like those kinds of stats since they don’t suggest anything about a player’s involvement in a game but maybe suggest that they’re just out of position recovering for the whole game. However, Silva was definitely involved and sometimes that extra effort you put in makes others do the same.

A lot of Fernandinho’s work is done off the ball, in ways that aren’t quantifiable by tackles or interceptions or distance covered. It’s clear how large an impact he has in City’s midfield since the two games he didn’t play due to injury were the two games they lost so far this season. Fernandinho deserves more than a paragraph of one game to highlight his skills, he’ll be the focus of an upcoming post in the future. But City need to find a replacement quickly for him, or find a way of playing that doesn’t rely so heavily on him sweeping up behind the front 5’s press.

It’s a shame that James Milner had to be the one to come off early in the second half, Milner plays similarly to Bernado Silva when Liverpool have the three in midfield and was as effective as Silva defensively until he got taken off. Moving to 4-2-3-1 since they needed to score was probably a sensible move, however needing a goal and leaving Jordan Henderson on the pitch alongside Fabinho (better version of Henderson) doesn’t always end well. It worked out since Liverpool scored an amazing team goal but they may have been more of a threat if Milner was alongside Fabinho. Also, doesn’t help pushing Wijnaldum out to left wing with several wingers sitting on the bench but hey.

Come the end of the season, this game will be regarded as a turning point whatever happens. Whether Liverpool collapse and City come back to win their second title in a row or Liverpool brush it off and continue in the same manor we will find out, but Manchester City have showed their hand and they are here to stay until the end of the season. We have our first real title race in years, take it in and enjoy it.

Thanks to @StatsZone and Understat for images, stats and xG numbers.

@TLMAnalytics

#7 Defensive Metrics [Decision Making]

Featured

“If I have to make a tackle then I have already made a mistake.”

Paolo Maldini

It’s a famous quote I’m sure you’ll have heard, but you can hear the penny drop in every single person who hears it for the first time. One of the best defenders (if not the best) to have played football couldn’t be wrong could he. Yet defenders and defensive players are judged mainly on statistics such as number of tackles or blocks. Tackles and blocks are usually last-ditch attempts to prevent an opponent from progressing.

Defending is a constant ongoing process that is happening throughout a football match, no matter who has the ball or where the ball is on the pitch. As a collective team, and individually, every player is moving into positions that adhere to a defensive structure with an aim of conceding the least amount of goals possible. Each player will contribute to that by performing defensive actions, these are usually known as tackles or blocks. However, to perform a tackle or block first requires the opposition to have the ball in a potentially dangerous area, or rather first requires you to allow the opposition to have the ball in a potentially dangerous area. More importantly and less easy to quantify would be the actions and ability to prevent a forward getting the ball in dangerous areas in the first place.

It doesn’t seem a stretch to suggest that the something better than blocking every shot on goal is to prevent every shot being taken in the first place.

When a forward has the ball, they will have an aim in mind of what they want to achieve with their possession. There will be a hierarchy of aims ranging from scoring a goal down to retaining possession of the ball. Whilst a defender will also have an aim in mind when a forward has the ball. Their hierarchy of aims will be a version of the reverse of the forwards, ranging from not conceding a goal to winning the ball back. The immediate aim of both the forward and the defender will depend on factors such as location of the pitch, time of the game, game state and the perceived abilities of each player by each player.

For example, if the striker has the ball in the penalty area then their primary aim may be to take a shot to score a goal, whilst the defender’s primary aim may be to not concede a goal.

If the fullback had the ball in their own half then their primary aim probably won’t be to score a goal straight away, but rather progress the ball up the pitch either through midfield or down the line to the winger. If those two options are not available then they potentially need to regress their aim down to maintaining possession and recycle the ball back to goalkeeper or centre backs. In this case, the defender may be a striker or a winger who has closed the ball down, the defender’s primary aim here may be to prevent forward progression of the ball towards a more dangerous position.

Figure 1: Davies’ decision making options v Chelsea

These thought processes will be going back and forth between each player at all times throughout a match. Even whilst nowhere near the ball, these are things players need to consider at maybe a more minute level. Furthering the example above with the fullback and winger, the fullback’s aim is to ball progression and the winger’s aim is to prevent ball progression. If possible, the fullback would play the ball straight into the striker so that they could progress the ball up the pitch as far as possible as quickly as possible, however collectively the defence need to negate that as an option. Maybe the defending centre back is marking the striker tightly with the defending central midfielder also blocking off any direct pass, just enough so that the fullback doesn’t consider passing to the striker a viable option.

Figure 2: Chelsea unable to prevent Davies from progressing the ball

If the defending team sufficiently prevent efficient progress into dangerous areas of the pitch then their job is made much easier. As we can see in Figure 1 and Figure 2, Chelsea were unable to prevent ball progression, as a result they are left to defend a more dangerous situation and even resort to tackling or blocking (!).

The decisions that each player has, defender or forward, aren’t limited to just marking or blocking passing options and passing or shooting. Forwards may want to dribble past players, cross the ball from wide or even off the ball may make runs into space to receive the ball. These decisions of the forwards cause defenders to react respectively, how well they deal with the questions asked by the forwards depends on the abilities of the team and players in question.

It would be interesting to look at the decision making of defenders and forwards in different situations by counting the number of times or frequency of a decision overall and whether that depends on who they are facing or where they are on the pitch. A decision here for a forward would be a simple action such as attempt a shot, attempt a dribble, pass the ball up the line or retain possession. Whilst a defensive decision would depend on the decision of the forward, it would be interesting to see if players change their decisions significantly when playing against certain players. It could be a way to measure to what degree a defender can force a forward into uncomfortable positions and into making unfavourable decisions or decisions lower down on the forwards hierarchy of aims.

As always, any feedback or questions are welcome. These are primitive ideas and just looking to provoke thoughts of football analytics from a different perspective.

@TLMAnalytics

#6 Defensive Metrics [Optimal Positioning]

Featured

Even though the most tracked part of a football match is where the ball is, the most interesting things often happen off the ball. The ball is only in play for 50-60 minutes of a 90-minute match, and each individual player is only on the ball a minimal part of that time. The majority of the game is played off the ball by all players, they need to move about the pitch in relation to their teammates, the opposition and the ball. A forward can move off the ball to find space between defenders to receive a pass, whilst the defenders need to keep an eye on the forwards and track these attempts.

For a specific player, at a given point of the game, there are locations on the pitch that would be considered worse positions to be in than others. For example, if the opposition had the ball on the edge of your penalty area, you would consider your central defender to be at a worse position if they were standing by the opposition’s corner flag than if they were marking the opposition’s forward. Since there is a concept of better or worse positions, that leads to the possibility of there being an optimal position for a specific player at a given point of the game. You could also think of it such as if you were to remove a single player from the game, where would you want to replace them in the game such that you couldn’t move them to a better place.

Several factors could affect the perception of a position at a given time being better or worse. These could be physical states of the game, such as locations of teammates, opposition or the ball. They could also be non-physical, such as score, the aim of the tactics or time on the clock. Considering where the ball is and who is in control of the ball will dictate the general area where your teammates and opposition will set up. Considering the tactics and formation that you and the opposition are looking to play will dictate the general areas of where individual players will set up.

Different tactical styles, scores and time remaining will affect what the aims of a team are. Some teams, such as Manchester City, want to control the largest surface area of the pitch possible. Control of the pitch can be determined by which team is likely to get possession of the ball if the ball was located in that area. Whilst other teams are aware that they can’t afford to try to control the largest surface area of the pitch, but rather look to control the areas of high interest such as around their own penalty area. Individual players need to position themselves with appropriate distances between each other to reflect their tactical style and goal. Certain distances between certain players would be better or worse than others, so again there must be an optimal distance with respect to tactical aim. When each player achieves this optimal distance, the collective team would appear to perform optimally.

A geometric way of viewing the areas of control on a pitch would be to look at Voronoi diagrams. 

Figure 1: Voronoi Diagram Red v Blue

If we look at Figure 1 with team red against team blue, each of the polygons surrounding each player would correspond to the areas on the pitch that they control. If the ball happened to be within their boundaries, they would most likely get to the ball first (considering each player has the same speed). This concept has been around for a while and has been made possible due to the technology available to football clubs, player tracking is everywhere nowadays and is crucial to understanding how your team is performing.

Voronoi diagrams can be used at a team level to understand structure and how well a team can transition between situations, but also is useful at the player level as you can identify which players find the most space or which players are the best at denying space.

In terms of quantifying better or worse positions of an individual player, the surface area of a player’s Voronoi region can be indicative of how well positioned they are. It is important to note that not all spaces on the football pitch are equal, controlling the areas closer to the goals area more beneficial than controlling the centre of the pitch. Perhaps a weighted surface area would be a better quantifier of control of the pitch and would be another contributor to identifying optimal positioning.

@TLMAnalytics


Special mention to below for their work already on Football Geometry and Voronoi Diagrams:
@Soccermatics –  https://medium.com/@Soccermatics/the-geometry-of-attacking-football-bee87e7a749
@UTVilla –  http://durtal.github.io/interactives/Football-Voronoi/

 #5 Defensive Metric Concepts [Expected Shot Block]

Featured

There are emerging metrics in football such as Expected Goals, Key Passes, Progressive Passes and even now Expected Assists. These are all measuring single events in a football match and quantifying their utility or effectiveness with an indicator or a probability of happening. These are also all measurements of how effective a player is at executing actions whilst on the ball, particularly in offensive positions such as shooting or creating shots from passes. They give us a better idea for which players and teams are most (or least) effective at offensive events. The higher your Expected Goals and the more Key Passes you make, the more goals your team is likely to score. However, there aren’t similar metrics that measure defensive contribution. This may be due to the act of contributing to goal scoring being an objective decision, with each goal scored there is a single player who scored it and it’s easy to allocate contributions. With allocating defensive contribution, it is hard to quantify the presence of a non-event. It is hard to quantify how much of an effect a player or team has on the opposition not scoring. I think that if it’s possible to quantify a sensible defensive metric of any kind then it could be as useful as any of the offensive metrics above, I will use this series to brainstorm some ideas of such defensive metrics and how it would be possible to compute them.

Expected Shot Block:

Where better to start with defensive metrics than with the clearest act of denying a goal, the shot block. This concept is in direct competition to and is inspired by the concept of Expected Goals.

For Expected Goals, each shot is given a probability of being a goal based on historical shots of a similar type. For example, a shot that is taken with the head from a cross may be given 0.1 xG whereas a shot from a counter attack inside the 6-yard box may be given 0.5 xG. It suggests that the shot from a counter is 5 times more likely to go in than the headed effort. These may not be realistic numbers but the concept stands. Some shots are more likely to go in than others depending on a number of factors including where on the pitch the shot was taken, what play led to the shot, what the shot is taken with and game state of the match.

The concept of the Expected Shot Block would be to calculate the probability that the shot is blocked by a defender. This would require more information than just the event data of the shot, it would require the knowledge of the presence of a defender and how likely it is that a defender makes a block in a similar situation based on historically similar shots. The time this decision is made would be at point of contact of the shot. Based on shots from the past, you can categorise them into similar categories as that of Expected Goals but with the added factor of the presence of a defender or defenders between the ball location and the goal. The ball location and the two posts of the goal create a triangle and if there are any defenders in this area then the shot would be identified as having the potential to be blocked. The presence of the goalkeeper is expected and since we are looking at shots being blocked not saved then the goalkeeper’s location can be acknowledged but not required for calculation. The location of the goalkeeper may alter the shot direction of the attacker so may affect shot blocking numbers.

Expected Shot Block - FM

Furthering the concept of an Expected Shot Block would be to calculate the percentage of the goal that is open to the shot at the point of contact. When identifying if a shot has the potential to be blocked, you can calculate the percentage of the goal that is available to be shot at where the defenders wouldn’t make a block. This calculation could be done in either 2D or 3D. You can assume an average area for the defender’s body and block out the area of the goal that the defender is in front of. In 2D this would be less accurate than in 3D since it would be assumed that the defender can block a shot of any height. Whereas in 3D, you could create silhouettes of the defender’s limbs and create a more accurate percentage that way.

There are some problems with this concept but I think it has potential. You need more than just event information, you also need player location data which is harder to get. It also assumes that the shot is a direct hit straight at goal, whereas many players attempt to bend and curl the ball around defenders. Just because a defender is in the way of the goal, doesn’t guarantee a blocked shot in that location. There are many times where a shot goes through a defender’s legs or just past a limb, defenders and players aren’t perfect.

It’s hard to quantify defensive actions and shot blocking seemed to have the most relevant as it’s related to the current set of offensive metrics, it’s not perfect. If anyone has any thoughts or comments regarding other issues I may have missed, please do let me know!

@TLMAnalytics

#4 Team Analysis: AS Monaco – Realistic Expectations

Featured

In this piece I will take a look at AS Monaco’s Ligue 1 performance to see why they are sitting in 19th after 13 games. Looking at their recent transfers, I’ll investigate what their expectations would have been compared to what their expectations should be.

This is a team that only two years ago won the Ligue 1 title, holding top spot for most of the season, and was a team full of emerging young talent. Players such as Kylian Mbappe, Bernado Silva, Thomas Lemar, Benjamin Mendy, Tiemoue Bakayoko and Fabinho all contributed greatly alongside veterans such as Radamel Falcao and Joao Moutinho. This season the only player still at the club is still Falcao, Monaco were a club that thrived off the talent of these rising stars and cashed in. Larger European clubs such as Paris Saint Germain, Manchester City, Atletico Madrid, Chelsea and Liverpool all came in and stripped Monaco of their title winning team.

Since the system worked so well previously, it’s not hard to see why they have tried to replicate their success and have looked to reinvest in a new crop of youngsters to compliment the likes of Falcao once again. They have brought in the likes of Youri Tielemans, Pietro Pellegri, Willem Geubells, Benjamin Henrichs and Aleksandr Golovin who are all under 22, with Pelegri and Geubells both 16 at the time they were brought in. The concept which they are trying to repeat is to build the team around giving these promising youngsters lots of playing time, hoping to accelerate their development and mature early, therefore prolonging their careers at the highest level.

This transition couldn’t happen overnight, and has taken two years for these recruitment changes to occur. Last year Monaco managed a 2nd place finish behind a resurgent PSG team, which is respectable considering they loaned Paris their best asset in Mbappe for the year and lost Bernado Silva and Mendy to Manchester City. That’s a lot of attacking threat to lose, however they managed to keep hold of Lemar, Fabinho and Moutinho. Fabinho and Moutinho are two competent central midfielders who can take control of any given game, allowing Monaco the foundation to let their forwards do their thing. It could be suggested that losing Fabinho to Liverpool and Moutinho to Wolves in the summer before this season are the losses that were hardest to replace. Fabinho has proved himself worthy of a spot in a Jurgen Klopp midfield three which is saying something and Moutinho is part of the Portuguese midfield duo at a Wolves team proving themselves already a competent Premier League team.

Out of those youngsters brought in over the last two years, only Youri Tielemans has the suggested promise to be able to replace either. However, Tielemans has been playing in a more offensive midfield role previously at Anderlecht and Belgium, relying on a young player who’s still getting used to controlling a game from deep may not be the best idea.

That moves us on to Monaco’s current crisis, they sit 19th in Ligue 1 after 13 games and just been thrashed 4-0 for the second time by PSG [PSG have won their opening 13 games and sit 13 points clear]. After 9 games, Leonardo Jardim, who was in charge of their title winning season, mutually agreed with the club to leave and has been replaced by Thierry Henry in his first managerial role.

Monaco v PSG 11Nov18

A team that has finished 1st and 2nd in the previous two seasons shouldn’t be anywhere near the bottom of the league at this point of the season. They have underperformed their xG and xGA across the 13 games, so they haven’t scored as many as they should and have conceded more than they should have. Though not by a huge margin, they have 12 goals from 16.31 xG and conceded 22 from 17.47 xGA. Regressing to the mean, we can expect Monaco to perform better than current standings suggest, but that is nowhere near challenging for the title. Based on previous seasons, expectations would be to dominate most games by creating lots of chances and giving away few. Their goal difference of -10 and xGD of -1.16 shows a difference in expectation versus reality but compared to 2nd place Lille’s xGD of +7.28 [PSG’s xGD = 23.66] shows how far away from pre-season expectations they are.

Except, there doesn’t seem to be anything clearly wrong. Their defence is as leaky as suggested, conceding 1.69 goals per game. They aren’t creating enough good chances to score the goals to win games, scoring 0.92 goal per game. They have had 150 shots, creating 15 big chances but conceded 154 shots and 19 big chances so far. Most worrying is that they aren’t controlling games, they aren’t putting the opposition under pressure and they seem to be playing in matches on a level playing field with many of the teams in the league. So far this season they are performing like an unlucky mid table side, nowhere near their expectations of European qualification.

When I say there doesn’t seem to be anything clearly wrong, I mean that there doesn’t seem to be anything immediately fixable wrong. It’s not the case that they can change just one thing and go back to being the title challenging side they used to be. That’s because they are literally a completely different team to that one, even if the expectation hasn’t changed, the players definitely have and they are not as good as those who left. Unfortunately, it seems as though Monaco’s attempts to recruit a new group of young title winners, or at least challengers haven’t worked so far. Which isn’t surprising. It will take time for the players to get used to playing in a top 5 European league and playing with each other and handling the expectations all at once. They aren’t suddenly a bad team, just not what they were last year and the expectations surrounding the team need to reflect that. It is also worth noting that they have had some serious injury concerns which has forced them to play maybe more youngsters than planned.

*credit to Understat and @Statszone for the numbers and figure

@TLMAnalytics

#26 Written Summary – Invincibles Defending with StatsBomb Events

Overview

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing (as been achieved in other leagues). They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This post aims to look into how Arsenal’s defence managed this.

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them. Find these here:

Event data records all on-ball actions from a match, this is more granular than high level team and player totals. For offensive events such as passing and shooting, event data is great since they are usually on-ball actions. There are on-ball defensive actions such as tackles, interceptions and recoveries which are well captured by event data, these are the defensive events that I will be using.

We can see where these events frequently occur on the pitch for Arsenal compared to their opposition. The key nuance is that just because these are the defensive events recorded, doesn’t mean they are all of the defensive plays that take place on the pitch. The hard part about defensive analysis is that a large proportion of defending are non-events and won’t be captured by event data.

Event data still provides insight by using a combination of the defensive events from Arsenal and the offensive events from Arsenal’s opponents. Arsenal’s defensive events will show where their on-ball defending took place. Arsenal’s opponents’ offensive events will show how they approached attacking Arsenal, this may be the offensive team getting their way or Arsenal’s defence forcing opponents to play in a certain way. From Arsenal’s success, the majority of the time it’s the latter.

Please find the underlying code and methodology here: https://github.com/ciaran-grant/StatsBomb-Invincibles

Arsenal’s Defensive Events

The below plot is a combination of a 2D histogram grid and marginal distribution plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.

More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.

The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.

Figure 1: Marginal Distribution Plot of Arsenal’s Defensive Events

The below density grid compares Arsenal’s defensive events relative to all defensive events in their games. Where Arsenal had more events than overall is in red and less than overall in blue. The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.

Figure 2: 2D Histogram of Arsenal’s events relative to all defensive events

Opponent’s Ball Progressions

By taking a look at opponent’s ball progressions we can see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?

These progressions have been grouped into approximate phases of play through the thirds and into the penalty area.

The progressions have been grouped into similar types of progressions by comparing KMeans and Agglomerative Clustering methods. Reassuringly there were similar number of clusters from both methods, but the KMeans method appeared to perform better by grouping similar passes at both start and end locations. Further details can be found here: https://github.com/ciaran-grant/StatsBomb-Invincibles

Own Third

Figure 3: Clusters of opponent’s ball progressions from their own third

We find that the most frequent ball progressions from their own third are shorter progressions into the middle third. There are short progressions centrally in clusters 1 and 7, with wider progressions in clusters 9 and 6. Longer progressions were less frequent.

Shorter progressions are easier to complete and less risky, so not surprising that they are the most frequent. This says nothing for how quality or sustainable these progressions are, but adds to the idea that most of the play appears to be out wide.

Middle Third

Figure 4: Clusters of opponent’s ball progressions from the middle third

Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.

Final Third

Figure 5: Clusters of opponent’s ball progressions from their final third

Again, lots of short progressions in wide full back areas in the final third seen in clusters 6 and 2. The lowest two frequent clusters appear to be deep crosses into the penalty area. These are the only consistent progressions into the penalty area, but usually create lower quality chances through headers or contested shots.

Penalty Area

Figure 6: Clusters of opponent’s ball progressions into Arsenal’s penalty area

There are fewer groups here due to fewer progressions into the penalty area, which is expected. There are more progressions from each wide area, with a large proportion coming from much shorter progressions in clusters 2 and 1. It’s more difficult to successfully progress the ball long distances into the penalty area. The few clusters here are pretty broadly grouped, intuitively human analysts could create these groupings pretty quickly which suggests the clustering hasn’t helped much here.

Shot Assists

Figure 7: Clusters of opponent’s shot assists

When looking at only shot assists, there are lots that are received outside the penalty area. This suggests that those shots are likely from further out and of lower quality without context. Due to the added restriction of requiring a shot at the end of these passes, there is likely more variance included in these passes and harder to identify clear patterns.

Takeaways

Across ball progressions throughout the whole pitch, the most frequent were shorter and in the wide areas. This is expected as shorter progressions are lower risk and wide areas are less important to defend, so are areas the defending team are willing to concede.

The length of progressions from the middle third appear to be longer than in their own third or in the final third. Context is needed in all individual circumstances but this may be due to a lower risk of failure due to being further away from your own goal or trying to take advantage of a short window of opportunity to quickly progress the ball longer distances into the final third.

When in the final third, the progressions shorten again. This is not due to lower risk, but likely due to a more densely populated area. There will be the majority of all players on the pitch within a single half of the pitch, navigating through there requires precision and patience from the offensive team.

Any completed progression into the penalty area is a success for the offensive team. There is a high chance you will create a shot and if you do it’s likely to be a higher quality shot than from outside the penalty area. Though not all completed progressions into the penalty area are created equal. If it’s a completed carry into the penalty area then awesome, you likely have the ball under control and can get a pretty good shot or extra pass. If it’s a completed pass then it depends how the player receives it, aerially or on the ground makes a difference to the shot quality. Aerial passes are harder to control and headers are of less quality than shots with the feet, however aerial passes are usually easier to complete into the penalty area. So higher quantity, lower quantity than ground passes or carries.

Shot assists rely on there being a shot at the end of them. So circularly they created a shot so are ‘good’ progressions but also they are ‘good’ progressions because they created a shot. As we can see, they are much more random which means it’s harder to understand without context why they created shooting opportunities since the locations alone don’t tell us anything. Although the context is available within StatsBomb’s data, I haven’t taken a further look here.

When considering how these progressions affect Arsenal’s defensive events, remember that the majority of their defensive events were performed out wide in the full back areas and in their own penalty area. Particularly out wide in the full back areas more than other teams, whilst defensive events within their own penalty area around the same as other teams.

At each third of the pitch, the most frequent ball progressions were out wide, which places the ball frequently in the full back areas. Due to the nature of defensive events, the only events recorded would be the on-ball actions that were defined including pressures, ball recoveries, tackles etc. The ball needs to be close to you to be able to perform these actions and get them recorded as events, so the opponent ball progression frequently going out wide combined with Arsenal’s defensive events in their defensive wide positions fits together well.

What this doesn’t tell us is if these are causally linked or just correlate. I would suggest there are more ball progressions made out wide than centrally in all of football due to the defence more likely willing to concede that space, so this doesn’t necessarily tell us much about Arsenal specifically. Though in Arsenal’s matches, they do perform defensive events in their full back areas more than other teams, which may suggest that there is something more than just correlations.

If it is a specified game plan to funnel the ball out wide and perform defensive events there, then Arsenal have done a great job at completing that. It’s a robust defensive plan if you can get it to work, the wider the ball is, the harder it is to score immediately from there. When defending it’s often useful to utilise the ends of the pitch as an ‘extra’ defender, which makes it easier to overwhelm offensive players.

If you’ve made it this far, thanks for your attention. Please find the code in previous blog posts and at https://github.com/ciaran-grant/StatsBomb-Invincibles

@TLMAnalytics

# 26 – Invincibles Defending with StatsBomb Events (v2.0)

** This is a second submission to the below course**

Updates include:

  • Defined structure
  • Including multiple clustering methods (Agglomerative clustering)
  • More attention paid to parameter selection (number of clusters and threshold distance)

The below is a Jupyter Notebook and can be found here alongside respective Python modules: https://github.com/ciaran-grant/StatsBomb-Invincibles

This was an initial submission for the Udacity Data Scientist Nanodegree, updates likely to come from further feedback. https://www.udacity.com/course/data-scientist-nanodegree–nd025

@TLMAnalytics

Udacity Capstone Project – Arsenal’s Invincibles’ Defence

Project Definition

Project Overview

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing. They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This project aims to look into how Arsenal’s defence managed this.

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them.

Problem Statement

The goal is to identify areas of Arsenal’s defensive strengths and the frequent approaches used by opponents. Tasks involved are as follows:

  1. Download and preprocess StatsBomb’s event data
  2. Explore and visualise Arsenal’s defensive actions
  3. Explore and visualise Opponent’s ball progression by thirds
  4. Cluster and evaluate Opponent’s ball progressions
  5. Cluster and evaluate Opponent’s shot creations

Metrics

A major problem when using many clustering algorithms is identifying how many clusters exist in the data since they require that as an input parameter. Sometimes expert judgement can provide a good estimate. However, some clustering algorithms such as agglomerative clustering require other inputs such as distance thresholds to determine clusters.

When using k-means clustering, to determine the number of clusters the following metrics are used. When using agglomerative clustering, to determine the distance threshold the same metrics are used. They try to consider the density of points within clusters and between clusters.

  1. Sum of squares within cluster:

Calculated using the inertia_ attribute of the k-means class, to compute the sum of squared distances of samples to their closest cluster center. Lower scores for more dense clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  1. Silhouette Coefficient:

Calculated using the mean intra-cluster distance and mean nearest-cluster distance for each sample. Best value is 1 and worst value is -1.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

  1. Calinski-Harabasz Index:

Known as the Variance Ratio Criterion, defined as the ratio between within-cluster dispersion and between-cluster dispersion. Higher scores signal clusters are dense and well separated.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html

  1. Davies-Bouldin Index:

Average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Minimum score is 0, lower values for better clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html

Analysis

Import Libraries
In [1]:
import json
from pandas.io.json import json_normalize
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

from CustomPitch import createVerticalPitch
from StatsBombPrep import ball_progression_events_into_thirds
from StatsBombViz import plot_sb_events, plot_sb_event_location, plot_sb_events_clusters, plot_individual_cluster_events
from StatsBombViz import plot_sb_event_grid_density_pitch, plot_histogram_ratio_pitch
from ClusterEval import kmeans_cluster, agglomerative_cluster, cluster_colour_map, cluster_evaluation, plot_cluster_evaluation
Load StatsBomb Data

StatsBomb collect event data for lots of football matches and have made freely available a selection of matches (including all FAWSL) to allow amateur projects to be able to take place. All they ask is to sign up to their StatsBomb Public Data User Agreement here: https://statsbomb.com/academy/. And get access to the data via their GitHub here: https://github.com/statsbomb/open-data

Their data is stored in json files, with the event data for each match identifiable by their match ids. To get the relevant match ids, you can find those in the matches.json file. To get the relevant season and competitions, you can find that in the competitions.json file.

In [2]:
# Load competitions to find correct season and league codes
with open('open-data-master/data/competitions.json') as f:
    competitions = json.load(f)
    
competitions_df = pd.DataFrame(competitions)
competitions_df[competitions_df['competition_name'] == 'Premier League']

# Arsenal Invincibles Season
with open('open-data-master/data/matches/2/44.json', encoding='utf-8') as f:
    matches = json.load(f)

# Find match ids
matches_df = json_normalize(matches, sep="_")
match_id_list = matches_df['match_id']

# Load events for match ids
arsenal_events = pd.DataFrame()
for match_id in match_id_list:
    with open('open-data-master/data/events/'+str(match_id)+'.json', encoding='utf-8') as f:
        events = json.load(f)
    events_df = json_normalize(events, sep="_")
    events_df['match_id'] = match_id
    events_df = events_df.merge(matches_df[['match_id', 'away_team_away_team_name', 'away_score', 
                                            'home_team_home_team_name', 'home_score']],
                                on='match_id', how='left')    
    arsenal_events = arsenal_events.append(events_df)

print('Number of matches: '+ str(len(match_id_list)))
Number of matches: 33

Data Preprocessing

Since we are working with event data, there is access to the pitch location in x, y coordinates of each recorded event. These are recorded as per their Pitch Coordinates found in the documentation: https://github.com/statsbomb/open-data/blob/master/doc/Open%20Data%20Events%20v4.0.0.pdf

The coordinates are mapped to a horizontal pitch with the origin (0, 0) in the top left corner, and (120, 80) in the bottom right. Since I am interested in defensive analysis from the point of view of Arsenal, I thought it would be easier to interpret if we converted these to a vertical pitch with Arsenal defending from the bottom and the opposition attacking downwards.

The location tuples for start and pass_end, carry_end are separated. They are all rotated to fit vertical pitch and then I create a universal end location for all progression events.

In [3]:
# Separate locations into x, y
arsenal_events[['location_x', 'location_y']] = arsenal_events['location'].apply(pd.Series)
arsenal_events[['pass_end_location_x', 'pass_end_location_y']] = arsenal_events['pass_end_location'].apply(pd.Series)
arsenal_events[['carry_end_location_x', 'carry_end_location_y']] = arsenal_events['carry_end_location'].apply(pd.Series)

# Create vertical locations
arsenal_events['vertical_location_x'] = 80 - arsenal_events['location_y']
arsenal_events['vertical_location_y'] = arsenal_events['location_x']
arsenal_events['vertical_pass_end_location_x'] = 80 - arsenal_events['pass_end_location_y']
arsenal_events['vertical_pass_end_location_y'] = arsenal_events['pass_end_location_x']
arsenal_events['vertical_carry_end_location_x'] = 80 - arsenal_events['carry_end_location_y']
arsenal_events['vertical_carry_end_location_y'] = arsenal_events['carry_end_location_x']

# Create universal end locations for event type
arsenal_events['end_location_x'] = np.where(arsenal_events['type_name'] == 'Pass',
                                            arsenal_events['pass_end_location_x'],
                                            np.where(arsenal_events['type_name'] == 'Carry',
                                                     arsenal_events['carry_end_location_x'],
                                                     np.nan))
arsenal_events['end_location_y'] = np.where(arsenal_events['type_name'] == 'Pass', 
                                            arsenal_events['pass_end_location_y'],
                                            np.where(arsenal_events['type_name'] == 'Carry', 
                                                     arsenal_events['carry_end_location_y'],
                                                     np.nan))
arsenal_events['vertical_end_location_x'] = np.where(arsenal_events['type_name'] == 'Pass',
                                                     arsenal_events['vertical_pass_end_location_x'],
                                                     np.where(arsenal_events['type_name'] == 'Carry', 
                                                              arsenal_events['vertical_carry_end_location_x'],
                                                              np.nan))
arsenal_events['vertical_end_location_y'] = np.where(arsenal_events['type_name'] == 'Pass', 
                                                     arsenal_events['vertical_pass_end_location_y'],
                                                     np.where(arsenal_events['type_name'] == 'Carry', 
                                                              arsenal_events['vertical_carry_end_location_y'],
                                                              np.nan))

As we are only interested in defensive events, we need to define the list of those. The full list of available events is located in the events documentation above.

I have defined all defensive events to use throughout below and filtered all events for those. Due to the nature of event data, these are all on-ball defensive actions. Often defensive plays are off-ball and explicitly deny future events from taking place, so these events may only highlight the end point of a potential sequence of ‘invisible’ defensive plays.

In [4]:
defensive_actions_list = ['Clearance', 'Pressure', 'Duel', 'Ball Recovery',
                          'Foul Committed', 'Block', 'Interception', '50/50']
defensive_actions = arsenal_events[arsenal_events['type_name'].isin(defensive_actions_list)]
arsenal_defensive_actions = arsenal_events[(arsenal_events['team_name'] == 'Arsenal') & 
                                           (arsenal_events['type_name'].isin(defensive_actions_list))]
opponents_defensive_actions = arsenal_events[(arsenal_events['team_name'] != 'Arsenal') & 
                                             (arsenal_events['type_name'].isin(defensive_actions_list))]

Since event data cannot tell the full defensive story, taking a look at the opposition’s offensive plays may help.

Firstly I remove all set pieces in favour of keeping only open play progressions. This is because defending from open play and set pieces requires different approaches, I will be focusing on open play progressions here.

Need to define what event types we class as ball progressions too. Here I have defined passes, carries and dribbles. The difference between a Carry and a Dribble is that a Dribble is ‘an attempt by a player to beat an opponent’, whilst a Carry is defined as ‘a player controls the ball at their feet while moving or standing still.’ In the data this distinction is clearer since a Carry has a start and end point, whilst a Dribble starts and ends in the same place.

The defined list of ball progression columns were aspirational and things I think would be useful to consider for next time, but didn’t get to look further into.

As mentioned above, when using the vertical pitch it would be useful to have the opponents moving in the opposite direction to the defending team (Arsenal) to feel more interpretable when visualising.

I have separated the ball progressions into thirds on the pitch since different approaches may be taken in different areas of the pitch. More caution would be expected closer to your own goal than in the opponents third.

Narrowing down more towards goal threatening progressions I’ve separated out passes and carries into the penalty area and shot assists.

Finally, dribbles are considered separately since they don’t have a separate end location. There is more focus on the progressions via passes and carries since they have quantifiable start to end progressions up the pitch.

In [5]:
# Remove Set Pieces
open_play_patterns = ['Regular Play', 'From Counter', 'From Keeper']
arsenal_open_play = arsenal_events[arsenal_events['play_pattern_name'].isin(open_play_patterns)]

# Define Opponents Ball Progression
event_types = ['Pass', 'Carry', 'Dribble']
ball_progression_cols = [
    'id', 'player_name', 'period', 'possession', 'duration', 'type_name', 
    'possession_team_name', 'team_name', 'play_pattern_name', 
    'vertical_location_x', 'vertical_location_y', 'vertical_end_location_x', 'vertical_end_location_y',
    'pass_length', 'pass_angle', 'pass_height_name', 
    'pass_body_part_name', 'pass_type_name', 'pass_outcome_name', 
    'ball_receipt_outcome_name', 'pass_switch', 'pass_technique_name',
    'pass_cross', 'pass_through_ball', 
    'pass_shot_assist', 'shot_statsbomb_xg', 'pass_goal_assist', 'pass_cut_back', 'under_pressure'
    ]

# Filter Opponents and Ball Progression Columns
opponent_ball_progressions = arsenal_open_play[(arsenal_open_play['team_name'] != 'Arsenal') & 
                                               (arsenal_open_play['type_name'].isin(event_types))]
opponent_ball_progressions = opponent_ball_progressions[ball_progression_cols]

# Reverse locations for opponents
opponent_ball_progressions['vertical_location_x'] = 80-opponent_ball_progressions['vertical_location_x']
opponent_ball_progressions['vertical_location_y'] = 120-opponent_ball_progressions['vertical_location_y']
opponent_ball_progressions['vertical_end_location_x'] = 80-opponent_ball_progressions['vertical_end_location_x']
opponent_ball_progressions['vertical_end_location_y'] = 120-opponent_ball_progressions['vertical_end_location_y']

# Separate events into thirds, penalty area
from_own_third_opp, from_mid_third_opp, from_final_third_opp, into_pen_area_opp = \
    ball_progression_events_into_thirds(opponent_ball_progressions)

# Filter shot assists
opponent_shot_assists = opponent_ball_progressions[opponent_ball_progressions['pass_shot_assist'] == True]

# Filter dribbles
opponent_dribbles = opponent_ball_progressions[opponent_ball_progressions['type_name'] == 'Dribble']

Data Exploration

This is an incredible sparse dataset, there are lots of events and lots of categorical columns to provide much needed context. Most columns aren’t applicable to most events so there are lots of missing or uninteresting values.

In [6]:
arsenal_defensive_actions.isnull().sum().sort_values(ascending=False)
Out[6]:
vertical_end_location_y     10251
goalkeeper_end_location     10251
goalkeeper_type_name        10251
goalkeeper_position_id      10251
goalkeeper_position_name    10251
                            ...  
home_team_home_team_name        0
away_score                      0
away_team_away_team_name        0
match_id                        0
id                              0
Length: 161, dtype: int64

The most frequent defensive action is a Pressure. Pressures don’t always result in a turnover, whereas Ball Recovery or Interception would.

In [7]:
arsenal_defensive_actions.groupby('type_name').count()['id']
Out[7]:
type_name
50/50               20
Ball Recovery     2020
Block              573
Clearance         1096
Duel              1338
Foul Committed     528
Interception       258
Pressure          4418
Name: id, dtype: int64

There are lots more Carries and Passes than Dribbles, but I expected there to be more Passes relative to Carries than there are.

In [8]:
opponent_ball_progressions.groupby('type_name').count()['id']
Out[8]:
type_name
Carry      5352
Dribble     327
Pass       6398
Name: id, dtype: int64

There are lots more progressions via Passes than Carries in the opposition’s own third and middle third. This is likely due to the reduced risk of a forward pass compared to a carry. If you lose the ball due to a misplaced pass, the ball is likely higher up the field and the passer is closer to their own goal as an extra defender. If you lose the ball due to being tackled, then you likely lose the ball from where you are and are chasing back to catch up.

In [9]:
from_own_third_opp.groupby('type_name').count()['id']
Out[9]:
type_name
Carry    148
Pass     833
Name: id, dtype: int64
In [10]:
from_mid_third_opp.groupby('type_name').count()['id']
Out[10]:
type_name
Carry    172
Pass     676
Name: id, dtype: int64

In the final third, there is a much more even split between Carries and Passes. Perhaps due to forward passing becoming much harder the further up the pitch and the reduced risk of Carries since you are so far away from your own goal.

In [11]:
from_final_third_opp.groupby('type_name').count()['id']
Out[11]:
type_name
Carry    488
Pass     531
Name: id, dtype: int64

Although Carries are good at getting near the penalty area, Passes are still the main way into the penalty area.

In [12]:
into_pen_area_opp.groupby('type_name').count()['id']
Out[12]:
type_name
Carry     85
Pass     351
Name: id, dtype: int64

And well assists are a subset of passes so no Carries here..

In [13]:
opponent_shot_assists.groupby('type_name').count()['id']
Out[13]:
type_name
Pass    81
Name: id, dtype: int64

Data Visualisation

Looking at all defensive events from Arsenal across the season leads to an overplotting mess. It’s pretty hard to tell anything from this.

In [14]:
plot_sb_event_location(arsenal_defensive_actions, alpha = 0.3)
Out[14]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x252f5c342e8>)

Looking at all of the opponent’s ball progressions is arguably even worse.

In [15]:
plot_sb_events(opponent_ball_progressions, alpha = 0.3)
Out[15]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x2528fd6d978>)

Due to the lower volume, we can actually see that there are more dribbles out wide than centrally.

In [16]:
plot_sb_event_location(opponent_dribbles)
Out[16]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x25298c82be0>)

Methodology

Event data records all on-ball actions from a match, this is more granular than high level team and player totals. As well as the event types, the event locations and extra information for the respective event type are recorded to provide context where possible. For offensive events such as passing and shooting, event data is great since they are usually on-ball actions. There are on-ball defensive actions such as tackles, interceptions and recoveries which are well captured by event data, these are the defensive events that I will be using. Using these defensive events, we can see where these events frequently occur on the pitch for Arsenal compared to their opposition. The key nuance is that just because these are the defensive events recorded, doesn’t mean they are all of the defensive plays that take place on the pitch.

The hard part about defensive analysis is that a large proportion of defending are non-events and won’t be captured by event data. If an opportunity to pass the ball to your striker is removed due to a defensive player blocking the passing lane, that pass will not happen and will not be recorded as such. The effective defensive play was to deny a potential event taking place for the offensive team.

Using event data to evaluate defenses can be done by using a combination of the defensive events from Arsenal and the offensive actions from Arsenal’s opponents. Arsenal’s defensive events will show where their on-ball defending took place. Arsenal’s opponents’ offensive events will show how they approached attacking Arsenal, this may be the offensive team getting their way or Arsenal’s defence forcing opponents to play in a certain way. From Arsenal’s success, the majority of the time it’s the latter.

To look at Arsenal’s defensive events, I plot the defensive events in a gridded 2D histogram across the pitch with marginal density plots along each axis to highlight areas of high activity. The same is done for Arsenal’s opponents and the differences can be highlighted using the ratio.

To look at Arsenal’s opponents events, I specifically look at ball progression including passes, carries and dribbles. These are separated into similar locations on the pitch since different approaches will be required:

  1. From their own third forwards
  2. From the middle third forwards
  3. From the final third forwards
  4. Into the penalty area
  5. Shot assists

These categories of ball progressions are clustered using K-Means and Agglomerative Clustering on the start and end locations to group similar ball progressions for each area of the pitch. The assumed number of clusters is decided using expert judgement and four evaluation measures:

  1. Sum of squared variance within cluster
  2. Silhouette Coefficient
  3. Calinski-Harabasz Index
  4. Davies-Bouldin Index

These clusters will represent the approaches that opponents tried to attack and create chances against Arsenal. No concrete results will be able to be drawn from only this, but they will give a better understanding.

Implementation

Defensive Events

In [17]:
# Arsenal Defensive Events Density Grid
fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(arsenal_defensive_actions)

# Title
fig.suptitle("Arsenal Defensive Actions",
             x=0.5, y=0.92,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')

# Direction of play arrow
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
Out[17]:
Text(10, 10, '')

Rather than the overplotting mess we saw above, the above plot is a combination of a 2D histogram grid and marginal density plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.

More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.

The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.

In [18]:
# All Defensive Events Density Grid
fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(defensive_actions,
                                                               grid_colour_map = 'Purples',
                                                               bar_colour = 'purple')

# Title
fig.suptitle("All Defensive Actions",
             x=0.5, y=0.92,
            fontsize = 'xx-large', fontweight='bold',
            color = 'purple')

# Direction of play arrow
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkviolet',
                            connectionstyle="arc3")
            )
Out[18]:
Text(10, 10, '')

Here we have all defensive events across the matches that Arsenal were in, so Arsenal will be a major contributor to these frequencies. It’s interesting to see the events still largely take place in their own penalty area, but less so in the full back areas.

In [19]:
# Relative Grid Density
fig, ax = plot_histogram_ratio_pitch(arsenal_defensive_actions, defensive_actions)

fig.suptitle("Arsenal's Relative Defensive Actions",
             x=0.5, y=0.87,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
fig.tight_layout()

Whilst seeing all defensive events gave some insight into what everyone was doing overall, we can also take a look at the relative difference between Arsenal’s defensive events and the overall view. This density grid shows where Arsenal had more events than overall in red and less than overall in blue.

The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.

Refinement

By taking a look at opponent’s ball progressions we can get potentially see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?

Ball Progression – Own Third

In [20]:
# From Own Third - Ball Progressions
fig, ax = plot_sb_events(from_own_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Own Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red line for own third
ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
plt.tight_layout()

Even whilst only looking at the ball progressions from their own third, it’s still hard to identify any trends or patterns of play. There are lots of longer passes, though I have only included progressions of at least 10 yards.

K-Means
In [21]:
# Create cluster evaluation dataframe for up to 50 clusters
own_third_clusters_kmeans = cluster_evaluation(from_own_third_opp, method='kmeans', max_clusters=50)

# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[21]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – More clusters gives lower sum of squares.
  2. Silhouette Coefficient (Higher) – Less clusters gives higher coefficient, under 5 especially and drops around 15.
  3. Calinski-Harabasz Index (Higher) – Less clusters gives higher index.
  4. Davies-Bouldin Index (Lower) – Max just less than 10

Just less than 10 seems appropriate.

In [53]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_own_third = kmeans_cluster(from_own_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_own_third))))
# Plot each cluster
title = "KMeans Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(3, 3, from_own_third_opp, cluster_labels_own_third, sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

We find that the most frequent cluster types for progressing the ball from their own third are shorter progressions into the middle third. There are short progressions centrally in clusters 10 and 17, with wider progressions in clusters 4 and 7.

Shorter progressions are easier to complete and less risky, so not surprising that they are the most frequent. This says nothing for how quality or sustainable these progressions are.

Agglomerative Clustering
In [23]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_third_clusters_agglo = cluster_evaluation(from_own_third_opp, method='agglomerative', min_distance=10, max_distance=500)

# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[23]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – Lower distance gives lower sum of squares, flat from 200 and sharp increase at 350.
  2. Silhouette Coefficient (Higher) – Really low distance or peaks again at around 250.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index.
  4. Davies-Bouldin Index (Lower) – Really low distance or dips again at 350.

250 is chosen as it appears to peak for Silhouette Coefficient with others trading off.

In [48]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_own_third = agglomerative_cluster(from_own_third_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_own_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(2, 3, from_own_third_opp, cluster_labels_agglo_own_third, figsize=(10, 10))
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 7 clusters

There are less clusters than the KMeans method. They seem to be more weighted towards the end location, as in clusters 1, 5 and 6 all end just inside the middle third in the centre, left and right side respectively. Whilst the remaining clusters all end closer to the opposite half.

Ball Progression – Middle Third

In [25]:
# From Mid Third - Ball Progressions
fig, ax = plot_sb_events(from_mid_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Mid Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red lines for middle third
ax.hlines(y=[40, 80], xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
fig.tight_layout()

Ball progressions from the middle third seem to have trajectories that are more dense out wide, so the ball rarely travels through the middle of the pitch in Arsenal’s own third. This may be another indicator that the ball is being directed outside, likely because there are lots of players located centrally.

KMeans
In [26]:
# Create cluster evaluation dataframe for up to 50 clusters
mid_third_clusters_kmeans = cluster_evaluation(from_mid_third_opp, method='kmeans', max_clusters=50)

# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Mid Third"
fig, axs = plot_cluster_evaluation(mid_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[26]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Mid Third')
  1. Sum of Squares (Lower) – More clusters, the lower the error.
  2. Silhouette Coefficient (Higher) – After settling, the peak seems just less than 10.
  3. Calinski-Harabasz Index (Higher) – Less clusters, the higher the index.
  4. Davies-Bouldin Index (Lower) – After settling, the more clusters the lower the index, with a drop around 10.

To balance 1 and 3, whilst satisfying both 2 and 4, 9 clusters seems appropriate.

In [27]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_mid_third = kmeans_cluster(from_mid_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_mid_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_mid_third_opp, cluster_labels_mid_third)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Middle Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.

Agglomerative
In [28]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_mid_clusters_agglo = cluster_evaluation(from_mid_third_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(from_mid_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_mid_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 848 passes.
Out[28]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – Lower distance means lower error, flatten at 200 and jumps at 300.
  2. Silhouette Coefficient (Higher) – Higher distance gives higher coefficient, spikes at 300+.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index, spikes at 300+.
  4. Davies-Bouldin Index (Lower) – Bottom of the curve seems to be around 200.

300+ is likely not producing sensible results due to constant error across the board. 200 seems to be reasonable everywhere.

In [29]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_mid_third = agglomerative_cluster(from_mid_third_opp, 200)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_mid_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Mid Third"
fig, axs = plot_individual_cluster_events(2, 4, from_mid_third_opp, cluster_labels_agglo_mid_third, figsize=(10,10), sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 8 clusters

There are similar number of clusters found across both methods. Agglomerative clustering again seems to more heavily weight the end locations heavier than the start locations. Lots of shorter progressions just into the final third and especially in wider areas such as clusters 4 and 3.

It has also found clusters entering the half space just outside the Arsenal penalty are in clusters 1 and 2 which is interesting. These 8 clusters seem to form 2 row x 8 column grid of end locations.

Ball Progression – Final Third

In [30]:
# From Final Third - Ball Progressions
fig, ax = plot_sb_events(from_final_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme = 'dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Final Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red line for final third
ax.hlines(y=40, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
fig.tight_layout()

Once in the final third, the progressions are less vertically direct. Hard to tell here, but looks like lots of shorter progressions and longer direct progressions into the box, likely crosses.

KMeans
In [31]:
# Create cluster evaluation dataframe for up to 50 clusters
final_third_clusters_kmeans = cluster_evaluation(from_final_third_opp, method='kmeans', max_clusters=50)
print("There are {} passes.".format(from_final_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Final Third"
fig, axs = plot_cluster_evaluation(final_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 1019 passes.
Out[31]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Final Third')
  1. Sum of Squares (Lower) – More clusters gives lower error.
  2. Silhouette Coefficient (Higher) – Less clusters gives higher coefficient.
  3. Calinski-Harabasz Index (Higher) – Less clusters gives higher coefficient
  4. Davies-Bouldin Index (Lower) – Drop just before 10 looks optimal.

Just before 10 is only clear sign, so have chosen 9.

In [32]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_final_third = kmeans_cluster(from_final_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_final_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_final_third_opp, cluster_labels_final_third, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Final Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Lots of short progressions in wide full back areas in the final third. The lowest two frequent clusters appear to be deep crosses into the penalty area, these are the only consistent progressions into the penalty area.

Agglomerative
In [33]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_final_clusters_agglo = cluster_evaluation(from_final_third_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(from_final_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Final Third"
fig, axs = plot_cluster_evaluation(own_final_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 1019 passes.
Out[33]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Final Third')
  1. Sum of Squares (Lower) – Lower distance gives lower error, spikes at 250.
  2. Silhouette Coefficient (Higher) – Higher distance gives higher coefficient, spikes at 250.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index, spikes at 250.
  4. Davies-Bouldin Index (Lower) – Optimal looks around 150-250

250 seems appropriate across all.

In [47]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_final_third = agglomerative_cluster(from_final_third_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_final_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Final Third"
fig, axs = plot_individual_cluster_events(2, 4, from_final_third_opp, cluster_labels_agglo_final_third, figsize=(10,8), sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=40, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 7 clusters

Again there are a similar number of clusters selected, with lots of progressions short and in wide areas. The least frequent again appear to be deep wide crosses into the penalty area and are the only consistent penalty area entry.

Ball Progression – Penalty Area

In [35]:
# Into Penalty Area - Ball Progressions
fig, ax = plot_sb_events(into_pen_area_opp, alpha = 0.7, figsize = (5, 10), pitch_theme = 'dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - Into Penalty Area",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
fig.tight_layout()

The majority of progressions into the penalty area come from out wide, however there are some that are incredibly direct and even from their own half.

KMeans
In [36]:
# Create cluster evaluation dataframe for up to 50 clusters
pen_area_clusters_kmeans = cluster_evaluation(into_pen_area_opp, method='kmeans', max_clusters=50)
print("There are {} passes.".format(into_pen_area_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - Penalty Area"
fig, axs = plot_cluster_evaluation(pen_area_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 436 passes.
Out[36]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - Penalty Area')
  1. Sum of Squares (Lower) – More clusters the lower the error.
  2. Silhouette Coefficient (Higher) – Very high coefficient below 5 clusters.
  3. Calinski-Harabasz Index (Higher) – Less clusters the higher the index.
  4. Davies-Bouldin Index (Lower) – Very low index below 5 clusters.

Peak Silhouette coefficient matches minimum DB Index at 4 here.

In [37]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_pen_area = kmeans_cluster(into_pen_area_opp, 4)
print("There are {} clusters".format(len(np.unique(cluster_labels_pen_area))))
# Plot individual clusters
fig, axs = plot_individual_cluster_events(2, 2, into_pen_area_opp, cluster_labels_pen_area, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - Into Penalty Area",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[37]:
Text(0.5, 1.01, 'KMeans Ball Progression Clusters - Into Penalty Area')

There is a smaller number of clusters selected here, grouping the clusters into broader groups. These seem to be split longer and shorter, and from each wide area. This doesn’t really help us.

Agglomerative
In [38]:
# Create cluster evaluation dataframe for up to 300 distance threshold
pen_area_clusters_agglo = cluster_evaluation(into_pen_area_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(into_pen_area_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - Penalty Area"
fig, axs = plot_cluster_evaluation(pen_area_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 436 passes.
Out[38]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - Penalty Area')
  1. Sum of Squares (Lower) – Lower distance gives lower error.
  2. Silhouette Coefficient (Higher) – Spike at 200 to max.
  3. Calinski-Harabasz Index (Higher) – Spike at 250 to max.
  4. Davies-Bouldin Index (Lower) – Lowest near 0 and around 250.

250 seems appropriate across the board.

In [39]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_pen_area = agglomerative_cluster(into_pen_area_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_pen_area))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - Penalty Area"
fig, axs = plot_individual_cluster_events(2, 2, into_pen_area_opp, cluster_labels_agglo_pen_area, sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[39]:
Text(0.5, 1.01, 'Agglomerative Ball Progression Clusters - Penalty Area')

Appropriate distance again gives 4 clusters which groups progressions very broadly into short or long and from left or right.

Ball Progression – Shot Assists

In [40]:
# Shot Assist
fig, ax = plot_sb_events(opponent_shot_assists, pitch_theme = 'dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - Shot Assists",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
fig.tight_layout()

Shot assists seem to be largely random, the criteria of having a shot attached to the end of the pass is incredible strict and contextual to other players.

KMeans
In [41]:
# Create cluster evaluation dataframe for up to 50 clusters
shot_assist_clusters_kmeans = cluster_evaluation(opponent_shot_assists, method='kmeans', max_clusters=50)
print("There are {} passes.".format(opponent_shot_assists.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - Shot Assists"
fig, axs = plot_cluster_evaluation(shot_assist_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 81 passes.
Out[41]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - Shot Assists')
  1. Sum of Squares (Lower) – More clusters gives lower error.
  2. Silhouette Coefficient (Higher) – Peak is at 4.
  3. Calinski-Harabasz Index (Higher) – Peak is at 4.
  4. Davies-Bouldin Index (Lower) – More clusters gives lower error, sharp drop around 5.

4 seems to be optimal for Silhouette COefficient and CH index.

In [42]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_shot_assist = kmeans_cluster(opponent_shot_assists, 4)
print("There are {} clusters".format(len(np.unique(cluster_labels_shot_assist))))
# Clustered Ball Progressions - Shot Assists 
fig, axs = plot_individual_cluster_events(2, 2, opponent_shot_assists, cluster_labels_shot_assist, figsize=(8,10),sample_size=10)
# Title
fig.suptitle("Ball Progression Clusters - Shot Assists",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[42]:
Text(0.5, 1.01, 'Ball Progression Clusters - Shot Assists')

When looking at only shot assists, there are lots from the left full back area as well as passes through the middle. Both of these are areas where if you can complete a pass towards goal then there is a higher chance of generating a shot. However, shot assists that do not penetrate the penalty area suggest that those shots are likely from further out and of lower quality.

Agglomerative
In [43]:
# Create cluster evaluation dataframe for up to 300 distance threshold
shot_assist_clusters_agglo = cluster_evaluation(opponent_shot_assists, method='agglomerative', min_distance=10, max_distance=200)
# 200 max_distance chosen here as lower distance between all samples and few samples.

# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - Shot Assists"
fig, axs = plot_cluster_evaluation(shot_assist_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[43]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - Shot Assists')
  1. Sum of Squares (Lower) – Lower distance gives lower error.
  2. Silhouette Coefficient (Higher) – Around 100 sees peak coefficient.
  3. Calinski-Harabasz Index (Higher) – 75-125 sees peak coefficient apart from near 0.
  4. Davies-Bouldin Index (Lower) – 125 sees dip in index.

125 chosen as appropriate across all metrics.

In [44]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_shot_assist = agglomerative_cluster(opponent_shot_assists, 125)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_shot_assist))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - Shot Assist"
fig, axs = plot_individual_cluster_events(2, 3, opponent_shot_assists, cluster_labels_agglo_shot_assist, figsize=(10,10),sample_size=4)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 5 clusters
Out[44]:
Text(0.5, 1.01, 'Agglomerative Ball Progression Clusters - Shot Assist')

Small number of clusters makes sense due to fewer progressions. Some of clusters which have end locations far away from the penalty area such as clusters 1 and 4, these are both central and require context to understand how players found shooting opportunities from here. Likely to be counter attacks with minimal defenders between them and the goal.

Results

Both KMeans and agglomerative clustering produced similar number of clusters after evaluation and reasonable groupings after visualisation.

The k-means clusters appear to align trajectories of progression from start to end locations fairly well. From the samples in the visualisations there are consistent start and end locations within cluster. Unlike agglomerative clusters there doesn’t seem to be a pattern to cluster groupings.

The agglomerative clusters appeared to be more heavily weighted to the end locations of progressions, this means that lots of the clusters were grouped into grids primarily focused on where they ended up. The aim of using clustering approaches was to try to find patterns that weren’t readily available to a human analyst, so in this sense the agglomerative clustering didn’t add anything extra.

There is no indication as to the quality of these shots produced and there is no context to compare to the wider league. But what is promising is that there is nothing significant standing out and and the common breaches of the Arsenal defence are just good progressions if they do come off. It’s not clear here how many more of these types of opportunities that were prevented due to defensive plays off the ball.

Justification

Opponent’s Ball Progressions

Across ball progressions throughout the whole pitch, the most frequent were shorter and in the wide areas. This is expected as shorter progressions are lower risk and wide areas are less important to defend, so are areas the defending team are willing to concede.

The length of progressions from the middle third appear to be longer than in their own third or in the final third. Context is needed in all individual circumstances but this may be due to a lower risk of failure due to being further away from your own goal or trying to take advantage of a short window of opportunity to quickly progress the ball longer distances into the final third.

When in the final third, the progressions shorten again. This is not due to lower risk, but likely due to a more densely populated area. There will be the majority of all players on the pitch within a single half of the pitch, navigating through there requires precision and patience from the offensive team.

Any completed progression into the penalty area is a success for the offensive team. There is a high chance you will create a shot and if you do it’s likely to be a higher quality shot than from outside the penalty area. Though not all completed progressions into the penalty area are created equal. If it’s a completed carry into the penalty area then awesome, you likely have the ball under control and can get a pretty good shot or extra pass. If it’s a completed pass then it depends how the player receives it, aerially or on the ground makes a difference to the shot quality. Aerial passes are harder to control and headers are of less quality than shots with the feet, however aerial passes are usually easier to complete into the penalty area. So higher quantity, lower quantity than ground passes or carries.

Shot assists rely on there being a shot at the end of them. So circularly they created a shot so are ‘good’ progressions but also they are ‘good’ progressions because they created a shot. As we can see, they are much more random which means it’s harder to understand without context why they created shooting opportunities since the locations alone don’t tell us anything. Although the context is available within StatsBomb’s data, I haven’t taken a further look here.

Arsenal’s Defensive Events

When considering how these progressions affect Arsenal’s defensive events, remember that the majority of their defensive events were performed out wide in the full back areas and in their own penalty area. Particularly out wide in the full back areas more than other teams, whilst defensive events within their own penalty area around the same as other teams.

At each third of the pitch, the most frequent ball progressions were out wide, which places the ball frequently in the full back areas. Due to the nature of defensive events, the only events recorded would be the on-ball actions that were defined including pressures, ball recoveries, tackles etc. The ball needs to be close to you to be able to perform these actions and get them recorded as events, so the opponent ball progression frequently going out wide combined with Arsenal’s defensive events in their defensive wide positions fits together well.

What this doesn’t tell us is if these are causally linked or just correlate. I would suggest there are more ball progressions made out wide than centrally in all of football due to the defence more likely willing to concede that space, so this doesn’t necessarily tell us much about Arsenal specifically. Though in Arsenal’s matches, they do perform defensive events in their full back areas more than other teams, which may suggest that there is something more than just correlations.

If it is a specified game plan to funnel the ball out wide and perform defensive events there, then Arsenal have done a great job at completing that. It’s a robust defensive plan if you can get it to work, the wider the ball is, the harder it is to score immediately from there. When defending it’s often useful to utilise the ends of the pitch as an ‘extra’ defender, which makes it easier to overwhelm offensive players.

Conclusion

Reflection

I set out to try to understand why Arsenal’s defense worked so well during their unbeaten season. Using StatsBomb’s event data for the majority of the season, I analysed where Arsenal’s defensive events were performed and how that compared to their opponents. This could only tell part of the story since defensive events only cover on-ball actions. It is accepted that defensive actions cover a whole lot more than just on-ball actions so further analysis was needed. I analysed how their opponents progressed the ball up the pitch form each third and how they created shots by clustering the locations to identify most frequent types of progressions. Considering both sides, it’s clear that much of Arsenal’s defensive work happens in their full back areas and their opponents try to progress the ball down the flanks. What is not clear is if Arsenal are causing this to happen via off the ball actions or if it’s just coincidence.

I found it hard to directly answer the question that I set out to, that’s likely due to a poor question in being too broad. However, I definitely am further along the road than when I started and has been interesting trying to work through these problems.

It was great working with event level data and trying to find interesting ways to communicate visualisations, hopefully the plots combined with the football pitches work well to add to understanding.

Improvement

Next time I would definitely be more narrow and specific with the question I set out to answer.

Another useful tool would be to get access to tracking level data for these matches. For defensive analysis there is much more emphasis on off-ball events and distances between all the players on the pitch, tracking data would provide the locations of all players on the pitch and the ball at all times. This would provide much greater detail but also be much more complicated to work with.

WordPress conversion from # 26 – Arsenal’s Invincibles Defence (Submission 2).ipynb by nb2wp v0.3.1

# 26 – Invincibles Defending with StatsBomb Events

The below is a Jupyter Notebook and can be found here alongside respective Python modules: https://github.com/ciaran-grant/StatsBomb-Invincibles

This was an initial submission for the Udacity Data Scientist Nanodegree, updates likely to come from further feedback. https://www.udacity.com/course/data-scientist-nanodegree–nd025

@TLMAnalytics

Udacity Capstone Project – Arsenal’s Invincibles’ Defence

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing. They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This project aims to look into how Arsenal’s defence managed this.

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them.

The goal is to identify areas of Arsenal’s defensive strengths and the frequent approaches used by opponents. Tasks involved are as follows:

  1. Download and preprocess StatsBomb’s event data
  2. Explore and visualise Arsenal’s defensive actions
  3. Explore and visualise Opponent’s ball progression by thirds
  4. Cluster and evaluate Opponent’s ball progressions
  5. Cluster and evaluate Opponent’s shot creations

A major problem when using many clustering algorithms is identifying how many clusters exist in the data since they require that as an input parameter. Sometimes expert judgement can provide a good estimate. However, some clustering algorithms such as agglomerative clustering require other inputs such as distance thresholds to determine clusters.

When using k-means clustering, to determine the number of clusters the following metrics are used. When using agglomerative clustering, to determine the distance threshold the same metrics are used. They try to consider the density of points within clusters and between clusters.

  1. Sum of squares within cluster:

Calculated using the inertia_ attribute of the k-means class, to compute the sum of squared distances of samples to their closest cluster center.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  1. Silhouette Coefficient:

Calculated using the mean intra-cluster distance and mean nearest-cluster distance for each sample. Best value is 1 and worst value is -1.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

  1. Calinski-Harabasz Index:

Known as the Variance Ratio Criterion, defined as the ratio between within-cluster dispersion and between-cluster dispersion. Higher scores signal clusters are dense and well separated.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html

  1. Davies-Bouldin Index:

Average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Minimum score is 0, lower values for better clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html

Analysis

Import Libraries

In [1]:
import json
from pandas.io.json import json_normalize
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

from CustomPitch import createVerticalPitch
from StatsBombPrep import