#25 – Why Liverpool’s Expected Goal Conundrum Makes Sense

As the new 2020/21 Premier League season is about to get underway, a big question is whether Liverpool can replicate their utter dominance. They’ve won the league by an unprecedented margin, and it never really looked like anything else was going to happen. Not even Pep Guardiola’s Manchester City teams that have just achieved 100 and 98 points in the previous two years could come close, they hadn’t been able to maintain that pace for a third season. Liverpool have just had consecutive 97 and 95 point seasons, hoping to replicate that form for the upcoming third season.

It had been noted that despite winning the league in record time, Liverpool’s expected goals and goal difference was still not as good as Manchester City’s. This suggests the idea that Manchester City were in fact the better team over the course of the season and that Liverpool have been merely lucky to win the league by such a margin.

Using shot distance, shot times and expected goals from www.fbref.com, I’ve approximate expected goals per shot for both Liverpool and Manchester City’s 2019/20 Premier League season. The total expected goal totals across each match have been proportioned out using shot distance to approximate expected goals per shot. Per shot information allows expected goals and minutes aggregation by gamestate.

Gamestate is an important factor in contextualising football matches. Stronger teams usually spend more time winning a game than weaker teams. Teams that are winning are no longer under obligation to push forward as much, with the losing team responsible for trying to get back into the game.

Across both the 97 and 95 point Liverpool seasons in 2018/19 and 2019/20, Liverpool achieved the majority of their expected goals in winning gamestates. Notably in the recent 2019/20 season,  they actually achieved more expected goals winning by a single goal than drawing.

Whereas for Manchester City, there is a clear distinction between their 98 point 2018/19 season and the recent 2019/20 season. They have earned a much larger proportion of their expected goals at a neutral gamestate than their title winning season, they are clearly creating the chances but perhaps have been wasteful. It’s also noticeable that they create lots of their expected goals when they’re already 3+ ahead and lots more when losing than the previous year.

Now lets take a look at how long each team has spent in each gamestate across the season.

Much like the expected goals charts, the proportion of minutes played between each club suggests a clear difference in approach that each team needed to adopt. Liverpool have spent a much larger proportion of their time winning by 1 goal, whilst Manchester City spent more time Losing and winning by 3+.

They both spent a similar proportion of time at neutral gamestates, though Manchester City’s expected goals at this gamestate were much higher. This suggests more chances and shots were required to go ahead in the game, Liverpool went ahead more efficiently and spent more time ahead at +1.

As mentioned earlier, your approach can change once you are winning. You no longer need to force anything or take risks, the responsibility to equalise or reduce the deficit is on the opponents so they need to take risks. Playing with no risks allows for a higher floor in performance, no doubt being in winning positions so much helped Liverpool maintain their momentum throughout the season. When you aren’t winning, you are required to create chances and shoot more which in turn helps build up your expected goals numbers.

Manchester City built up a lot of expected goals whilst at a neutral gamestate, when they were losing and when they were 3+ ahead. At neutral gamestates, these are the goals that convert into points most easily, and Liverpool were more efficient than Manchester City. When losing, you need to create shots (rack up expected goals numbers) to get back in the games, but you’re only losing because you didn’t score the first goal. When you’re 3+ ahead, these shots and expected goals likely won’t change the points returns of the match. Manchester City spent lots of time either needing to score goals in losing or neutral gamestates or absolutely crushing teams, and little in between, which perhaps explains their ridiculous expected goals numbers.

Liverpool spend little time and expected goals to get from neutral to +1 gamestates, meaning they could spend reduced time with responsibility to take more risks. They spent little time losing and lots of time ahead +1, with little time spent at 3+. They get ahead early and then not much else happened in the game, pretty good strategy to win. They’re deserving champions and perhaps explains why their expected goals aren’t as bonkers as Manchester City’s.

@TLMAnalytics

#24 Premier League Points History

It’s finally nearing end of this current season, so I wanted to have a look at past league points totals to get some context for how this season is shaping up.

To do so, I’ve taken data from www.fbref.com for league tables for last 24 years of 38 game seasons in the Premier League. A jupyter notebook used to get the league tables and create the plots is in my GitHub here: https://github.com/ciaran-grant/premier_league_points

This post looks to investigate the following three questions:

  • How do points totals recently compare to early Premier League seasons?
  • What has happened to the gap between the champions and relegation survivors?
  • How many points do you tend to need to qualify for Europe recently?

History of Premier League Points

Looks like the top 6 have been getting more points recently at the expense of the bottom half. There are only a set number of points available across all teams, the more points the big boys get the less available for the lower teams.

Does this mean there is an increasing gap between the top teams and the rest? We’ve gone through several incarnations with a top 4, then top 6 and now arguably a top 2 with Liverpool and Manchester City.

Champions and Relegation Survival

Relegation survival has been calculated as one more point than the points achieved by 18th place for simplicity of not getting too picky about goal difference.

Only 4 times out of the 24 seasons has a team required 40 points to survive, whilst it seems around 35 will usually be enough to be safe. Of course you want to aim for more points, but this seems to bust the myth of 40 points required for survival, often less is sufficient.

The points required to become champions has increased, meaning the gap between champions and relegation survival has increased in recent years. 2016/17, 17/18, 18/19 have been 3 of the highest points totals ever, only Chelsea in 04/05 with 95 in between here.

Could just be recency bias, but Liverpool were on track for 100+ this year. They’re likely to actually get 95+ and it’s expected both Liverpool/City to get close to 90+ again next year. They’re making 90 points seem normal and is becoming the minimum to win now.

How about qualifying for Europe?

European Qualification

It’s not only the Champions that has seen a point inflation, and in line with the whole top 6 sweeping up more points looks like there’s more points required to qualify for both European competitions as well.

The last decade has seen a higher average points requirement for getting both Champions League and Europa League qualifications, with 70 points for Champions League and 60 points for Europa League.

Current Season Context

Champions Liverpool are on 93 points with 2 games left, and likely they will break 95. This will make the last 3 seasons the highest 3 totals ever by a Premier League Champion. They could still win the league this year with less points than last year (97) and be in the top 4 highest points scoring teams ever.

Champions League qualification looks like coming up short of the 70 points mark. Chelsea, Leicester and Manchester United all comfortably in Europe sitting above 60 points but all to play for with 2 games left. Will be a low bar for both European competitions this year.

As for relegation survival, Bournemouth and Aston Villa both sit on 31 points with 2 games left. If things stay as they are, 32 points required for survival is the lowest since 09/10 where 31 points were needed. That year 40 points would’ve been good enough to finish 14th, this year 40 points looks good for 15th.

Final Thoughts

This season seems to be getting stretched by how ridiculously good Liverpool have been for 90% of the season, subconsciously or otherwise taking their foot off the gas once the title was wrapped up. Apart from prime Messi/Ronaldo Barcelona and Real Madrid teams, I wouldn’t expect many teams even get above 90 points, let alone dream of nearing 100.

Are these points totals also consistent across other 38 game seasons in France, Italy and Spain? How does this change when you consider Germany’s 34 game season or the 46 game seasons in the Football League?

@TLMAnalytics

#21 A View of the Suspended 2019/20 Season by Shots

Here’s a quick look back to see where the suspended seasons were left. I’m going to take a look at how efficient teams have been at converting shots into shots on target whilst simultaneously limiting shots on target against. How many shots a team takes is the most basic indicator of attacking output. You don’t shoot, you don’t score after all. Shots on target goes a step further to add a qualitative element to the shots. You don’t shoot on target, you don’t score.

If shots are indicative of how likely a team is to score and shots on target are even better, then we can also turn that around defensively. To win a game you need to score more goals than the opponent, which means that as well as trying to score yourself, you need to prevent the opposition from scoring. Reducing the shots conceded likely reduces the chances to concede a goal, and reducing the shots on target conceded does even better.

During a game, teams will try to maximise their own chances of scoring and reduce the opposition’s chances of scoring at the same time. A measure that captures both aspects of this sufficiently is known as the total shots ratio, with a respective total shots on target ratio as well. For a specific game, the total shot ratio is calculated for each respective team below:

Total Shot Ratio = Shots by Team / Total Shots of both Teams

Since every shot one team takes is a shot conceded by the other team, the sum of ratios for each team will always be 1. This also means that a total shots conceded ratio can be calculated, which turns out to be equal to the total shot ratio for the opposition in a single game.

This measure considers the proportion of shots you take against the total shots in a game. This means that if both teams have lots of shots, then the match is more likely to be equal. Whereas if one team takes lots of shots AND stops the opponent from taking lots of shots then that must be better, which is reflected here.

I’ve taken the shot and shots on target information for each match so far in each league to calculate the total shot ratio and total shots on target ratio for each match. Although not every team has played each other just yet, an average of these per game ratios was easiest to represent how each team has performed so far. Teams with high total shots on target ratios are likely to be the stronger teams in the league. Teams with higher total shots on target ratios than total shots ratios appear to be more efficient in terms of creating shots on target for themselves and limiting their opponents to just shots. Whilst teams with lower total shots on target ratios than total shots ratios seem to adhere to quantity over quality when it comes to shot selection or are susceptible to concede lots of shot on target.

Below are the results for each league, they seem to be pretty good approximations for the current league tables and manages to potentially group teams into tiers.

La Liga

La Liga 2019/20 – Total Shots Ratios and Total Shots on Target Ratios
  • Barcelona and Valencia are both getting a much higher proportion of shots on target in their matches, suggests that they perhaps are hesitant to shoot and rather manufacture better chances. Or they are great at limiting their opponents to settling for off target shots
  • Sevilla, Eibar and Espanol are at the opposite end of the spectrum, pretty inefficient at both ends
  • Real Sociedad are deserving of their top 4 place and Bilbao arguably should be better off than their mid table place suggests

Serie A

Serie A 2019/20 – Total Shots Ratios and Total Shots on Target Ratios
  • Yet another reason why Atalanta are so good this year, they create shots on target at a higher rate than their opponents more than anyone else in the league
  • 2nd down to 7th consists of the remaining European challengers and Sampdoria, who actually sit 16th! Potentially unlucky to be that far down based on shot counts

Premier League

Premier League 2019/20 – Total Shots Ratios and Total Shots on Target Ratios
  • Man City and Chelsea lead the way but are both pretty inefficient considering their shot dominance.
  • Liverpool have clearly been the best team in the league and are 3rd here, with a suggestion they have been one of the more efficient teams. This goes to show that game state can have an impact on shot counts.
  • Doesn’t look so good for Tottenham/Arsenal who are actually below 0.5 for Total Shots on Target Ratios, either by chance or design they both aren’t getting shots on target as much as top half sides expect let alone Champions League teams.

Ligue 1

Ligue 1 2019/20 – Total Shots Ratios and Total Shots on Target Ratios
  • Lille are way up there almost with Paris which bodes well for them!
  • Lyon look to be an efficient mid table team, which goes with their below par season so far. Expected them up with Marseille/Lille at least.

Bundesliga

Bundesliga 2019/20 – Total Shots Ratios and Total Shots on Target Ratios
  • Top 4 looks as expected, with Gladbach’s incredibly efficient shot to shots on target helping them keep pace.
  • The bottom half seems to be pretty inefficient, Hertha don’t appear to like shooting on target that much..

All data from: http://www.football-data.co.uk

@TLMAnalytics

#20 Age x Minutes Played in Top 5 European Leagues

Considering the suspensions of all major leagues, I thought it would be a good chance to catch up on how each one. Specifically here I’m taking a look at the distribution of minutes played by players in each team by the age of those players. The inspiration for this came from Real Sociedad and seeing that the core of their team has been a bunch of early twenty year olds and they currently sit in 4th should the leagues end now. This did seem unusual, but to compare with other teams this sparked a project that I have come across much more time to complete.

The data for this analysis has come from transfermarkt.com. Using the helpful tutorials on FC Python, specifically https://fcpython.com/scraping/introduction-scraping-data-transfermarkt, means all of the data you see is possible to get into a much cleaner, easier to use table or dataframe.

Minutes played is in all competitions during the 2019/2020 season up to the latest round of games before league suspensions.

I’ve taken a look at the top 5 European leagues and noted down some interesting, some weird and some funny things that came up.

La Liga

  • Real Sociedad

The initial inspiration for this project, so it’s nice to confirm the intuition that allocating lots of minutes to a younger group of players isn’t the norm. They have done well this year, I expect this team to have players poached sooner rather than later.

  • Real Madrid

What struck me with this was the gap between the core 7/8 players in their prime ages and the rest of the squad. They are always in ‘win now’ mode so it’s hard to ease in any youngsters, especially with the expectations of the last decade. But they’re going to need to start to trust a few more of the younger players they actually have an abundance of.

  • Atletico Madrid

In my head, Atletico’s team is full of 30+ year olds. They are all just passed their prime and have all the experience and tactical nouse a single team could contain. Then I see that the majority of their team is late 20s, actually just hitting their prime. They’ve got some kind of weird, Simeone style conveyor belt going on behind the scenes there.

Ligue 1

  • Lyon/Lille

Another case of lots of minutes played by young players, not too surprising that it’s Lyon and Lille. However worth pointing out again because this is still very much out of the ordinary and some teams seem to be able to consistently do this.

  • Rennes

This outlier up in the top left corner is Eduardo Camavinga, born on November 2002. He’s still 17. He’s on course to play 3000 minutes this year at centre midfield. He’s pretty good.

  • Montpellier

This other outlier in the top right is Vitorino Hilton, born in September 1977. He’s 42. He’s on course to play 3000 minutes this year at centre back. I had never heard of him before. He’s 42 and playing 3000 minutes in Ligue 1. I have gained much respect for him.

Bundesliga

  • Gladback/BVB

Obligatory lots of minutes for young players alert here. No surprise from Dortmund, but Gladbach are just a notch or two below in terms of the quality and quantity of minutes they’re getting from younger players this year.

  • Liepzig

RB Leipzig should also be in the lots of minutes for young players, but I thought they deserved a separate mention just because they ONLY play young players. There might be a virtual barrier around the training ground which doesn’t let you in after your 30th birthday.

  • Paderborn

Paderborn are currently bottom of the Bundesliga and look to have generally spread the minutes around, maybe trying different players or tactics to find something that works. I haven’t watched them. But interesting that the player with most minutes played is 22 year old Sebastian Vasiliadis in centre midfield. Is he the player the coaches trust the most? Or have they had injury troubles to other key players? He could be staying in the Bundesliga next year.

Serie A

  • Juventus

If there was ever a ‘win now’ team’s age profile, it would be this one. High proportion of minutes given to age 30+ players means that surely a rebuild is coming soon. De Ligt/Demiral are probably the start of that. (Note Buffon in the bottom right skewing the x-axis for every other team.)

  • AC Milan

The only really negatively correlated distribution I’ve seen, lots of minutes for young and not so many minutes for old players. I’d heard that Milan had made a decision to go in a different direction than overpaying just-past-their-prime players on long contracts, seems like they’re at least giving younger guys a go.

  • Atalanta

Special mention here goes to the island of players in their prime that Atalanta seem to have collected together. Hopefully they have a Gasparini type conveyor belt ready to go.

Premier League

  • Dwight Mcneil

This is probably my favourite distribution of all. And yet there is nothing actually surprising about what there is to see. Burnley are a team of all old/in their prime players, and then there’s Dwight McNeil. Come on Sean, give him someone from his own generation to talk to at least.

  • Wolves, Sheffield Utd don’t rotate

There’s definitely something to be said here about team understanding and having consistent line-ups. All these teams have a group of players playing the majority of their minutes, all these teams arguably perform above expectations. This may be confirmation bias as I haven’t checked the cases where core teams underperform. Wolves have played more games than the other teams in Europe, and still don’t rotate. Looking at you Conor Coady at 4000+ minutes already.

  • Aston Villa

The obligatory team with lots of minutes for young players is Aston Villa? They’re not especially seeing the success that other synonymous teams around Europe are having, probably due to this being mostly debut seasons all at once in a relegation fight.

That about wraps it up, if there are any interesting things you have spotted then give me a shout! Once again, this wouldn’t have been possible without starting off with getting the data from transfermarkt so huge props to FCPython and their website which does a great job. Check them out at @FC_Python.

@TLMAnalytics

#19: How to quantify the prevention of a potential goal scoring opportunity

Featured

Chances Created and Chances Missed

Chances Created is a metric which tries to quantify the number of goal scoring opportunities that a player is directly involved in.
Opta definition is Assists + Key Passes, where Assists are passes (final touch) which result in a goal from the subsequent play and KP are passes which result in a shot that doesn’t become a goal.
So chances created can be reduced to the final passes (touches) before another player has an attempt at goal (and scores or doesn’t).
As I’ve discussed on podcast The Monthly Football Podcast, Assists alone are pretty random since you rely on the shot actually going in the goal, so chances created is a bit less noisy and should more reliably predict future assists than assists actually do due to the sheer volume of chances created and opportunities for goals to be scored rather than relying on goals actually being scored which is hard.

Chances created relies on a player to actually have a shot at the end, otherwise there is no record of the opportunity. Opta also have ‘Chance Missed’, which is defined as a big chance opportunity where the player doesn’t get a shot away. Chance missed will be attributed to the player who has the big chance and decides not to shoot, which doesn’t help the creator who provided the chance. If we assume that the miss is largely due to the player not executing an attempt, then mapping these chances missed back to the creator in addition to chances created would give credit to creating the opportunity and not punishing them for something out of their control, such as the forward deciding to delay a shot and missing the chance to.

Chances Denied Metric

As usual, chances created and most quantified statistics deal with the offensive side of the game since it’s more tangible. Shots are there and they happen, counting them is pretty straightforward. A bit less straightforward is to count the passes prior to shots, with chances created. Both of these can be tracked over many events and quantify expected outcomes based off similar situations in the past, this results in expected goals and expected assists. What is not straightforward is how to quantify the benefit of defensive actions.

We can count tackles, blocks, interceptions and recoveries, however, much like steals, blocks and rebounds in the NBA, they don’t quite tell the whole story about how a defence works. Weaker teams are asked to defend more since they have less possession, this means they have more chances to rack up interceptions, tackles and recoveries. Possession adjusting these measures helps somewhat to normalise these differences, which means that we can compare the frequency of each action assuming they all have equal chances to do so. However it’s still hard to differentiate the quality of the actions, or how important they were to each team.

Chances denied are an attempt to quantify how much of an opportunity was denied by an interception or ball recovery. In a purely defensive, denying your opponent a goal scoring opportunity, sense, recovering the ball in the middle of the pitch is not as important as recovering the ball on the edge of your own box. Expected threat, created by Karun Singh (@karun1710), is a metric which quantifies how likely a team is to score from each location on the pitch within the next 5 actions. If we assign the xT to a recovery or interception or tackle considering the location on the pitch it occurs then we may get a proxy for how important each action was. Since defensive teams will get more opportunities, it may be worth possession adjusting this also to compare like for like.

The general concept trying to be captured here is to quantify the quality of chance or potential quality of chance that is denied due to the action taken by the defender. This quantity can be given solely to the defender making the action or collectively assigned to the players involved to appreciate the team aspect of defensive play. There is a question whether to include tactical fouls in here as well as legal ball recoveries, but will save that for another time.

@TLMAnalytics

https://www.optasports.com/news/opta-s-event-definitions/
https://karun.in/blog/expected-threat.html

#18: Space & Structure: Attempting to quantify (Pt. 2)

Introduction

Following on from the previous post where I tried to talk through why space and structure matters in football, this post here will try (emphasis on try) to quantify the concepts. The inspiration for this method of quantifying structure has come from looking at event data and trying to combine that with a notion of structure for each event.

When picturing each event, it’s at a point in time of a match with the known locations of each player. They’re snapshots in time with a load of information about how each team is organised for this specific event. These can be plot individually on a pitch, with a visual representation for each event to provide context. If you do plot each event individually, as you’d expect, each plot will look different as players move around the pitch. This difference is what I would like to quantify. Each event has a bunch of player locations in relation to the pitch and ball locations, is there a way to quantify this specific set of locations for an event? If so, can we see how that quantity changes when moving on to the next event?

If a team is defending and they have a certain quantity attached to their structure during a specific event, when the offending team does something which changes the defensive team’s structure massively and causes a goal scoring chance then I’d expect this quantity to have changed significantly to reflect the change in scenario on the pitch.

In the brief research I’ve done trying to find something that fits this criteria, it seemed sensible to take a look at network graphs. They have nodes which can represent players and edges which can be weighted depending on distances between the players at each event. Networks have been used in football in the past with reference to passing metrics and average positions to quantify and qualify different styles of play. They’re usually aggregated views of whole matches, whereas this idea is to sort of create a time series of individual sequential events each represented by networks using distances to represent how a match unfolds. For each match there would be a network representing each event, this allows a distance metric to be calculated between each event to represent how significant or important an event was in changing the structure or immediate flow of the game. Since networks don’t inherently care for locations, to compensate for this I have added in the locations of the centre of the pitch, each corner and each goal to reflect movement around the pitch.

For example, a pass between centre backs intending to maintain possession and cycle the ball may not have such a large difference in structures before and after the pass of either team. Whereas a through ball which plays a forward through on goal and cuts through the defence will have a much larger difference in structures before and after the through ball. This is the type of distinction I hope that will be quantified.

# Libraries ---------------
library("SBpitch")
library(tidyverse)
library(igraph)
library(NetworkDistance)

Example

Enough theory, let’s try and see a simplified extreme example. Since for each event all player locations are necessary and that data is both hard and expensive to come by, I have created an extremely unrealistic event which hopefully shows the concept.

Reminder on how to go about plotting a pitch using the ‘SBpitch’ library.

goaltype = "box"
grass_colour = "#202020"
line_colour =  "#797876"
background_colour = "#202020" 
goal_colour = "#131313"

ymin <- 0 # minimum width
ymax <- 80 # maximum width
xmin <- 0 # minimum length
xmax <- 120 # maximum length

blank_pitch <- create_Pitch(
  goaltype = goaltype,
  grass_colour = grass_colour, 
  line_colour =  line_colour, 
  background_colour = background_colour, 
  goal_colour = goal_colour,
  padding = 0
)

plot(blank_pitch)

plot of chunk blank_pitch

Locations on the pitch are static, so will be universal for each event.

# Pitch Locations
centre_circle <- c((xmin+xmax)/2, (ymin+ymax)/2)
bottom_left_corner <- c(xmin, ymin)
top_left_corner <- c(xmin, ymax)
bottom_right_corner <- c(xmax, ymin)
top_right_corner <- c(xmax, ymax)
left_goal <- c(xmin, (ymin+ymax)/2)
right_goal <- c(xmax, (ymin+ymax)/2)

Event 1

The first event is a pass, where both teams are in relatively neutral positions, almost like from a kick-off. The ball is on the half way line.

# Ball Location
ball_xy_1 <- centre_circle

# Player Locations
player_1_position_1 <- c(10,40)
player_2_position_1 <- c(20, 15)
player_3_position_1 <- c(20, 30)
player_4_position_1 <- c(20, 50)
player_5_position_1 <- c(20, 65)
player_6_position_1 <- c(50, 15)
player_7_position_1 <- c(50, 30)
player_8_position_1 <- c(59, 40)
player_9_position_1 <- c(50, 65)
player_10_position_1 <- c(80, 30)
player_11_position_1 <- c(80, 50)

player_12_position_1 <- c(120-10, 80-40)
player_13_position_1 <- c(120-20, 80-15)
player_14_position_1 <- c(120-20, 80-30)
player_15_position_1 <- c(120-20, 80-50)
player_16_position_1 <- c(120-20, 80-65)
player_17_position_1 <- c(120-50, 80-15)
player_18_position_1 <- c(120-50, 80-30)
player_19_position_1 <- c(120-50, 80-50)
player_20_position_1 <- c(120-50, 80-65)
player_21_position_1 <- c(120-80, 80-30)
player_22_position_1 <- c(120-80, 80-50)

# Complete Location DataFrame
event_1_player_locations <- as.data.frame(cbind(
  centre_circle,
  bottom_left_corner,
  top_left_corner,
  bottom_right_corner,
  top_right_corner,
  left_goal,
  right_goal,
  ball_xy_1,
  player_1_position_1,
  player_2_position_1,
  player_3_position_1,
  player_4_position_1,
  player_5_position_1,
  player_6_position_1,
  player_7_position_1,
  player_8_position_1,
  player_9_position_1,
  player_10_position_1,
  player_11_position_1,
  player_12_position_1,
  player_13_position_1,
  player_14_position_1,
  player_15_position_1,
  player_16_position_1,
  player_17_position_1,
  player_18_position_1,
  player_19_position_1,
  player_20_position_1,
  player_21_position_1,
  player_22_position_1
))
event_1_xy <- as.data.frame(t(event_1_player_locations)) 
names(event_1_xy)[1] <- "location_x"
names(event_1_xy)[2] <- "location_y"
teams <- c("pitch", "pitch", "pitch", "pitch", "pitch", "pitch", "pitch",
           "ball", 
           "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
           "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2")
player_size <- c(1, 1, 1, 1, 1, 1, 1, 1.5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
team_colours <- c("cyan", "Red", "White", "Green")


# Plot event 1 on the pitch
blank_pitch +
  geom_point(data = event_1_xy, aes(x=location_x, y = location_y, colour = teams),  size = player_size) +
  scale_colour_manual(values = team_colours) +
  theme(legend.position = "none")

plot of chunk event_1_pitch

There are no immediate goal scoring chances for the team in possession. We can calculate a distance matrix based on the distances between each player, the ball and points on the pitch. This distance matrix will help to create a network which looks entirely unappealing but contains all the information we want.

# Create distance matrix (weighted adjacency matrix)
dist_event_1 <- as.matrix(dist(event_1_xy %>% select(c("location_x", "location_y"))))

# create Network for all
graph_event_1 <- graph_from_adjacency_matrix(dist_event_1,
                                             mode = "undirected",
                                             weighted = TRUE,
                                             diag = FALSE)
plot(graph_event_1)

Event 2

The second event is immediately after the first event above, a pass has been made to the forward who receives the ball in a much more advanced position. Amazingly, there has been no attempt to block or deny him any space, if anything the defence has moved completely out of the way. This is pretty poor defending with a pretty poor defensive structure after the pass.

# Ball Location
ball_xy_2 <- c(100, 40) # Edge of the right penalty area

# Player Locations
player_1_position_2 <- c(10+10, 40)
player_2_position_2 <- c(20+19, 15)
player_3_position_2 <- c(20+19, 30)
player_4_position_2 <- c(20+19, 50)
player_5_position_2 <- c(20+19, 65)
player_6_position_2 <- c(50+19, 15)
player_7_position_2 <- c(50+19, 30)
player_8_position_2 <- c(50+19, 50)
player_9_position_2 <- c(50+19, 65)
player_10_position_2 <- c(80+19, 30)
player_11_position_2 <- c(80+19, 40)

player_12_position_2 <- c(120-10, 80-40)
player_13_position_2 <- c(120-30, 80-10)
player_14_position_2 <- c(120-20, 80-25)
player_15_position_2 <- c(120-20, 80-55)
player_16_position_2 <- c(120-30, 80-70)
player_17_position_2 <- c(120-50, 80-15)
player_18_position_2 <- c(120-50, 80-30)
player_19_position_2 <- c(120-50, 80-50)
player_20_position_2 <- c(120-50, 80-65)
player_21_position_2 <- c(120-80, 80-30)
player_22_position_2 <- c(120-80, 80-50)

event_2_player_locations <- as.data.frame(cbind(
  centre_circle,
  bottom_left_corner,
  top_left_corner,
  bottom_right_corner,
  top_right_corner,
  left_goal,
  right_goal,
  ball_xy_2,
  player_1_position_2,
  player_2_position_2,
  player_3_position_2,
  player_4_position_2,
  player_5_position_2,
  player_6_position_2,
  player_7_position_2,
  player_8_position_2,
  player_9_position_2,
  player_10_position_2,
  player_11_position_2,
  player_12_position_2,
  player_13_position_2,
  player_14_position_2,
  player_15_position_2,
  player_16_position_2,
  player_17_position_2,
  player_18_position_2,
  player_19_position_2,
  player_20_position_2,
  player_21_position_2,
  player_22_position_2
))
event_2_xy <- as.data.frame(t(event_2_player_locations))
names(event_2_xy)[1] <- "location_x"
names(event_2_xy)[2] <- "location_y"
teams <- c("pitch", "pitch", "pitch", "pitch", "pitch", "pitch", "pitch",
           "ball", 
           "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
           "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2")
player_size <- c(1, 1, 1, 1, 1, 1, 1, 1.5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
team_colours <- c("cyan", "Red", "White", "Green")

# Plot event 2 on the pitch
blank_pitch +
  geom_point(data = event_2_xy, aes(x=location_x, y = location_y, colour = teams),  size = player_size) +
  scale_colour_manual(values = team_colours) +
  theme(legend.position = "none")

plot of chunk event_2_pitch

Again we can create a distance matrix which helps create an unappealing network representation of the each player’s location in relation to each other , the ball and points on the pitch.

# Create distance matrix (weighted adjacency matrix)
dist_event_2 <- as.matrix(dist(event_2_xy %>% select(c("location_x", "location_y"))))

# Create Network for all
graph_event_2 <- graph_from_adjacency_matrix(dist_event_2,
                                             mode = "undirected",
                                             weighted = TRUE,
                                             diag = FALSE)
plot(graph_event_2)

Computing Network Differences

We have two events, one before a pass is made with the defence in a seemingly comfortable position and the second after a brilliantly executed pass that caused the defence to immediately abort mission. There is a clear difference in the players locations and scenario that can be seen from the pitch view, hopefully comparing the networks can reflect this intuitive difference.

The package NetworkDistance has several metrics which aim to compute how difference several networks are form each other using their adjacency matrices. The distance matrices that we have calculated are weighted adjacency matrices and should be appropriate candidates. It’s worth noting the obvious, that these are fully connected, undirected and weighted networks so not all measures will be appropriate. I am no expert in quantifying networks, however some research into the measures led to the intuition that the best candidates for representing the differences seen on the pitch won’t involve counting or replacing edges or looking at node specific measures. This is because each network will be fully connected, the only difference is the weighting’s between each node and we are interested in overall differences not specifically individual nodes jut yet, there may be some merit to looking at those.

The types of distance measures that seem to fit these criteria are (weighted) spectral distances, which compare the distance between the Eigenvectors of each distance matrix of the networks or the diffusion distance which is based off heat diffusion and has an element of time involved.

The measures for each is below, though I haven’t got an intuition for the scale I’m expecting these numbers to be as there’s no real situation for comparison. They are split into the pairwise distances between each network and a matrix representing the respective spectra. In this case there are only two networks for two events and so a single pairwise distance.

# Calculate distance between events (can do all events at once)
events_list <- list(
  dist_event_1, 
  dist_event_2
)

# Centrality
# event_centrality_close <- nd.centrality(events_list, out.dist = TRUE, mode = "Close", directed = FALSE)
# event_centrality_degree <- nd.centrality(events_list, out.dist = TRUE, mode = "Degree", directed = FALSE)
# event_centrality_btness <- nd.centrality(events_list, out.dist = TRUE, mode = "Between", directed = FALSE)

# L2 Distance of Continous SPectral Densities
event_csd <- nd.csd(events_list, out.dist = TRUE, bandwidth = 1)

# Discrete Spectral Distance
event_dsd_adj <- nd.dsd(events_list, out.dist = TRUE, type = "Adj")
event_dsd_lap <- nd.dsd(events_list, out.dist = TRUE, type = "Lap")
event_dsd_slap <- nd.dsd(events_list, out.dist = TRUE, type = "SLap")
event_dsd_nlap <- nd.dsd(events_list, out.dist = TRUE, type = "NLap")

# Edge Difference Distance
# event_edd <- nd.edd(events_list, out.dist = TRUE)

# Extremal Distance with top-k Eigenvalues
# event_extremal <- nd.extremal(events_list, out.dist = TRUE, k = ceiling(nrow(events_list)/5))

# Graph Diffusion Distance
event_gdd <- nd.gdd(events_list, out.dist = TRUE, vect = seq(from = 0.1, to = 1, length.out = 10))

# Hamming Distance
# event_hamming <- nd.hamming(events_list, out.dist = TRUE)

# HIM Distance
# event_him <- nd.him(events_list, out.dist = TRUE, xi = 1, ntest = 10)

# Network Flow Distance
# event_nfd <- nd.nfd(events_list, order = 0, out.dist = TRUE, vect = seq(from = 0, to = 10, length.out = 1000))

# Distance with Weighted Spectral Distribution
event_wsd <- nd.wsd(events_list, out.dist = TRUE, K = 50, wN = 4)

# Combine Relevant Distances
events_distance <- data.frame(cbind(
  event_csd,
  event_dsd_adj,
  event_dsd_lap,
  event_dsd_slap,
  event_dsd_nlap,
  event_wsd,
  event_gdd
))

events_distance

Still to come

There are of course some factors that haven’t been considered here such as different areas of the pitch being more important than others. Future ideas would ideally include a weighting that is applied depending on where the ball is located on the pitch. A further improvement to this would be to separate each team for the same event and evaluate their network differences separately alongside the overall difference.

This concept is still a work in progress, but thought it was worth thowing the idea out there. Any questions, suggestions or errors please give me a shout!

thelastmananalytics@hotmail.com
@TLMAnalytics

#17 Space & Structure: Why it matters (Pt. 1)

I always like the idea that Lionel Messi is such a good player that he creates more space by standing still than other players do running around. Space is a premium on the football pitch as it is, so being able to manufacture more space is a great skill to have. I should clarify, space in important areas of the pitch is a premium. There is actually lots of space on the pitch, it’s just that the majority of it isn’t contested as is considered unimportant. Exactly where these important areas are is up for debate, but generally accepted is that the areas close to each goal are more important than others. Being able to manufacture space in these important areas of the pitch is a great skill to have, and one that ultimately determines how well a player and team will perform.

This is the first part of what will likely be a few entries on my thoughts about Space and Structure’s importance in football.

Football is a simple game; the opposition players make it complicated. Putting the small round thing in the large rectangular white thing is a pretty simple concept, and is easy to do when there are no other players trying to stop you. For example, it’s much easier to score an open goal than it is to score in the resulting play from your own goal kick. Aside from the distance, the main obstruction is other players actively preventing your progress up the field towards their goal.

Denying the opposition comfortable possession in important areas of the pitch is a desirable feature of a well working defensive system, whilst gaining comfortable possession in important areas of the pitch is desirable when attacking. Whole styles of play are designed and implemented with both of these dimensions in mind.

  • Guardiola’s Barcelona used quick, short passing to probe and pull a defence out of shape before exploiting the available space to create high probability goal scoring chances.
  • Simeone’s Atlético Madrid employ a disciplined low block where defenders deny space and cover each other, whilst not over extending.

These are two opposite extremes of the same coin, both systems have an established structure which maintain control over the space in important areas of the pitch.

Guardiola’s Barcelona worked so well for many reasons, but partly because they were able to consistently pull defenders out of position and distort the structure that the defending team had employed. Once the structure was out of shape, it became much easier to manufacture space where they wanted and create goal scoring opportunities.

Simeone’s Atlético Madrid consistently denied goal scoring opportunities partly due to their individual and structural discipline. They maintain their structure and ensure contingency plans are available, such as double teams and covering defenders, and effective when required.

It’s clear that managing and controlling space in important areas of the pitch is crucial both with and without the ball. The best attacking teams seem to make defensive efforts obsolete, whilst the best attacking teams make defending look so simple. Quantifying contributions to attacking play is a more well established due to individual measurements such as shots, goals, assists and now also ‘Expected’ metrics. This is since scoring goals has been considered an individual achievement and readily quantified by how many goals a player scores themselves. Defensive contributions are harder to quantify and much more nuanced. There are individual metrics such as blocks, tackles and interceptions, however these don’t always correlate with reduced goals conceded. Weaker teams are put in positions to perform these actions more often than stronger teams, but they also concede more goals.

As individual contributions are less important defensively, it seems more reasonable to seek to quantify defensive efforts using team-oriented measurements. In different phases of the game, teams will adopt a shape to their team which reflects both what they want to protect and how they want to protect it. For example, a team may retreat and allow the opposition to carry the ball out of their half but as soon as they enter their own half, they will immediately apply pressure. This suggests that they see the important areas of the pitch in their own half and want to protect this area. Whilst a team may adopt a high press and immediately press the opposition in their half, with the aim of winning the ball back and countering nearer to the opposition’s goal. This approach suggests that their important areas of the pitch cover a much larger area of the field, with a lower emphasis on their own half. These examples are specific to a point in time and will evolve as the ball moves around the pitch, constant re-evaluation of important areas and reactions between teams are the decisions that individual players must make throughout a match.

Without the ball, a team will want to ensure that their intended structure is maintained. With the ball, a team will try to break the defensive team’s structure. The assumption here is that it’s much harder to create goal scoring opportunities when attacking a defence’s intended structure than once you’ve forced them out of their comfort zone. Great attacking teams can force defences out of their structure more easily, usually by precise ball movement or individual skill moving the ball past opponents. Great defensive teams avoid being disrupted from their structure, usually by being comfortable in a wide array of structures and so effectively always in an adequate defensive shape or forcing the attacking team to play by their rules.

It seems obvious that the team which controls the space on the pitch will control the game. When watching football matches, we can get an intuition about how each team sets up and how that affects the flow of a game, however it’s hard to quantify that intuition. It’s hard to determine exactly what make Barcelona and Atletico Madrid’s use of structure and space so good as they appear to have completely opposite styles. In the proceeding parts to my thoughts on structure and space, I’ll take a look at potential ways to quantify or measure their use of space and structure.

@TLMAnalytics

#16: StatsBomb – Messi Ball Receipt Locations

Featured

** Re-uploaded with correct y-axis (duh, Messi played on the right..)

Introduction

Over the last few moths, Statsbomb have released all of the event data for matches including Lionel Messi’s La Liga matches. Data this detailed and clean is incredibly hard (expensive) to come by, so to give free access to everyone is amazing and much appreciated!

I’ve only just had a chance to take a look at the data, and seen many great pieces put out already. Considering who the data is of, I don’t think it will ever go out of fashion so it’s never too late to start playing around.

You can get access to the data and there’s a very helpful getting started guide here:
https://statsbomb.com/resource-centre/
https://statsbomb.com/2019/07/messi-data-release-part-1-working-with-statsbomb-data-in-r/

** You do need to have the latest version of R and the StatsbombR package installed

There are almost too many things to look at in this data set, so I’ve decided to try to focus on a specific part of Messi’s game and see what I find.

Messi gets the ball, a lot. And he obviously does great things with it once he’s got it, but taking a look at where/how he manages to get the ball would be interesting. Surely the one thing an opposing team would try to do against him would be to try to stop him getting the ball or at least limit him receiving the ball in dangerous areas. That’s the inspiration for taking a look at where he receives the ball on the pitch.

Data Prep

Load all the necessary libraries. The usual suspects for R data manipulation like plyr/tidyverse/magrittr, plotting graphs and pitches with FC_RStats’ SBpitch/ggplot2/cowplot and access to the data in StatsBombR.

# Libraries ---------------
library(plyr)
library(StatsBombR)
library("SBpitch")
library("ggplot2")
library(tidyverse)
library(magrittr)
library(cowplot)

We are only looking at La Liga matches, so let’s only load matches from that competition. There is even a cleaning function ‘allclean’ which adds in some extra columns which will be of use such as x/y locations. We have joined on the season names also as they’re much more intuitive than the season id that has been assigned.

There are events from matches from Messi’s debut season 2004/05 through to 2015/16, consisting of events such as shots, passes and even nutmegs. We’re interested in passes received by the man himself. Note there is also an indicator “ball_receipt.outcome.name” that identifies when a pass is missed, we want to exclude these and only look at passes to Messi that he received (NA values).

Plotting a pitch

To get some perspective relative to an actual football pitch and use StatsBomb’s event location data, FC_RStats has created a function “create_Pitch” which does exactly that. Using ggplot2 and a set of pitch type parameters, it’s easy to plot a pitch with the same proportions as the event data collected by StatsBomb.

This pitch can be used as the base to visualise all events by plotting the x/y locations.

goaltype = "box"
grass_colour = "#202020"
line_colour =  "#797876"
background_colour = "#202020" 
goal_colour = "#131313"

ymin <- 0 # minimum width
ymax <- 80 # maximum width
xmin <- 0 # minimum length
xmax <- 120 # maximum length

blank_pitch <- create_Pitch(
  goaltype = goaltype,
  grass_colour = grass_colour, 
  line_colour =  line_colour, 
  background_colour = background_colour, 
  goal_colour = goal_colour,
  padding = 0
)

plot(blank_pitch)

Quick Look

Initial data and visual processing is done, we can now start to take a look at the interesting stuff!

At a high level, we can have a look at all of the times Messi received the ball and plot them on a pitch. This will probably get overcrowded but can start to provide some understanding.

# All ball receipts ------------
Messi_Plot <- 
  blank_pitch +
    geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Plot)

Since eachtime Messi receives the ball is in a specific location, we’ve used points to represent this on the pitch. This looks okay initially, but it’s pretty hard to work out exactly what’s going on and doesn’t really tell us anything we didn’t already know. Messi gets the ball a lot in the opposition’s half.

There are lots of overlapping points, let’s try to get a view of the density distribution to see where specifically he has received the ball the most.

# Density Receipts ----------------
Messi_Density_Plot <- 
  blank_pitch +
    geom_density_2d(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts - Density") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Density_Plot)

This mostly suggests the same thing, Messi likes to receive the ball in the opposition half. Though we can now also see that there are two “peaks”, one far out wide near the top and one closer to the centre. More central areas are more dangerous, whereas you might get more space out wide to be able to receive the ball easier.

Luckily (definitely not luckily) StatsBomb just so happen to have a flag which identifies events that occurred under pressure. I believe under pressure is taken as having an opposition player within X metres of you actively affecting your decision making.

Let’s take a look at Messi’s ball receives whilst under pressure and under no pressure. I would expect that you would be under pressure more often the closer you get to the opposition’s goal.

# Pressure --------------
Messi_Ball_Receipts <- Messi_Ball_Receipts %>%
  mutate(pressure = ifelse(is.na(under_pressure), "No Pressure", "Pressure"))

Messi_Pressure_Plot <-
  blank_pitch +
  geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=80-location.y, colour = pressure)) +
  ggtitle("Messi Ball Receipts by Pressure") +
  theme(plot.background = element_rect(fill = grass_colour),
        plot.title = element_text(hjust = 0.5, colour = line_colour),
        legend.position = "bottom",
        legend.title = element_blank(),
        legend.background = element_rect(fill = grass_colour),
        legend.text = element_text(color = line_colour))

Messi_Pressure_Plot

I’m not really sure what I expected. It’s pretty hard to distinguish between the two as there are so many points. To further filter the data we can take a look at this in each season.

# Pressure Season Loop ----------------
for (i in rev(La_Liga$season_name)) {
    print(
      blank_pitch +
        geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), 
                   aes(x=location.x, y=80-location.y, colour = pressure))  +
        ggtitle(paste0("Messi Ball Receipts by Pressure - ", i)) +
        theme(plot.background = element_rect(fill = grass_colour),
              plot.title = element_text(hjust = 0.5, colour = line_colour),
              legend.position = "bottom",
              legend.title = element_blank(),
              legend.background = element_rect(fill = grass_colour),
              legend.text = element_text(color = line_colour))
      ) 
}

Now there’s a lot less going on. Remember those two peaks of ball receipts from above? We can see here that this is due to Messi receiving the ball in different areas of the pitch in different seasons. Again, this is something we already probably knew. Messi started his career as a wide forward so will receive the ball out wide most of the time. From 2009/10 onwards he starts to receive the ball much more centrally, coinciding with his time playing as a “False 9” up front. Coincidentally, his already ridiculous production output skyrocketted. Messi getting the ball in central areas = goal machine.

This still hasn’t really answered the question of where Messi recieves the ball under pressure as it’s hard to tell if there’s a pattern to the blue/red or if it’s all just random.

Something that can help here are marginal density plots. These can be plotted along each axis separately and can hopefully display the distribution of ball receipts more intuitively.

Taking a look at all seasons initially.

xdens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=location.x, fill = pressure), alpha = 0.5) +
  xlim(xmin, xmax)

combined_pressure_plot <- insert_xaxis_grob(Messi_Pressure_Plot, xdens_pressure, position = "top") 

ydens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=80-location.y, fill = pressure), alpha = 0.5) +
  xlim(ymin, ymax) +
  coord_flip()

combined_pressure_plot %<>%
  insert_yaxis_grob(., ydens_pressure, position = "right")
ggdraw(combined_pressure_plot)

Again there’s a bit too much going on on the pitch here, but looking at the marginal distributions across each axis is interesting.

Across the top it looks like there’s not too much difference in distribution between “Pressure”“ and “No Pressure”. There is a higher peak for “No Pressure”“ about halfway inside the opposition’s half which could be due to Barcelona practically camping themselves outside the opposition box and all defenders are on the edge of their own box for the majority of most games.

Along the right is as expected, there are many more pass receives under no pressure out wide.

And for each season separately.

for (i in rev(La_Liga$season_name)) {
    p <- blank_pitch +
      geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, y=80-location.y, colour = pressure)) +
      ggtitle(paste0("Messi Ball Receipts - ", i)) +
      theme(plot.background = element_rect(fill = grass_colour),
            plot.title = element_text(hjust = 0.5, colour = line_colour),
            legend.position = "bottom",
            legend.title = element_blank(),
            legend.background = element_rect(fill = grass_colour),
            legend.text = element_text(color = line_colour))


    xdens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, fill = pressure), alpha = 0.5) +
      xlim(xmin, xmax)
    xplot <- insert_xaxis_grob(p, xdens, position = "top") 

    ydens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=80-location.y, fill = pressure), alpha = 0.5) +
      xlim(ymin, ymax) +
      coord_flip()

    comb_plot <- insert_yaxis_grob(xplot, ydens, position = "right")
    print(ggdraw(comb_plot))
}

Now this is what we all came here to see.

For the first 6 seasons of his career (2004/05 – 2009/10), Messi actually received the ball closer to goal under no pressure than he did under pressure (distribution across the top), which is pretty incredible and opposed to both what you would expect and what we saw overall. These are the wide forward Messi seasons, which shows just how good he was and how good he was getting at being a wide forward. Where he most often received the ball under no pressure (peak of the top distribution) actually moves closer to the opposition goal until 2008/09.

Then in 2009/10 something magical happens. Somehow he manages to receive the ball under no pressure both closer to the goal (across the top) AND dead in the centre of the field (along the right). Which of course is a recipe for success.

From then on, looks like teams at least tried to put pressure on him when he received the ball close to the goal. Not really sure that worked so much though.

There are a lot more amazing things from Messi’s career hidden away in this amazing data set. Thanks again to StatsBomb for the free access to explore and show off some things that are possible with the data.

@TLMAnalytics

16: StatsBomb – Messi Ball Receipt Locations

Introduction

Over the last few moths, Statsbomb have released all of the event data for matches including Lionel Messi’s La Liga matches. Data this detailed and clean is incredibly hard (expensive) to come by, so to give free access to everyone is amazing and much appreciated!

I’ve only just had a chance to take a look at the data, and seen many great pieces put out already. Considering who the data is of, I don’t think it will ever go out of fashion so it’s never too late to start playing around.

You can get access to the data and there’s a very helpful getting started guide here:
https://statsbomb.com/resource-centre/
https://statsbomb.com/2019/07/messi-data-release-part-1-working-with-statsbomb-data-in-r/

** You do need to have the latest version of R and the StatsbombR package installed

There are almost too many things to look at in this data set, so I’ve decided to try to focus on a specific part of Messi’s game and see what I find.

Messi gets the ball, a lot. And he obviously does great things with it once he’s got it, but taking a look at where/how he manages to get the ball would be interesting. Surely the one thing an opposing team would try to do against him would be to try to stop him getting the ball or at least limit him receiving the ball in dangerous areas. That’s the inspiration for taking a look at where he receives the ball on the pitch.

Data Prep

Load all the necessary libraries. The usual suspects for R data manipulation like plyr/tidyverse/magrittr, plotting graphs and pitches with FC_RStats’ SBpitch/ggplot2/cowplot and access to the data in StatsBombR.

# Libraries ---------------
library(plyr)
library(StatsBombR)
library("SBpitch")
library("ggplot2")
library(tidyverse)
library(magrittr)
library(cowplot)

We are only looking at La Liga matches, so let’s only load matches from that competition. There is even a cleaning function ‘allclean’ which adds in some extra columns which will be of use such as x/y locations. We have joined on the season names also as they’re much more intuitive than the season id that has been assigned.

There are events from matches from Messi’s debut season 2004/05 through to 2015/16, consisting of events such as shots, passes and even nutmegs. We’re interested in passes received by the man himself. Note there is also an indicator “ball_receipt.outcome.name” that identifies when a pass is missed, we want to exclude these and only look at passes to Messi that he received (NA values).

Plotting a pitch

To get some perspective relative to an actual football pitch and use StatsBomb’s event location data, FC_RStats has created a function “create_Pitch” which does exactly that. Using ggplot2 and a set of pitch type parameters, it’s easy to plot a pitch with the same proportions as the event data collected by StatsBomb.

This pitch can be used as the base to visualise all events by plotting the x/y locations.

goaltype = "box"
grass_colour = "#202020"
line_colour =  "#797876"
background_colour = "#202020" 
goal_colour = "#131313"

ymin <- 0 # minimum width
ymax <- 80 # maximum width
xmin <- 0 # minimum length
xmax <- 120 # maximum length

blank_pitch <- create_Pitch(
  goaltype = goaltype,
  grass_colour = grass_colour, 
  line_colour =  line_colour, 
  background_colour = background_colour, 
  goal_colour = goal_colour,
  padding = 0
)

plot(blank_pitch)

Quick Look

Initial data and visual processing is done, we can now start to take a look at the interesting stuff!

At a high level, we can have a look at all of the times Messi received the ball and plot them on a pitch. This will probably get overcrowded but can start to provide some understanding.

# All ball receipts ------------
Messi_Plot <- 
  blank_pitch +
    geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Plot)

Since each time Messi receives the ball is in a specific location, we’ve used points to represent this on the pitch. This looks okay initially, but it’s pretty hard to work out exactly what’s going on and doesn’t really tell us anything we didn’t already know. Messi gets the ball a lot in the opposition’s half.

There are lots of overlapping points, let’s try to get a view of the density distribution to see where specifically he has received the ball the most.

# Density Receipts ----------------
Messi_Density_Plot <- 
  blank_pitch +
    geom_density_2d(data = Messi_Ball_Receipts, aes(x=location.x, y=location.y), colour = "purple") +
    ggtitle("Messi Ball Receipts - Density") +
    theme(plot.background = element_rect(fill = grass_colour),
          plot.title = element_text(hjust = 0.5, colour = line_colour))
plot(Messi_Density_Plot)

This mostly suggests the same thing, Messi likes to receive the ball in the opposition half. Though we can now also see that there are two “peaks”, one far out wide near the top and one closer to the centre. More central areas are more dangerous, whereas you might get more space out wide to be able to receive the ball easier.

Luckily (definitely not luckily) StatsBomb just so happen to have a flag which identifies events that occurred under pressure. I believe under pressure is taken as having an opposition player within X metres of you actively affecting your decision making.

Let’s take a look at Messi’s ball receives whilst under pressure and under no pressure. I would expect that you would be under pressure more often the closer you get to the opposition’s goal.

# Pressure --------------
Messi_Ball_Receipts <- Messi_Ball_Receipts %>%
  mutate(pressure = ifelse(is.na(under_pressure), "No Pressure", "Pressure"))

Messi_Pressure_Plot <-
  blank_pitch +
  geom_point(data = Messi_Ball_Receipts, aes(x=location.x, y=location.y, colour = pressure)) +
  ggtitle("Messi Ball Receipts by Pressure") +
  theme(plot.background = element_rect(fill = grass_colour),
        plot.title = element_text(hjust = 0.5, colour = line_colour),
        legend.position = "bottom",
        legend.title = element_blank(),
        legend.background = element_rect(fill = grass_colour),
        legend.text = element_text(color = line_colour))

Messi_Pressure_Plot

I’m not really sure what I expected. It’s pretty hard to distinguish between the two as there are so many points. To further filter the data we can take a look at this in each season.

# Pressure Season Loop ----------------
for (i in rev(La_Liga$season_name)) {
    print(
      blank_pitch +
        geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), 
                   aes(x=location.x, y=location.y, colour = pressure))  +
        ggtitle(paste0("Messi Ball Receipts by Pressure - ", i)) +
        theme(plot.background = element_rect(fill = grass_colour),
              plot.title = element_text(hjust = 0.5, colour = line_colour),
              legend.position = "bottom",
              legend.title = element_blank(),
              legend.background = element_rect(fill = grass_colour),
              legend.text = element_text(color = line_colour))
      ) 
}

Now there’s a lot less going on. Remember those two peaks of ball receipts from above? We can see here that this is due to Messi receiving the ball in different areas of the pitch in different seasons. Again, this is something we already probably knew. Messi started his career as a wide forward so will receive the ball out wide most of the time. From 2009/10 onwards he starts to receive the ball much more centrally, coinciding with his time playing as a “False 9” up front. Coincidentally, his already ridiculous production output skyrocketted. Messi getting the ball in central areas = goal machine.

This still hasn’t really answered the question of where Messi recieves the ball under pressure as it’s hard to tell if there’s a pattern to the blue/red or if it’s all just random.

Something that can help here are marginal density plots. These can be plotted along each axis separately and can hopefully display the distribution of ball receipts more intuitively.

Taking a look at all seasons initially.

xdens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=location.x, fill = pressure), alpha = 0.5) +
  xlim(xmin, xmax)

combined_pressure_plot <- insert_xaxis_grob(Messi_Pressure_Plot, xdens_pressure, position = "top") 

ydens_pressure <- axis_canvas(Messi_Pressure_Plot, axis = "x") +
  geom_density(data = Messi_Ball_Receipts, aes(x=location.y, fill = pressure), alpha = 0.5) +
  xlim(ymin, ymax) +
  coord_flip()

combined_pressure_plot %<>%
  insert_yaxis_grob(., ydens_pressure, position = "right")
ggdraw(combined_pressure_plot)

Again there’s a bit too much going on on the pitch here, but looking at the marginal distributions across each axis is interesting.

Across the top it looks like there’s not too much difference in distribution between “Pressure”“ and “No Pressure”. There is a higher peak for “No Pressure”“ about halfway inside the opposition’s half which could be due to Barcelona practically camping themselves outside the opposition box and all defenders are on the edge of their own box for the majority of most games.

Along the right is as expected, there are many more pass receives under no pressure out wide.

And for each season separately.

for (i in rev(La_Liga$season_name)) {
    p <- blank_pitch +
      geom_point(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, y=location.y, colour = pressure)) +
      ggtitle(paste0("Messi Ball Receipts - ", i)) +
      theme(plot.background = element_rect(fill = grass_colour),
            plot.title = element_text(hjust = 0.5, colour = line_colour),
            legend.position = "bottom",
            legend.title = element_blank(),
            legend.background = element_rect(fill = grass_colour),
            legend.text = element_text(color = line_colour))


    xdens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.x, fill = pressure), alpha = 0.5) +
      xlim(xmin, xmax)
    xplot <- insert_xaxis_grob(p, xdens, position = "top") 

    ydens <- axis_canvas(p, axis = "x") +
      geom_density(data = Messi_Ball_Receipts %>% filter(season_name == i), aes(x=location.y, fill = pressure), alpha = 0.5) +
      xlim(ymin, ymax) +
      coord_flip()

    comb_plot <- insert_yaxis_grob(xplot, ydens, position = "right")
    print(ggdraw(comb_plot))
}

Now this is what we all came here to see.

For the first 6 seasons of his career (2004/05 – 2009/10), Messi actually received the ball closer to goal under no pressure than he did under pressure (distribution across the top), which is pretty incredible and opposed to both what you would expect and what we saw overall. These are the wide forward Messi seasons, which shows just how good he was and how good he was getting at being a wide forward. Where he most often received the ball under no pressure (peak of the top distribution) actually moves closer to the opposition goal until 2008/09.

Then in 2009/10 something magical happens. Somehow he manages to receive the ball under no pressure both closer to the goal (across the top) AND dead in the centre of the field (along the right). Which of course is a recipe for success.

From then on, looks like teams at least tried to put pressure on him when he received the ball close to the goal. Not really sure that worked so much though.

There are a lot more amazing things from Messi’s career hidden away in this amazing data set. Thanks again to StatsBomb for the free access to explore and show off some things that are possible with the data.

@TLMAnalytics

#15: Getting Started with Free StatsBomb Event Data – xG Shot Map Tutorial

Introduction

After attending StatsBomb’s Introduction to Football Analytics last week, I was inspired to take another look at the free events data that they offer. One of the main obstacles to breaking into the football analytical industry is getting data to play around with and show what you can do, which is why Statsbomb’s commitment to offering such samples of data for free is so amazing and should be taken advantage of! There are endless possibilities of insight and visualisations to create using the event data, limited only by your creativity.

Support and free tutorials are also freely available for using data in R, including their own StatsBombR package and FCrStats’s twitter and GitHub who provides functions for creating custom pitches for visualisations. Did I mention they were both free?

https://github.com/statsbomb/StatsBombR & @StatsBomb
https://github.com/FCrSTATS & @FC_rstats

It can be intimidating to start to work with complex data like this, so I will go through step by step and create a version of a popular match visualisation: an Expected Goals Shot Map.

Since the Fifa Women’s World Cup is currently taking place and the StatsBombR package is continually being updated with new games as they are played, I thought I’d use the recent England v Argentina game as an example.

Install RStudio

FOr those completely new to R, you can download the latest RStudio version here:

https://cran.r-project.org/

And install packages using:

install.packages(“…package name here…”)

Load Libraries

To start we will load the relevant libraries.

library("StatsBombR") # Event data
library("SBpitch")    # Custom functions for creating pitches
library("ggplot2")    # Building visualisations 
library("tidyverse")  # Data manipulation

Create Blank Pitch

Using FCrStat’s SBpitch package you can create a pitch to use with custom visualisations using the create_Pitch() function. You can specify the colours and which lines you want to see. For the xG Shot Map, we will use the whole pitch.

# Create a blank pitch using create_Pitch()
blank_pitch <- create_Pitch(
  goaltype = "box",
  grass_colour = "#202020", 
  line_colour =  "#797876", 
  background_colour = "#202020", 
  goal_colour = "#131313"
)

blank_pitch

unnamed-chunk-2-1

Get StatsBomb Data

Using the StatsBombR package, getting access to the free events data is as simple as running the StatsBombFreeEvents() function as below and storing it in your environment.

statsbomb_events <- StatsBombFreeEvents()

## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please register your details on https://www.statsbomb.com/resource-centre and read our User Agreement carefully."

Get Match Info

There are over 100 variables for each event of each match, so we want to narrow the data set down to a single match. We are interested in the Fifa Women’s World Cup match with England v Argentina. We are also only interested in shots, so will only include those types of events.

I have also included the colours for the respective teams to use later on.

The x,y location of each event is stored in a single variable as an array.
Using the separate() function in the Tidyverse we can extract these and create new variables called “location_x” and “location_y”.
Use as.numeric() to make the new location variables numeric so we can plot them later.

event_type <- "Shot"
team1_colour <- "red4"
team2_colour <- "lightblue"

# Narrow down to a specific match: Australia Women's v Brazil Women's
match <- statsbomb_events %>%
  filter(# Fifa Women's World Cup Competition ID
           competition_id == 72 & 
           # Eng Womens v BArg Women's Match ID
           match_id == 22962 & 
           # Only keep events that are shots
           type.name == event_type ) %>%
  # X,Y locations are stored in a single array column, separate() into two columns
  separate(col = location, into = c(NA, "location_x","location_y")) %>%
  mutate(location_x = as.numeric(location_x),
         location_y = 80 - as.numeric(location_y))

Create Goal and xG Indicators

Since we are interested in the actual goals and expected goals of each shot, we can create a goal indicator variable and respective expected goal variables for the shots of each team.

match <- match %>%
  mutate(# Create a goal indicator
         Goal = ifelse(shot.outcome.name == "Goal","1","0"),
         # Create England goal indicator and xG
         team1_Goal = ifelse((shot.outcome.name == "Goal" & team.name == unique(match$team.name)[1]),"1","0"),
         team1_xG = ifelse(team.name == unique(match$team.name)[1],shot.statsbomb_xg,NA),
         # Create Argentina goal indicator and xG
         team2_Goal = ifelse((shot.outcome.name == "Goal" & team.name == unique(match$team.name)[2]),"1","0"),
         team2_xG = ifelse(team.name == unique(match$team.name)[2],shot.statsbomb_xg,NA)
)

Plot Shot Locations

Okay, lots of preparation done so far. Let’s plot some shots!

ggplot2 builds plots from the ground upwards. Remember the blank_pitch we made earlier? We use that as a base and add the shot locations on top using geom_point to add points/dots

# Plotting raw shot locations
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y), colour = "white")

unnamed-chunk-5-1

Oops, looks like all the shots happened at the same end, regardless of team. We need to reverse the shot locations of one team, since we know the pitch dimensions from create_Pitch() as 120 x 80, we can use those.

# Looks like all shots are at the same end, need to reverse the locations of one team
match <- match %>%
  mutate(location_x = ifelse(team.name == unique(match$team.name)[1],
                             120 - location_x,
                             location_x),
         location_y = ifelse(team.name == unique(match$team.name)[1],
                             80 - location_y,
                             location_y)
         )

Plot Respective Coloured Locations

Let’s see if it worked!

# Try again, with different colours for each team
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))

unnamed-chunk-7-1

Oof, looks like England had lots of shots and denied Argentina anything significant.

Highlight Goals

We have shot locations, but it would be nice to see which shots are goals using the goal indicator we created earlier and we can use a different shape (triangle) to differentiate.

# Now highlight the goals
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))
unnamed-chunk-8-1

Plot Size of xG

Looks like England scored from their shot closest to the goal in the Argentina 6-yard box. Let’s see how likely they were to score by using the size of the points to reflect the expected goals.

# Now use size to reflect shot xG
blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal, size = shot.statsbomb_xg)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team1_colour, team2_colour))
unnamed-chunk-9-1

Looks like England scored with their best chance and could potentially have scored a few more considering their volume of relatively good shots. This is a skeleton of an Expected Goals Shot Map, we can add in annotations to make the final plot look more presentable and quantify each team’s expected goals versus actual goals.

Add Titles and Annotations

blank_pitch + 
  geom_point(data = match, aes(x=location_x, y=location_y, colour = team.name, shape = Goal, size = shot.statsbomb_xg)) +
  theme(legend.position="none") + 
  scale_colour_manual(values = c(team2_colour, team1_colour)) + 
  # Australia's xG
  geom_text(aes(x = 2, y=78,label = unique(match$team.name)[1]), hjust=0, vjust=0.5, size = 5, colour = team1_colour) +
  geom_text(aes(x = 2, y=75,label = paste0("Expected Goals (xG): ",round(sum(match$team1_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) + 
  geom_text(aes(x = 2, y=73,label = paste0("Actual Goals: ",round(sum(as.numeric(match$team1_Goal), na.rm = TRUE),0))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) + 
  geom_text(aes(x = 2, y=71,label = paste0("xG Difference: ",round(sum(as.numeric(match$team1_Goal), na.rm = TRUE),0)-round(sum(match$team1_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team1_colour) +
  # Brazil's xG
  geom_text(aes(x = 80, y=78,label = unique(match$team.name)[2]), hjust=0, vjust=0.5, size = 5, colour = team2_colour) +
  geom_text(aes(x = 80, y=75,label = paste0("Expected Goals (xG): ",round(sum(match$team2_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team2_colour) + 
  geom_text(aes(x = 80, y=73,label = paste0("Actual Goals: ",round(sum(as.numeric(match$team2_Goal), na.rm = TRUE),0))), hjust=0, vjust=0.5, size = 3, colour = team2_colour) + 
  geom_text(aes(x = 80, y=71,label = paste0("xG Difference: ",round(sum(as.numeric(match$team2_Goal), na.rm = TRUE),0)-round(sum(match$team2_xG, na.rm = TRUE),2))), hjust=0, vjust=0.5, size = 3, colour = team2_colour)
unnamed-chunk-10-1

That looks a little better, at least we now know the score and how each team did compared to their expected goals. After creating a blank pitch, we only need to add layers to get a visualisation of the information we want which is incredibly powerful. To get the visualisation for another match, simply change the match_id (and team colours) above!

The only packages used to create this are those loaded above, with the free events data provided by StatsBomb and extra functions/tutorials by FCrStats

Again, you can find those two here:
@StatsBomb
@FCrStats

Hopefully this will help get some people get started and overcome any initial intimidation. I will look to provide more of these types of step by step guides going forwards the more I get to play around with the data.

@TLMAnalytics