#29 – Player Contracts and Performance

Featured

Players usually get to sign a new contract for one of two reasons:

1. Sustained increase in performance

2. Transferred to a new club

Both of these are two sides of the same coin where some kind of increased performance is rewarded either at the current club or by moving to a different club.

It’s an easy narrative to sell that a player whose contract is running down will put up good performances and try harder to secure that next payday. Whilst after they sign that contract they will try less hard as they have a secure future, and performances will drop.

Whilst that may be true in some specific cases, the integrity of the large majority of players surely wouldn’t be swayed that easily.

That’s what I’ve taken a look at here: how player performance in games leading up to their contract compared to their performance after signing the contract.

The players used were those with their contract dates available here: https://www.spotrac.com/epl/contracts/

They are the top 100 contracts from the English Premier League.

Whilst the player performance measures were taken from match logs on: https://fbref.com/

Player performance is not completely described in counting stats, no matter how advanced. Context is always necessary when interpreting statistics for football performances. This analysis will take a very high level view and take a rolling average of position specific measures in matches leading up to a contract date and then the same after to compare.

Even at this high level there is context to consider. Some of the contracts are due to increased performance levels whilst others are due to transfers which would likely be the result of increased performance.

For transfers, they can only occur in the transfer windows which are either January or in Summer. So some of these will be comparing between seasons and some will be comparing within seasons, but both will be playing with different players and potentially in different roles with their new teams.

For contract extensions in the same teams, they can occur at any time and so could occur mid-season.

Performance measures for positions are as you’d expect:

  • For forwards they are mainly goalscoring measures
  • Attacking midfielders are more creative measures
  • Defensive midfielders are some possession and defensive measures
  • Central defenders are similar to defensive midfielders actually
  • Full backs are defensive measures and creativity.

So let’s take a look at some specific players. Below are a selection of high level players including Aubameyang, Kane, Grealish, De Bruyne, Chillwell, Wan Bisakka, Maguire. A mixture of transfers and extensions.

The first up is Aubameyang, who was the inspiration for this work. After signing a large extension with Arsenal it doesn’t feel like he’s hit the same levels as before, and the numbers suggest similar. Potentially guilty of the narrative not trying after securing the extension but more likely just a decline in performance due to age.

Next is Grealish, who has been amazing for Aston Villa, and potential transfer rumours to Man Utd. Though seems like his extension at Villa hasn’t had the same effect as Aubameyang. Arguably Grealish is even better now than before the contract. Potentially due to his age and motivation with the Euros coming up, the contract extension seems deserved.

Lastly is a transfer example with Ben Chilwell, had a great season with Leicester last year and now playing for a ‘better’ team at Chelsea. Though this whole process seems to have completely skipped Chillwell’s mind as his performances are as varied both before and after the transfer and new contract.

Now taking a look at whether contracts seem to affect some positions more than others. Each individual player’s average measure has been averaged to try to find an overall trend.

There are no clear drop offs after signing a new contract. Most of the time immediately before the contract seems to be the worst performances and they seem to pick it up afterwards. Perhaps contract talks have a negative effect on player performances from a mental perspective.

Arguably forwards have the sharpest drop in performance leading up to contracts, and whilst their performance does increase. It’s not certain they’ll return to pre-contract levels.

Forwards’ performance measures are largely affected by goalscoring, which is pretty lucky in itself. The players getting new contract and transfers are those that are likely to be performing well leading up to their contract or transfer. This likely includes some hot goalscoring streaks and over-performing which will be hard to replicate going forwards.

And finally taking a look at all positions together, to try to get an overall view at player performances before and after signing contracts.

It seems that there’s a drop in performance leading up to the contract and then it picks back up afterwards.

This is largely driven by the high pre-contract performances of forwards as discussed earlier. With the remaining positions not as large a difference.

As mentioned multiple times, this sample of players is implicitly biased towards players who have performed well before signing contracts, especially large ones. If looking at all contracts across all leagues and taking into account the contract monetary amount there may be further patterns to find.

But in this sample we seem to be seeing a slight drop in performance leading up to contracts and then a return to mostly similar levels afterwards. There is the inevitable regression to the mean argument which is applicable across all positions. It’s largely affecting positions and players that rely on more variable performance measures like goalscoring rather than midfielders or defenders whose measures include higher volume measures such as touches or passes.

@TLMAnalytics

#28 – Quantifying Non-Shot Chances

Featured

Why do they matter?

Goals, shots and expected goals are only counted if shots are actually taken.

Problem

The problem is that there are some chances where a goal is likely but a player doesn’t shoot for some reason, they choose to pass, or hesitate etc.

How do we measure or quantify these Non-Shot chances?

Existing Solutions

Expected threat is a great way of thinking about this, as Karun Singh (https://twitter.com/karun1710) proposed and outlines here: https://karun.in/blog/expected-threat.html. It manages to attribute value to passes and plays before goals and assists.

The secondary Ozil assist that penetrates the defence is the one that actually creates the goal scoring opportunity, even though Kolasinac is the one credited with the assist and Aubameyang actually scores the goal.

Expected threat can attribute a proportion of the goal contribution to Ozil’s pass as well as Kolasinac’s assist which is great.

Even if there isn’t a shot at the end of the play, expected threat is assigned to all locations on the pitch so you can see how valued it is.

Specifically, you can quantify the threat of each action by working out the difference in expected threat at the start and end of the action. If it increases, then you’re more likely to score in the next few actions than previously which is good.

How this solves the problem.

If a player has a great chance to shoot but doesn’t, there is no recording of this under traditional metrics.

Using expected threat, the quantified threat at that location could be attributed to that possession. So even though there was no shot, there is still some value attributed to getting in a good position to score.

Across all possessions in a match, each team will get into threatening positions and sometimes shoot, sometimes not.

If they shoot, then great we can count the shot and assign it an expected goal value. If they don’t, then we can assign that possession the highest expected threat value achieved in that possession.

This way, all possessions are worth something and reflect the intuition that a goal could happen on any given possession but sometimes things just don’t fall exactly right.

A Simplistic Approach

Now I would like to use expected threat values, but haven’t created my own expected threat model yet.

I have created a simple expected goals model for this and will use it as an approximate to the type of concept that expected threat more rigorously develops.

Expected threat is calculated through iterations of solving what a player is likely to do from each position and each subsequent iteration works out what they can do in subsequent actions.

An expected threat model with a single iteration is considered an expected goal model, so approximates expected threat. It assumes that a player can only shoot from all areas of the pitch and what the likelihood is of scoring from there, as opposed to considering moving or passing to other locations as well.

What I’ve done:

1. Using StatsBomb event data with shot freeze frames to train a logistic regression model to predict the probability of scoring a goal considering distance from goal, angle of goal available, number of defenders blocking the goal and the distance to the closest defender.

https://github.com/ciaran-grant/Non-Shot-xG

Much of the work uses the Friends of Tracking (https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking) tutorials as a base and builds from there.

def calculate_xG(shot):
    ''' calculate_xG (shot)
    Calculates the Expected Goals based on a model trained using StatsBomb event and freeze frame data.
    Input is a row of a dataframe with columns for distance, angle, distance_nearest_defender, number_blocking_defenders
    '''
    # For the model, get the intercept
    intercept=1.0519
    # For as many variables as put in the model, 
    # bsum = intercept + (coefficient * variable value)
    bsum=intercept + 0.1080*shot['distance'] - 1.6109*shot['angle'] - 
    0.1242*shot['distance_nearest_defender'] + 0.3260*shot['number_blocking_defenders']
    # Calculate probability of goal as 1 / 1 + exp(model output)
    xG = 1/(1+np.exp(bsum)) 
    return xG

The above model is by no means the best (or even close). But using StatsBomb’s freezeframe event data for shots let’s you build a model that can use defender location information which is great for this concept.

2. Using Metrica’s sample tracking data (match 2), calculated the expected goal for all frames of the game. Creating necessary features using locations of ball and nearest players on the pitch for both the home and away team.

3. It’s only physically possible for an attacking player to shoot if they have the ball, so have created a new non-shot xG for those frames where the ball is within a certain distance of an attacking player.

Note: this is approximating an attacking player being in control of the ball, but also picks up frames where the ball flies past a player from a cross or an attacker walks past the goalkeeper holding the ball.

x-axis: Time (s), y-axis: Expected Goals.
Home team scores a goal. Expected goal (red) and non-shot expected goals (green) are tracked on time series below. When a shot is taken, they are the same thing.

4. Metrica define possessions as per their documentation. For each possession, if it ended in a shot, then take the xG of that shot, otherwise take the highest xG_available to approximate the non-shot quality of the possession.

x-axis: Time (s), y-axis: Expected Goals.
Home team gets into the opponent’s penalty area, but misplaces the cross and turns over possession.

Non-shot expected goals are NOT designed to suggest that the above red player should have shot rather than trying to cross for a better opportunity to shoot. They are useful to track the quality of possessions where there is no alternative. The above possession doesn’t count in any traditional metrics because no shots were taken, but perhaps should be tracked somewhere.

5. Totalled all the traditional game statistics including shot xG and the new non-shot xG which are below.

HomeAway
Goals32
Shots1311
Expected Goals2.042.02
Non-Shot Expected Goals5.313.46
Passes543421
Possessions147136
Post match box score for Metrica Sample Game 2

Discussion

As you can see, the non-shot xG is always higher than the standard xG since it includes xG and in addition, the quality of the possessions without shots. A measure for how ‘wasteful’ a team was would be to look at the difference.

A much higher non-shot xG than xG would suggest the team got into quality shooting positions frequently but didn’t take advantage and take the shots.

A more similar non-shot xG and xG would suggest the team made the most of their shooting positions by taking the shot.

Again, using the simplistic xG model here is an approximation for something more sophisticated like expected threat. But I think it nicely highlights the distinction and what’s missing when only relying on goals, shots and expected goals to review how a team performed in a match solely from the box score.

There are some notebooks and code alongside this that I’ve put on GitHub, do check it out if you’d like! Feel free to ask questions or comment over at @TLMAnalytics.

#27 – Can You Dominate Without the Ball?

Featured

The inspiration to this discussion came from the Tottenham v Arsenal match last weekend, where Arsenal heavily domianted possession (68% v 32%), shots (12 v 6) and sent in 34 crosses. Despite seemingly dominating the on ball metrics and appearing to have been unlucky to come away losing 2-0 if you only looked at the post-game stats, watching the game tells a different story. Are there any ways of measuring the dominance of a match and do they require the ball?

Arsenal didn’t ever really seem like scoring and Spurs didn’t ever really feel under threat of conceding. It’s worth mentioning here that for the majority of the game Spurs were winning after scoring with their first open play shot, which means that Spurs didn’t need to overextend themselves. From then on the responsibility for trying to attack fell to Arsenal, whilst Spurs could soak up pressure and hit on the counter if they so choose. Which they of course did.

Spurs had a game plan and their intentions were clear, they also executed this game plan extremely well appearing comfortable the whole game. Arsenal maybe had a game plan and their intentions were not so clear, it’s not clear if they executed it and arguably they did not since they lost the game whilst appearing desperate. Intent or execution of intent are potential candidates for measuring dominance.

So in this particular match it seems the difference was the defending team had a game plan, executed it and so were comfortable. This is different to a potentially similar scenario where a possession heavy team is creating lots of chances and seems to score at a will. Think Manchester City, Barcelona or Bayern Munich against lower skilled opposition. They have a game plan, execute it and are comfortable when they’re playing at their best.

This highlights the distinction between traditional box-score statistical dominance and the eye test. Controlling the ball doesn’t mean you are controlling the match, rather it’s the control of space on the pitch and particularly the areas of importance.

For example without the ball:

  • Spurs were winning
  • The only way they would lose is by letting Arsenal create chances
  • That’s most likely if Arsenal get the ball near Spurs’ penalty area
  • So Spurs decided not to let them do that (very successfully)

Or with the ball:

  • Barcelona are controlling possession and are actively trying to score
  • Perhaps to go ahead or increase their lead
  • They need to create chances
  • To do so they need to get control of the ball near the opponent’s area

Both scenarios end up with the team that can control the area on the pitch they want to seemingly dominating the game.

I feel like a flow chart would be appropriate, something like this makes sense. There are lots of options I haven’t put here since I just wanted to get the main point across. And the word dominant is used optimistically here, there are of course exceptions to everything in specific circumstances.

Arsenal:

  • Do you want the ball? YES
  • Can you get the ball? YES
  • Do you want to score? YES
  • Can you create chances? NO
  • Not dominant

Spurs:

  • Do you want the ball? NO
  • Are you conceding chances? NO
  • Dominant

In terms of quantifying this match dominance, the questions in the flow chart roughly correspond to existing metrics.

  • Do you want the ball? Gamestate and style dependent (Eg. winning, losing or drawing and possession/counter attack)
  • Can you get the ball? Gamestate, style and possession dependent (Eg. pressures by area of the pitch and passes completed, attempted)
  • Do you want to score? Gamestate dependent (Eg. winning, losing or drawing)
  • Can you create chances? Gamestate, style and space dependent (Eg. chances created, xG, shots, passes into the penalty area)
  • Are you conceding chances? Gamestate, style and space dependent (Eg. chances conceded, xGA, shots against, passes conceded into own penalty area)

Notice all of these questions are gamestate dependent, which shows how important it is. Gamestate usually leads the game plan of a team, if you’re losing you want to score and if you’re winning you want to not concede. Gamestate is easy to quantify, so could be used as an approximation for intent or combined with a style metric to further outline what a suggested game plan for a team would be in a particular scenario.

The other dependencies involve possession, space and style. Possession could be important, but not on its own. Pitch control models are great at understanding which team controls which area of the pitch at any given point of a match. You could go a step further and introduce a possession quality weighting to create an possession value model, which would reward control of the pitch closer to areas of importance.

Combining these two together would involve creating a time series of a match that weights the pitch control model or possession value model of each part of the match by the gamestate. For example, if you are winning, then the possession value of the ball near the opponent’s goal isn’t so high since you don’t need to score. If you are losing then the possession value in the opponent’s half is much higher, as you need to generate chances to get back into the game.

This is a really long way of saying that if you can execute your gameplan successfully and comfortably, arguably you are the dominant team in that period of the match. And just because you have the ball a lot, if you can’t do anything with it then it’s not that great, I’m sure we can all agree there.

@TLMAnalytics

# 26 – Invincibles Defending with StatsBomb Events (v3.0)

Project Overview

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing. They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This project aims to look into how Arsenal’s defence managed this. To do so training a model is not necessary as this is a descriptive task, not a predictive one. Instead, I will focus on answering the questions with data visualisations to effectively represent and communicate the answers to the questions.

Problem Statement

The goal is to identify areas of Arsenal’s defensive strengths and the frequent approaches used by opponents. Tasks involved are as follows:

  1. Download and preprocess StatsBomb’s event data.
  2. Explore and visualise Arsenal’s defensive actions.
  3. Explore and visualise Opponent’s ball progression by thirds.
  4. Cluster and evaluate Opponent’s ball progressions using several clustering methods.
  5. Visualise clustering results to aid understanding.
  6. Cluster and evaluate Opponent’s shot creations using several clustering methods.
  7. Visualise clustering results to aid understanding

Dataset Decsription

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them.

Since we are working with event data, there is access to the pitch location in x, y coordinates of each recorded event. These are recorded as per their Pitch Coordinates found in the documentation: https://github.com/statsbomb/open-data/blob/master/doc/Open%20Data%20Events%20v4.0.0.pdf

There are 112,773 rows and 161 columns in the data. Each row is an event from a game that has a unique id, timestamp, team, player and event information such as type and location. We will not need all events as we will filter for only defensive event types and then separately for ball progressions from Arsenal’s opponents.

There are 10251 Arsenal defensive events and 12077 opponent ball progression events. Example rows of each dataset are below:

Metrics

To determine the appropriate number of clusters for K-Means and distance threshold for Agglomerative clustering, the below metrics are used. They try to consider the density of points within clusters and between clusters.

  1. Sum of squares within cluster:

Calculated using the inertia_ attribute of the k-means class, to compute the sum of squared distances of samples to their closest cluster center. Lower scores for more dense clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  1. Silhouette Coefficient:

Calculated using the mean intra-cluster distance and mean nearest-cluster distance for each sample. Best value is 1 and worst value is -1.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

  1. Calinski-Harabasz Index:

Known as the Variance Ratio Criterion, defined as the ratio between within-cluster dispersion and between-cluster dispersion. Higher scores signal clusters are dense and well separated.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html

  1. Davies-Bouldin Index:

Average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Minimum score is 0, lower values for better clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html

Data Processing

In [1]:
import json
from pandas.io.json import json_normalize
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

from CustomPitch import createVerticalPitch
from StatsBombPrep import ball_progression_events_into_thirds
from StatsBombViz import plot_sb_events, plot_sb_event_location, plot_sb_events_clusters, plot_individual_cluster_events
from StatsBombViz import plot_sb_event_grid_density_pitch, plot_histogram_ratio_pitch
from ClusterEval import kmeans_cluster, agglomerative_cluster, cluster_colour_map, cluster_evaluation, plot_cluster_evaluation

StatsBomb collect event data for lots of football matches and have made freely available a selection of matches (including all FAWSL) to allow amateur projects to be able to take place. All they ask is to sign up to their StatsBomb Public Data User Agreement here: https://statsbomb.com/academy/. And get access to the data via their GitHub here: https://github.com/statsbomb/open-data

Their data is stored in json files, with the event data for each match identifiable by their match ids. To get the relevant match ids, you can find those in the matches.json file. To get the relevant season and competitions, you can find that in the competitions.json file.

In [2]:
# Load competitions to find correct season and league codes
with open('open-data-master/data/competitions.json') as f:
    competitions = json.load(f)
    
competitions_df = pd.DataFrame(competitions)
competitions_df[competitions_df['competition_name'] == 'Premier League']

# Arsenal Invincibles Season
with open('open-data-master/data/matches/2/44.json', encoding='utf-8') as f:
    matches = json.load(f)

# Find match ids
matches_df = json_normalize(matches, sep="_")
match_id_list = matches_df['match_id']

# Load events for match ids
arsenal_events = pd.DataFrame()
for match_id in match_id_list:
    with open('open-data-master/data/events/'+str(match_id)+'.json', encoding='utf-8') as f:
        events = json.load(f)
    events_df = json_normalize(events, sep="_")
    events_df['match_id'] = match_id
    events_df = events_df.merge(matches_df[['match_id', 'away_team_away_team_name', 'away_score', 
                                            'home_team_home_team_name', 'home_score']],
                                on='match_id', how='left')    
    arsenal_events = arsenal_events.append(events_df)

print('Number of matches: '+ str(len(match_id_list)))
Number of matches: 33
  • The location tuples for start and pass_end, carry_end are separated. They are all rotated to fit vertical pitch and then I create a universal end location for all progression events.
  • I have defined all defensive events to use throughout below and filtered all events for those.
  • Firstly I remove all set pieces in favour of keeping only open play progressions. This is because defending from open play and set pieces requires different approaches, I will be focusing on open play progressions here.
  • Need to define what event types we class as ball progressions too.
  • When using the vertical pitch it would be useful to have the opponents moving in the opposite direction to the defending team (Arsenal) to feel more interpretable when visualising.
  • I have separated the ball progressions into thirds on the pitch since different approaches may be taken in different areas of the pitch.
  • Narrowing down more towards goal threatening progressions I’ve separated out passes and carries into the penalty area and shot assists.

Exploratory Analysis

The most frequent defensive action is a Pressure. Pressures don’t always result in a turnover, whereas Ball Recovery or Interception would

In [7]:
arsenal_defensive_actions.groupby('type_name').count()['id']
Out[7]:
type_name
50/50               20
Ball Recovery     2020
Block              573
Clearance         1096
Duel              1338
Foul Committed     528
Interception       258
Pressure          4418
Name: id, dtype: int64

There are lots more Carries and Passes than Dribbles, but I expected there to be more Passes relative to Carries than there are.

In [8]:
opponent_ball_progressions.groupby('type_name').count()['id']
Out[8]:
type_name
Carry      5352
Dribble     327
Pass       6398
Name: id, dtype: int64

There are lots more progressions via Passes than Carries in the opposition’s own third and middle third. This is likely due to the reduced risk of a forward pass compared to a carry. If you lose the ball due to a misplaced pass, the ball is likely higher up the field and the passer is closer to their own goal as an extra defender. If you lose the ball due to being tackled, then you likely lose the ball from where you are and are chasing back to catch up.

In [9]:
from_own_third_opp.groupby('type_name').count()['id']
Out[9]:
type_name
Carry    148
Pass     833
Name: id, dtype: int64
In [10]:
from_mid_third_opp.groupby('type_name').count()['id']
Out[10]:
type_name
Carry    172
Pass     676
Name: id, dtype: int64

In the final third, there is a much more even split between Carries and Passes. Perhaps due to forward passing becoming much harder the further up the pitch and the reduced risk of Carries since you are so far away from your own goal.

In [11]:
from_final_third_opp.groupby('type_name').count()['id']
Out[11]:
type_name
Carry    488
Pass     531
Name: id, dtype: int64

Although Carries are good at getting near the penalty area, Passes are still the main way into the penalty area.

In [12]:
into_pen_area_opp.groupby('type_name').count()['id']
Out[12]:
type_name
Carry     85
Pass     351
Name: id, dtype: int64

And well assists are a subset of passes so no Carries here..

In [13]:
opponent_shot_assists.groupby('type_name').count()['id']
Out[13]:
type_name
Pass    81
Name: id, dtype: int64

Looking at all defensive events from Arsenal across the season leads to an overplotting mess. There are so many defensive events that it’s hard to see any clear trends apart from there are fewer higher up the pitch.

In [14]:
plot_sb_event_location(arsenal_defensive_actions, alpha = 0.3)
Out[14]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x252f5c342e8>)

Looking at all of the opponent’s ball progressions is just as difficult to understand, there are lots of progressions in the middle and down the wide areas.

In [15]:
plot_sb_events(opponent_ball_progressions, alpha = 0.3)
Out[15]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x2528fd6d978>)

Data Visualisation

Defensive Events

To better understand and explore Arsenal’s defensive events, the below plot is a combination of a 2D histogram grid and marginal density plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.

More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.

The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.

In [17]:
# Arsenal Defensive Events Density Grid
fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(arsenal_defensive_actions)

# Title
fig.suptitle("Arsenal Defensive Actions",
             x=0.5, y=0.92,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')

# Direction of play arrow
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
Out[17]:
Text(10, 10, '')

We can also take a look at the relative difference between Arsenal’s defensive events and the overall view. This density grid shows where Arsenal had more events than overall in red and less than overall in blue.

The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.

In [19]:
# Relative Grid Density
fig, ax = plot_histogram_ratio_pitch(arsenal_defensive_actions, defensive_actions)

fig.suptitle("Arsenal's Relative Defensive Actions",
             x=0.5, y=0.87,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
fig.tight_layout()

Opponent Ball Progression

By taking a look at opponent’s ball progressions we can get potentially see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?

Across each third of the pitch, both KMeans and Agglomerative clustering methods were used. To identify the optimal number of clusters (Kmeans) and distance threshold (Agglomerative), a range of clusters and distance thresholds were used and evaluated across the specified metrics. The optimal number and distance thresholds were used to create clusters and visualise, there is a full example using own third progressions and results from remaining thirds below.

Own Third

In [20]:
# From Own Third - Ball Progressions
fig, ax = plot_sb_events(from_own_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Own Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red line for own third
ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
plt.tight_layout()
K-Means
In [21]:
# Create cluster evaluation dataframe for up to 50 clusters
own_third_clusters_kmeans = cluster_evaluation(from_own_third_opp, method='kmeans', max_clusters=50)

# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[21]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – More clusters gives lower sum of squares.
  2. Silhouette Coefficient (Higher) – Less clusters gives higher coefficient, under 5 especially and drops around 15.
  3. Calinski-Harabasz Index (Higher) – Less clusters gives higher index.
  4. Davies-Bouldin Index (Lower) – Max just less than 10

Just less than 10 seems appropriate.

In [53]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_own_third = kmeans_cluster(from_own_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_own_third))))
# Plot each cluster
title = "KMeans Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(3, 3, from_own_third_opp, cluster_labels_own_third, sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters
Agglomerative Clustering
In [23]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_third_clusters_agglo = cluster_evaluation(from_own_third_opp, method='agglomerative', min_distance=10, max_distance=500)

# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[23]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – Lower distance gives lower sum of squares, flat from 200 and sharp increase at 350.
  2. Silhouette Coefficient (Higher) – Really low distance or peaks again at around 250.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index.
  4. Davies-Bouldin Index (Lower) – Really low distance or dips again at 350.

250 is chosen as it appears to peak for Silhouette Coefficient with others trading off.

In [48]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_own_third = agglomerative_cluster(from_own_third_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_own_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(2, 3, from_own_third_opp, cluster_labels_agglo_own_third, figsize=(10, 10))
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 7 clusters

There are less clusters than the KMeans method. They seem to be more weighted towards the end location, as in clusters 1, 5 and 6 all end just inside the middle third in the centre, left and right side respectively. Whilst the remaining clusters all end closer to the opposite half.

The remaining thirds all underwent the same evaluation process and KMeans produced more appropriate groupings across the board. Agglomerative clustering focused on the end locations when grouping and grouped according to a structured grid which wasn’t the point of this exercise.

Middle Third

In [27]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_mid_third = kmeans_cluster(from_mid_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_mid_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_mid_third_opp, cluster_labels_mid_third)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Middle Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.

Final Third

In [32]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_final_third = kmeans_cluster(from_final_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_final_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_final_third_opp, cluster_labels_final_third, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Final Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Lots of short progressions in wide full back areas in the final third. The lowest two frequent clusters appear to be deep crosses into the penalty area, these are the only consistent progressions into the penalty area.

Penalty Area

In [37]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_pen_area = kmeans_cluster(into_pen_area_opp, 4)
print("There are {} clusters".format(len(np.unique(cluster_labels_pen_area))))
# Plot individual clusters
fig, axs = plot_individual_cluster_events(2, 2, into_pen_area_opp, cluster_labels_pen_area, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - Into Penalty Area",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[37]:
Text(0.5, 1.01, 'KMeans Ball Progression Clusters - Into Penalty Area')

There is a smaller number of clusters selected here, grouping the clusters into broader groups. These seem to be split longer and shorter, and from each wide area.

Shot Assists

In [42]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_shot_assist = kmeans_cluster(opponent_shot_assists, 4)
print("There are {} clusters".format(len(np.unique(cluster_labels_shot_assist))))
# Clustered Ball Progressions - Shot Assists 
fig, axs = plot_individual_cluster_events(2, 2, opponent_shot_assists, cluster_labels_shot_assist, figsize=(8,10),sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - Shot Assists",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[42]:
Text(0.5, 1.01, 'Ball Progression Clusters - Shot Assists')

When looking at only shot assists, there are lots from the left full back area as well as passes through the middle. Both of these are areas where if you can complete a pass towards goal then there is a higher chance of generating a shot. However, shot assists that do not penetrate the penalty area suggest that those shots are likely from further out and of lower quality.

Conclusion

Both KMeans and agglomerative clustering produced similar number of clusters after evaluation and reasonable groupings after visualisation. The k-means clusters appear to align trajectories of progression from start to end locations fairly well. From the samples in the visualisations there are consistent start and end locations within cluster. Unlike agglomerative clusters there doesn’t seem to be a pattern to cluster groupings. The agglomerative clusters appeared to be more heavily weighted to the end locations of progressions, this means that lots of the clusters were grouped into grids primarily focused on where they ended up. The aim of using clustering approaches was to try to find patterns that weren’t readily available to a human analyst, so in this sense the agglomerative clustering didn’t add anything extra.

There is no indication as to the quality of these shots produced and there is no context to compare to the wider league. But what is promising is that there is nothing significant standing out and and the common breaches of the Arsenal defence are just good progressions if they do come off. It’s not clear here how many more of these types of opportunities that were prevented due to defensive plays off the ball.

Across ball progressions throughout the whole pitch, the most frequent were shorter and in the wide areas. This is expected as shorter progressions are lower risk and wide areas are less important to defend, so are areas the defending team are willing to concede.

The length of progressions from the middle third appear to be longer than in their own third or in the final third. Context is needed in all individual circumstances but this may be due to a lower risk of failure due to being further away from your own goal or trying to take advantage of a short window of opportunity to quickly progress the ball longer distances into the final third.

When in the final third, the progressions shorten again. This is not due to lower risk, but likely due to a more densely populated area. There will be the majority of all players on the pitch within a single half of the pitch, navigating through there requires precision and patience from the offensive team.

Any completed progression into the penalty area is a success for the offensive team. There is a high chance you will create a shot and if you do it’s likely to be a higher quality shot than from outside the penalty area. Though not all completed progressions into the penalty area are created equal. If it’s a completed carry into the penalty area then awesome, you likely have the ball under control and can get a pretty good shot or extra pass. If it’s a completed pass then it depends how the player receives it, aerially or on the ground makes a difference to the shot quality. Aerial passes are harder to control and headers are of less quality than shots with the feet, however aerial passes are usually easier to complete into the penalty area. So higher quantity, lower quantity than ground passes or carries.

Shot assists rely on there being a shot at the end of them. So circularly they created a shot so are ‘good’ progressions but also they are ‘good’ progressions because they created a shot. As we can see, they are much more random which means it’s harder to understand without context why they created shooting opportunities since the locations alone don’t tell us anything. Although the context is available within StatsBomb’s data, I haven’t taken a further look here.

When considering how these progressions affect Arsenal’s defensive events, remember that the majority of their defensive events were performed out wide in the full back areas and in their own penalty area. Particularly out wide in the full back areas more than other teams, whilst defensive events within their own penalty area around the same as other teams.

At each third of the pitch, the most frequent ball progressions were out wide, which places the ball frequently in the full back areas. Due to the nature of defensive events, the only events recorded would be the on-ball actions that were defined including pressures, ball recoveries, tackles etc. The ball needs to be close to you to be able to perform these actions and get them recorded as events, so the opponent ball progression frequently going out wide combined with Arsenal’s defensive events in their defensive wide positions fits together well.

What this doesn’t tell us is if these are causally linked or just correlate. I would suggest there are more ball progressions made out wide than centrally in all of football due to the defence more likely willing to concede that space, so this doesn’t necessarily tell us much about Arsenal specifically. Though in Arsenal’s matches, they do perform defensive events in their full back areas more than other teams, which may suggest that there is something more than just correlations.

If it is a specified game plan to funnel the ball out wide and perform defensive events there, then Arsenal have done a great job at completing that. It’s a robust defensive plan if you can get it to work, the wider the ball is, the harder it is to score immediately from there. When defending it’s often useful to utilise the ends of the pitch as an ‘extra’ defender, which makes it easier to overwhelm offensive players.

Improvements

Next time I would definitely be more narrow and specific with the question I set out to answer. There was no set acceptance criteria for whether or not the results that were produced were sufficient to understand the question. This led to the mre descriptive, exploratory nature of the project and ended up scratching the surface rather than delving deeper into solving a specific problem.

Another useful tool would be to get access to tracking level data for these matches. For defensive analysis there is much more emphasis on off-ball events and distances between all the players on the pitch, tracking data would provide the locations of all players on the pitch and the ball at all times. This would provide much greater detail but also be much more complicated to work with.

Reflection

I set out to try to understand why Arsenal’s defense worked so well during their unbeaten season. Using StatsBomb’s event data for the majority of the season, I analysed where Arsenal’s defensive events were performed and how that compared to their opponents. This could only tell part of the story since defensive events only cover on-ball actions. It is accepted that defensive actions cover a whole lot more than just on-ball actions so further analysis was needed. I analysed how their opponents progressed the ball up the pitch form each third and how they created shots by clustering the locations to identify most frequent types of progressions. Considering both sides, it’s clear that much of Arsenal’s defensive work happens in their full back areas and their opponents try to progress the ball down the flanks. What is not clear is if Arsenal are causing this to happen via off the ball actions or if it’s just coincidence.

I found it hard to directly answer the question that I set out to, that’s likely due to a poor question in being too broad. However, I definitely am further along the road than when I started and has been interesting trying to work through these problems.

It was great working with event level data and trying to find interesting ways to communicate visualisations, hopefully the plots combined with the football pitches work well to add to understanding.

#26 Written Summary – Invincibles Defending with StatsBomb Events

Overview

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing (as been achieved in other leagues). They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This post aims to look into how Arsenal’s defence managed this.

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them. Find these here:

Event data records all on-ball actions from a match, this is more granular than high level team and player totals. For offensive events such as passing and shooting, event data is great since they are usually on-ball actions. There are on-ball defensive actions such as tackles, interceptions and recoveries which are well captured by event data, these are the defensive events that I will be using.

We can see where these events frequently occur on the pitch for Arsenal compared to their opposition. The key nuance is that just because these are the defensive events recorded, doesn’t mean they are all of the defensive plays that take place on the pitch. The hard part about defensive analysis is that a large proportion of defending are non-events and won’t be captured by event data.

Event data still provides insight by using a combination of the defensive events from Arsenal and the offensive events from Arsenal’s opponents. Arsenal’s defensive events will show where their on-ball defending took place. Arsenal’s opponents’ offensive events will show how they approached attacking Arsenal, this may be the offensive team getting their way or Arsenal’s defence forcing opponents to play in a certain way. From Arsenal’s success, the majority of the time it’s the latter.

Please find the underlying code and methodology here: https://github.com/ciaran-grant/StatsBomb-Invincibles

Arsenal’s Defensive Events

The below plot is a combination of a 2D histogram grid and marginal distribution plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.

More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.

The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.

Figure 1: Marginal Distribution Plot of Arsenal’s Defensive Events

The below density grid compares Arsenal’s defensive events relative to all defensive events in their games. Where Arsenal had more events than overall is in red and less than overall in blue. The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.

Figure 2: 2D Histogram of Arsenal’s events relative to all defensive events

Opponent’s Ball Progressions

By taking a look at opponent’s ball progressions we can see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?

These progressions have been grouped into approximate phases of play through the thirds and into the penalty area.

The progressions have been grouped into similar types of progressions by comparing KMeans and Agglomerative Clustering methods. Reassuringly there were similar number of clusters from both methods, but the KMeans method appeared to perform better by grouping similar passes at both start and end locations. Further details can be found here: https://github.com/ciaran-grant/StatsBomb-Invincibles

Own Third

Figure 3: Clusters of opponent’s ball progressions from their own third

We find that the most frequent ball progressions from their own third are shorter progressions into the middle third. There are short progressions centrally in clusters 1 and 7, with wider progressions in clusters 9 and 6. Longer progressions were less frequent.

Shorter progressions are easier to complete and less risky, so not surprising that they are the most frequent. This says nothing for how quality or sustainable these progressions are, but adds to the idea that most of the play appears to be out wide.

Middle Third

Figure 4: Clusters of opponent’s ball progressions from the middle third

Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.

Final Third

Figure 5: Clusters of opponent’s ball progressions from their final third

Again, lots of short progressions in wide full back areas in the final third seen in clusters 6 and 2. The lowest two frequent clusters appear to be deep crosses into the penalty area. These are the only consistent progressions into the penalty area, but usually create lower quality chances through headers or contested shots.

Penalty Area

Figure 6: Clusters of opponent’s ball progressions into Arsenal’s penalty area

There are fewer groups here due to fewer progressions into the penalty area, which is expected. There are more progressions from each wide area, with a large proportion coming from much shorter progressions in clusters 2 and 1. It’s more difficult to successfully progress the ball long distances into the penalty area. The few clusters here are pretty broadly grouped, intuitively human analysts could create these groupings pretty quickly which suggests the clustering hasn’t helped much here.

Shot Assists

Figure 7: Clusters of opponent’s shot assists

When looking at only shot assists, there are lots that are received outside the penalty area. This suggests that those shots are likely from further out and of lower quality without context. Due to the added restriction of requiring a shot at the end of these passes, there is likely more variance included in these passes and harder to identify clear patterns.

Takeaways

Across ball progressions throughout the whole pitch, the most frequent were shorter and in the wide areas. This is expected as shorter progressions are lower risk and wide areas are less important to defend, so are areas the defending team are willing to concede.

The length of progressions from the middle third appear to be longer than in their own third or in the final third. Context is needed in all individual circumstances but this may be due to a lower risk of failure due to being further away from your own goal or trying to take advantage of a short window of opportunity to quickly progress the ball longer distances into the final third.

When in the final third, the progressions shorten again. This is not due to lower risk, but likely due to a more densely populated area. There will be the majority of all players on the pitch within a single half of the pitch, navigating through there requires precision and patience from the offensive team.

Any completed progression into the penalty area is a success for the offensive team. There is a high chance you will create a shot and if you do it’s likely to be a higher quality shot than from outside the penalty area. Though not all completed progressions into the penalty area are created equal. If it’s a completed carry into the penalty area then awesome, you likely have the ball under control and can get a pretty good shot or extra pass. If it’s a completed pass then it depends how the player receives it, aerially or on the ground makes a difference to the shot quality. Aerial passes are harder to control and headers are of less quality than shots with the feet, however aerial passes are usually easier to complete into the penalty area. So higher quantity, lower quantity than ground passes or carries.

Shot assists rely on there being a shot at the end of them. So circularly they created a shot so are ‘good’ progressions but also they are ‘good’ progressions because they created a shot. As we can see, they are much more random which means it’s harder to understand without context why they created shooting opportunities since the locations alone don’t tell us anything. Although the context is available within StatsBomb’s data, I haven’t taken a further look here.

When considering how these progressions affect Arsenal’s defensive events, remember that the majority of their defensive events were performed out wide in the full back areas and in their own penalty area. Particularly out wide in the full back areas more than other teams, whilst defensive events within their own penalty area around the same as other teams.

At each third of the pitch, the most frequent ball progressions were out wide, which places the ball frequently in the full back areas. Due to the nature of defensive events, the only events recorded would be the on-ball actions that were defined including pressures, ball recoveries, tackles etc. The ball needs to be close to you to be able to perform these actions and get them recorded as events, so the opponent ball progression frequently going out wide combined with Arsenal’s defensive events in their defensive wide positions fits together well.

What this doesn’t tell us is if these are causally linked or just correlate. I would suggest there are more ball progressions made out wide than centrally in all of football due to the defence more likely willing to concede that space, so this doesn’t necessarily tell us much about Arsenal specifically. Though in Arsenal’s matches, they do perform defensive events in their full back areas more than other teams, which may suggest that there is something more than just correlations.

If it is a specified game plan to funnel the ball out wide and perform defensive events there, then Arsenal have done a great job at completing that. It’s a robust defensive plan if you can get it to work, the wider the ball is, the harder it is to score immediately from there. When defending it’s often useful to utilise the ends of the pitch as an ‘extra’ defender, which makes it easier to overwhelm offensive players.

If you’ve made it this far, thanks for your attention. Please find the code in previous blog posts and at https://github.com/ciaran-grant/StatsBomb-Invincibles

@TLMAnalytics

# 26 – Invincibles Defending with StatsBomb Events (v2.0)

** This is a second submission to the below course**

Updates include:

  • Defined structure
  • Including multiple clustering methods (Agglomerative clustering)
  • More attention paid to parameter selection (number of clusters and threshold distance)

The below is a Jupyter Notebook and can be found here alongside respective Python modules: https://github.com/ciaran-grant/StatsBomb-Invincibles

This was an initial submission for the Udacity Data Scientist Nanodegree, updates likely to come from further feedback. https://www.udacity.com/course/data-scientist-nanodegree–nd025

@TLMAnalytics

Udacity Capstone Project – Arsenal’s Invincibles’ Defence

Project Definition

Project Overview

Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing. They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This project aims to look into how Arsenal’s defence managed this.

Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them.

Problem Statement

The goal is to identify areas of Arsenal’s defensive strengths and the frequent approaches used by opponents. Tasks involved are as follows:

  1. Download and preprocess StatsBomb’s event data
  2. Explore and visualise Arsenal’s defensive actions
  3. Explore and visualise Opponent’s ball progression by thirds
  4. Cluster and evaluate Opponent’s ball progressions
  5. Cluster and evaluate Opponent’s shot creations

Metrics

A major problem when using many clustering algorithms is identifying how many clusters exist in the data since they require that as an input parameter. Sometimes expert judgement can provide a good estimate. However, some clustering algorithms such as agglomerative clustering require other inputs such as distance thresholds to determine clusters.

When using k-means clustering, to determine the number of clusters the following metrics are used. When using agglomerative clustering, to determine the distance threshold the same metrics are used. They try to consider the density of points within clusters and between clusters.

  1. Sum of squares within cluster:

Calculated using the inertia_ attribute of the k-means class, to compute the sum of squared distances of samples to their closest cluster center. Lower scores for more dense clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  1. Silhouette Coefficient:

Calculated using the mean intra-cluster distance and mean nearest-cluster distance for each sample. Best value is 1 and worst value is -1.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

  1. Calinski-Harabasz Index:

Known as the Variance Ratio Criterion, defined as the ratio between within-cluster dispersion and between-cluster dispersion. Higher scores signal clusters are dense and well separated.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html

  1. Davies-Bouldin Index:

Average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Minimum score is 0, lower values for better clusters.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html

Analysis

Import Libraries
In [1]:
import json
from pandas.io.json import json_normalize
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

from CustomPitch import createVerticalPitch
from StatsBombPrep import ball_progression_events_into_thirds
from StatsBombViz import plot_sb_events, plot_sb_event_location, plot_sb_events_clusters, plot_individual_cluster_events
from StatsBombViz import plot_sb_event_grid_density_pitch, plot_histogram_ratio_pitch
from ClusterEval import kmeans_cluster, agglomerative_cluster, cluster_colour_map, cluster_evaluation, plot_cluster_evaluation
Load StatsBomb Data

StatsBomb collect event data for lots of football matches and have made freely available a selection of matches (including all FAWSL) to allow amateur projects to be able to take place. All they ask is to sign up to their StatsBomb Public Data User Agreement here: https://statsbomb.com/academy/. And get access to the data via their GitHub here: https://github.com/statsbomb/open-data

Their data is stored in json files, with the event data for each match identifiable by their match ids. To get the relevant match ids, you can find those in the matches.json file. To get the relevant season and competitions, you can find that in the competitions.json file.

In [2]:
# Load competitions to find correct season and league codes
with open('open-data-master/data/competitions.json') as f:
    competitions = json.load(f)
    
competitions_df = pd.DataFrame(competitions)
competitions_df[competitions_df['competition_name'] == 'Premier League']

# Arsenal Invincibles Season
with open('open-data-master/data/matches/2/44.json', encoding='utf-8') as f:
    matches = json.load(f)

# Find match ids
matches_df = json_normalize(matches, sep="_")
match_id_list = matches_df['match_id']

# Load events for match ids
arsenal_events = pd.DataFrame()
for match_id in match_id_list:
    with open('open-data-master/data/events/'+str(match_id)+'.json', encoding='utf-8') as f:
        events = json.load(f)
    events_df = json_normalize(events, sep="_")
    events_df['match_id'] = match_id
    events_df = events_df.merge(matches_df[['match_id', 'away_team_away_team_name', 'away_score', 
                                            'home_team_home_team_name', 'home_score']],
                                on='match_id', how='left')    
    arsenal_events = arsenal_events.append(events_df)

print('Number of matches: '+ str(len(match_id_list)))
Number of matches: 33

Data Preprocessing

Since we are working with event data, there is access to the pitch location in x, y coordinates of each recorded event. These are recorded as per their Pitch Coordinates found in the documentation: https://github.com/statsbomb/open-data/blob/master/doc/Open%20Data%20Events%20v4.0.0.pdf

The coordinates are mapped to a horizontal pitch with the origin (0, 0) in the top left corner, and (120, 80) in the bottom right. Since I am interested in defensive analysis from the point of view of Arsenal, I thought it would be easier to interpret if we converted these to a vertical pitch with Arsenal defending from the bottom and the opposition attacking downwards.

The location tuples for start and pass_end, carry_end are separated. They are all rotated to fit vertical pitch and then I create a universal end location for all progression events.

In [3]:
# Separate locations into x, y
arsenal_events[['location_x', 'location_y']] = arsenal_events['location'].apply(pd.Series)
arsenal_events[['pass_end_location_x', 'pass_end_location_y']] = arsenal_events['pass_end_location'].apply(pd.Series)
arsenal_events[['carry_end_location_x', 'carry_end_location_y']] = arsenal_events['carry_end_location'].apply(pd.Series)

# Create vertical locations
arsenal_events['vertical_location_x'] = 80 - arsenal_events['location_y']
arsenal_events['vertical_location_y'] = arsenal_events['location_x']
arsenal_events['vertical_pass_end_location_x'] = 80 - arsenal_events['pass_end_location_y']
arsenal_events['vertical_pass_end_location_y'] = arsenal_events['pass_end_location_x']
arsenal_events['vertical_carry_end_location_x'] = 80 - arsenal_events['carry_end_location_y']
arsenal_events['vertical_carry_end_location_y'] = arsenal_events['carry_end_location_x']

# Create universal end locations for event type
arsenal_events['end_location_x'] = np.where(arsenal_events['type_name'] == 'Pass',
                                            arsenal_events['pass_end_location_x'],
                                            np.where(arsenal_events['type_name'] == 'Carry',
                                                     arsenal_events['carry_end_location_x'],
                                                     np.nan))
arsenal_events['end_location_y'] = np.where(arsenal_events['type_name'] == 'Pass', 
                                            arsenal_events['pass_end_location_y'],
                                            np.where(arsenal_events['type_name'] == 'Carry', 
                                                     arsenal_events['carry_end_location_y'],
                                                     np.nan))
arsenal_events['vertical_end_location_x'] = np.where(arsenal_events['type_name'] == 'Pass',
                                                     arsenal_events['vertical_pass_end_location_x'],
                                                     np.where(arsenal_events['type_name'] == 'Carry', 
                                                              arsenal_events['vertical_carry_end_location_x'],
                                                              np.nan))
arsenal_events['vertical_end_location_y'] = np.where(arsenal_events['type_name'] == 'Pass', 
                                                     arsenal_events['vertical_pass_end_location_y'],
                                                     np.where(arsenal_events['type_name'] == 'Carry', 
                                                              arsenal_events['vertical_carry_end_location_y'],
                                                              np.nan))

As we are only interested in defensive events, we need to define the list of those. The full list of available events is located in the events documentation above.

I have defined all defensive events to use throughout below and filtered all events for those. Due to the nature of event data, these are all on-ball defensive actions. Often defensive plays are off-ball and explicitly deny future events from taking place, so these events may only highlight the end point of a potential sequence of ‘invisible’ defensive plays.

In [4]:
defensive_actions_list = ['Clearance', 'Pressure', 'Duel', 'Ball Recovery',
                          'Foul Committed', 'Block', 'Interception', '50/50']
defensive_actions = arsenal_events[arsenal_events['type_name'].isin(defensive_actions_list)]
arsenal_defensive_actions = arsenal_events[(arsenal_events['team_name'] == 'Arsenal') & 
                                           (arsenal_events['type_name'].isin(defensive_actions_list))]
opponents_defensive_actions = arsenal_events[(arsenal_events['team_name'] != 'Arsenal') & 
                                             (arsenal_events['type_name'].isin(defensive_actions_list))]

Since event data cannot tell the full defensive story, taking a look at the opposition’s offensive plays may help.

Firstly I remove all set pieces in favour of keeping only open play progressions. This is because defending from open play and set pieces requires different approaches, I will be focusing on open play progressions here.

Need to define what event types we class as ball progressions too. Here I have defined passes, carries and dribbles. The difference between a Carry and a Dribble is that a Dribble is ‘an attempt by a player to beat an opponent’, whilst a Carry is defined as ‘a player controls the ball at their feet while moving or standing still.’ In the data this distinction is clearer since a Carry has a start and end point, whilst a Dribble starts and ends in the same place.

The defined list of ball progression columns were aspirational and things I think would be useful to consider for next time, but didn’t get to look further into.

As mentioned above, when using the vertical pitch it would be useful to have the opponents moving in the opposite direction to the defending team (Arsenal) to feel more interpretable when visualising.

I have separated the ball progressions into thirds on the pitch since different approaches may be taken in different areas of the pitch. More caution would be expected closer to your own goal than in the opponents third.

Narrowing down more towards goal threatening progressions I’ve separated out passes and carries into the penalty area and shot assists.

Finally, dribbles are considered separately since they don’t have a separate end location. There is more focus on the progressions via passes and carries since they have quantifiable start to end progressions up the pitch.

In [5]:
# Remove Set Pieces
open_play_patterns = ['Regular Play', 'From Counter', 'From Keeper']
arsenal_open_play = arsenal_events[arsenal_events['play_pattern_name'].isin(open_play_patterns)]

# Define Opponents Ball Progression
event_types = ['Pass', 'Carry', 'Dribble']
ball_progression_cols = [
    'id', 'player_name', 'period', 'possession', 'duration', 'type_name', 
    'possession_team_name', 'team_name', 'play_pattern_name', 
    'vertical_location_x', 'vertical_location_y', 'vertical_end_location_x', 'vertical_end_location_y',
    'pass_length', 'pass_angle', 'pass_height_name', 
    'pass_body_part_name', 'pass_type_name', 'pass_outcome_name', 
    'ball_receipt_outcome_name', 'pass_switch', 'pass_technique_name',
    'pass_cross', 'pass_through_ball', 
    'pass_shot_assist', 'shot_statsbomb_xg', 'pass_goal_assist', 'pass_cut_back', 'under_pressure'
    ]

# Filter Opponents and Ball Progression Columns
opponent_ball_progressions = arsenal_open_play[(arsenal_open_play['team_name'] != 'Arsenal') & 
                                               (arsenal_open_play['type_name'].isin(event_types))]
opponent_ball_progressions = opponent_ball_progressions[ball_progression_cols]

# Reverse locations for opponents
opponent_ball_progressions['vertical_location_x'] = 80-opponent_ball_progressions['vertical_location_x']
opponent_ball_progressions['vertical_location_y'] = 120-opponent_ball_progressions['vertical_location_y']
opponent_ball_progressions['vertical_end_location_x'] = 80-opponent_ball_progressions['vertical_end_location_x']
opponent_ball_progressions['vertical_end_location_y'] = 120-opponent_ball_progressions['vertical_end_location_y']

# Separate events into thirds, penalty area
from_own_third_opp, from_mid_third_opp, from_final_third_opp, into_pen_area_opp = \
    ball_progression_events_into_thirds(opponent_ball_progressions)

# Filter shot assists
opponent_shot_assists = opponent_ball_progressions[opponent_ball_progressions['pass_shot_assist'] == True]

# Filter dribbles
opponent_dribbles = opponent_ball_progressions[opponent_ball_progressions['type_name'] == 'Dribble']

Data Exploration

This is an incredible sparse dataset, there are lots of events and lots of categorical columns to provide much needed context. Most columns aren’t applicable to most events so there are lots of missing or uninteresting values.

In [6]:
arsenal_defensive_actions.isnull().sum().sort_values(ascending=False)
Out[6]:
vertical_end_location_y     10251
goalkeeper_end_location     10251
goalkeeper_type_name        10251
goalkeeper_position_id      10251
goalkeeper_position_name    10251
                            ...  
home_team_home_team_name        0
away_score                      0
away_team_away_team_name        0
match_id                        0
id                              0
Length: 161, dtype: int64

The most frequent defensive action is a Pressure. Pressures don’t always result in a turnover, whereas Ball Recovery or Interception would.

In [7]:
arsenal_defensive_actions.groupby('type_name').count()['id']
Out[7]:
type_name
50/50               20
Ball Recovery     2020
Block              573
Clearance         1096
Duel              1338
Foul Committed     528
Interception       258
Pressure          4418
Name: id, dtype: int64

There are lots more Carries and Passes than Dribbles, but I expected there to be more Passes relative to Carries than there are.

In [8]:
opponent_ball_progressions.groupby('type_name').count()['id']
Out[8]:
type_name
Carry      5352
Dribble     327
Pass       6398
Name: id, dtype: int64

There are lots more progressions via Passes than Carries in the opposition’s own third and middle third. This is likely due to the reduced risk of a forward pass compared to a carry. If you lose the ball due to a misplaced pass, the ball is likely higher up the field and the passer is closer to their own goal as an extra defender. If you lose the ball due to being tackled, then you likely lose the ball from where you are and are chasing back to catch up.

In [9]:
from_own_third_opp.groupby('type_name').count()['id']
Out[9]:
type_name
Carry    148
Pass     833
Name: id, dtype: int64
In [10]:
from_mid_third_opp.groupby('type_name').count()['id']
Out[10]:
type_name
Carry    172
Pass     676
Name: id, dtype: int64

In the final third, there is a much more even split between Carries and Passes. Perhaps due to forward passing becoming much harder the further up the pitch and the reduced risk of Carries since you are so far away from your own goal.

In [11]:
from_final_third_opp.groupby('type_name').count()['id']
Out[11]:
type_name
Carry    488
Pass     531
Name: id, dtype: int64

Although Carries are good at getting near the penalty area, Passes are still the main way into the penalty area.

In [12]:
into_pen_area_opp.groupby('type_name').count()['id']
Out[12]:
type_name
Carry     85
Pass     351
Name: id, dtype: int64

And well assists are a subset of passes so no Carries here..

In [13]:
opponent_shot_assists.groupby('type_name').count()['id']
Out[13]:
type_name
Pass    81
Name: id, dtype: int64

Data Visualisation

Looking at all defensive events from Arsenal across the season leads to an overplotting mess. It’s pretty hard to tell anything from this.

In [14]:
plot_sb_event_location(arsenal_defensive_actions, alpha = 0.3)
Out[14]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x252f5c342e8>)

Looking at all of the opponent’s ball progressions is arguably even worse.

In [15]:
plot_sb_events(opponent_ball_progressions, alpha = 0.3)
Out[15]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x2528fd6d978>)

Due to the lower volume, we can actually see that there are more dribbles out wide than centrally.

In [16]:
plot_sb_event_location(opponent_dribbles)
Out[16]:
(<Figure size 360x720 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x25298c82be0>)

Methodology

Event data records all on-ball actions from a match, this is more granular than high level team and player totals. As well as the event types, the event locations and extra information for the respective event type are recorded to provide context where possible. For offensive events such as passing and shooting, event data is great since they are usually on-ball actions. There are on-ball defensive actions such as tackles, interceptions and recoveries which are well captured by event data, these are the defensive events that I will be using. Using these defensive events, we can see where these events frequently occur on the pitch for Arsenal compared to their opposition. The key nuance is that just because these are the defensive events recorded, doesn’t mean they are all of the defensive plays that take place on the pitch.

The hard part about defensive analysis is that a large proportion of defending are non-events and won’t be captured by event data. If an opportunity to pass the ball to your striker is removed due to a defensive player blocking the passing lane, that pass will not happen and will not be recorded as such. The effective defensive play was to deny a potential event taking place for the offensive team.

Using event data to evaluate defenses can be done by using a combination of the defensive events from Arsenal and the offensive actions from Arsenal’s opponents. Arsenal’s defensive events will show where their on-ball defending took place. Arsenal’s opponents’ offensive events will show how they approached attacking Arsenal, this may be the offensive team getting their way or Arsenal’s defence forcing opponents to play in a certain way. From Arsenal’s success, the majority of the time it’s the latter.

To look at Arsenal’s defensive events, I plot the defensive events in a gridded 2D histogram across the pitch with marginal density plots along each axis to highlight areas of high activity. The same is done for Arsenal’s opponents and the differences can be highlighted using the ratio.

To look at Arsenal’s opponents events, I specifically look at ball progression including passes, carries and dribbles. These are separated into similar locations on the pitch since different approaches will be required:

  1. From their own third forwards
  2. From the middle third forwards
  3. From the final third forwards
  4. Into the penalty area
  5. Shot assists

These categories of ball progressions are clustered using K-Means and Agglomerative Clustering on the start and end locations to group similar ball progressions for each area of the pitch. The assumed number of clusters is decided using expert judgement and four evaluation measures:

  1. Sum of squared variance within cluster
  2. Silhouette Coefficient
  3. Calinski-Harabasz Index
  4. Davies-Bouldin Index

These clusters will represent the approaches that opponents tried to attack and create chances against Arsenal. No concrete results will be able to be drawn from only this, but they will give a better understanding.

Implementation

Defensive Events

In [17]:
# Arsenal Defensive Events Density Grid
fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(arsenal_defensive_actions)

# Title
fig.suptitle("Arsenal Defensive Actions",
             x=0.5, y=0.92,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')

# Direction of play arrow
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
Out[17]:
Text(10, 10, '')

Rather than the overplotting mess we saw above, the above plot is a combination of a 2D histogram grid and marginal density plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.

More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.

The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.

In [18]:
# All Defensive Events Density Grid
fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(defensive_actions,
                                                               grid_colour_map = 'Purples',
                                                               bar_colour = 'purple')

# Title
fig.suptitle("All Defensive Actions",
             x=0.5, y=0.92,
            fontsize = 'xx-large', fontweight='bold',
            color = 'purple')

# Direction of play arrow
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkviolet',
                            connectionstyle="arc3")
            )
Out[18]:
Text(10, 10, '')

Here we have all defensive events across the matches that Arsenal were in, so Arsenal will be a major contributor to these frequencies. It’s interesting to see the events still largely take place in their own penalty area, but less so in the full back areas.

In [19]:
# Relative Grid Density
fig, ax = plot_histogram_ratio_pitch(arsenal_defensive_actions, defensive_actions)

fig.suptitle("Arsenal's Relative Defensive Actions",
             x=0.5, y=0.87,
            fontsize = 'xx-large', fontweight='bold',
            color = 'darkred')
ax.annotate("",
            xy=(10, 30), xycoords='data',
            xytext=(10, 10), textcoords='data',
            arrowprops=dict(width=10,
                            headwidth=20,
                            headlength=20,
                            edgecolor='black',
                            facecolor = 'darkred',
                            connectionstyle="arc3")
            )
fig.tight_layout()

Whilst seeing all defensive events gave some insight into what everyone was doing overall, we can also take a look at the relative difference between Arsenal’s defensive events and the overall view. This density grid shows where Arsenal had more events than overall in red and less than overall in blue.

The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.

Refinement

By taking a look at opponent’s ball progressions we can get potentially see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?

Ball Progression – Own Third

In [20]:
# From Own Third - Ball Progressions
fig, ax = plot_sb_events(from_own_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Own Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red line for own third
ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
plt.tight_layout()

Even whilst only looking at the ball progressions from their own third, it’s still hard to identify any trends or patterns of play. There are lots of longer passes, though I have only included progressions of at least 10 yards.

K-Means
In [21]:
# Create cluster evaluation dataframe for up to 50 clusters
own_third_clusters_kmeans = cluster_evaluation(from_own_third_opp, method='kmeans', max_clusters=50)

# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[21]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – More clusters gives lower sum of squares.
  2. Silhouette Coefficient (Higher) – Less clusters gives higher coefficient, under 5 especially and drops around 15.
  3. Calinski-Harabasz Index (Higher) – Less clusters gives higher index.
  4. Davies-Bouldin Index (Lower) – Max just less than 10

Just less than 10 seems appropriate.

In [53]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_own_third = kmeans_cluster(from_own_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_own_third))))
# Plot each cluster
title = "KMeans Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(3, 3, from_own_third_opp, cluster_labels_own_third, sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

We find that the most frequent cluster types for progressing the ball from their own third are shorter progressions into the middle third. There are short progressions centrally in clusters 10 and 17, with wider progressions in clusters 4 and 7.

Shorter progressions are easier to complete and less risky, so not surprising that they are the most frequent. This says nothing for how quality or sustainable these progressions are.

Agglomerative Clustering
In [23]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_third_clusters_agglo = cluster_evaluation(from_own_third_opp, method='agglomerative', min_distance=10, max_distance=500)

# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_third_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[23]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – Lower distance gives lower sum of squares, flat from 200 and sharp increase at 350.
  2. Silhouette Coefficient (Higher) – Really low distance or peaks again at around 250.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index.
  4. Davies-Bouldin Index (Lower) – Really low distance or dips again at 350.

250 is chosen as it appears to peak for Silhouette Coefficient with others trading off.

In [48]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_own_third = agglomerative_cluster(from_own_third_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_own_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Own Third"
fig, axs = plot_individual_cluster_events(2, 3, from_own_third_opp, cluster_labels_agglo_own_third, figsize=(10, 10))
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 7 clusters

There are less clusters than the KMeans method. They seem to be more weighted towards the end location, as in clusters 1, 5 and 6 all end just inside the middle third in the centre, left and right side respectively. Whilst the remaining clusters all end closer to the opposite half.

Ball Progression – Middle Third

In [25]:
# From Mid Third - Ball Progressions
fig, ax = plot_sb_events(from_mid_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Mid Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red lines for middle third
ax.hlines(y=[40, 80], xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
fig.tight_layout()

Ball progressions from the middle third seem to have trajectories that are more dense out wide, so the ball rarely travels through the middle of the pitch in Arsenal’s own third. This may be another indicator that the ball is being directed outside, likely because there are lots of players located centrally.

KMeans
In [26]:
# Create cluster evaluation dataframe for up to 50 clusters
mid_third_clusters_kmeans = cluster_evaluation(from_mid_third_opp, method='kmeans', max_clusters=50)

# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Mid Third"
fig, axs = plot_cluster_evaluation(mid_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
Out[26]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Mid Third')
  1. Sum of Squares (Lower) – More clusters, the lower the error.
  2. Silhouette Coefficient (Higher) – After settling, the peak seems just less than 10.
  3. Calinski-Harabasz Index (Higher) – Less clusters, the higher the index.
  4. Davies-Bouldin Index (Lower) – After settling, the more clusters the lower the index, with a drop around 10.

To balance 1 and 3, whilst satisfying both 2 and 4, 9 clusters seems appropriate.

In [27]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_mid_third = kmeans_cluster(from_mid_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_mid_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_mid_third_opp, cluster_labels_mid_third)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Middle Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.

Agglomerative
In [28]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_mid_clusters_agglo = cluster_evaluation(from_mid_third_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(from_mid_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Own Third"
fig, axs = plot_cluster_evaluation(own_mid_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 848 passes.
Out[28]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Own Third')
  1. Sum of Squares (Lower) – Lower distance means lower error, flatten at 200 and jumps at 300.
  2. Silhouette Coefficient (Higher) – Higher distance gives higher coefficient, spikes at 300+.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index, spikes at 300+.
  4. Davies-Bouldin Index (Lower) – Bottom of the curve seems to be around 200.

300+ is likely not producing sensible results due to constant error across the board. 200 seems to be reasonable everywhere.

In [29]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_mid_third = agglomerative_cluster(from_mid_third_opp, 200)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_mid_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Mid Third"
fig, axs = plot_individual_cluster_events(2, 4, from_mid_third_opp, cluster_labels_agglo_mid_third, figsize=(10,10), sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 8 clusters

There are similar number of clusters found across both methods. Agglomerative clustering again seems to more heavily weight the end locations heavier than the start locations. Lots of shorter progressions just into the final third and especially in wider areas such as clusters 4 and 3.

It has also found clusters entering the half space just outside the Arsenal penalty are in clusters 1 and 2 which is interesting. These 8 clusters seem to form 2 row x 8 column grid of end locations.

Ball Progression – Final Third

In [30]:
# From Final Third - Ball Progressions
fig, ax = plot_sb_events(from_final_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme = 'dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - From Final Third",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
# Dotted red line for final third
ax.hlines(y=40, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5)
fig.tight_layout()

Once in the final third, the progressions are less vertically direct. Hard to tell here, but looks like lots of shorter progressions and longer direct progressions into the box, likely crosses.

KMeans
In [31]:
# Create cluster evaluation dataframe for up to 50 clusters
final_third_clusters_kmeans = cluster_evaluation(from_final_third_opp, method='kmeans', max_clusters=50)
print("There are {} passes.".format(from_final_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - From Final Third"
fig, axs = plot_cluster_evaluation(final_third_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 1019 passes.
Out[31]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - From Final Third')
  1. Sum of Squares (Lower) – More clusters gives lower error.
  2. Silhouette Coefficient (Higher) – Less clusters gives higher coefficient.
  3. Calinski-Harabasz Index (Higher) – Less clusters gives higher coefficient
  4. Davies-Bouldin Index (Lower) – Drop just before 10 looks optimal.

Just before 10 is only clear sign, so have chosen 9.

In [32]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_final_third = kmeans_cluster(from_final_third_opp, 9)
print("There are {} clusters".format(len(np.unique(cluster_labels_final_third))))
# Clustered Ball Progressions - From Mid Third
fig, axs = plot_individual_cluster_events(3, 3, from_final_third_opp, cluster_labels_final_third, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - From Final Third",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for middle third across all axes
for ax in axs.flatten():
    ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 9 clusters

Lots of short progressions in wide full back areas in the final third. The lowest two frequent clusters appear to be deep crosses into the penalty area, these are the only consistent progressions into the penalty area.

Agglomerative
In [33]:
# Create cluster evaluation dataframe for up to 300 distance threshold
own_final_clusters_agglo = cluster_evaluation(from_final_third_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(from_final_third_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - From Final Third"
fig, axs = plot_cluster_evaluation(own_final_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 1019 passes.
Out[33]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - From Final Third')
  1. Sum of Squares (Lower) – Lower distance gives lower error, spikes at 250.
  2. Silhouette Coefficient (Higher) – Higher distance gives higher coefficient, spikes at 250.
  3. Calinski-Harabasz Index (Higher) – Higher distance gives higher index, spikes at 250.
  4. Davies-Bouldin Index (Lower) – Optimal looks around 150-250

250 seems appropriate across all.

In [47]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_final_third = agglomerative_cluster(from_final_third_opp, 250)
print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_final_third))))
# Plot each cluster
title = "Agglomerative Ball Progression Clusters - From Final Third"
fig, axs = plot_individual_cluster_events(2, 4, from_final_third_opp, cluster_labels_agglo_final_third, figsize=(10,8), sample_size=10)
# Title
fig.suptitle(title,
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
# Dotted red lines for own third across all axes
for ax in axs.flatten():
    ax.hlines(y=40, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)
There are 7 clusters

Again there are a similar number of clusters selected, with lots of progressions short and in wide areas. The least frequent again appear to be deep wide crosses into the penalty area and are the only consistent penalty area entry.

Ball Progression – Penalty Area

In [35]:
# Into Penalty Area - Ball Progressions
fig, ax = plot_sb_events(into_pen_area_opp, alpha = 0.7, figsize = (5, 10), pitch_theme = 'dark')

# Set figure colour to dark theme
fig.set_facecolor("#303030")
# Title
ax.set_title("Opponent Ball Progression - Into Penalty Area",
             fontdict = dict(fontsize=12,
                             fontweight='bold',
                             color='white'),
            pad=-10)
fig.tight_layout()

The majority of progressions into the penalty area come from out wide, however there are some that are incredibly direct and even from their own half.

KMeans
In [36]:
# Create cluster evaluation dataframe for up to 50 clusters
pen_area_clusters_kmeans = cluster_evaluation(into_pen_area_opp, method='kmeans', max_clusters=50)
print("There are {} passes.".format(into_pen_area_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "KMeans Cluster Evaluation - Penalty Area"
fig, axs = plot_cluster_evaluation(pen_area_clusters_kmeans)
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 436 passes.
Out[36]:
Text(0.5, 0.98, 'KMeans Cluster Evaluation - Penalty Area')
  1. Sum of Squares (Lower) – More clusters the lower the error.
  2. Silhouette Coefficient (Higher) – Very high coefficient below 5 clusters.
  3. Calinski-Harabasz Index (Higher) – Less clusters the higher the index.
  4. Davies-Bouldin Index (Lower) – Very low index below 5 clusters.

Peak Silhouette coefficient matches minimum DB Index at 4 here.

In [37]:
np.random.seed(1000)

# KMeans based on chosen number of clusters
cluster_labels_pen_area = kmeans_cluster(into_pen_area_opp, 4)
print("There are {} clusters".format(len(np.unique(cluster_labels_pen_area))))
# Plot individual clusters
fig, axs = plot_individual_cluster_events(2, 2, into_pen_area_opp, cluster_labels_pen_area, sample_size=10)
# Title
fig.suptitle("KMeans Ball Progression Clusters - Into Penalty Area",
             fontsize = 'xx-large', fontweight='bold',
             color = 'white',
             x=0.5, y=1.01)
There are 4 clusters
Out[37]:
Text(0.5, 1.01, 'KMeans Ball Progression Clusters - Into Penalty Area')

There is a smaller number of clusters selected here, grouping the clusters into broader groups. These seem to be split longer and shorter, and from each wide area. This doesn’t really help us.

Agglomerative
In [38]:
# Create cluster evaluation dataframe for up to 300 distance threshold
pen_area_clusters_agglo = cluster_evaluation(into_pen_area_opp, method='agglomerative', min_distance=10, max_distance=500)
print("There are {} passes.".format(into_pen_area_opp.shape[0]))
# Plot cluster evaluation metrics by cluster number
title = "Agglomerative Cluster Evaluation - Penalty Area"
fig, axs = plot_cluster_evaluation(pen_area_clusters_agglo, method='agglomerative')
fig.suptitle(title,
            fontsize = 'xx-large', fontweight='bold')
There are 436 passes.
Out[38]:
Text(0.5, 0.98, 'Agglomerative Cluster Evaluation - Penalty Area')
  1. Sum of Squares (Lower) – Lower distance gives lower error.
  2. Silhouette Coefficient (Higher) – Spike at 200 to max.
  3. Calinski-Harabasz Index (Higher) – Spike at 250 to max.
  4. Davies-Bouldin Index (Lower) – Lowest near 0 and around 250.

250 seems appropriate across the board.

In [39]:
np.random.seed(1000)

# Agglomerative Clustering based on chosen distance threshold
cluster_labels_agglo_pen_area = agglomerative_cluster(into_pen_area_opp, 250)
print("There are {}