Project Overview
Arsenal’s Invincibles season is unique to the Premier League, they are the only team that has gone a complete 38 game season without losing. They’re considered to be one of the best teams ever to play in the Premier League. Going a whole season without losing suggests they were at least decent at defending, to reduce or completely remove bad luck from ruining their perfect record. This project aims to look into how Arsenal’s defence managed this. To do so training a model is not necessary as this is a descriptive task, not a predictive one. Instead, I will focus on answering the questions with data visualisations to effectively represent and communicate the answers to the questions.
Problem Statement
The goal is to identify areas of Arsenal’s defensive strengths and the frequent approaches used by opponents. Tasks involved are as follows:
- Download and preprocess StatsBomb’s event data.
- Explore and visualise Arsenal’s defensive actions.
- Explore and visualise Opponent’s ball progression by thirds.
- Cluster and evaluate Opponent’s ball progressions using several clustering methods.
- Visualise clustering results to aid understanding.
- Cluster and evaluate Opponent’s shot creations using several clustering methods.
- Visualise clustering results to aid understanding
Dataset Decsription
Using StatsBomb’s public event data for the Arsenal 03/04 season* (33 games), I take a look at where Arsenal’s defensive actions take place and how opponents attempted to progress the ball and create chances against them.
Since we are working with event data, there is access to the pitch location in x, y coordinates of each recorded event. These are recorded as per their Pitch Coordinates found in the documentation: https://github.com/statsbomb/open-data/blob/master/doc/Open%20Data%20Events%20v4.0.0.pdf
There are 112,773 rows and 161 columns in the data. Each row is an event from a game that has a unique id, timestamp, team, player and event information such as type and location. We will not need all events as we will filter for only defensive event types and then separately for ball progressions from Arsenal’s opponents.
There are 10251 Arsenal defensive events and 12077 opponent ball progression events. Example rows of each dataset are below:


Metrics
To determine the appropriate number of clusters for K-Means and distance threshold for Agglomerative clustering, the below metrics are used. They try to consider the density of points within clusters and between clusters.
- Sum of squares within cluster:
Calculated using the inertia_ attribute of the k-means class, to compute the sum of squared distances of samples to their closest cluster center. Lower scores for more dense clusters.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- Silhouette Coefficient:
Calculated using the mean intra-cluster distance and mean nearest-cluster distance for each sample. Best value is 1 and worst value is -1.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
- Calinski-Harabasz Index:
Known as the Variance Ratio Criterion, defined as the ratio between within-cluster dispersion and between-cluster dispersion. Higher scores signal clusters are dense and well separated.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html
- Davies-Bouldin Index:
Average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Minimum score is 0, lower values for better clusters.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html
Data Processing
In [1]:import json from pandas.io.json import json_normalize import os import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn import metrics import warnings warnings.filterwarnings('ignore') from CustomPitch import createVerticalPitch from StatsBombPrep import ball_progression_events_into_thirds from StatsBombViz import plot_sb_events, plot_sb_event_location, plot_sb_events_clusters, plot_individual_cluster_events from StatsBombViz import plot_sb_event_grid_density_pitch, plot_histogram_ratio_pitch from ClusterEval import kmeans_cluster, agglomerative_cluster, cluster_colour_map, cluster_evaluation, plot_cluster_evaluation
StatsBomb collect event data for lots of football matches and have made freely available a selection of matches (including all FAWSL) to allow amateur projects to be able to take place. All they ask is to sign up to their StatsBomb Public Data User Agreement here: https://statsbomb.com/academy/. And get access to the data via their GitHub here: https://github.com/statsbomb/open-data
Their data is stored in json files, with the event data for each match identifiable by their match ids. To get the relevant match ids, you can find those in the matches.json file. To get the relevant season and competitions, you can find that in the competitions.json file.
In [2]:# Load competitions to find correct season and league codes with open('open-data-master/data/competitions.json') as f: competitions = json.load(f) competitions_df = pd.DataFrame(competitions) competitions_df[competitions_df['competition_name'] == 'Premier League'] # Arsenal Invincibles Season with open('open-data-master/data/matches/2/44.json', encoding='utf-8') as f: matches = json.load(f) # Find match ids matches_df = json_normalize(matches, sep="_") match_id_list = matches_df['match_id'] # Load events for match ids arsenal_events = pd.DataFrame() for match_id in match_id_list: with open('open-data-master/data/events/'+str(match_id)+'.json', encoding='utf-8') as f: events = json.load(f) events_df = json_normalize(events, sep="_") events_df['match_id'] = match_id events_df = events_df.merge(matches_df[['match_id', 'away_team_away_team_name', 'away_score', 'home_team_home_team_name', 'home_score']], on='match_id', how='left') arsenal_events = arsenal_events.append(events_df) print('Number of matches: '+ str(len(match_id_list)))
- The location tuples for start and pass_end, carry_end are separated. They are all rotated to fit vertical pitch and then I create a universal end location for all progression events.
- I have defined all defensive events to use throughout below and filtered all events for those.
- Firstly I remove all set pieces in favour of keeping only open play progressions. This is because defending from open play and set pieces requires different approaches, I will be focusing on open play progressions here.
- Need to define what event types we class as ball progressions too.
- When using the vertical pitch it would be useful to have the opponents moving in the opposite direction to the defending team (Arsenal) to feel more interpretable when visualising.
- I have separated the ball progressions into thirds on the pitch since different approaches may be taken in different areas of the pitch.
- Narrowing down more towards goal threatening progressions I’ve separated out passes and carries into the penalty area and shot assists.
Exploratory Analysis
The most frequent defensive action is a Pressure. Pressures don’t always result in a turnover, whereas Ball Recovery or Interception would
In [7]:arsenal_defensive_actions.groupby('type_name').count()['id']Out[7]:There are lots more Carries and Passes than Dribbles, but I expected there to be more Passes relative to Carries than there are.
In [8]:opponent_ball_progressions.groupby('type_name').count()['id']Out[8]:There are lots more progressions via Passes than Carries in the opposition’s own third and middle third. This is likely due to the reduced risk of a forward pass compared to a carry. If you lose the ball due to a misplaced pass, the ball is likely higher up the field and the passer is closer to their own goal as an extra defender. If you lose the ball due to being tackled, then you likely lose the ball from where you are and are chasing back to catch up.
In [9]:from_own_third_opp.groupby('type_name').count()['id']Out[9]:In [10]:from_mid_third_opp.groupby('type_name').count()['id']Out[10]:In the final third, there is a much more even split between Carries and Passes. Perhaps due to forward passing becoming much harder the further up the pitch and the reduced risk of Carries since you are so far away from your own goal.
In [11]:from_final_third_opp.groupby('type_name').count()['id']Out[11]:Although Carries are good at getting near the penalty area, Passes are still the main way into the penalty area.
In [12]:into_pen_area_opp.groupby('type_name').count()['id']Out[12]:And well assists are a subset of passes so no Carries here..
In [13]:opponent_shot_assists.groupby('type_name').count()['id']Out[13]:Looking at all defensive events from Arsenal across the season leads to an overplotting mess. There are so many defensive events that it’s hard to see any clear trends apart from there are fewer higher up the pitch.
In [14]:plot_sb_event_location(arsenal_defensive_actions, alpha = 0.3)Out[14]:Looking at all of the opponent’s ball progressions is just as difficult to understand, there are lots of progressions in the middle and down the wide areas.
In [15]:plot_sb_events(opponent_ball_progressions, alpha = 0.3)Out[15]:Data Visualisation
Defensive Events
To better understand and explore Arsenal’s defensive events, the below plot is a combination of a 2D histogram grid and marginal density plots across each axis. We can see that the frequency of defensive actions is evenly spread left to right and more heavily skewed to their own half.
More specifically, the highest action areas are in front of their own goal and out wide in the full back areas above the penalty area. Defensive actions in their own penalty area are expected as that the closest to your goal and crosses into the box are dealt with.
The full back areas seem to be more proactive in making defensive actions before the opponent gets closer to the byline. Passes and cutbacks from these areas close to the byline and penalty are usually generate high quality shooting chances, so minimising the opponents ability to get here is great.
In [17]:# Arsenal Defensive Events Density Grid fig, ax, ax_histx, ax_histy = plot_sb_event_grid_density_pitch(arsenal_defensive_actions) # Title fig.suptitle("Arsenal Defensive Actions", x=0.5, y=0.92, fontsize = 'xx-large', fontweight='bold', color = 'darkred') # Direction of play arrow ax.annotate("", xy=(10, 30), xycoords='data', xytext=(10, 10), textcoords='data', arrowprops=dict(width=10, headwidth=20, headlength=20, edgecolor='black', facecolor = 'darkred', connectionstyle="arc3") )Out[17]:We can also take a look at the relative difference between Arsenal’s defensive events and the overall view. This density grid shows where Arsenal had more events than overall in red and less than overall in blue.
The darkest red areas are again the full back areas, suggesting that Arsenal’s full backs performed more on-ball defensive actions than their opponents. Whereas they defended their penalty area about as evenly as opponents and less frequently in their opponent’s half.
In [19]:# Relative Grid Density fig, ax = plot_histogram_ratio_pitch(arsenal_defensive_actions, defensive_actions) fig.suptitle("Arsenal's Relative Defensive Actions", x=0.5, y=0.87, fontsize = 'xx-large', fontweight='bold', color = 'darkred') ax.annotate("", xy=(10, 30), xycoords='data', xytext=(10, 10), textcoords='data', arrowprops=dict(width=10, headwidth=20, headlength=20, edgecolor='black', facecolor = 'darkred', connectionstyle="arc3") ) fig.tight_layout()Opponent Ball Progression
By taking a look at opponent’s ball progressions we can get potentially see the opponent’s point of view here. Do Arsenal’s full back areas have so many defensive events because they are ‘funneling’ their opponents there as they see it as a strength or do Arsenal’s opponents target and exploit their full back areas?
Across each third of the pitch, both KMeans and Agglomerative clustering methods were used. To identify the optimal number of clusters (Kmeans) and distance threshold (Agglomerative), a range of clusters and distance thresholds were used and evaluated across the specified metrics. The optimal number and distance thresholds were used to create clusters and visualise, there is a full example using own third progressions and results from remaining thirds below.
Own Third
In [20]:# From Own Third - Ball Progressions fig, ax = plot_sb_events(from_own_third_opp, alpha = 0.5, figsize = (5, 10), pitch_theme='dark') # Set figure colour to dark theme fig.set_facecolor("#303030") # Title ax.set_title("Opponent Ball Progression - From Own Third", fontdict = dict(fontsize=12, fontweight='bold', color='white'), pad=-10) # Dotted red line for own third ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.7, linewidth=2, linestyle='--', zorder=5) plt.tight_layout()In [21]:# Create cluster evaluation dataframe for up to 50 clusters own_third_clusters_kmeans = cluster_evaluation(from_own_third_opp, method='kmeans', max_clusters=50) # Plot cluster evaluation metrics by cluster number title = "KMeans Cluster Evaluation - From Own Third" fig, axs = plot_cluster_evaluation(own_third_clusters_kmeans) fig.suptitle(title, fontsize = 'xx-large', fontweight='bold')Out[21]:
- Sum of Squares (Lower) – More clusters gives lower sum of squares.
- Silhouette Coefficient (Higher) – Less clusters gives higher coefficient, under 5 especially and drops around 15.
- Calinski-Harabasz Index (Higher) – Less clusters gives higher index.
- Davies-Bouldin Index (Lower) – Max just less than 10
Just less than 10 seems appropriate.
In [53]:np.random.seed(1000) # KMeans based on chosen number of clusters cluster_labels_own_third = kmeans_cluster(from_own_third_opp, 9) print("There are {} clusters".format(len(np.unique(cluster_labels_own_third)))) # Plot each cluster title = "KMeans Ball Progression Clusters - From Own Third" fig, axs = plot_individual_cluster_events(3, 3, from_own_third_opp, cluster_labels_own_third, sample_size=10) # Title fig.suptitle(title, fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01) # Dotted red lines for own third across all axes for ax in axs.flatten(): ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)In [23]:# Create cluster evaluation dataframe for up to 300 distance threshold own_third_clusters_agglo = cluster_evaluation(from_own_third_opp, method='agglomerative', min_distance=10, max_distance=500) # Plot cluster evaluation metrics by cluster number title = "Agglomerative Cluster Evaluation - From Own Third" fig, axs = plot_cluster_evaluation(own_third_clusters_agglo, method='agglomerative') fig.suptitle(title, fontsize = 'xx-large', fontweight='bold')Out[23]:
- Sum of Squares (Lower) – Lower distance gives lower sum of squares, flat from 200 and sharp increase at 350.
- Silhouette Coefficient (Higher) – Really low distance or peaks again at around 250.
- Calinski-Harabasz Index (Higher) – Higher distance gives higher index.
- Davies-Bouldin Index (Lower) – Really low distance or dips again at 350.
250 is chosen as it appears to peak for Silhouette Coefficient with others trading off.
In [48]:np.random.seed(1000) # Agglomerative Clustering based on chosen distance threshold cluster_labels_agglo_own_third = agglomerative_cluster(from_own_third_opp, 250) print("There are {} clusters".format(len(np.unique(cluster_labels_agglo_own_third)))) # Plot each cluster title = "Agglomerative Ball Progression Clusters - From Own Third" fig, axs = plot_individual_cluster_events(2, 3, from_own_third_opp, cluster_labels_agglo_own_third, figsize=(10, 10)) # Title fig.suptitle(title, fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01) # Dotted red lines for own third across all axes for ax in axs.flatten(): ax.hlines(y=80, xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)There are less clusters than the KMeans method. They seem to be more weighted towards the end location, as in clusters 1, 5 and 6 all end just inside the middle third in the centre, left and right side respectively. Whilst the remaining clusters all end closer to the opposite half.
The remaining thirds all underwent the same evaluation process and KMeans produced more appropriate groupings across the board. Agglomerative clustering focused on the end locations when grouping and grouped according to a structured grid which wasn’t the point of this exercise.
Middle Third
In [27]:np.random.seed(1000) # KMeans based on chosen number of clusters cluster_labels_mid_third = kmeans_cluster(from_mid_third_opp, 9) print("There are {} clusters".format(len(np.unique(cluster_labels_mid_third)))) # Clustered Ball Progressions - From Mid Third fig, axs = plot_individual_cluster_events(3, 3, from_mid_third_opp, cluster_labels_mid_third) # Title fig.suptitle("KMeans Ball Progression Clusters - From Middle Third", fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01) # Dotted red lines for middle third across all axes for ax in axs.flatten(): ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)Similar to the progressions from their own third, the most frequent progressions from the middle third are shorter down the wide areas in clusters 2 and 9, with some longer progressions in clusters 7 and 5. Clusters 8 and 4 suggest a number of progressions do make it into the centre of Arsenal’s own half.
Final Third
In [32]:np.random.seed(1000) # KMeans based on chosen number of clusters cluster_labels_final_third = kmeans_cluster(from_final_third_opp, 9) print("There are {} clusters".format(len(np.unique(cluster_labels_final_third)))) # Clustered Ball Progressions - From Mid Third fig, axs = plot_individual_cluster_events(3, 3, from_final_third_opp, cluster_labels_final_third, sample_size=10) # Title fig.suptitle("KMeans Ball Progression Clusters - From Final Third", fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01) # Dotted red lines for middle third across all axes for ax in axs.flatten(): ax.hlines(y=[40,80], xmin=0, xmax=80, color='red', alpha=0.5, linewidth=2, linestyle='--', zorder=5)Lots of short progressions in wide full back areas in the final third. The lowest two frequent clusters appear to be deep crosses into the penalty area, these are the only consistent progressions into the penalty area.
Penalty Area
In [37]:np.random.seed(1000) # KMeans based on chosen number of clusters cluster_labels_pen_area = kmeans_cluster(into_pen_area_opp, 4) print("There are {} clusters".format(len(np.unique(cluster_labels_pen_area)))) # Plot individual clusters fig, axs = plot_individual_cluster_events(2, 2, into_pen_area_opp, cluster_labels_pen_area, sample_size=10) # Title fig.suptitle("KMeans Ball Progression Clusters - Into Penalty Area", fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01)Out[37]:There is a smaller number of clusters selected here, grouping the clusters into broader groups. These seem to be split longer and shorter, and from each wide area.
Shot Assists
In [42]:np.random.seed(1000) # KMeans based on chosen number of clusters cluster_labels_shot_assist = kmeans_cluster(opponent_shot_assists, 4) print("There are {} clusters".format(len(np.unique(cluster_labels_shot_assist)))) # Clustered Ball Progressions - Shot Assists fig, axs = plot_individual_cluster_events(2, 2, opponent_shot_assists, cluster_labels_shot_assist, figsize=(8,10),sample_size=10) # Title fig.suptitle("KMeans Ball Progression Clusters - Shot Assists", fontsize = 'xx-large', fontweight='bold', color = 'white', x=0.5, y=1.01)Out[42]:When looking at only shot assists, there are lots from the left full back area as well as passes through the middle. Both of these are areas where if you can complete a pass towards goal then there is a higher chance of generating a shot. However, shot assists that do not penetrate the penalty area suggest that those shots are likely from further out and of lower quality.
Conclusion
Both KMeans and agglomerative clustering produced similar number of clusters after evaluation and reasonable groupings after visualisation. The k-means clusters appear to align trajectories of progression from start to end locations fairly well. From the samples in the visualisations there are consistent start and end locations within cluster. Unlike agglomerative clusters there doesn’t seem to be a pattern to cluster groupings. The agglomerative clusters appeared to be more heavily weighted to the end locations of progressions, this means that lots of the clusters were grouped into grids primarily focused on where they ended up. The aim of using clustering approaches was to try to find patterns that weren’t readily available to a human analyst, so in this sense the agglomerative clustering didn’t add anything extra.
There is no indication as to the quality of these shots produced and there is no context to compare to the wider league. But what is promising is that there is nothing significant standing out and and the common breaches of the Arsenal defence are just good progressions if they do come off. It’s not clear here how many more of these types of opportunities that were prevented due to defensive plays off the ball.
Across ball progressions throughout the whole pitch, the most frequent were shorter and in the wide areas. This is expected as shorter progressions are lower risk and wide areas are less important to defend, so are areas the defending team are willing to concede.
The length of progressions from the middle third appear to be longer than in their own third or in the final third. Context is needed in all individual circumstances but this may be due to a lower risk of failure due to being further away from your own goal or trying to take advantage of a short window of opportunity to quickly progress the ball longer distances into the final third.
When in the final third, the progressions shorten again. This is not due to lower risk, but likely due to a more densely populated area. There will be the majority of all players on the pitch within a single half of the pitch, navigating through there requires precision and patience from the offensive team.
Any completed progression into the penalty area is a success for the offensive team. There is a high chance you will create a shot and if you do it’s likely to be a higher quality shot than from outside the penalty area. Though not all completed progressions into the penalty area are created equal. If it’s a completed carry into the penalty area then awesome, you likely have the ball under control and can get a pretty good shot or extra pass. If it’s a completed pass then it depends how the player receives it, aerially or on the ground makes a difference to the shot quality. Aerial passes are harder to control and headers are of less quality than shots with the feet, however aerial passes are usually easier to complete into the penalty area. So higher quantity, lower quantity than ground passes or carries.
Shot assists rely on there being a shot at the end of them. So circularly they created a shot so are ‘good’ progressions but also they are ‘good’ progressions because they created a shot. As we can see, they are much more random which means it’s harder to understand without context why they created shooting opportunities since the locations alone don’t tell us anything. Although the context is available within StatsBomb’s data, I haven’t taken a further look here.
When considering how these progressions affect Arsenal’s defensive events, remember that the majority of their defensive events were performed out wide in the full back areas and in their own penalty area. Particularly out wide in the full back areas more than other teams, whilst defensive events within their own penalty area around the same as other teams.
At each third of the pitch, the most frequent ball progressions were out wide, which places the ball frequently in the full back areas. Due to the nature of defensive events, the only events recorded would be the on-ball actions that were defined including pressures, ball recoveries, tackles etc. The ball needs to be close to you to be able to perform these actions and get them recorded as events, so the opponent ball progression frequently going out wide combined with Arsenal’s defensive events in their defensive wide positions fits together well.
What this doesn’t tell us is if these are causally linked or just correlate. I would suggest there are more ball progressions made out wide than centrally in all of football due to the defence more likely willing to concede that space, so this doesn’t necessarily tell us much about Arsenal specifically. Though in Arsenal’s matches, they do perform defensive events in their full back areas more than other teams, which may suggest that there is something more than just correlations.
If it is a specified game plan to funnel the ball out wide and perform defensive events there, then Arsenal have done a great job at completing that. It’s a robust defensive plan if you can get it to work, the wider the ball is, the harder it is to score immediately from there. When defending it’s often useful to utilise the ends of the pitch as an ‘extra’ defender, which makes it easier to overwhelm offensive players.
Improvements
Next time I would definitely be more narrow and specific with the question I set out to answer. There was no set acceptance criteria for whether or not the results that were produced were sufficient to understand the question. This led to the mre descriptive, exploratory nature of the project and ended up scratching the surface rather than delving deeper into solving a specific problem.
Another useful tool would be to get access to tracking level data for these matches. For defensive analysis there is much more emphasis on off-ball events and distances between all the players on the pitch, tracking data would provide the locations of all players on the pitch and the ball at all times. This would provide much greater detail but also be much more complicated to work with.
Reflection
I set out to try to understand why Arsenal’s defense worked so well during their unbeaten season. Using StatsBomb’s event data for the majority of the season, I analysed where Arsenal’s defensive events were performed and how that compared to their opponents. This could only tell part of the story since defensive events only cover on-ball actions. It is accepted that defensive actions cover a whole lot more than just on-ball actions so further analysis was needed. I analysed how their opponents progressed the ball up the pitch form each third and how they created shots by clustering the locations to identify most frequent types of progressions. Considering both sides, it’s clear that much of Arsenal’s defensive work happens in their full back areas and their opponents try to progress the ball down the flanks. What is not clear is if Arsenal are causing this to happen via off the ball actions or if it’s just coincidence.
I found it hard to directly answer the question that I set out to, that’s likely due to a poor question in being too broad. However, I definitely am further along the road than when I started and has been interesting trying to work through these problems.
It was great working with event level data and trying to find interesting ways to communicate visualisations, hopefully the plots combined with the football pitches work well to add to understanding.