Motivation

Formations and player roles are a very vague concept. We simplify things into 4-2-3-1s and 4-3-3s and right backs and centre forwards but within those simplifications there are many nuances to how different teams and different players function.

This logic simplifes the role of a player to moving the ball from one part of the pitch to another. This by itself isn’t a sufficient measure of similar players, but it may still be a measure of similar roles. For instance it doesn’t include physical attributes of pace, stamina, etc. quality of shooting, passing, etc. quality of awareness, positioning, etc. This results from this logic could be used in addition to these other measures for better results.

Methodology

To compare players in isolation

  1. Isolate periods of longer than 60 minutes during which no substitutions happened, no formation changes happened. Remove the rest of the data.

  2. For each player, in each match, extract four sets of loctions -
    • all the locations where the player received a pass from i.e. the location of the passer who passed the ball to this player. I will address this as (xstart1, ystart1)
    • all the locations where the player received a pass at i.e. the location of the player when he received the ball I will address this as (xend1, yend1)
    • all the locations where the player passed from i.e. the location of the player when he passed the ball to another player. I will address this as (xstart2, ystart2)
    • all the locations the player passed to i.e. the location of the recipient of the pass when he received the ball. I will address this as (xend2, yend2)

eg. if we are interested in player B, then for a play in which player A passed to player B and then player B passed to player C, we’d see two passes - pass 1 which went from A at location Start1 to B at location End1, and then pass 2 which went from B at location Start2 to C at location End2, with player B having moved with the ball from location End1 to location Start2.

Each of those four points contribute respectively to the four different datasets desribed above.

  1. Compare each of those four distribution with the respective distributions of all other players in their matches. Use a distance measure ( earth mover’s distance ) to decide whether the distributions of two players are similar.

  2. Combine the EMD between each set of distances to come up with an overall distance. I add the four distances up as if they were Euclidan distances, i.e. overall distance = ( ( Distance1 ^ 2 ) + ( Distance2 ^ 2 ) + ( Distance3 ^ 2 ) + ( Distance4 ^ 2 ) ) ^ 0.5

To compare teams

Using the player comparison metric, compare all eleven players of a team in one match with all eleven players of a team in another match. Find a one to one pairing between players such that the sum of distances within all eleven pairs is the smallest. This can be solved as an assignment problem, i.e. for each comparison, of all the possible ( 11 * 10 * 9 … * 1 ) = 11! possible pairings between the 11 players in each match, the one with the least sum of distances across each player pair is chosen as the best pairing.

Here are all the 11 X 11 distances with the final pairs highlighted for some comparisons to give you an idea of how this works.

The sum or average of these eleven distances or is a measure of how similarly the two teams passed which can be assumed to be a proxy for how similar the two teams’ playing strategy was. The smaller the sum of distances, the closer the formations of the two teams. Instead of the average distance, the maximum distance amongst these eleven distances is a stricter measure which can be used for the same purpose and this is what I prefer.

EMD illustration

It is a measure of the amount of change needed to transform one distribution to another. For instance, if you have two distributions, one concentrated 100% at (0,0) and another 50% at (0,1) and 50% at (1,1) then the EMD between them would be 1.2071068 = 0.5 * ( distance of moving a point from c(0,1) to c(0,0) = 1 ) + 0.5 * ( distance of moving a point from c(1,1) to c(0,0) = 1.4142135 )

Here is an example from the dataset. Three distributions of points on the pitch, and the EMD between them are shown below. These three distributions are of locations from where Toby Alderweireld made passes against these teams.

Match1 Match2 DistancePassXY
A vs Huddersfield H vs Leicester 5.251464
A vs Huddersfield A vs West Brom 22.282919
H vs Leicester A vs West Brom 24.657782

Note how the game against WBA as a very different set of points where passes were made from, and accordingly the distance between that game and the other two games is much higher than the distance between the other two games.

Additional notes

  • Passing data is probably more reflective of attacking similarity than defensive similarity. Like how I simplified attacking play to moving the ball from one area of the pitch to another, we could consider defense being about not letting the ball move into some area of the pitch. That would need off the ball positioning data which I don’t have access to.

  • For more role specific distances, the distance measure can be extended to any other spatial distribution as well, eg. locations where the player shoots from, where the player tackles, etc. While defensive actions get ignored, my hope is that offensive actions indirectly still get captured, for instance if you expect to see a player shoot from certain positions, you may have less passes from the same position for him and that should cause the distance to increase between him and another player who tends to pass from that same position.

  • I could have chosen to calculate the distance between the start and end coordinates of the passes as two four dimensional data (xstart1, ystart1, xend1, yend1) and (xstart2, ystart2, xend2, yend2). I instead choose to calculate the distance between the set of starting points and the ending points separately as four two dimensional datasets (xstart1, ystart1), (xend1, yend1), (xstart2, ystart2), and (xend2, yend2). Cons - A player who passes from xy1 to xy2, and xy3 to xy4, and another player who passes from xy1 to xy4, and xy3 to xy2, would have 0 distance in the chosen methodology which would be incorrect. Pros - it is easier to understand the reasons for the overall distance between two distributions. I haven’t verified this but my hope is that the case in the cons is probably rare and the interpretability is worth that loss. I’ll test this at a later point in time.

  • Other aspects of a team’s strategy, such as the playing style of opposition, what’s at stake, the situation in the game, red cards, etc. have been ignored for sake of simplicity. It should be trivial to match the playing style of the respective oppositions using the same method that we’re using to match the players and the teams.

Data used

  • I only used data from matches for the 2017-18 season of the EPL, the EFL, Ligue 1, La Liga, Bundesliga, and Serie A.

Get in touch

Do you have suggestions, comments, new ideas to build on top of this, etc.? I’d love to hear. Find me on Twitter - @thecomeonman.