This is a thinking out loud kind of post. A bunch of basic charts and numbers are here which should be enough to give an idea of what I’m proposing. I want to think about this some more, which is usually a long and slow process because life gets in the way, before doing a version with nice looking vizes, etc. + also get feedback and comments from you guys.
http://www.lukebornn.com/papers/fernandez_sloan_2019.pdf proposes a probabilistic possession value based on combining the possession value ( PV ) of the part of the pitch from a PV model, the probability of the ball being controlled in the part of the pitch from a pitch control model, the pass difficulty from a pass probability model, and the probability of the ball being played to that part of the pitch from an action likelihood model. These numbers can then be used to infer passing tendencies, awareness, positioning, and many other things.
Pitch control, pass probability, and action likelihood are outcome based models. Pitch control tells you the probability of controlling the ball if it were played to that part of the pitch. Pass probability tells you the probability of successfully making a pass to a particular point on the pitch and is likely based on historic pass data which means it is outcome based. Action likelihood is also probably based on historic match data and is also probably outcome based. An additional aspect we need to evaluate things like player decision making is intent. That’s what the rest of the post is about.
We’ll exclude action likelihood and only look at the output just from the possession value and the pitch control models. This lets us evaluate any opportunity across the pitch without prejudice for what players tend to do in that situation.
If you’re familiar with pitch control and possession value, you can probably just glance at the Glossary section and then skip straight to the Intent vs. Outcome section.
Pitch control tells you the probability of a team and its players controlling the ball at various parts of the pitch if the ball were to be passed there. This is done for each instant of the game and changes based on the positions of the players and their movement.
I use Spearman’s model but you could probably swap it out for any other model too.
The probability of scoring at the end of a play / in the next some moves / in the next some seconds once the team has a ball in a particular part of the pitch.
I’ve used the PV grid at https://raw.githubusercontent.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking/master/EPV_grid.csv which looks like this for a team attacking left to right -
The possession value sharply climbs close to 50% near the opposition goal but is quite low for most of the pitch.
We can rotate the pitch halfway around and get the PV values for the respective part of the pitch for the opposition since that team is attacking right to left.
You should be able to swap it out for another PV model, if you so prefer.
You should be able to combine these to evaluate what passing options offer progress to the team in possession.
AttackOriginPV
DefenseOriginPV = -AttackOriginPV
(x,y)
, the PV for each target location can be calculated for both
the teams, AttackTargetPV_xy
and DefenseTargetPV_xy
(x,y)
, the attacking team has a
probability of keeping control of the ball, AttackProbabilty_xy
,
and therefore the defending team’s probability is
DefenseProbabilty_xy = 1 - AttackProbabilty_xy
(x,y)
, we can estimate an expected PV for playing
a pass there:
AttackTargetEPV_xy = ( AttackProbabilty_xy * AttackTargetPV_xy ) - ( DefenseProbabilty_xy * DefenseTargetPV_xy )
DefenseTargetEPV_xy = ( DefenseProbabilty_xy * DefenseTargetPV_xy ) - ( AttackProbabilty_xy * AttackTargetPV_xy )
AttackEPVAdded_xy = AttackTargetEPV_xy - AttackOriginPV
DefenseEPVAdded_xy = DefenseTargetEPV_xy - DefenseOriginPV
AttackTargetEPV_xy
. We’ll call the EPV added at that point
AttackMaxEPVAdded
. This value may not always be positive.A positive AttackEPVAdded_xy
implies that passing the ball to that
point on the pitch is likely to increase the chance of scoring even
after considering the risk of conceding the ball. A negative value
implies that passing the ball to that point is likely going to decrease
the chance of scoring.
I ignore the possibility of passes that travel more than 2/3 the length of the pitch. Only Ederson can make such passes. We consider the area that is within a radius of 2/3 length of the pitch at any point of time and ignore everything outside of it.
We can expect DefenseEPVAdded_xy
to be positive very often, since
DefenseOriginPV
is a negative value so the more interesting thing to
analyse would be AttackEPVAdded_xy
.
I use one frame to give an example of these calculations.
Here is what the pitch control looks like in a random frame. I exclude players that are offside so the two players from the blue team in the offside position have zero contribution to pitch control.
This is what the pitch looks like in terms of AttackEPVAdded_xy
.
There is some territory in the middle of the pitch where passing the
ball is expected to return a positive AttackEPVAdded_xy
. And then
there is an area near the opposition goal, which has a larger
AttackEPVAdded_xy
than the middle of the pitch. This area lights up
because 1, the goalkeeper isn’t given extra powers to control the ball
in the pitch control model which is not the case in real life since
keepers can catch the ball with their hands, dive on it, etc. and 2,
pitch control is only reflective of passes, not shots. As a result of
this inconsistency, the logic above would suggest that in this situation
the optimal action for the team in possession, the blue team, should be
to pass the ball all the way to the goal and hope that the small chance
of another player from the same team controlling the ball pays off for
the high reward you’d get from successfully controlling the ball at that
location.
Don’t want to raise your hopes so specifying that what I’ll be proposing here doesn’t directly address this goalkeeper area problem but does help reduce the problem.
I use game 1 from Metrica Sports’ release available here - https://github.com/metrica-sports/sample-data/tree/master/data/Sample_Game_1. The findings WRT to the home and away team obviously don’t carry over to all games but the more general inferences probably do.
Let us look at the proportion of the pitch that offers a positive
AttackEPVAdded_xy
at each instant the team had possession of the ball.
AttackEPVAdded_xy
for both the
teams. It is exactly 0% of the area in 19% for the home team and 11%
for the away team has the ball. A value of 0% for
AttackEPVAdded_xy
means the team in possession of the ball has no
options on the pitch to pass to where they are expected to increase
their chances of scoring by enough to negate the chance of
conceding. This is very important because a number of times a pass
is played just to retain possession and not necessarily increase the
chance of scoring and the typical way PV is applied would give these
passes a negative score which is unfair because the situation is not
taken into account.AttackEPVAdded_xy
for the attacking
team.Let us look at the values of the AttackEPVAdded_xy
itself.
If the team were to randomly pass to any point on the pitch, at each
instant of the game, what AttackEPVAdded_xy
could they expect? This
would be the same as averaging the AttackEPVAdded_xy
values over the
entire pitch and would look like -
If we instead look at the best option the team has at any instant,
AttackMaxEPVAdded
-
AttackMaxEPVAdded
between -1% and 2% for both teams. Specifically the amount of time
is -Team | HighAttackMaxDeltaPVIntent\_pct |
---|---|
A | 96% |
H | 95% |
The proportion of time AttackMaxEPVAdded
is above 2% -
Team | LowAttackMaxDeltaPVIntent\_pct |
---|---|
A | 3% |
H | 4% |
How about where these options are available?
The location offering AttackMaxEPVAdded
at each instant during the
sample of frames looks like as below. Both teams have almost equal
possession so don’t worry too much about normalising this chart.
We haven’t looked at the distribution of the ball positions in these frames so we shouldn’t draw detailed conclusions from just these figures. That analysis is not necessary for the point of this post anyway so we’ll skip it. This is just warming you up to how we can look at things.
If we filter only the frames where AttackMaxEPVAdded
> 0.02, these
are the frames which had at least one pass which allowed the team to
make substantial progress towards scoring. Note that these are potential
passes that could have been played and not necessarily the actual pass
that was played at the time. There may also have been more than one such
pass at any frame but we will pick only the one with the
AttackMaxEPVAdded
value. It’s also likely that during the course of
the play, a similar kind of pass continued to remain the optimal choice
for consecutive frames in the sample and you would therefore see it
multiple times.
This is point of this post. Thanks for staying with me. All the stuff above was to get you comfortable with the setting which hopefully you are by now.
Note how most high value AttackEPVAdded_xy
passes are long passes,
often aimed at the edges of the pitch. Note how the optimal targets are
also concentrated around the edges of the pitch. Long passes are likely
to fetch more PV because it usually gets you much closer to the goal
from where the ball was before. Edges of the pitch are more attractive
because defenders will tend to stay towards the insides of the pitch
leaving one side less for a defender to impact the pitch control of a
wide positioned player from the attacking team. This is expected.
The reason I started writing is because while these are rewarding passes if they actually happen, it is also much harder to execute such passes. In the EPV paper, there is a pass probability component which is a logistic regression model but if it is like most other pass probability models then it is also modelling the probability of accurately passing to a particular point on the pitch and the training data is probably based on historic pass data which are again outcomes. I think there is room to improve this.
The risk of passing the ball out of bounds, or not being able to pass
accurately to the point with the maximum EPV added is much higher for
passes close to the sidelines, long passes, and so on compared to easier
passes to someone in the middle of the pitch or to someone closer to the
player in possession. Until now we were looking at an outcome based
AttackMaxEPVAdded
i.e. the EPV the team could expect to gain if the
ball reached a particular (x.y)
but if we switch to an intention based
AttackEPVAdded_xy
, i.e. the EPV the team could expect to gain if the
ball was attempted to be passed to a particular (x.y)
then we would
need to incorporate the risks of inaccuracy in passing. There are other
factors that could also cause inaccuracy such as pressure on the ball,
the passer’s body orientation, etc. but for now we will keep it simple
and consider only the length of the pass and the target location. Note
that the term inaccuracy is typically used to describe whether a pass
was successfully received by a teammate but in this post it is being
used to describe whether a pass intended for a particular location
actually reaches that location or goes somewhere else.
To model this inaccuracy - let us assume that a pass intended for a particular location x,y could actually go anywhere in the neighbourhood of x,y depending on how far x,y is from the pass origin location. To simulate this, I take the intended pass target location and calculate possible actual pass target location by adding some noise to the intended coordinates. For actual targets that go out of bounds the appropriate corner, goal kick, or throw in location is calculated and considered as the next event and the PV calculation is done accordingly.
For each intended target location, the probability of the ball landing in any of the actual target location in the neighbourhood of the intended target location is modelled as a two dimensional gaussian centred around the intended target location. The standard deviation of the gaussian is a linear function of the intended pass length such that 99% of the pass attempts land within a radius of ( 10% of the intended pass length ) from the intended pass target location. For computational convenience, we ignore points outside this radius.
Now that we have these probabilities, we can calculate a new set of pitch control and expected possession values for each intended target location by aggregating the outcome based pitch control and expected possession values of all its actual target locations weighed by the the probability of that actual target location.
This is not the same as just having a pass distance feature in the pass probability logistic regression model. We are redefining pitch control and possession value altogether by differentiating what they mean for a receiver and what they mean for a passer.
Another one frame example to give you an idea of how this manifests. From here on, the charts marked outcome are the same set of numbers from the charts in the earlier section and the charts marked intent are the new set of numbers.
Here is what the delta in control probability between the two looks like. Note the change in the colour scale -
Let’s exclude the edges and see how much the central areas change by. Again note the change in the colour scale -
What difference does that translate to in terms of the
AttackEPVAdded_xy
Okay not very easy to spot differences. The delta in the central area looks like -
The optimal intent based pass target location is quite far from the optimal outcome based pass target location. Optimal location marked with a + -
The target is still on the right wing, but near the centre line as per the intent model whereas the outcome based optimal location is around a quarter of the length of the pitch farther ahead and much closer to the edge of the pitch.
Before we look at the distributions from earlier again, here is a quick comparison of the change -
AttackMaxEPVAdded
values take a hit when
you compare them to the corresponding intent based
AttackMaxEPVAdded
. This is expected since a lot of these were near
the edges of the pitch or very long passes.Here are the rest of the distributions with the comparisons.
AttackEPVAdded_xy
in the outcome version of the model to smaller
parts of the pitch in the intent version of the model.The other distributions also change in a similar manner -
AttackMaxEPVAdded
values. This makes sense, you’d expect ådding noise would reduce the
reward possible from the optimal choice.AttackMaxEPVAdded
passes. These were
the kind of passes where there is a chance of the ball going out of
bounds as well or not being hit exactly at the target which get
penalised in the intent model but not in the outcome model.Here is a distribution of the difference in distance between the optimal intent based pass distance and the optimal outcome based pass distance bucketed by the optimal outcome based pass distance.
( We use a distance based noise function because we expect longer passes to be harder to execute. The trend of longer passes seeing a more drastic change might be a self-fulfilling result as a consequence of our formulation of noise. Some day we’ll have a good passing inaccuracy model and then we’ll know. )
The distance between the two optimal locations -
Most of the time is is within 10 units ( 1/12th the length of the pitch ) which means they are close enough to each that is probably doesn’t matter much. There are many cases though where this distance is quite large, i.e. cases the two locations are farther apart than 10 units. Those cases are where an intent based model is evaluating the situation and evaluating the options better than an outcome based model.
Team | HighDistance |
---|---|
H | 15% |
A | 21% |
If we look at shots as another example, there is a gap between xG and PSxG which is where an intent based model would fit in. xG captures all the possibilities in that situation and PSxG captures the final outcome from that situation but neither are a good measure of intent. To connect intent, wanting to shoot at a particular point on the goal, with a particular speed, with a particular trajectory etc. with outcome, the final shot properties, we could repeat a similar process as above highlighting what sort of errors can be expected in any of those properties of intent. For eg. choosing a very high speed shot probably causes higher noise in the target point on the goal and therefore the intent based PSxG should include other noisy trajectories which consider the possibility of a shot being in the neighbourhood of the final shot to evaluate how good a decision it was to pick that point.
The difference between the expected reward from an intention and an outcome is noteworthy.
There are a lot of simple assumptions built into this analysis. It would be useful to validate those assumptions from data, which I don’t think we can do from publicly available resources though.
On Twitter - @thecomeonman, or on mail - mail dot thecomeonman at gmail dot com