Forecasting Team Passes in Football Matches
A data-driven approach to predict pass counts in upcoming matches using historical performance metrics across multiple leagues.
Dataset Overview
Our analysis is based on a comprehensive dataset of 4,414 matches spanning six major football leagues:
  • English Premier League
  • La Liga
  • Ligue 1
  • Brazilian Serie A
  • Swedish Allsvenskan
  • Major League Soccer
Each match provides 31 statistics per team plus 31 for their opponent, with one full season of data per league.
Summary
Fundamental statistics of key actions and occurrences during a game, such as xg, goals, passes.
Target Variable - Passes
Predicting the number of passes by the main team in the next match.
Sample Dataset
Feature Engineering
3-match Rolling Averages to Capture Recent Form
  • Computed 3-match rolling averages for all 60 performance metrics:
  • 30 team-specific features (e.g., xg_rolling, shots_rolling)
  • 30 opponent-specific features (e.g., opp_xg_rolling, opp_shots_rolling)
  • This approach capture recent form, smooth out random variation & prevent data leakage by using stats from the current match.
One-hot encoding of categorical variables
  • Categorical variables, including league_code, opp_team_code, and season_code, are converted to binary format where each category is represented by a separate column.
Model Development (Regression)
Linear Regression Model
  • Forward stepwise selection (Cp criterion) was used to identify a predictive subset from 60 rolling stats.
  • The algorithm starts with no predictors and adds one variable at a time, selecting the variable that most improves model performance (lowest Cp).
  • Cp has a penalty term, increasing in terms of the number of predictors.
  • The process stops as soon as adding further variables no longer reduces Cp, ensuring the model remains simple and avoids unnecessary complexity.
Polynomial Regression Model
  • Degrees 2–4 were added to selected numerical features to capture non-linear relationships.
Regression Model - Selected Features
Team Performance
16 variables capturing average performance of the team's last 3 matches
attacks_rolling, counter_attacks_rolling, corners_rolling, crosses_rolling, free_kicks_won_rolling, goal_kicks_rolling, goals_rolling, offsides_rolling, passes_acc_rolling, passes_forwards_rolling, passes_in_final_third_rolling, passes_long_rolling, passes_short_rolling, possessions_duration_rolling, saves_rolling, and xg_rolling.
Opponent Performance
10 variables capturing average stats of the team's last 3 opponents
opp_corners_rolling, opp_dribbles_rolling, opp_free_kicks_won_rolling, opp_interceptions_rolling, opp_offsides_rolling, opp_passes_forwards_rolling, opp_passes_in_final_third_rolling, opp_passes_short_rolling, opp_shots_on_rolling, and opp_throw_ins_rolling.
Contextual Factors
3 categorical variables describing match context, always included regardless of performance
tourament_id, opp_team_id, and season_id.
Regression Model Performance
We evaluated the performance of our regression models using Root Mean Squared Error (RMSE) and R-squared (R²).
Linear Regression
Stable performance across both training and testing datasets, indicating a good balance without significant overfitting.
Polynomial (Degree 2)
Slight increase in RMSE and decrease in R² on the test set, suggesting minor overfitting.
Polynomial (Degrees 3 & 4)
Severe overfitting, with near-perfect fit on the training data (RMSE ≈ 0, R² ≈ 1.00) but extremely poor generalization on the test set (huge RMSE, negative R²).
Model Development (Decision Tree)
Decision trees predict outcomes by partitioning the predictor space into distinct regions, assigning each region the mean response value of its observations. This is achieved through recursive binary splitting to minimize prediction error (RSS).
Algorithm:
1. Initialization
Begin with all observations in a single region at the tree's root.
2. Recursive Binary Splitting
Next, evaluate every predictor (j) and possible split point (s), selecting the split that minimizes Residual Sum of Squares (RSS).
RSS = \sum_{i : x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1(j,s)})^2 + \sum_{i : x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2(j,s)})^2
3. Iteration
Apply the splitting process to each new child node created.
4. Stopping Criterion
Splitting halts when a predefined maximum depth (i.e., the maximum number of splits from rood node to any terminal node) is reached.
5. Parameter Tunning
Train trees of varying depths (1 to 20), compute test set RMSE for each, and select the depth with the lowest test RMSE to balance complexity and predictive accuracy.
Decision Tree Model Fitting
Optimal Tree Depth
Decision Tree Structure
Model Comparison
Summary:
  • Linear Regression is chosen as the final model due to the lowest test RMSE & performance consistency.
  • Even though Decision Tree performs comparably to LR, it is slightly worse on the test set.
Use Case: Predicting Real Madrid's Next Match
Predicted Real Madrid's pass count for a hypothetical upcoming match based on their most recent 3 matches:
503.3
Predicted Passes
For Real Madrid's next match with Barcelona
Limitations & Next Steps
1
Linear Regression (LR)
Limitation: Stepwise selection is a greedy, discrete method that may miss the globally optimal set of predictors.
Improvement: Regularisation methods (e.g., Ridge or Lasso) offer a more systematic approach, balancing fit and complexity globally by shrinking coefficients toward zero.
2
Polynomial Regression
Limitation: Higher-degree polynomials explode in terms of feature count (e.g., 5535 terms for degree 3), causing overfitting in train set but oscillate wildly in test set.
Improvement: Generalised Additive Models (GAMs) can capture non-linear relationships through smooth functions for each predictor.
3
Decision Tree (DT)
Limitation: The tuning of maximum depth is simplistic, leading to potential overfitting or underfitting. DT is also unstable.
Improvement:
  • Use Cost Complexity Pruning to systematically prune a large tree.
  • Ensemble methods (e.g., bagging, random forest) can reduce variance and improve predictive power by combining multiple trees.
4
Train-test Validation
Limitation: sensitive to how data is split (e.g., 80/20)
Improvement: Use k-fold cross-validation (e.g., 5- or 10-fold) to better evaluate model performance across different subsets of data.