2020-2021 - Correlations and win model

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Apologies if there might be mistakes or typos.

Data:

TeamWinsReb. MarginEff. FG %Defensive Eff. FG%AssistsFT MadeTurnoversBlocks
Illinois169.854.646.61615.312.82.7
Iowa148.454.449.117.913.89.43.9
Michigan149.654.144.414.112.511.54.2
Purdue133.649.948.112.913.412.23.5
Ohio State124.653.750.212.715.2113.1
Rutgers10-0.649.148.512.99.810.75.2
Wisconsin90.747.450.112.610.68.93.3
Michigan State9-5.345.548.81413.312.14.3
Maryland9-1.35148.411.911.210.93.3
Penn State7-2.746.653.612.712.811.42.1
Indiana7-3.54851.513.315.411.63.1
Minnesota6-643.451.812.71510.54.2
Northwestern6-6.850.152.713.49.812.12.4
Nebraska3-10.248.250.913.110.414.73.1
Average9.6049.749.613.612.811.43.5
NU Rank#12 (Tie)#13#6#13#5#13 (Tie)#11#13

Correlations to wins:
CategoryCorrelations
Reb. Margin96.2%
Defensive EFG%-76.0%
Eff. FG%75.0%
Assists56.7%
FT Made39.2%
Turnovers-26.2%
Blocks19.8%

Predictive model (R Square = 94.6%)

Wins = 16.4 + 0.53*RebMg - 0.17*DefEFG% - 0.06EFG% + 0.11**** + 0.08*FTMade + 0.14*TO + 0.23*Blocks

P Value of Rb Margin is 4.01%, next lowest P value (Defensive EFG%) is 67.78%.

Conclusion: Last season, rebound mattered more than anything. We were not good at rebounding

Also, the idea that we just can't hit an open shot is not backed up by the data. We just notice our misses a lot more than our opponents'.
 

NUThump

Redshirt
May 29, 2001
1,321
21
38
Thanks for posting. Question, since I am not very familiar with these models: would calculating an effective FG margin (offense - defence) and using that be any different than using them as separate inputs? The top 5 would all have a positive margin and the bottom 5 would all be negative.
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Thanks for posting. Question, since I am not very familiar with these models: would calculating an effective FG margin (offense - defence) and using that be any different than using them as separate inputs? The top 5 would all have a positive margin and the bottom 5 would all be negative.

Considering EFG% margin, our rank is #8:
MI - 9.7
IL - 8
IA - 5.3
OSU - 3.5
MA - 2.6
PU - 1.8
RU - 0.6
NU - -2.6
NE - -2.7
WI - -2.7
MSU - -3.3
IN - -3.5
PSU - -7
MN - -8.4

Correlation to wins is 84.14%.

Model still strong at R square 94.33%. P value of Reb Margin is 1.6% but of EFG% Mg is 96.25%. Rebounds are really, statistically, what is reliable.

Wins = 3.37 + 0.57*RebMg - 0.01*EFG%Mg + 0.05**** + 0.06*FTMade + 0.25*TO + 0.54*Blocks
 

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
Just curious, gato... if you ignore blocks (which to me seems practically irrelevant) I'd guess the correlation is basically the same?

Also, if you have the data, wouldn't you want turnover margin as an input?
Turnover margin (and rebounding margin) tells you how many more (or less) opportunities you have to shoot.

Needless to say, I think think type of unbiased approach is excellent. But I said it anyhow.
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Just curious, gato... if you ignore blocks (which to me seems practically irrelevant) I'd guess the correlation is basically the same?

Also, if you have the data, wouldn't you want turnover margin as an input?
Turnover margin (and rebounding margin) tells you how many more (or less) opportunities you have to shoot.

Needless to say, I think think type of unbiased approach is excellent. But I said it anyhow.
That can be work for tomorrow! TO we actually averaged 0.6 less than opponents.
 

DaCat

All-Conference
May 29, 2001
25,505
1,899
113
Considering EFG% margin, our rank is #8:
MI - 9.7
IL - 8
IA - 5.3
OSU - 3.5
MA - 2.6
PU - 1.8
RU - 0.6
NU - -2.6
NE - -2.7
WI - -2.7
MSU - -3.3
IN - -3.5
PSU - -7
MN - -8.4

Correlation to wins is 84.14%.

Model still strong at R square 94.33%. P value of Reb Margin is 1.6% but of EFG% Mg is 96.25%. Rebounds are really, statistically, what is reliable.

Wins = 3.37 + 0.57*RebMg - 0.01*EFG%Mg + 0.05**** + 0.06*FTMade + 0.25*TO + 0.54*Blocks
Good work.
 

Hungry Jack

All-Conference
Nov 17, 2008
37,171
2,666
67
Apologies if there might be mistakes or typos.

Data:

TeamWinsReb. MarginEff. FG %Defensive Eff. FG%AssistsFT MadeTurnoversBlocks
Illinois169.854.646.61615.312.82.7
Iowa148.454.449.117.913.89.43.9
Michigan149.654.144.414.112.511.54.2
Purdue133.649.948.112.913.412.23.5
Ohio State124.653.750.212.715.2113.1
Rutgers10-0.649.148.512.99.810.75.2
Wisconsin90.747.450.112.610.68.93.3
Michigan State9-5.345.548.81413.312.14.3
Maryland9-1.35148.411.911.210.93.3
Penn State7-2.746.653.612.712.811.42.1
Indiana7-3.54851.513.315.411.63.1
Minnesota6-643.451.812.71510.54.2
Northwestern6-6.850.152.713.49.812.12.4
Nebraska3-10.248.250.913.110.414.73.1
Average9.6049.749.613.612.811.43.5
NU Rank#12 (Tie)#13#6#13#5#13 (Tie)#11#13

Correlations to wins:
CategoryCorrelations
Reb. Margin96.2%
Defensive EFG%-76.0%
Eff. FG%75.0%
Assists56.7%
FT Made39.2%
Turnovers-26.2%
Blocks19.8%

Predictive model (R Square = 94.6%)

Wins = 16.4 + 0.53*RebMg - 0.17*DefEFG% - 0.06EFG% + 0.11**** + 0.08*FTMade + 0.14*TO + 0.23*Blocks

P Value of Rb Margin is 4.01%, next lowest P value (Defensive EFG%) is 67.78%.

Conclusion: Last season, rebound mattered more than anything. We were not good at rebounding

Also, the idea that we just can't hit an open shot is not backed up by the data. We just notice our misses a lot more than our opponents'.
I don't like the model output. Therefore it sucks.

/s
 

AdamOnFirst

All-Conference
Nov 29, 2021
9,708
1,346
113
We have to have the nerdiest goddamn basketball fans on the planet.

Well done NU. Very NU. Love you for it.
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
Dude. I’m playing with numbers. I do not aim to be the next KenPom. I’ve now changed the model 4 times. All while taking a break from stuff at work. No advanced statistic in basketball is a regression model. But I will probably continue to play with it, remove variables, etc. I could increase the sample size by using game data and not averages but, again, I’m playing, not intending to be an analyst.

Anyway, unless my memory really fails me, the r squared is the opposite of a red flag. The high p values of every variable other than rebounding margin is where you could hang your hat on.

Run your stuff, play with the numbers yourself. Make suggestions?
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
Dude. I’m playing with numbers. I do not aim to be the next KenPom. I’ve now changed the model 4 times. All while taking a break from stuff at work. No advanced statistic in basketball is a regression model. But I will probably continue to play with it, remove variables, etc. I could increase the sample size by using game data and not averages but, again, I’m playing, not intending to be an analyst.

Anyway, unless my memory really fails me, the r squared is the opposite of a red flag. The high p values of every variable other than rebounding margin is where you could hang your hat on.

Run your stuff, play with the numbers yourself. Make suggestions?
Fair enough, I shouldn't have been so blunt, since I do consider this an interesting topic, but the model as it currently stands is just statistical noise. Here are some more constructive thoughts:

1) r-squared being near 1 is absolutely a red flag for overfitting for a model with 14 observations and 7 inputs. Imagine a model with 14 observations and a single, random categorical covariate taking unique values from 1-14. The model r-squared is 1, but the model is obviously just noise. Your model is just a slightly less extreme version of this (see here for more). The other dead giveaway is that the p-values for the regression coefficients are extremely high, so essentially none of the model covariates are statistically significant.

2) the easiest and most interesting way to fix the model is just to expand your data set to include all games for all division 1 teams. Or if not all teams, then at least multiple seasons worth of data for the b1g.

3) if you have the time to look into it, running a poisson regression would be more appropriate here given that you're looking at a discrete outcome.
 
Last edited:

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
Fair enough, I shouldn't have been so blunt, since I do consider this an interesting topic, but the model as it currently stands is just statistical noise. Here are some more constructive thoughts:

1) r-squared being near 1 is absolutely a red flag for overfitting for a model with 14 observations and 7 inputs. Imagine a model with 14 observations and a single, random categorical covariate taking unique values from 1-14. The model r-squared is 1, but the model is obviously just noise. Your model is just a slightly less extreme version of this (see here for more). The other dead giveaway is that the p-values for the regression coefficients are extremely high, so essentially none of the model covariates are statistically significant.

2) the easiest and most interesting way to fix the model is just to expand your data set to include all games for all division 1 teams. Or if not all teams, then at least multiple seasons worth of data for the b1g.

3) if you have the time to look into it, running a poisson regression would be more appropriate here given that you're looking at a discrete outcome.
Your comment about overfitting is fair, but definitely too harsh, as you have said.
However, the number of inputs can be reduced.
Steals and Turnovers can become one variable.
Rebounding Margin is obviously quite predictive of success all by itself.
Blocked shots can't add much, so should be removed.
Assists are probably also relatively unimportant.

If its 3 or 4 variables, I think your concerns are largely addressed.
And I think you'll still get a high correlation.
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
I attempted several new variables and nothing has a good correlation. Therefore when running a regression, even with less variables, the p values were not good.

***/Turnover Ratio - 49.82%
Steals - -46.17%
Turnover Margin - -21.54%
Turnover Margin + Steals - -34.52%

I might try later adding more data. Trying to stay away from using non conference teams or non conference games to eliminate data from blowouts and the like. But can add more seasons of B1G play. I do not remember the poisson regression so would have to re-educate myself on that one.

For me the insight, with no illusion that this is great data, is that, from all the variables that increase possessions, rebounds, steals, turnovers and blocks, only rebounds fit as a predictor of wins and losses
 

Hungry Jack

All-Conference
Nov 17, 2008
37,171
2,666
67
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
I actually understand this. Somewhat.
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
Your comment about overfitting is fair, but definitely too harsh, as you have said.
However, the number of inputs can be reduced.
Steals and Turnovers can become one variable.
Rebounding Margin is obviously quite predictive of success all by itself.
Blocked shots can't add much, so should be removed.
Assists are probably also relatively unimportant.

If its 3 or 4 variables, I think your concerns are largely addressed.
And I think you'll still get a high correlation.
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
 

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
Steals would be a proxy for turnovers forced, if we didn't have that stat.
Essentially the winner of a game should be very correlated with the number of shots attempted by each team and the success rates.
Possessions determine number of shots. Rebounds, steals, turnovers determine number of possessions.
(whereas blocks and assists do not)
Free throws are another factor, but possibly not that impactful.

I'll read the article you linked.
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
Good read. Enjoyed it. It points out the issues with rebounding margin. And I do agree that rebound percentage is a better metric. Torvik is a much smarter guy than me and uses Rebound %.

But I am not sure rebound margin needs to die. Assuming competitive games, like the ones in the B1G, rebounding margin will, in most cases, point to having more possessions throughout the game. And more possessions than an opponent means, on average, more points. The interesting thing to me is that none of the other stats that point to more possessions than the opponent, for example, turnover margin, show any big correlation to wins.
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
Steals would be a proxy for turnovers forced, if we didn't have that stat.
Essentially the winner of a game should be very correlated with the number of shots attempted by each team and the success rates.
Possessions determine number of shots. Rebounds, steals, turnovers determine number of possessions.
(whereas blocks and assists do not)
Free throws are another factor, but possibly not that impactful.

I'll read the article you linked.
Turnovers are the number of turnovers the team commits, whereas steals are the number of steals the team gets. No correlation whatsoever is implied - teams can commit loads of steals and commit loads of turnovers (i.e. NUWBB teams of late), or whatever combination you could imagine.

I think we're pretty much saying the same thing regarding success rate of possessions - this is exactly what rebounding margin is measuring. I think it's cool to see this borne out in the results, but if the goal is to identify interesting traits associated with winning, this ain't it. This is just a version of the old John Madden quote, "usually the team that scores the most points wins the game".
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
Good read. Enjoyed it. It points out the issues with rebounding margin. And I do agree that rebound percentage is a better metric. Torvik is a much smarter guy than me and uses Rebound %.

But I am not sure rebound margin needs to die. Assuming competitive games, like the ones in the B1G, rebounding margin will, in most cases, point to having more possessions throughout the game. And more possessions than an opponent means, on average, more points. The interesting thing to me is that none of the other stats that point to more possessions than the opponent, for example, turnover margin, show any big correlation to wins.
Just to clarify: both teams get the same number of possessions in a game. The only caveat to that is that obviously one team could get 1 more possession than its opponent if it got the first and last possession, but the differential cannot exceed 1. This is just the nature of the game, and perhaps when you say "possessions" you mean scores or shots? If you're interested in teams that generate more shots/opportunities through rebounds, I'd definitely focus on adding something specifically measuring offensive rebounds.

In any case, it just bears repeating that in your dataset rebound margin is itself directly correlated to scoring more than your opponent, so it's no surprise it's correlated with wins. Gasaway calls it meaningless because, among other things, it's not pace-adjusted and so rebounding margin will always look better for fast-paced teams. In the B1G, however, where pace of play is similarly slow across teams and their conference schedules are nearly identical, the variance in pace is less of an issue. Instead, the issue is that rebounding margin isn't interesting, it's just another way of asking "did you score more than your opponents?"
 

GatoLouco

Sophomore
Nov 13, 2019
5,636
116
63
Just to clarify: both teams get the same number of possessions in a game. The only caveat to that is that obviously one team could get 1 more possession than its opponent if it got the first and last possession, but the differential cannot exceed 1. This is just the nature of the game, and perhaps when you say "possessions" you mean scores or shots? If you're interested in teams that generate more shots/opportunities through rebounds, I'd definitely focus on adding something specifically measuring offensive rebounds.

In any case, it just bears repeating that in your dataset rebound margin is itself directly correlated to scoring more than your opponent, so it's no surprise it's correlated with wins. Gasaway calls it meaningless because, among other things, it's not pace-adjusted and so rebounding margin will always look better for fast-paced teams. In the B1G, however, where pace of play is similarly slow across teams, the variance in pace is less of an issue. Instead, the issue is that rebounding margin isn't interesting, it's just another way of asking "did you score more than your opponents?"
You are correct. I am using the word possessions wrong. Five offensive rebounds don't mean 5 more possessions as the possession only ends when the other team gets the ball.
 

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
I found that article to be pretty weak, to be honest. My guess is that the author is fixated on other people using rebounding margin to define which teams are the best at rebounding. Given varying levels of competition and styles of play, sure thats a valid gripe (although quite overstated in that article).

However, rebounding is a determinant of possessions AND a reflection on how well your defense is forcing missed shots, so it is a major determinant of which team scores more points - all things being equal.

When I use the word "possessions" I am talking about opportunities to get a shot.
If I shoot and miss and then get the ball, to me that is a new possession. Do people really think something different?

I am pretty confident that rebounding margin, turnover margin and shooting percentages would be good "predictors" of who won a given game. I guess free throws would make it even more robust.
 

mission_cat

Redshirt
Nov 9, 2012
91
1
0
I found that article to be pretty weak, to be honest. My guess is that the author is fixated on other people using rebounding margin to define which teams are the best at rebounding. Given varying levels of competition and styles of play, sure thats a valid gripe (although quite overstated in that article).
I think the context for the author's fixation is that people continue to use it instead of more relevant metrics.

However, rebounding is a determinant of possessions AND a reflection on how well your defense is forcing missed shots, so it is a major determinant of which team scores more points - all things being equal.
Rebounding margin is too ambiguous of a metric to call it a determinant of which team scores more points. The entire point is that it can just as easily be viewed as a consequence of which teams scores more points. You tend to get more rebounds only after you play good team defense and force missed shots. You limit the number of rebounds your opponent can get by playing good offense and scoring with high frequency. Rebounds are downstream of these team skills. You could argue that a team could play good defense, force lots of missed shots, and still not have good rebounding. This is yet another argument to look at defensive rebounding percentage, since this roughly controls for how good your team is at defending and will measure their skill at rebounding, independent of their skill at forcing bad shots.

I agree that generating more possessions via good offensive rebounding is a great measure of team skill, but (1) the vast majority of rebounds are defensive, so you should look at a statistic that isolates offensive rebounds, and (2) if a team is really good at shooting, you will naturally have fewer opportunities to get offensive rebounds. So you should look at offensive rebound % for the same reasons you would look at defensive rebound % - it controls for offensive skill and will specifically measure how good a team is at rebounding alone.

When I use the word "possessions" I am talking about opportunities to get a shot.
If I shoot and miss and then get the ball, to me that is a new possession. Do people really think something different?
Yes, a possession is defined as beginning when you receive the ball and ending when the other team takes possession of the ball. Offensive rebounds just extend a single possession.

I am pretty confident that rebounding margin, turnover margin and shooting percentages would be good "predictors" of who won a given game. I guess free throws would make it even more robust.
I completely agree. If we added a pace-adjusted measure for turnover margin to the season-level stats above, I'm sure it would be significantly correlated with wins. But if the goal of this analysis is to find out why teams are good, then turnover margin and FT margin should be way more interesting than rebound margin.
 

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
I think the context for the author's fixation is that people continue to use it instead of more relevant metrics.


Rebounding margin is too ambiguous of a metric to call it a determinant of which team scores more points. The entire point is that it can just as easily be viewed as a consequence of which teams scores more points. You tend to get more rebounds only after you play good team defense and force missed shots. You limit the number of rebounds your opponent can get by playing good offense and scoring with high frequency. Rebounds are downstream of these team skills. You could argue that a team could play good defense, force lots of missed shots, and still not have good rebounding. This is yet another argument to look at defensive rebounding percentage, since this roughly controls for how good your team is at defending and will measure their skill at rebounding, independent of their skill at forcing bad shots.

I agree that generating more possessions via good offensive rebounding is a great measure of team skill, but (1) the vast majority of rebounds are defensive, so you should look at a statistic that isolates offensive rebounds, and (2) if a team is really good at shooting, you will naturally have fewer opportunities to get offensive rebounds. So you should look at offensive rebound % for the same reasons you would look at defensive rebound % - it controls for offensive skill and will specifically measure how good a team is at rebounding alone.


Yes, a possession is defined as beginning when you receive the ball and ending when the other team takes possession of the ball. Offensive rebounds just extend a single possession.


I completely agree. If we added a pace-adjusted measure for turnover margin to the season-level stats above, I'm sure it would be significantly correlated with wins. But if the goal of this analysis is to find out why teams are good, then turnover margin and FT margin should be way more interesting than rebound margin.

The way I see it, the use of these stats would be to use them as inputs and try to project or estimate the final score. You wouldn't take the final score and then try to estimate how many rebounds each team got because nobody really cares.
 

PurpleWhiteBoy

Redshirt
Feb 25, 2021
5,303
0
0
I looked at some numbers from the 2019-20 Big Ten season.
My conclusion is that if you take your EFG% in a game, your opponent's EFG% in the game, the made free throw differential in the game, the rebounding margin and the turnover margin, you can predict the final scoring difference quite accurately.

That may seem pretty obvious. You outrebound your opponent by 5, win the turnover battle by 7, you should win by 12, unless you shoot the ball worse than they do or they make a lot more free throws...
Something along those lines.

What I like about it is that it gives me a way to modify the actual +/- for a guy based on his contributions (or damage) to the team while he was on the court.