• Have something to say? Register Now! and be posting in minutes!

Exploring the Relationship Between Runs and Wins- And Observing Outliers

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:
attachment.php


Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

Screen Shot 2017-02-14 at 2.26.51 PM.png

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Probably the problem with the '06 Indians is that @SlinkyRedfoot rooted for them and cursed them with the stink he got from that hobo in Kensington when he went to North Philly in May
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
How'd last years Rangers do?
Good question. The database I was working with only went up to 2015, so I'm not sure where they would be in that model, but probably pretty high and to the right. If they release a database for 2016, I'll definitely plot it!
 

Guy Incognito

Crack a window, will ya?
24,089
5,003
533
Joined
Jul 26, 2016
Location
The Village!
Hoopla Cash
$ 342.86
Fav. Team #1
Fav. Team #2
Fav. Team #3
Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:
attachment.php


Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

View attachment 154942

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!
Outside of any real in-depth analysis, the '06 Indians had 3 90-win teams in their division (White Sox won 90 games and finished 5 games out of second), while the AL West in '08 only had the Angels finish above .500. So I'm guessing the outlier in both cases was the relative strength/weakness of the division.
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Outside of any real in-depth analysis, the '06 Indians had 3 90-win teams in their division (White Sox won 90 games and finished 5 games out of second), while the AL West in '08 only had the Angels finish above .500. So I'm guessing the outlier in both cases was the relative strength/weakness of the division.
Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:
attachment.php


Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

View attachment 154942

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!


Just wondering why you would make that formula up when there is already the Pythagorean estimated win%

and how does your formula compare to it??

and does your formula take into account different totals with the same differentials?? are teams better when they score more or prevent more runs??
 

Guy Incognito

Crack a window, will ya?
24,089
5,003
533
Joined
Jul 26, 2016
Location
The Village!
Hoopla Cash
$ 342.86
Fav. Team #1
Fav. Team #2
Fav. Team #3
Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.
Gotta have context with the numbers sometimes.

Honestly, I didn't look at standings first either. Just kind of stumbled across it when I was looking for other info on the Angels team (I was prepared to go on some diatribe about how the Indians stunk playing small ball, as they were great in blowouts but bad in one-run games), and saw the AL West standings... then looked at the '06 AL Central standings, and, yeah, there's only so many wins to go around.
 

soxfan1468927

Well-Known Member
7,001
978
113
Joined
Jul 3, 2013
Location
603
Hoopla Cash
$ 7,185.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.
Even if you take out games in the division. Angels were a 98-99 win team outside the division. And the Indians were an 80 win team in games played outside their division. Not much difference.
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Just wondering why you would make that formula up when there is already the Pythagorean estimated win%

and how does your formula compare to it??

and does your formula take into account different totals with the same differentials?? are teams better when they score more or prevent more runs??
Linear regression is a classic way of predicting something. I could have used Bill James' formula, but I chose not.

Bill James' formula is better. Its RMSE is lower, and is much better for extreme situations. For example, a team that has a run differential of +1,000 in a linear regression would be expected to win more than all of it's games, which is impossible. In the Pythagorean formula, they have a winning percentage around .800, which is more feasible.

These things will likely never happen in our lifetime, but they do make James' formula superior to the class linear regression.
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
how did the Baltimore Orioles of 2015 do in this... Picking them because if my memory is correct, they were a first place team with a terrible RD, due to being amazing in one run games...
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Gotta have context with the numbers sometimes.

Honestly, I didn't look at standings first either. Just kind of stumbled across it when I was looking for other info on the Angels team (I was prepared to go on some diatribe about how the Indians stunk playing small ball, as they were great in blowouts but bad in one-run games), and saw the AL West standings... then looked at the '06 AL Central standings, and, yeah, there's only so many wins to go around.
Gotta have context with the numbers all the time, brother.
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
Linear regression is a classic way of predicting something. I could have used Bill James' formula, but I chose not.

Bill James' formula is better. Its RMSE is lower, and is much better for extreme situations. For example, a team that has a run differential of +1,000 in a linear regression would be expected to win more than all of it's games, which is impossible. In the Pythagorean formula, they have a winning percentage around .800, which is more feasible.

These things will likely never happen in our lifetime, but they do make James' formula superior to the class linear regression.


I understand that... My question was why put so much work in for a known inferior method... and then ask why it may be inferior...
 

Omar 382

Well-Known Member
16,827
1,166
173
Joined
Jul 17, 2013
Hoopla Cash
$ 1,000.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
I understand that... My question was why put so much work in for a known inferior method... and then ask why it may be inferior...
When did I ask if it was inferior?
 

soxfan1468927

Well-Known Member
7,001
978
113
Joined
Jul 3, 2013
Location
603
Hoopla Cash
$ 7,185.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
how did the Baltimore Orioles of 2015 do in this... Picking them because if my memory is correct, they were a first place team with a terrible RD, due to being amazing in one run games...
I believe you're thinking of 2012. They were a wild card team that won 93 games with a run differential of +7. They went 29-9 in 1-run games and 16-2 in extra innings.
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
When did I ask if it was inferior?

my bad... but some of the reason for the outlier is because of the inferior formula... Your formula doesn't take into account how different TOTALS with same RD could be affected...

But even the Pythagorean estimated records for the Angels was pretty off, the Angels had 12 more wins than the estimated record should have been... Cleveland was only 4 wins less... Not that big of a difference...

But I would love to see in general form if it is better to Score more runs or prevent more runs... and where the line is for that...
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
I believe you're thinking of 2012. They were a wild card team that won 93 games with a run differential of +7. They went 29-9 in 1-run games and 16-2 in extra innings.


yes, it was that 2012 Orioles...
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
my bad... but some of the reason for the outlier is because of the inferior formula... Your formula doesn't take into account how different TOTALS with same RD could be affected...

But even the Pythagorean estimated records for the Angels was pretty off, the Angels had 12 more wins than the estimated record should have been... Cleveland was only 4 wins less... Not that big of a difference...

But I would love to see in general form if it is better to Score more runs or prevent more runs... and where the line is for that...


took the wrong Indians... the 2006 Indians had 11 less wins than they were expected... so I guess I solved nothing...
 

MilkSpiller22

Gorilla
33,689
6,429
533
Joined
Apr 18, 2013
Hoopla Cash
$ 89,217.00
Fav. Team #1
Fav. Team #2
Fav. Team #3
interestingly though the 2006 Indians had a 51% save percentage compared to the 2008 angels having a 74...

I guess Bullpens COULD be the entire reason...
 
Top