Exploring the Relationship Between Runs and Wins- And Observing Outliers

Omar 382 · Feb 14, 2017

Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:

Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

Screen Shot 2017-02-14 at 2.26.51 PM.png

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!

Omar 382 · Feb 14, 2017

Probably the problem with the '06 Indians is that @SlinkyRedfoot rooted for them and cursed them with the stink he got from that hobo in Kensington when he went to North Philly in May

broncosmitty · Feb 14, 2017

How'd last years Rangers do?

Omar 382 · Feb 14, 2017

broncosmitty said:
How'd last years Rangers do?

Good question. The database I was working with only went up to 2015, so I'm not sure where they would be in that model, but probably pretty high and to the right. If they release a database for 2016, I'll definitely plot it!

Guy Incognito · Feb 14, 2017

Omar 382 said:
Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:

Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

View attachment 154942

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!

Outside of any real in-depth analysis, the '06 Indians had 3 90-win teams in their division (White Sox won 90 games and finished 5 games out of second), while the AL West in '08 only had the Angels finish above .500. So I'm guessing the outlier in both cases was the relative strength/weakness of the division.

Omar 382 · Feb 14, 2017

Guy Incognito said:
Outside of any real in-depth analysis, the '06 Indians had 3 90-win teams in their division (White Sox won 90 games and finished 5 games out of second), while the AL West in '08 only had the Angels finish above .500. So I'm guessing the outlier in both cases was the relative strength/weakness of the division.

Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.

MilkSpiller22 · Feb 14, 2017

Omar 382 said:
Most people understand, or at least believe, that a run differential of about 10 runs leads to 1 win. I used a linear regression, and got the following output:

Therefore, a team's estimated winning percentage can be obtained from the following formula:

Wpct = 0.4999918 + 0.0006287 × RD

This formula tells us that a team with a run differential of 0, or 750 runs allowed and 750 runs scored, can expect to win about half its games, or 81 games. In addition, a one unit increase in run differential leads to a 0.0006287 increase in winning percentage. Therefore, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.500+10·0.0006287 ≈ 0.506. A .506 winning percentage in a 162 game season corresponds to about 82 wins.

I analyzed all teams since 2000, and plotted their residuals (basically the difference between the actual and estimated winning percentages of each team) versus the run differential for the fitted linear model. Here are my results:

View attachment 154942

The graphic may make the model not appear as effective as it should be, as there are quite a few points away from the straight line, but you must remember that I used -0.05 and 0.05 as parameters. If I instead used -0.10 and 0.10, the dots would appear even closer to the line.

[If you are wondering about the model's efficacy, read this, if not; then don't. I took the root mean square error, abbreviated as RMSE, to estimate the average magnitude of the errors. Approximately two thirds of the residuals fall between −RMSE and +RMSE, while 95% of the residuals are between −2·RMSE and 2·RMSE. Therefore, my model looks fairly sound.]

The funny thing I noticed were the two outliers: the 2008 Angels and the 2006 Indians. The Angels had a +68 run differential, they were supposed, according to the linear equation, to have a 0.542 winning percentage; they ended the season at 0.617. The residual value for this team is 0.617−0.542 = 0.075. On the other side, the 2006 Cleveland Indians, with a +88 run differential, are seen as a 0.555 team by the linear model, but they actually finished at a mere 0.481, corresponding to the residual 0.481 − 0.555 = −0.073.

I wonder what, if anything observable, caused these two teams to over and underperform the linear model. Questions and comments are welcome!

Just wondering why you would make that formula up when there is already the Pythagorean estimated win%

and how does your formula compare to it??

and does your formula take into account different totals with the same differentials?? are teams better when they score more or prevent more runs??

Guy Incognito · Feb 14, 2017

Omar 382 said:
Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.

Gotta have context with the numbers sometimes.

Honestly, I didn't look at standings first either. Just kind of stumbled across it when I was looking for other info on the Angels team (I was prepared to go on some diatribe about how the Indians stunk playing small ball, as they were great in blowouts but bad in one-run games), and saw the AL West standings... then looked at the '06 AL Central standings, and, yeah, there's only so many wins to go around.

soxfan1468927 · Feb 14, 2017

Omar 382 said:
Wow! I never thought of that. Goes to show how important recognizing and having domain knowledge is.

I actually can't believe I didn't think of that, when something like their divisions is staring you right in the face.

Even if you take out games in the division. Angels were a 98-99 win team outside the division. And the Indians were an 80 win team in games played outside their division. Not much difference.

Omar 382 · Feb 14, 2017

MilkSpiller22 said:
Just wondering why you would make that formula up when there is already the Pythagorean estimated win%

and how does your formula compare to it??

and does your formula take into account different totals with the same differentials?? are teams better when they score more or prevent more runs??

Linear regression is a classic way of predicting something. I could have used Bill James' formula, but I chose not.

Bill James' formula is better. Its RMSE is lower, and is much better for extreme situations. For example, a team that has a run differential of +1,000 in a linear regression would be expected to win more than all of it's games, which is impossible. In the Pythagorean formula, they have a winning percentage around .800, which is more feasible.

These things will likely never happen in our lifetime, but they do make James' formula superior to the class linear regression.

MilkSpiller22 · Feb 14, 2017

how did the Baltimore Orioles of 2015 do in this... Picking them because if my memory is correct, they were a first place team with a terrible RD, due to being amazing in one run games...

Omar 382 · Feb 14, 2017

Guy Incognito said:
Gotta have context with the numbers sometimes.

Honestly, I didn't look at standings first either. Just kind of stumbled across it when I was looking for other info on the Angels team (I was prepared to go on some diatribe about how the Indians stunk playing small ball, as they were great in blowouts but bad in one-run games), and saw the AL West standings... then looked at the '06 AL Central standings, and, yeah, there's only so many wins to go around.

Gotta have context with the numbers all the time, brother.

MilkSpiller22 · Feb 14, 2017

Omar 382 said:
Linear regression is a classic way of predicting something. I could have used Bill James' formula, but I chose not.

Bill James' formula is better. Its RMSE is lower, and is much better for extreme situations. For example, a team that has a run differential of +1,000 in a linear regression would be expected to win more than all of it's games, which is impossible. In the Pythagorean formula, they have a winning percentage around .800, which is more feasible.

These things will likely never happen in our lifetime, but they do make James' formula superior to the class linear regression.

I understand that... My question was why put so much work in for a known inferior method... and then ask why it may be inferior...

Omar 382 · Feb 14, 2017

MilkSpiller22 said:
I understand that... My question was why put so much work in for a known inferior method... and then ask why it may be inferior...

When did I ask if it was inferior?

soxfan1468927 · Feb 14, 2017

MilkSpiller22 said:
how did the Baltimore Orioles of 2015 do in this... Picking them because if my memory is correct, they were a first place team with a terrible RD, due to being amazing in one run games...

I believe you're thinking of 2012. They were a wild card team that won 93 games with a run differential of +7. They went 29-9 in 1-run games and 16-2 in extra innings.

MilkSpiller22 · Feb 14, 2017

Omar 382 said:
When did I ask if it was inferior?

my bad... but some of the reason for the outlier is because of the inferior formula... Your formula doesn't take into account how different TOTALS with same RD could be affected...

But even the Pythagorean estimated records for the Angels was pretty off, the Angels had 12 more wins than the estimated record should have been... Cleveland was only 4 wins less... Not that big of a difference...

But I would love to see in general form if it is better to Score more runs or prevent more runs... and where the line is for that...

MilkSpiller22 · Feb 14, 2017

soxfan1468927 said:
I believe you're thinking of 2012. They were a wild card team that won 93 games with a run differential of +7. They went 29-9 in 1-run games and 16-2 in extra innings.

yes, it was that 2012 Orioles...

MilkSpiller22 · Feb 14, 2017

MilkSpiller22 said:
my bad... but some of the reason for the outlier is because of the inferior formula... Your formula doesn't take into account how different TOTALS with same RD could be affected...

But even the Pythagorean estimated records for the Angels was pretty off, the Angels had 12 more wins than the estimated record should have been... Cleveland was only 4 wins less... Not that big of a difference...

But I would love to see in general form if it is better to Score more runs or prevent more runs... and where the line is for that...

took the wrong Indians... the 2006 Indians had 11 less wins than they were expected... so I guess I solved nothing...

MilkSpiller22 · Feb 14, 2017

interestingly though the 2006 Indians had a 51% save percentage compared to the 2008 angels having a 74...

I guess Bullpens COULD be the entire reason...

Exploring the Relationship Between Runs and Wins- And Observing Outliers

Omar 382

Well-Known Member

Omar 382

Well-Known Member

broncosmitty

Banned in Europe

Omar 382

Well-Known Member

Guy Incognito

Crack a window, will ya?

Omar 382

Well-Known Member

MilkSpiller22

Gorilla

Guy Incognito

Crack a window, will ya?

soxfan1468927

Well-Known Member

Omar 382

Well-Known Member

MilkSpiller22

Gorilla

Omar 382

Well-Known Member

MilkSpiller22

Gorilla

Omar 382

Well-Known Member

soxfan1468927

Well-Known Member

MilkSpiller22

Gorilla

MilkSpiller22

Gorilla

MilkSpiller22

Gorilla

MilkSpiller22

Gorilla

Similar threads