Bill Polian: Genius or Crazy? 2-15-07

Bill Polian: Genius or Crazy?
2-15-07
During the 2006-2007 NFL season, the Indianapolis Colts achieved many ‘firsts’
in NFL history. Unlike Peyton Manning’s record 49 touchdown passes in the 2004
seasons, many of these firsts were negative for the Colts. They were the first team to
reach the playoffs after ranking last in the league in defensive rushing yards. They gave
up the second most rushing yards in a single game since the NFL merger, 375 yards
against the Jacksonville Jaguars, which was also by far the worst in franchise history.
Despite the bad performance during the regular season, they were the first team with such
a poor defensive season to win (or even reach) the Super Bowl.
Throughout the season, the media was very critical of the team for it’s poor play.
Many sportswriters predicted the Colts would lose early in the playoffs, based mostly on
the popular adages that surround the NFL playoffs. These adages are “defense wins
championships” and, the one that I will be testing “To win, you have to run the ball and
stop the run.” In the face of question after question regarding the defense and this adage,
Bill Polian, the Colts’ highly animated and opinionated general manager, made a
statement that seemed to contradict all popular opinions about what it takes to win in the
NFL. He declared that the important statistics in the NFL had nothing to do with rushing,
and that it was really yards per pass attempt and turnover ratio that were most important.
Since none of the popular NFL adages mention yards per pass attempt at all, this
statement was not very well-received among Colts fans, of which many believed that he
was making excuses for the way he had built the Colts team. The goal of this paper is to
test this statement using statistics.
To begin with, I had to decide what method I would use to test this statement.
Using a linear regression to find the relationship between wins and some of the various
statistics was the first method that came to mind. Even though this seems a simple
method of testing, the combinations of statistics that I chose resulted in 306 linear
regressions that needed to be run in Minitab, and the collection of data itself was a
difficult process.
The first step, then, was the collection of data. I decided that I did not want to go
back too many years, because rule changes and varying styles of play may have altered
the important statistics in seasons too long ago. I used data from the 1999 season up
through the 2006 season, and also all of those seasons put together into one data set. I
also had to decide which statistics to test, and so when I collected my data I included the
following statistics: offensive pass attempts (O PA), offensive pass yards (O PY),
offensive yards per pass (O YPP), offensive rush attempts, (O RA) offensive rush yards
(O RY), offensive yards per rush (O YPR), offensive turnovers (O TO), and all of the
preceding statistics for each defense as well. Instead of using turnover ratio, I used the
turnover difference, defensive takeaways minus offensive giveaways. These statistics
should be analogous.
I was not sure how to collect data in an automated fashion, because taking data for
each season would have been exceedingly tedious. Each year from my source was sorted
by points scored for the offense and points given up by the defense. I wanted a way to put
all the teams in the same order for each season so I could just copy and paste the data into
Minitab. Since I did not know how to write a script to extract the data in some way, I
used the most powerful tool that I was familiar with, Microsoft Excel.
In excel, I used a function called “VLOOKUP,” which stands for vertical look-up.
This function searches for an input term in the first column of a table of data in excel,
then outputs the result from a specific column a certain number of columns away. For
example, if I was looking for data from the third column of my data table for the Colts, I
would tell “VLOOKUP” to look for “IND” (for Indianapolis) and output the data from
column 3 of that row. I did this for each year of my data by copying the table of stats into
excel from the web page, then applying this function to all the relevant stats. It kept me
from having to manually input each data point for all eight seasons with over fourteen
different categories to record.
Next came the initial analysis of the data. I decided to start by simply testing each
individual statistic with a linear regression model predicting wins in the regular season. I
figured this would be mostly useful for giving me an idea as to which statistics were
going to be most important. I didn’t expect that any of the statistics from this single
variable regression would predict the number of wins a team would have to a high level
of accuracy. Table 1 and figure 1 show some selected results from the single variable
regressions.
Scatterplot of wins 00 vs D RA
14
12
10
8
wins 00
0
350 400 450 500 550 600
D RA
Figure 1: Scatterplot with regression for wins versus D RA in 2000

Year Best predictor (R2 %) 2nd best predictor (R2 %)
1999 D RA (47.2) D RY (43.8)
2000 D RA (64.3) TO Diff (48.1)
2001 D RA (53) D YPP (36.6)
2002 D RA (50.2) D YPP (32.9)
2003 D RA (44.4) O YPP (43.9)
2004 D RA (46.3) O RY (43.5)
2005 D RA (56) O RA (49.7)
2006 O RA (41.3) D RA (30.7)
overall D RA (47.8) O RA (32.4)
Table 1: Sample results from the single variable test
After analyzing all of the single variable regression data, the astounding fact was
that defensive rushing attempts was clearly the best single variable predictor. This did not
seem to directly follow any of the adages about the NFL, unless you take the large logical
step that failing to stop the run implies that teams will run more. Defensive yards per rush
had an R2 value of only 0.8%, however, so this theory does not explain everything very
well. Yards per pass did show up frequently in the top three R2 values, which told me that
Bill Polian may actually be onto something.
Although running the single variable regression data did not give much in terms
of accurately predicting the number of wins a team may end up with, it did help to tell me
which statistics were important when I began my multiple variable regressions. There
was no way I could test every combination of two statistics, so I decided to test the
corresponding statistics offensively and defensively (i.e. O YPP and D YPP, O RA and D
RA, etc.), along with many of the stats coupled with TO difference. I chose TO
difference partly because I knew that it was part of Bill Polian’s theory that he stated, but
also because it had shown a substantial R2 in the single variable tests, even though it was
never the highest value. Table 2 shows selected results much like table 1.
Year Best predictors (R2 %) 2nd best predictors (R2
%)
1999 O YPP, D YPP (54.3) D RA, TO Diff (50.2)
2000 D RA, TO Diff (71.3) O RA, D RA (69.8)
2003 O YPP, D YPP (63.2) O YPP, TO Diff (54)
2004 D RA, TO Diff (54.3) O RY, TO Diff (48.5)
2005 D RY, TO Diff (65.9) D RA, TO Diff (65.7)
2006 D YPP, TO Diff (71) O RA, TO Diff (64.7)
overall D RA, TO Diff (56) O RA, D RA (51.8)
Table 2: Some sample values for the 2 variable regression
Once again, the two variable regression shows that rushing attempts dominate most of the
best predictors, along with turnover difference. At this point in the data analysis, I was
beginning to think that Bill Polian’s theory would have to be rejected. I only had one
more combination of variables to test, those involving 3 variables. I needed to test this
last set because I wanted to test O YPP, D YPP, and TO Diff together, which was Bill
Polian’s statement.
This time, I chose the predictors that had proven to be the best over the course of
the single and two variable regressions. A sample table for the variables used is table 3,
and the overall results are shown in table 4, with only the best predictors listed.
Pre 1 Pre 2 Pre 3 R-Sq

O PA DPA TO Diff 25.4
O PY DPY TO Diff 44.2
O YPP D YPP TO Diff 62.1
O RA D RA TO Diff 50.3
O RY D RY TO Diff 42.2
O YPR D YPR TO Diff 23.4
Table 3: Sample values for 3 variable testing of 1999 statistics
Year Best predictors (R2 %)
1999 O YPP, D YPP, TO Diff (62.1)
2000 O RA, D RA, TO Diff (73.8)
2002 O RA, D RA, TO Diff (63.3)
2003 O YPP, D YPP, TO Diff (77)
2004 O YPP, D YPP, TO Diff (58)
overall O YPP, D YPP, To Diff (62.1)
Table 4: Best predictors for 3 variable regression
The results of this final regression test were unexpected. On all but two of the seasons,
along with the overall, yards per pass and turnover difference were the best predictors for
wins. After all of the earlier two regression and single regression results pointed towards
rushing attempts, the final test with three variables pointed to yards per pass. I think this
is due to the fact that BOTH offensive yards per pass and defensive yards per pass are
more consistent predictors than offensive and defensive rush attempts, because offensive
rush attempts is not a good predictor. Figures 2 and 3 illustrate this point.
Scatterplot of wins 01 vs O YPP, D YPP

Variable
14
O YPP
D YPP
12
10
wins 01
0
5 6 7 8 9
X-Data
Figure 2: Showing the negative relationship between wins and D YPP and the positive
relationship between wins and O YPP for the 2001 season
Scatterplot of wins 01 vs O RA, D RA

Variable
14
O RA
D RA
12
10
wins 01
0
350 400 450 500 550 600
X-Data
Figure 3: Showing the relationships between offensive and defensive rushing attempts
and wins for the 2001 season

The preceding graphs seem to substantiate the claim that even though defensive rushing
attempts is the best single predictor and is included in the best two variable predictor,
when combined with offensive rushing attempts the prediction does not work as
consistently as the yards per pass prediction equation because of the extra variance added
by offensive rushing attempts. If you take Bill Polian’s statement generally, that offensive
yards per pass, defensive yards per pass, and turnover difference are the most important
statistics, the regression model test appears to substantiate this claim.
After I had finished regression testing, I was eager to validate my results by using
the model utility F test, where F is calculated using the equation
F = [(SSres(reduced) – SSres(full))/g]/[SSres(full)/(n-k-1)]. First, when I calculated SSres(full) by
calculating a regression with all the variables as predictors, I was not happy with the
result. The p-value for the coefficients involving both O YPP and D YPP were greater
than 0.05, meaning that hypothesis testing on the value of that coefficient had concluded
that the coefficients were not statistically that different from zero. Performing the F test
only confirmed this result with an F of 0.5578, clearly showing that the coefficients are
not statistically different from zero. This seemed to be a contradiction to me, since the
best overall predictors were O YPP, D YPP, and TO Diff. The coefficients for each of the
rushing attempt statistics showed p-values of approximately 0, so they clearly could not
be removed.
I attribute my lack of a satisfactory explanation for this discrepancy to
inexperience with the use of model utility testing. When testing certain groups of
variables, the result may show that those variables do not contribute enough to the
regression to leave them in there. However, testing the variable in a different group may
show that that variable should remain in the regression. I was not sure of a way to use
model utility testing to get a definitive answer about which variables made the best
predictions.
The last method I wanted to use was an actual test of a regression equation on
some data not within the set that was used to create it. I decided to use data from 1998,
and test the regression for O YPP, D YPP and TO Diff, against the regression for O RA,
D RA, TO Diff for a few teams. Table 5 shows the results.
Actual wins YPP predicted wins RA predicted wins
12 11 10
10 12 12
10 9 10
9 9 8
3 4 3
Table 5: Simple test of overall regression models for 5 teams from 1998
It would be difficult to achieve definitive results testing in this way. In fact what this
method really shows is that both predictions are fairly accurate, but it would take
hundreds of data points to draw a strong conclusion about one regression or the other.
One final point that is important to mention is the use of R2 instead of R2(adj). At
the beginning of my work on testing the data, I did not know what R2(adj) was. I did not
know that it took into account the number of variables that were being used when finding
R2. However, once I learned this fact, I realized that I only compared R2 values between
regressions with the same number of variables, so I did not need to worry about the
difference from R2 to R2(adj).

In conclusion, it appears that Bill Polian’s statement about yards per pass being
the most important statistic is actually pretty close to the truth. It is not definitive, and the
R2 values show that the statistics are never a great predictor for of the variance in wins
around the NFL. However, a simple test has shown that using the regression equation for
all of the data from the last eight seasons can predict fairly well the number of wins a
team will have at the end of the 16 game regular season.

References
Pro Football Reference website, “www.pro-football-reference.com,” last accessed 2-14-
07.

Bill Polian: Genius or Crazy? 2-15-07

Uploaded by

Copyright:

Available Formats

You might also like

Bill Polian: Genius or Crazy? 2-15-07

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bill Polian: Genius or Crazy? 2-15-07

Uploaded by

Copyright:

Available Formats

Bill Polian: Genius or Crazy?

test this statement using statistics.

turnover difference, defensive takeaways minus offensive giveaways. These statistics

different categories to record.

Figure 1: Scatterplot with regression for wins versus D RA in 2000

1999 D RA (47.2) D RY (43.8)

2000 D RA (64.3) TO Diff (48.1)

2001 D RA (53) D YPP (36.6)

2002 D RA (50.2) D YPP (32.9)

2003 D RA (44.4) O YPP (43.9)

2004 D RA (46.3) O RY (43.5)

2005 D RA (56) O RA (49.7)

2006 O RA (41.3) D RA (30.7)

overall D RA (47.8) O RA (32.4)

Table 1: Sample results from the single variable test

Bill Polian may actually be onto something.

Year Best predictors (R2 %) 2nd best predictors (R2

1999 O YPP, D YPP (54.3) D RA, TO Diff (50.2)

2000 D RA, TO Diff (71.3) O RA, D RA (69.8)

2001 D RA, TO Diff (62.2) O RA, D RA (56.1)

2002 D RA, TO Diff (63.3) O RA, D RA (50.3)

2003 O YPP, D YPP (63.2) O YPP, TO Diff (54)

2004 D RA, TO Diff (54.3) O RY, TO Diff (48.5)

2005 D RY, TO Diff (65.9) D RA, TO Diff (65.7)

2006 D YPP, TO Diff (71) O RA, TO Diff (64.7)

overall D RA, TO Diff (56) O RA, D RA (51.8)

Table 2: Some sample values for the 2 variable regression

Pre 1 Pre 2 Pre 3 R-Sq

Year Best predictors (R2 %)

1999 O YPP, D YPP, TO Diff (62.1)

2000 O RA, D RA, TO Diff (73.8)

2001 O YPP, D YPP, TO Diff (67.6)

2002 O RA, D RA, TO Diff (63.3)

2003 O YPP, D YPP, TO Diff (77)

2004 O YPP, D YPP, TO Diff (58)

2005 O YPP, D YPP, TO Diff (71.9)

2006 O YPP, D YPP, TO Diff (74.5)

overall O YPP, D YPP, To Diff (62.1)

Table 4: Best predictors for 3 variable regression

Scatterplot of wins 01 vs O YPP, D YPP

relationship between wins and O YPP for the 2001 season

Scatterplot of wins 01 vs O RA, D RA

and wins for the 2001 season

statistics, the regression model test appears to substantiate this claim.

the model utility F test, where F is calculated using the equation

F = [(SSres(reduced) – SSres(full))/g]/[SSres(full)/(n-k-1)]. First, when I calculated SSres(full) by

I attribute my lack of a satisfactory explanation for this discrepancy to

D RA, TO Diff for a few teams. Table 5 shows the results.

Actual wins YPP predicted wins RA predicted wins

difference from R2 to R2(adj).

team will have at the end of the 16 game regular season.

Pro Football Reference website, “www.pro-football-reference.com,” last accessed 2-14-

You might also like