Bill Polian: Genius or Crazy? 2-15-07

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Bill Polian: Genius or Crazy?

2-15-07
During the 2006-2007 NFL season, the Indianapolis Colts achieved many ‘firsts’

in NFL history. Unlike Peyton Manning’s record 49 touchdown passes in the 2004

seasons, many of these firsts were negative for the Colts. They were the first team to

reach the playoffs after ranking last in the league in defensive rushing yards. They gave

up the second most rushing yards in a single game since the NFL merger, 375 yards

against the Jacksonville Jaguars, which was also by far the worst in franchise history.

Despite the bad performance during the regular season, they were the first team with such

a poor defensive season to win (or even reach) the Super Bowl.

Throughout the season, the media was very critical of the team for it’s poor play.

Many sportswriters predicted the Colts would lose early in the playoffs, based mostly on

the popular adages that surround the NFL playoffs. These adages are “defense wins

championships” and, the one that I will be testing “To win, you have to run the ball and

stop the run.” In the face of question after question regarding the defense and this adage,

Bill Polian, the Colts’ highly animated and opinionated general manager, made a

statement that seemed to contradict all popular opinions about what it takes to win in the

NFL. He declared that the important statistics in the NFL had nothing to do with rushing,

and that it was really yards per pass attempt and turnover ratio that were most important.

Since none of the popular NFL adages mention yards per pass attempt at all, this

statement was not very well-received among Colts fans, of which many believed that he

was making excuses for the way he had built the Colts team. The goal of this paper is to

test this statement using statistics.

To begin with, I had to decide what method I would use to test this statement.

Using a linear regression to find the relationship between wins and some of the various
statistics was the first method that came to mind. Even though this seems a simple

method of testing, the combinations of statistics that I chose resulted in 306 linear

regressions that needed to be run in Minitab, and the collection of data itself was a

difficult process.

The first step, then, was the collection of data. I decided that I did not want to go

back too many years, because rule changes and varying styles of play may have altered

the important statistics in seasons too long ago. I used data from the 1999 season up

through the 2006 season, and also all of those seasons put together into one data set. I

also had to decide which statistics to test, and so when I collected my data I included the

following statistics: offensive pass attempts (O PA), offensive pass yards (O PY),

offensive yards per pass (O YPP), offensive rush attempts, (O RA) offensive rush yards

(O RY), offensive yards per rush (O YPR), offensive turnovers (O TO), and all of the

preceding statistics for each defense as well. Instead of using turnover ratio, I used the

turnover difference, defensive takeaways minus offensive giveaways. These statistics

should be analogous.

I was not sure how to collect data in an automated fashion, because taking data for

each season would have been exceedingly tedious. Each year from my source was sorted

by points scored for the offense and points given up by the defense. I wanted a way to put

all the teams in the same order for each season so I could just copy and paste the data into

Minitab. Since I did not know how to write a script to extract the data in some way, I

used the most powerful tool that I was familiar with, Microsoft Excel.

In excel, I used a function called “VLOOKUP,” which stands for vertical look-up.

This function searches for an input term in the first column of a table of data in excel,
then outputs the result from a specific column a certain number of columns away. For

example, if I was looking for data from the third column of my data table for the Colts, I

would tell “VLOOKUP” to look for “IND” (for Indianapolis) and output the data from

column 3 of that row. I did this for each year of my data by copying the table of stats into

excel from the web page, then applying this function to all the relevant stats. It kept me

from having to manually input each data point for all eight seasons with over fourteen

different categories to record.

Next came the initial analysis of the data. I decided to start by simply testing each

individual statistic with a linear regression model predicting wins in the regular season. I

figured this would be mostly useful for giving me an idea as to which statistics were

going to be most important. I didn’t expect that any of the statistics from this single

variable regression would predict the number of wins a team would have to a high level

of accuracy. Table 1 and figure 1 show some selected results from the single variable

regressions.

Scatterplot of wins 00 vs D RA
14

12

10

8
wins 00

0
350 400 450 500 550 600
D RA

Figure 1: Scatterplot with regression for wins versus D RA in 2000


Year Best predictor (R2 %) 2nd best predictor (R2 %)

1999 D RA (47.2) D RY (43.8)

2000 D RA (64.3) TO Diff (48.1)

2001 D RA (53) D YPP (36.6)

2002 D RA (50.2) D YPP (32.9)

2003 D RA (44.4) O YPP (43.9)

2004 D RA (46.3) O RY (43.5)

2005 D RA (56) O RA (49.7)

2006 O RA (41.3) D RA (30.7)

overall D RA (47.8) O RA (32.4)

Table 1: Sample results from the single variable test

After analyzing all of the single variable regression data, the astounding fact was

that defensive rushing attempts was clearly the best single variable predictor. This did not

seem to directly follow any of the adages about the NFL, unless you take the large logical

step that failing to stop the run implies that teams will run more. Defensive yards per rush

had an R2 value of only 0.8%, however, so this theory does not explain everything very

well. Yards per pass did show up frequently in the top three R2 values, which told me that

Bill Polian may actually be onto something.

Although running the single variable regression data did not give much in terms

of accurately predicting the number of wins a team may end up with, it did help to tell me

which statistics were important when I began my multiple variable regressions. There

was no way I could test every combination of two statistics, so I decided to test the
corresponding statistics offensively and defensively (i.e. O YPP and D YPP, O RA and D

RA, etc.), along with many of the stats coupled with TO difference. I chose TO

difference partly because I knew that it was part of Bill Polian’s theory that he stated, but

also because it had shown a substantial R2 in the single variable tests, even though it was

never the highest value. Table 2 shows selected results much like table 1.

Year Best predictors (R2 %) 2nd best predictors (R2

%)

1999 O YPP, D YPP (54.3) D RA, TO Diff (50.2)

2000 D RA, TO Diff (71.3) O RA, D RA (69.8)

2001 D RA, TO Diff (62.2) O RA, D RA (56.1)

2002 D RA, TO Diff (63.3) O RA, D RA (50.3)

2003 O YPP, D YPP (63.2) O YPP, TO Diff (54)

2004 D RA, TO Diff (54.3) O RY, TO Diff (48.5)

2005 D RY, TO Diff (65.9) D RA, TO Diff (65.7)

2006 D YPP, TO Diff (71) O RA, TO Diff (64.7)

overall D RA, TO Diff (56) O RA, D RA (51.8)

Table 2: Some sample values for the 2 variable regression

Once again, the two variable regression shows that rushing attempts dominate most of the

best predictors, along with turnover difference. At this point in the data analysis, I was

beginning to think that Bill Polian’s theory would have to be rejected. I only had one

more combination of variables to test, those involving 3 variables. I needed to test this

last set because I wanted to test O YPP, D YPP, and TO Diff together, which was Bill

Polian’s statement.
This time, I chose the predictors that had proven to be the best over the course of

the single and two variable regressions. A sample table for the variables used is table 3,

and the overall results are shown in table 4, with only the best predictors listed.

Pre 1 Pre 2 Pre 3 R-Sq


O PA DPA TO Diff 25.4
O PY DPY TO Diff 44.2
O YPP D YPP TO Diff 62.1
O RA D RA TO Diff 50.3
O RY D RY TO Diff 42.2
O YPR D YPR TO Diff 23.4
Table 3: Sample values for 3 variable testing of 1999 statistics

Year Best predictors (R2 %)

1999 O YPP, D YPP, TO Diff (62.1)

2000 O RA, D RA, TO Diff (73.8)

2001 O YPP, D YPP, TO Diff (67.6)

2002 O RA, D RA, TO Diff (63.3)

2003 O YPP, D YPP, TO Diff (77)

2004 O YPP, D YPP, TO Diff (58)

2005 O YPP, D YPP, TO Diff (71.9)

2006 O YPP, D YPP, TO Diff (74.5)

overall O YPP, D YPP, To Diff (62.1)

Table 4: Best predictors for 3 variable regression

The results of this final regression test were unexpected. On all but two of the seasons,

along with the overall, yards per pass and turnover difference were the best predictors for

wins. After all of the earlier two regression and single regression results pointed towards

rushing attempts, the final test with three variables pointed to yards per pass. I think this

is due to the fact that BOTH offensive yards per pass and defensive yards per pass are
more consistent predictors than offensive and defensive rush attempts, because offensive

rush attempts is not a good predictor. Figures 2 and 3 illustrate this point.

Scatterplot of wins 01 vs O YPP, D YPP


Variable
14
O YPP
D YPP
12

10
wins 01

0
5 6 7 8 9
X-Data

Figure 2: Showing the negative relationship between wins and D YPP and the positive

relationship between wins and O YPP for the 2001 season

Scatterplot of wins 01 vs O RA, D RA


Variable
14
O RA
D RA
12

10
wins 01

0
350 400 450 500 550 600
X-Data

Figure 3: Showing the relationships between offensive and defensive rushing attempts

and wins for the 2001 season


The preceding graphs seem to substantiate the claim that even though defensive rushing

attempts is the best single predictor and is included in the best two variable predictor,

when combined with offensive rushing attempts the prediction does not work as

consistently as the yards per pass prediction equation because of the extra variance added

by offensive rushing attempts. If you take Bill Polian’s statement generally, that offensive

yards per pass, defensive yards per pass, and turnover difference are the most important

statistics, the regression model test appears to substantiate this claim.

After I had finished regression testing, I was eager to validate my results by using

the model utility F test, where F is calculated using the equation

F = [(SSres(reduced) – SSres(full))/g]/[SSres(full)/(n-k-1)]. First, when I calculated SSres(full) by

calculating a regression with all the variables as predictors, I was not happy with the

result. The p-value for the coefficients involving both O YPP and D YPP were greater

than 0.05, meaning that hypothesis testing on the value of that coefficient had concluded

that the coefficients were not statistically that different from zero. Performing the F test

only confirmed this result with an F of 0.5578, clearly showing that the coefficients are

not statistically different from zero. This seemed to be a contradiction to me, since the

best overall predictors were O YPP, D YPP, and TO Diff. The coefficients for each of the

rushing attempt statistics showed p-values of approximately 0, so they clearly could not

be removed.

I attribute my lack of a satisfactory explanation for this discrepancy to

inexperience with the use of model utility testing. When testing certain groups of

variables, the result may show that those variables do not contribute enough to the

regression to leave them in there. However, testing the variable in a different group may
show that that variable should remain in the regression. I was not sure of a way to use

model utility testing to get a definitive answer about which variables made the best

predictions.

The last method I wanted to use was an actual test of a regression equation on

some data not within the set that was used to create it. I decided to use data from 1998,

and test the regression for O YPP, D YPP and TO Diff, against the regression for O RA,

D RA, TO Diff for a few teams. Table 5 shows the results.

Actual wins YPP predicted wins RA predicted wins

12 11 10

10 12 12

10 9 10

9 9 8

3 4 3

Table 5: Simple test of overall regression models for 5 teams from 1998

It would be difficult to achieve definitive results testing in this way. In fact what this

method really shows is that both predictions are fairly accurate, but it would take

hundreds of data points to draw a strong conclusion about one regression or the other.

One final point that is important to mention is the use of R2 instead of R2(adj). At

the beginning of my work on testing the data, I did not know what R2(adj) was. I did not

know that it took into account the number of variables that were being used when finding

R2. However, once I learned this fact, I realized that I only compared R2 values between

regressions with the same number of variables, so I did not need to worry about the

difference from R2 to R2(adj).


In conclusion, it appears that Bill Polian’s statement about yards per pass being

the most important statistic is actually pretty close to the truth. It is not definitive, and the

R2 values show that the statistics are never a great predictor for of the variance in wins

around the NFL. However, a simple test has shown that using the regression equation for

all of the data from the last eight seasons can predict fairly well the number of wins a

team will have at the end of the 16 game regular season.


References

Pro Football Reference website, “www.pro-football-reference.com,” last accessed 2-14-

07.

You might also like