Final Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Federal Reserve Economic Data Estimation and Projection

ESE 499 Final Report February 10, 2014 May 2, 2014 Submitted to: Professor Dennis Mell Systems Science and Engineering Washington University on April 28, 2014

Submitted by: Sean Feng Senior in Systems Engineering and Finance 665 South Skinker Boulevard St. Louis, MO 63105 (989) 488-3991 sean.feng@wustl.edu Peter Suyderhoud Senior in Systems Engineering and Economics 7116 Forsyth Boulevard St. Louis, MO 63105 (703) 405-5340 petersuyderhoud@gmail.com

Under the direction of: Mr. Keith Taylor Data DeskResearch Division Federal Reserve Bank of St. Louis One Federal Reserve Bank Plaza Broadway and Locust Streets St. Louis, MO 63101 (314) 444-4211 Keith.g.taylor@stls.frb.org

Abstract of Federal Reserve Economic Data Estimation and Projection Sean Feng and Peter Suyderhoud The aim of this project is to utilize systems engineering techniques to accurately estimate and predict economic metrics including Gross Domestic Product (GDP) as reported by the Federal Reserve Bank of St. Louis (St. Louis FED). As the St. Louis FED takes initiatives to expand the number of series it tracks and reports, the data desk is facing difficulties in confirming data estimates and generating new projections. Our work will serve as a preliminary foundation for the St. Louis FED to more effectively validate data for its vintages. Using both archival and real-time data from the Archival Federal Reserve Economic Data (ALFRED), we will apply concepts such as random processes and statistical regression to update existing estimations and project new predictions for the coming period. We will implement our algorithms using Matlab and similar coding platforms in order to provide an initial framework for the St. Louis Fed to adapt for its proprietary software.

Resource Requirements In order to complete our project, we will need computer time in the lab, access to Matlab and other similar platforms, advising time with ESE and economics professors, and client time. 1. Historical Economic Data from ALFRED a. Client has provided us with sufficient data 2. Faculty Advising Time a. Randall Hoven (Electrical and Systems Engineering) b. Two hours/month 3. Client Meeting Time a. Keith Taylor b. Five hours/month 4. Computer Access a. Urbauer 214 computer lab b. Matlab, Stata c. Estimated 50 hours of usage

2|Page

Table of Contents Abstract ........................................................................................................................................... 2 Resource Requirements .................................................................................................................. 2 Introduction..................................................................................................................................... 4 Problem Statement .......................................................................................................................... 4 Employed Technical Approaches.................................................................................................... 4 Process Details..................................................................................................................... 5 Instructions for Using the Code....................................................................................... 6 Conclusion and Next Steps................................................................................................................... 7 Appendix ......................................................................................................................................... 8

3|Page

Introduction Federal Reserve Economic Data (FRED) is an online database maintained by the Research division of the Federal Reserve Bank of St. Louis containing more than 140,000 economics time series from approximately 60 sources. Examples of these time series include consumer price indexes, employment, population, exchange rates, interest rates, and gross domestic product. These series are collected from agencies such as the U.S. Census and the Bureau of Labor Statistics and compiled by the Data Desk, the team responsible for all data found at FRED. In addition to tracking the real time observations of each times series, the Data Desk also keeps vintages of these series for revision purposes. On the initial release date, values in each time series are assigned preliminary values which are subsequently revised over the period. The time series is revised twice total; the third value becomes the final observation.

Problem Statement Last year, the data desk increased the number of series in FRED by 100%, with plans to increase it by the same amount for the year ended December 2014. As the size of the database increases, the data desk is having difficulty adequately validating the data before writing it to the database. The current data validation method requires an excess of time and resources, as it involves error checks for each individual data entry. As the St. Louis FED rapidly scales up FRED, it faces the challenge of validating new data fast enough to keep up with new entries. We aim to create an algorithm that uses existing vintages in order to create an estimate for the coming period that will be compared to the actual vintage when it is released. We will need to differentiate pure economic noise from human error and methodology revisions. We intend to create this algorithm specifically for GDP data, since it is the most complete and least noisy data series. This estimation will allow the data desk to quickly validate new vintage entry by comparing the values to a generated estimation. If the values fall within a predetermined confidence interval, then the data desk can validate the entire vintage without reviewing each individual data point in the series.

Employed Technical Approaches Our project relied most on statistical approaches highlighted in the systems engineering and economics curriculums at Washington University in St. Louis. The primary focus of this project was to develop statistical models in order to best use the extensive historical data to revise and predict new data. Topics required for the project included, but were not limited to: Random Processes and Kalman Filtering, Probability and Statistics, Operations Research, Optimization, Algorithms and Data Structures, and Econometrics. Random Processes. The input data itself contains considerable random noise. This noise was divided into three categories: natural revision, methodology revisions, and human error. In our algorithm, we distinguished between the effects of economic revision noise and methodology revisions, while disregarding human error since there already exists raw data that edits this error out. In addition, we identified a set acceptable error margins for our output vintage, given the level of input error and our tolerance for error.
4|Page

Statistical Model Identification. In order to properly regress the data to provide us with revisions and predictions, we identified an appropriate mathematical model that relates the historical data to the newest vintage. We explored a variety of models, including linear, exponential, logarithmic, and polynomial regressions techniques. We also determined the limits for the scope the input in order to remove irrelevant or outdated data from our model. Algorithms and Data Analysis. After distinguishing the input error and identifying a model, we used the Matlab platform to develop a code that inputs historical data in order to generate revisions and predictions. This process required us to explore different implementation techniques, since not all methods provided us with equally accurate or plausible results.

Process Details In order to implement code using Matlab, we first needed to standardize the data input. We decided to outline five requirements for the data input: 1. Time must be represented as a number 2. The first column and row must be headers that include only time as a number 3. The data must have a fixed step widththere needs to be a consistent number of vintages before the next observation is estimated. 4. The revision time must be fixedthe time to reach a steady-state value must be constant across vintages within one data set. 5. The revision time cannot exceed 3 6. Fundamental methodology changes must overlap with a preliminary vintage 7. Fundamental methodology changes must update all historical values in the same period Using this standardized GDP dataset input, we developed an algorithm (See Appendix A) to compile the steady-state GDP values for each observation date. We used the vector of steady-state GDP values to identify the dynamics of the data set. We used Stata to test different models and time windows in order to find the best model to predict new preliminary GDP estimates. We quickly decided to narrow the window to only include values after mid-2009 in order to remove the noisy effects of the 2008 financial crisis. We found that if we included historical values before then, the GDP estimates for future periods would consistently be too high. With this time window, we identified two plausible models: linear and log-linear (Appendix D). Both models had strong t-values and high adjusted R-squared values. Although the linear model seemed to backfit better for historical data, we recognized that a log-linear model would be better in the long run. Next, we categorized the error in the input data. We took as given that GDP predictions are made quarterly and then are revised monthly such that there is one initial estimate followed by two revisions. By the third month in the quarter, the value tends to stay constant until there is a fundamental methodology change. Since we removed the effects of human error from the input data and set methodology changes as an additional input, we could isolate output error and categorize error values for the preliminary estimate and for each successive revision. We then used an updated algorithm (See Appendix B) to identify the standard deviation for each change (e.g. from a preliminary value to a first revision). This allowed us to develop confidence intervals for the revision estimations. The error for preliminary estimates would be instead based on a manual input, which in our case was the standard deviation from our regression model.
5|Page

With an identified model and categorized errors, we were able to create a new vintage estimation that could be used to validate new real vintages (See Appendix C). In our estimated vintage, we categorized three types of values: 1. Stead-state valuesafter steady-state is reached for an observation, we assumed it would remain constant until the next methodology change 2. Revision values (to be checked)revisions after the preliminary value that can change The number of revision values that must be checked depends on the step size and number of revisions until steady-state 3. Preliminary Estimatenew observation added to a vintage based on the regression model For our vintage estimation, we generated steady-state values and revision values by assuming them equal to the most recent vintage. We generated preliminary estimates using our regression models. We then validated revision values and preliminary estimates by using the estimated vintages as the mean for a confidence interval and then applied the error that we identified earlier to specific upper and lower limits. Since we assumed that steady-state values would stay constant, we also outputted the sum of the error of steady-state values, which should be zero if there was not human error or a methodology change. Finally, we backtested our algorithm with historical GDP data to see how valid the revisions and predictions were. After passing the backtest, we decided to extend our tests to less convenient and less normalized datasets. We tested total nonfarm employees (PAYEMS), which has a step size of 1 and revision number of 3. Our code was completely functional for this extension, but would need a better prediction model for proper validation.

Instructions for Using the Code Part A: Dataset Formatting and Model Identification 1. Replace column and row headers with dates in the form of a number January 2014 should be represented as: December 2014 should be represented as: 2. Remove all columns that cause the step width to exceed the appropriate constant amount 3. Input the appropriate dataset and manual inputs into ValuesForRegression.m revisionsthe number of revision periods before steady-state is reached, including a preliminary value stepWidththe number of vintages before a new preliminary value is added latestShiftthe column number of the most recent methodology change firstFinalthe index of the first steady-state value in the dataset 4. Regress the output vector valuesForRegression against time to identify a model and its corresponding parameters in order to predict new preliminary values Part B: Vintage Validation 1. Input the appropriate dataset and manual inputs into ErrorID.m in order to output errors for each revision revisionsthe number of revision periods before steady-state is reached, including a preliminary value
6|Page

stepWidththe number of vintages before a new preliminary value is added lastVintageinput 0 if the most recent vintage ends on a preliminary values, 1 if it ends on a first revision, or 2 if it ends on a second revision firstPrelimthe index of the first preliminary estimate in the dataset 2. Input the appropriate dataset and manual inputs into vintageEstimation.m revisionsthe number of revision periods before steady-state is reached, including a preliminary value stepWidththe number of vintages before a new preliminary value is added lastVintageinput 0 if the most recent vintage ends on a preliminary values, 1 if it ends on a first revision, or 2 if it ends on a second revision timenumeric representation of the time of the real vintage that must be validated errorTolerancenumber of standard deviations included in the confidence interval revisionError(revisions)error tolerance (in terms of standard deviations) for preliminary predictions 3. Input the appropriate regression parameters as identified in Part A. Change the following ifstatement when necessary:

4. Type the following command into the Matlab command window: [validation, cumulativeError, errorFinder] = vintageEstimation(revisionError, realVintage) validationreturns 1 if all values that were meant to be changed fall within the predefined confidence interval cumulativeErrorsums the error for all values that were not meant to be changed errorFinderidentifies the row index of all values that were not meant to be changed

Conclusion and Next Steps Using GDP data as a sample set, we identified an appropriate regression model and revision errors between vintages in order to specify confidence intervals for new vintage validation. We implemented code that takes standardized inputs that allow for quick changes to error tolerance and other parameters of validation. More importantly, we have made our process repeatable in order to set a foundation from which the data desk team at the St. Louis Fed can create a more efficient validation platform that can extend across all datasets. We do have limitations to our code due to the scope of the course. We have set a limit for the revision number to not exceed three in order to simplify the logic of our code. If the data desk encounters more complicated data sets with larger steady-state times, then the errorID.m and vintageEstimation.m functions could be extended to apply to such cases. For even more complicated datasets where there may not exist a constant steady-state time, other engineering methods should be explored, such as Kalman Filtering or non-linear regression.
7|Page

Appendices Appendix A: ValuesForRegression

8|Page

Appendix B: ErrorID

9|Page

10 | P a g e

Appendix C: vintageEstimation

11 | P a g e

12 | P a g e

13 | P a g e

Appendix D: GDP Regressions versus Time Linear

Log-Linear

14 | P a g e

You might also like