Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Econometric tasks

 Estimation
 Testing hypotheses
 Forecasting / predicting / simulation

Econometrics models

1_Deterministic part of the model

 Relation between variables and dependent variables (linear or nonlinear)


 What variables should be included (each variable has a theory, its relationship with the
model)
 What are the implication relationships among the variables (what affects what)
 Which variables are exogenous and which are endogenous

2_Stochastic part of the model


 How should the error term enter? (+ui or *ui?)
 Additive?
 Multiplicative?
 What probability distribution might it plausibly follow? (We expect it to be normally
distributed, what happens if it is not normally distributed?)

Data
Experimental vs observational data
Randomized control trials: extracting data in such a way to establish causation, is x causes y must be
established by research, we design an experiment, else it’s just a pure correlation. Eg small kids
given sugar candies, most parents believe it made students hyperactive, but research was conducted
to make sure of causation

We have a control room and treatment room, which was divided from a randomized control room.
Divide the sample into 2 groups, 50 in control and 50 in treatment, with no selection bias. Then, give
treatment to treatment room to test its whether its because of candy only that the kids are
becoming hyperactive (or maybe to check covid efficiency by giving vaccine). Check later what the
difference in control room and treatment room is. Then we make sure if any difference, it was
because of the treatment only. This type of data is an experimental type of data.
National Sample surveys, NFHS data are observation type of data. This type has the drawback of
causal data.

Formation of data. 3 types of data-

Cross section data


 Many units observed at one point in time
 Examples include individual census or survey respondents, states or countries, reed
students, colleges etc.

Times series data


 Same unit observed at many points in time (usually equally spaced)
 Examples include national macroeconomic variables

Panel (longitudinal data)


 Multiple units observed at multiple points in time (pre condition; household should remain
the same)
 Eg: state level and national level time series data (for many states or countries)

Pooled cross section data- taking a sample at one point in time, and taking another sample from a
different place at a different point of time, different from panel data cuz the samples are taken from
different location and different times

Estimation and Testing

Data science- finding the causal link or causal nature

What is an estimator: OLS (reduce the distance by squaring), MLE (maximize likelihood to fit),

Logic models- yes or no type of data, regression analysis does not fit.

 Determining the proper estimator for the data and model


 May need to test properties of data to determine this
 Estimators are mathematical functions or algorithms that are used to calculate the values of
the parameters of a model from a sample of data. They can be point estimators like least
squares or MLE or distribution estimators like Bayesian methods and they provide an
estimate of the unknown population and parameters based on the sample data.
 Performing estimation and testing hypotheses of interest
After estimating, we start with testing, whether the model is right or not

There could be some problems like

omitted variable bias (there was some var which was very imp in determination of the effect, but we
don’t have the variable or we haven’t included it, find a solution or get a proxy for that variable (like
monthly per capita income, since no one would actually disclose their own income, can’t do that for
high income groups though)),

multi collinearity (not a big problem, perfect multi collinearity, where one variable perfectly explains
another),

auto correlation (mostly a problem of time series data, error time in previous time affects present
error term which can also affect the future error term, causing unfariness)

heteroscedasticity (define, we need constant variance, homoscedasticity favored, why is variance


increasing and what is it remedy), and functional form, which we refine.

Diagnostic evaluation of model


 Examination of diagnostic statistics
 Residuals as clues to missing effects
 Revise model, data set, and/or estimation method and repeat

Assessment of validity

Difference between causation and correlation

Mostly harmless econometrics- by Joshua Angrist (causation and correlation explained)

Causation- an additional year of education causes wages to increase by a given amount, all else
equal

Correlation- an additional year of education is associated with higher wages

Economic studies, a causation cannot be determined, it is only correlation


R Programming
Variables
Variables are nothing but a name to a memory location containing a value. A variable in R can store
Numeric values, Complex value, words matrices or even tables.

In R, we do not need to declare a variable before we use it, unlike other programming languages like
Java, C, C++, etc.

Variable can have any kind of data type

In R we can use the following for assignment of values to variables: ‘ = ’, ‘ <- ‘, ‘ -> ‘

R programming data types

In R, data type is not specified for the variables in advance, rather, it gets the data type of the R
object assigned to it. R is called a dynamically typed language, which means we can change a data
type of the same variables again and again, when using it in a program.

Eg: x = 15, x = TRUE, x = “Hi”

Data type tells us which types of value a variable has and what types of mathematical, relational or
logical operations can be applied to it without causing an error.

Data types-

Logical- 0, 1 (Boolean values)

Integer- Whole numbers

Numeric- Float values

Complex- 3 + 2i (imaginary numbers)

Character/String- alphabets

Vectors

They are the most basic R data objects and there are six types of atomic variables. The six atomic
variables are-

Logical

Numeric

Integer

Complex

Character
In general, a vector is defined and initialized in the following manner

Vtr = c(2, 5, 11, 24)

Put L after every value to store it as an integer (vec4)

In the case of c(1,1.2,1.3), it’ll get stored as numeric and denote 1L as 1.0.
Vec 6 becomes all characters

If we don’t want all to become characters, we use “lists” which stores the data as it is without
changing its type.

Lists are quite similar to vectors, but lists are the R objects which can contain elements of different
types like numbers, strings, vectors and another list inside it.

Vtr <- c(“Hi”, “Bye”, “What up”)

mylist <- list(Vtr, 22.5, 14965, TRUE)

Matrix

Matrix is the R object in which the elements are arranged in a 2 dimensional rectangular layout.

Basic syntax for creating a matrix in R- matrix(data, nrow, ncol, byrow, dimnames)

Where data is the input vector which becomes the data elements of the matrix.

Nrow is the number of rows to be created

Ncol is the number of columns to be created

Byrow is a logical clue. If TRUE, then the input vector elements are arranged by row.

Dimname is the name assigned to the rows and columns.


Mymatrix <- matrix(c(1:25), nrow = 5, ncol = 5, byrow = TRUE)

Default of byrow is false.

If the number of columns and rows are more than required, the data elements start repearitng.

Array
Arrays in R are data objects which can be used to store data in more than two dimensions. It takes
vectors as input and uses the values in the ‘dim’ parameter to create an array.

The basic syntax for creating an array in R is-

array(data, dim, dimnames)

where

data is the input vector which becomes the data elements of the array

dim is the dimension of the array, where you pass the number of rows, column and the number of
matrices to be created by mentioned dimensions.

Dimname are the names assigned to the rows and columns.

Myarray <- array (c(1:16), dim=c(4,4,2)

Data Frame

A dataframe is a table or a two dimensional array like structure in which each column contains
values of one variable and each row contains one set of values for each column . In a data frame;

Column names shouldn’t be empty

Each column should have the same amount of data items

Data stored in a data frame can be of numeric, factor or character type.

The row names should be unique.


Emp_id = c(100:104)

Emp_name = c(“timmy”, “benny”, “jenny”, “lenny”, “bonny”)

Dept = c(“sales”, “finance”, “marketing”, “HR”, “RD”)

Emp_data = data.frame(Emp_id, Emp_name, dept)

Emp_data

Data Operators

Arithemtic Operators- performing basic arithmetic

A+b

a-b

a*b

a/b

a%%b

> x=22

> y=7
> x+y [1] 29 #addition

> x-y [1] 15 #subtraction

> x%%y [1] 1 #modulus

> x*y [1] 154

> x/y [1] 3.142857

> x^y [1] 2494357888

> x%/%y [1] 3 floor division

Relational Operators

These operators help us perform the relational operators like checking if a variable is
greater/lesser/equal to another variable. The output of a relational operator is always a logical
value.

==, !=, >, <, >=, <=

For conditions in ‘ if (a==b) ‘

Num3 = ( num1 == num2 )

Printing num3, we get FALSE or TRUE

Logical Operators

These operators compare 2 entities and are typically used with Boolean (logical) values such as and,
or, not

A&b it combines each element of vectors and gives an output TRUE if both the values are TRUE

A|b it combines each element of the vector and gives output True if one of the elements is TRUE.

!a takes each element of the vector and gives the opposite logical value

A&&b, a||b only the first value in a vector is compared, works as double-and / double-or

IF statements, Else If statements, Else statements

a=7

b=7
{ if(a>b)

{ print ("a greater than b") }

else if (a<b)

{ print ("a less than b") }

else

{ print ("both numbers are equal") }

Loops

Repeat loop: it repeats a statement or a groups of statements while given conditions are TRUE.
Repeat loop is the best example of an exit controlled loop where the code is first executed and then
the condition is checked to determine if the control should be inside the loop or exit from it.

While loop: it helps to repeat a statement or a group of statements while a given condition is true.
While loop, when compared to the repeat loop is slightly different. It is an example of an entry
controlled loop where the condition is first checked and only if the condition is found to be true
does the control be delivered inside the loop to execute the code.
For Loop: repeat a statement or a group of statements for a fixed number of times. Unlike repeat
and while loop, the for loop is used in situations where we are aware of the number of times the
code needs to be executed before=hand. It is like the while loop where the condition is first checked
and only then the code written inside gets executed.

x=2

repeat

x=x^2

if (x>100)

print(x)

break

}
}

Fibonacci Sequence

Mean

Median
MODE

Functions

A <- c (1,2,3,4,5,6,7,8)

y <- table(A)
y

names(y)[which(y==max(y))]

here; table, which, max, name are all functions

(a) Is an argument

Function type: Inbuilt function

myfunction (data)

return myfunction

Function syntax

Function_name<-function(arg1,arg2….) {

#code fragments

Function_name = #value to return

}
Importing and Exporting data

R works most easily with datasets stored as text files. Typically, values in text files are separated or
delimited, by tabs or spaces:

Or by comas (CSV files)

Reading in text data

 R provides several related functions to read data stored as files


 Use read.csv() to read in data stored as CSV and read.delimit() to read in text data delimited
by other characters (such as tabs or spaces)
 For read.delim(), specify the delimited in the sep=argument.
 Both read.csv() and read.delim() assume the first row of the text file is a row of variable
names. If this is not true, use the argument header=FALSE.
To go

Third row, second column.

Values 1 to 2 of column height.

All values of column diabetic.

To go from a data frame to a vector, use $

And to go from a vector to a subset, we use [ ]

Variables, or columns of a data-frame can be selected with the $ operator, and the resulting object is
a vector. We can further subset elements of the selected column vector using [ ].

Subsetting creates a numeric factor;


Examining the structure of an object
Use dim() on two dimensional objects to get the number of rows and columns.

Use str() to see the structure of the object, including its class and the data types of elements. We
also see the first few rows of each variable.
Adding new variables to the data frame

To ass a column variable called logHeight to d

mydata$logHeight = log(mydata$height)

To just create a new vector

mydata$z = rep(0,5) #function that will repeat zero, five times

You cannot create a variable larger than the dataframe, hence you’ll get an error

Correct: mydata$z(0,3)

OTHER USEFUL FUNCTIONS

log() -logarithm

min_rank() -rank values

cut() -cut a continuous variable into intervals with the new integer value signifying into which
interval the original value falls

scale() -standardizes the variable

lag(), lead() -lag and lead a variable

cumsum() -cumulative sum


rowMeans() -means and sums of several columns
To select all except just two columns; select (dog_data, -c(id,sex))

Appending
Combining of two data frames before merging them. Make sure to check that the columns are the
same using ‘names’. Rows of one data frame get added to the other.
Merging
Merging is the adding together of different columns in one variable.

The ‘dplyr’ package have several types of merges that can be done using different functions like
inner_join(), left join, right join and such.

You might also like