Ecotrix With R and Python

Econometric tasks
 Estimation
 Testing hypotheses
 Forecasting / predicting / simulation
Econometrics models
1_Deterministic part of the model
 Relation between variables and dependent variables (linear or nonlinear)

 What variables should be included (each variable has a theory, its relationship with the
model)
 What are the implication relationships among the variables (what affects what)
 Which variables are exogenous and which are endogenous
2_Stochastic part of the model

 How should the error term enter? (+ui or *ui?)
 Additive?
 Multiplicative?
 What probability distribution might it plausibly follow? (We expect it to be normally
distributed, what happens if it is not normally distributed?)
Data
Experimental vs observational data
Randomized control trials: extracting data in such a way to establish causation, is x causes y must be
established by research, we design an experiment, else it’s just a pure correlation. Eg small kids
given sugar candies, most parents believe it made students hyperactive, but research was conducted
to make sure of causation
We have a control room and treatment room, which was divided from a randomized control room.
Divide the sample into 2 groups, 50 in control and 50 in treatment, with no selection bias. Then, give
treatment to treatment room to test its whether its because of candy only that the kids are
becoming hyperactive (or maybe to check covid efficiency by giving vaccine). Check later what the
difference in control room and treatment room is. Then we make sure if any difference, it was
because of the treatment only. This type of data is an experimental type of data.
National Sample surveys, NFHS data are observation type of data. This type has the drawback of
causal data.
Formation of data. 3 types of data-
Cross section data

 Many units observed at one point in time
 Examples include individual census or survey respondents, states or countries, reed
students, colleges etc.
Times series data

 Same unit observed at many points in time (usually equally spaced)
 Examples include national macroeconomic variables
Panel (longitudinal data)

 Multiple units observed at multiple points in time (pre condition; household should remain
the same)
 Eg: state level and national level time series data (for many states or countries)
Pooled cross section data- taking a sample at one point in time, and taking another sample from a
different place at a different point of time, different from panel data cuz the samples are taken from
different location and different times
Estimation and Testing
Data science- finding the causal link or causal nature
What is an estimator: OLS (reduce the distance by squaring), MLE (maximize likelihood to fit),
Logic models- yes or no type of data, regression analysis does not fit.
 Determining the proper estimator for the data and model

 May need to test properties of data to determine this
 Estimators are mathematical functions or algorithms that are used to calculate the values of
the parameters of a model from a sample of data. They can be point estimators like least
squares or MLE or distribution estimators like Bayesian methods and they provide an
estimate of the unknown population and parameters based on the sample data.
 Performing estimation and testing hypotheses of interest
After estimating, we start with testing, whether the model is right or not
There could be some problems like
omitted variable bias (there was some var which was very imp in determination of the effect, but we
don’t have the variable or we haven’t included it, find a solution or get a proxy for that variable (like
monthly per capita income, since no one would actually disclose their own income, can’t do that for
high income groups though)),
multi collinearity (not a big problem, perfect multi collinearity, where one variable perfectly explains
another),
auto correlation (mostly a problem of time series data, error time in previous time affects present
error term which can also affect the future error term, causing unfariness)
heteroscedasticity (define, we need constant variance, homoscedasticity favored, why is variance

increasing and what is it remedy), and functional form, which we refine.
Diagnostic evaluation of model

 Examination of diagnostic statistics
 Residuals as clues to missing effects
 Revise model, data set, and/or estimation method and repeat
Assessment of validity
Difference between causation and correlation
Mostly harmless econometrics- by Joshua Angrist (causation and correlation explained)
Causation- an additional year of education causes wages to increase by a given amount, all else
equal
Correlation- an additional year of education is associated with higher wages
Economic studies, a causation cannot be determined, it is only correlation

R Programming
Variables
Variables are nothing but a name to a memory location containing a value. A variable in R can store
Numeric values, Complex value, words matrices or even tables.
In R, we do not need to declare a variable before we use it, unlike other programming languages like
Java, C, C++, etc.
Variable can have any kind of data type
In R we can use the following for assignment of values to variables: ‘ = ’, ‘ <- ‘, ‘ -> ‘
R programming data types
In R, data type is not specified for the variables in advance, rather, it gets the data type of the R
object assigned to it. R is called a dynamically typed language, which means we can change a data
type of the same variables again and again, when using it in a program.
Eg: x = 15, x = TRUE, x = “Hi”
Data type tells us which types of value a variable has and what types of mathematical, relational or
logical operations can be applied to it without causing an error.
Data types-
Logical- 0, 1 (Boolean values)
Integer- Whole numbers
Numeric- Float values
Complex- 3 + 2i (imaginary numbers)
Character/String- alphabets
Vectors
They are the most basic R data objects and there are six types of atomic variables. The six atomic
variables are-
Logical
Numeric
Integer
Complex
Character
In general, a vector is defined and initialized in the following manner
Vtr = c(2, 5, 11, 24)
Put L after every value to store it as an integer (vec4)
In the case of c(1,1.2,1.3), it’ll get stored as numeric and denote 1L as 1.0.
Vec 6 becomes all characters
If we don’t want all to become characters, we use “lists” which stores the data as it is without
changing its type.
Lists are quite similar to vectors, but lists are the R objects which can contain elements of different
types like numbers, strings, vectors and another list inside it.
Vtr <- c(“Hi”, “Bye”, “What up”)
mylist <- list(Vtr, 22.5, 14965, TRUE)
Matrix
Matrix is the R object in which the elements are arranged in a 2 dimensional rectangular layout.
Basic syntax for creating a matrix in R- matrix(data, nrow, ncol, byrow, dimnames)
Where data is the input vector which becomes the data elements of the matrix.
Nrow is the number of rows to be created
Ncol is the number of columns to be created
Byrow is a logical clue. If TRUE, then the input vector elements are arranged by row.
Dimname is the name assigned to the rows and columns.

Mymatrix <- matrix(c(1:25), nrow = 5, ncol = 5, byrow = TRUE)
Default of byrow is false.
If the number of columns and rows are more than required, the data elements start repearitng.
Array
Arrays in R are data objects which can be used to store data in more than two dimensions. It takes
vectors as input and uses the values in the ‘dim’ parameter to create an array.
The basic syntax for creating an array in R is-
array(data, dim, dimnames)
where
data is the input vector which becomes the data elements of the array
dim is the dimension of the array, where you pass the number of rows, column and the number of
matrices to be created by mentioned dimensions.
Dimname are the names assigned to the rows and columns.
Myarray <- array (c(1:16), dim=c(4,4,2)
Data Frame
A dataframe is a table or a two dimensional array like structure in which each column contains
values of one variable and each row contains one set of values for each column . In a data frame;
Column names shouldn’t be empty
Each column should have the same amount of data items
Data stored in a data frame can be of numeric, factor or character type.
The row names should be unique.

Emp_id = c(100:104)
Emp_name = c(“timmy”, “benny”, “jenny”, “lenny”, “bonny”)
Dept = c(“sales”, “finance”, “marketing”, “HR”, “RD”)
Emp_data = data.frame(Emp_id, Emp_name, dept)
Emp_data
Data Operators
Arithemtic Operators- performing basic arithmetic
A+b
a-b
a*b
a/b
a%%b
> x=22
> y=7
> x+y [1] 29 #addition
> x-y [1] 15 #subtraction
> x%%y [1] 1 #modulus
> x*y [1] 154
> x/y [1] 3.142857
> x^y [1] 2494357888
> x%/%y [1] 3 floor division
Relational Operators
These operators help us perform the relational operators like checking if a variable is
greater/lesser/equal to another variable. The output of a relational operator is always a logical
value.
==, !=, >, <, >=, <=
For conditions in ‘ if (a==b) ‘
Num3 = ( num1 == num2 )
Printing num3, we get FALSE or TRUE
Logical Operators
These operators compare 2 entities and are typically used with Boolean (logical) values such as and,
or, not
A&b it combines each element of vectors and gives an output TRUE if both the values are TRUE
A|b it combines each element of the vector and gives output True if one of the elements is TRUE.
!a takes each element of the vector and gives the opposite logical value
A&&b, a||b only the first value in a vector is compared, works as double-and / double-or
IF statements, Else If statements, Else statements
a=7
b=7
{ if(a>b)
{ print ("a greater than b") }
else if (a<b)
{ print ("a less than b") }
else
{ print ("both numbers are equal") }
Loops
Repeat loop: it repeats a statement or a groups of statements while given conditions are TRUE.
Repeat loop is the best example of an exit controlled loop where the code is first executed and then
the condition is checked to determine if the control should be inside the loop or exit from it.
While loop: it helps to repeat a statement or a group of statements while a given condition is true.
While loop, when compared to the repeat loop is slightly different. It is an example of an entry
controlled loop where the condition is first checked and only if the condition is found to be true
does the control be delivered inside the loop to execute the code.
For Loop: repeat a statement or a group of statements for a fixed number of times. Unlike repeat
and while loop, the for loop is used in situations where we are aware of the number of times the
code needs to be executed before=hand. It is like the while loop where the condition is first checked
and only then the code written inside gets executed.
x=2
repeat
x=x^2
if (x>100)
print(x)
break
}
}
Fibonacci Sequence
Mean
Median
MODE
Functions
A <- c (1,2,3,4,5,6,7,8)
y <- table(A)
y
names(y)[which(y==max(y))]
here; table, which, max, name are all functions
(a) Is an argument
Function type: Inbuilt function
myfunction (data)
return myfunction
Function syntax
Function_name<-function(arg1,arg2….) {
#code fragments
Function_name = #value to return
}
Importing and Exporting data
R works most easily with datasets stored as text files. Typically, values in text files are separated or
delimited, by tabs or spaces:
Or by comas (CSV files)
Reading in text data
 R provides several related functions to read data stored as files

 Use read.csv() to read in data stored as CSV and read.delimit() to read in text data delimited
by other characters (such as tabs or spaces)
 For read.delim(), specify the delimited in the sep=argument.
 Both read.csv() and read.delim() assume the first row of the text file is a row of variable
names. If this is not true, use the argument header=FALSE.
To go
Third row, second column.
Values 1 to 2 of column height.
All values of column diabetic.
To go from a data frame to a vector, use $
And to go from a vector to a subset, we use [ ]
Variables, or columns of a data-frame can be selected with the $ operator, and the resulting object is
a vector. We can further subset elements of the selected column vector using [ ].
Subsetting creates a numeric factor;

Examining the structure of an object
Use dim() on two dimensional objects to get the number of rows and columns.
Use str() to see the structure of the object, including its class and the data types of elements. We
also see the first few rows of each variable.
Adding new variables to the data frame
To ass a column variable called logHeight to d
mydata$logHeight = log(mydata$height)
To just create a new vector
mydata$z = rep(0,5) #function that will repeat zero, five times
You cannot create a variable larger than the dataframe, hence you’ll get an error
Correct: mydata$z(0,3)
OTHER USEFUL FUNCTIONS
log() -logarithm
min_rank() -rank values
cut() -cut a continuous variable into intervals with the new integer value signifying into which
interval the original value falls
scale() -standardizes the variable
lag(), lead() -lag and lead a variable
cumsum() -cumulative sum

rowMeans() -means and sums of several columns
To select all except just two columns; select (dog_data, -c(id,sex))
Appending
Combining of two data frames before merging them. Make sure to check that the columns are the
same using ‘names’. Rows of one data frame get added to the other.
Merging
Merging is the adding together of different columns in one variable.
The ‘dplyr’ package have several types of merges that can be done using different functions like
inner_join(), left join, right join and such.

Ecotrix With R and Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ecotrix With R and Python

Uploaded by

Copyright:

Available Formats

Econometric tasks

1_Deterministic part of the model

 Relation between variables and dependent variables (linear or nonlinear)

2_Stochastic part of the model

Formation of data. 3 types of data-

Cross section data

Times series data

Panel (longitudinal data)

Estimation and Testing

Data science- finding the causal link or causal nature

 Determining the proper estimator for the data and model

There could be some problems like

heteroscedasticity (define, we need constant variance, homoscedasticity favored, why is variance

Diagnostic evaluation of model

Difference between causation and correlation

Mostly harmless econometrics- by Joshua Angrist (causation and correlation explained)

Correlation- an additional year of education is associated with higher wages

Economic studies, a causation cannot be determined, it is only correlation

Variable can have any kind of data type

R programming data types

Eg: x = 15, x = TRUE, x = “Hi”

Logical- 0, 1 (Boolean values)

Integer- Whole numbers

Numeric- Float values

Complex- 3 + 2i (imaginary numbers)

Vtr = c(2, 5, 11, 24)

Put L after every value to store it as an integer (vec4)

Vtr <- c(“Hi”, “Bye”, “What up”)

mylist <- list(Vtr, 22.5, 14965, TRUE)

Nrow is the number of rows to be created

Ncol is the number of columns to be created

Dimname is the name assigned to the rows and columns.

Default of byrow is false.

The basic syntax for creating an array in R is-

array(data, dim, dimnames)

Dimname are the names assigned to the rows and columns.

Myarray <- array (c(1:16), dim=c(4,4,2)

Column names shouldn’t be empty

Each column should have the same amount of data items

Data stored in a data frame can be of numeric, factor or character type.

The row names should be unique.

Emp_name = c(“timmy”, “benny”, “jenny”, “lenny”, “bonny”)

Dept = c(“sales”, “finance”, “marketing”, “HR”, “RD”)

Emp_data = data.frame(Emp_id, Emp_name, dept)

Arithemtic Operators- performing basic arithmetic

> x-y [1] 15 #subtraction

> x%%y [1] 1 #modulus

> x*y [1] 154

> x/y [1] 3.142857

> x^y [1] 2494357888

> x%/%y [1] 3 floor division

==, !=, >, <, >=, <=

For conditions in ‘ if (a==b) ‘

Num3 = ( num1 == num2 )

Printing num3, we get FALSE or TRUE

IF statements, Else If statements, Else statements

{ print ("a greater than b") }

{ print ("a less than b") }

{ print ("both numbers are equal") }

here; table, which, max, name are all functions

Function type: Inbuilt function

Function_name = #value to return

Or by comas (CSV files)