Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"

use "roster"

*opens the roster dataset. You can also directly click the dta file or

*write the full path after use.

tab age

set more off

*Disables the annoying "more" option displayed everytime you run a command that obtains results so long, they can't fit in the screen all at
once.

*Numeric variables are stored as byte, int, long, float or double.

*Byte, int and long can only hold integers

*String variables are stored as str9, str13 etc.

*String variables can hold non-numeric characters as well. No mathematical operation can be formed on them

tab age

summarize age

drop if age==2018

*drops observations in which age=2018

tab sb1q4

tab sb1q4, nolab

*Or

label list sb1q4

*finding the value labels of gender

drop if sb1q4==0

summarize age if sb1q4==2

*restricting the observations for which the age variable is being summarized,

*to only female respondents (compare female age stats with overall stats)
count if sb1q4==1 & age<18

*211,185 respondents are male and below 18

*TASK: How many respondents are above or equal to 18 and either separated or

*divorced?

count if age>=18 & (sb1q7==4 | sb1q7==5)

*2178 respondents above or equal to 18 are either separated or divorced.

sum age if sb1q7==2

tab sb1q7 sb1q4

*Two-way tabulation. Breakdown of respondents in each marital status category by gender, or, breakdown of male and female respondents by
the marital status category (depends if you are following values

*horizontally or vertically). Try the "column", "row", "column nofreq" and "row nofreq" options with the tab command (feel free to use the help
file to see how to use these options)

*and see how you can enhance the results obtained from this two-way tabulation command.

cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"

use "roster"

set more off

drop if age==2018

*drops observations in which age=2018

preserve

drop sb1q62

restore

*restores the dataset and undo any changes you made

codebook age

*provides more detailed summary statistics

*TASK: How many districts in this dataset?

codebook district
list age in 1/5

*lists the first 5 observations

ed

*saving dataset

save "roster_new", replace

/*Data Cleaning involves:

1) Renaming Variables

2) Assigning/Changing Variable Labels

3) Defining Value Labels

4) Generating New Variables

5) Creating smaller subsets of data

6) Merging Datasets

7) Reshaping Datasets

8) Removing special characters from string variables

9) Converting variables from string to numeric or vice versa

10)Recoding Missing Values

*/

*To rename a variable:

rename sb1q4 gender

*TASK: Rename the variable sb1q11 to HHmember

rename sb1q11 HHmember

*To assign a variable label:

label variable hhcode "Household Code"


*TASK: Assign the variable label "Primary Sampling Unit" to the psu variable

label variable psu "Primary Sampling Unit"

*Changing Variable Label

label variable age

label variable age "Age"

*Recoding Missing Values

tab gender

replace gender=. if gender==0

tab gender

tab gender, missing

tab gender, nolab missing

*TASK: In the marital status variable, replace the 0 values with missing values

replace sb1q7=. if sb1q7==0

cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"

use "roster"

set more off

rename sb1q4 gender

replace gender=. if gender==0

*Generating Variables and Assigning Value Labels

gen female=0

replace female=1 if gender==2

replace female=. if gender==.

/*We can now assign value labels to the female variable through the following

two-step procedure:

1) Define a value label

2) Assign that value label to our variable of interest


*/

tab female

lab define gender 0"Male" 1"Female"

lab val female gender

tab female

label list gender

*TASK: Now generate a variable named "Male". It should take the value 1 for

*male respondents and 0 for female respondents. Assign value labels to this

*variable.

*To verify that you have generated the correct binary variable for male:

tab gender male

*Changing Value Labels

tab sb1q11

tab sb1q11, nolab

lab val sb1q11

tab sb1q11

recode sb1q11 (2=0)

tab sb1q11

lab def hhmember 0"No" 1"Yes"

lab val sb1q11 hhmember

tab sb1q11

*Conversion from numeric to string

decode sb1q7, gen(marital)

tostring age, gen(age_string)

tostring sb1q4, gen(gender)

cd "C:\Users\Lenovo\Documents\LUMS\Fall 2022\Econometrics\01 dta"


*Importing from excel

import excel "C:\Users\Lenovo\Documents\Job Applications\LUMS\Spring 2023\Econometrics\roster_subset.xlsx", sheet("Sheet1") firstrow

*Generating variables from string variables

describe sb1q7

gen divorced=0

replace divorced=1 if sb1q7=="divorced"

replace divorced=. if sb1q7==""

tab divorced

*TASK: Generate a binary variable and name it "married". It should take the

*value 1 if the respondent is currently married and 0 otherwise.

*Convert a string variable to a numeric categorical variable

encode sb1q7, generate(marital)

*TASK: Convert sb1q4 from string to numeric

encode sb1q4, gen(gender)

use "roster.dta", clear

*Convert province from numeric to string

decode province, gen(province_string)

*upper(), lower() and proper functions

*To make all the characters into uppercase

replace province_string = upper(province_string)

*To make all the characters into lowercase

replace province_string = lower(province_string)


*To make all just the first character into uppercase

replace province_string = proper(province_string)

*TASK: Convert the district variable from numeric to string. Transform the

*variable such that all the characters are in uppercase.

di subinstr("Ec00nometrics","00","o",.)

di subinstr("IceCCCCCCCCCCream","C","",.)

di subinstr("Iceream","er","ecr",.)

replace province_string = subinstr(province_string,"a","@",.)

***Replace @ with a

replace province_string = subinstr(province_string,"@","a",.)

di substr("Econometrics",1,3)

di substr("Econometrics",6,.)

di substr("Econometrics",-3,.)

gen prov_code = substr(province_string,1,3)

*generates the first three characters from the province_string variable

You might also like