Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

data_cleaningmanagement.

p2 - Printed on 25-Sep-23 1:59:50 AM


1 ***************Cleaning Management.P2********************************
2 **********************************************************************
3 * clear, directory
4 **********************************************************************
5
6
7 *clear
8 clear all
9 set more off
10 macro drop _all
11
12 *directory
13 cd "/Users/..../data_management"
14
15 // globals
16 global raw "data/raw"
17 global clean "data/clean"
18 global analysis "data/analysis"
19 global logs "logs"
20
21
22 **********************************************************************
23 * open a log file
24 **********************************************************************
25
26 *open log file
27 *log using "$logs\load_clean_ipeds", replace
28
29
30 **********************************************************************
31 * input raw dataset
32 **********************************************************************
33
34 *load data from ipeds
35 import delimited using "$raw/us_data.csv"
36
37 *take a look at the data
38 browse
39
40
41 **********************************************************************
42 * identify and clean up the variables i want to work with
43 **********************************************************************
44
45 *rename variables
46 rename sectorofinstitutionhd2015 sector
47 rename carnegieclassification2015underg classification
48 rename fulltimeundergraduateenrollmentd ugrad_enrl_ft
49 rename undergraduateenrollmentdrvef2015 ugrad_enrl
50 rename percentadmittedtotaldrvadm2015 pct_admitted
51
52 *change the order
53 order unitid institutionname sector classification ugrad_enrl ugrad_enrl_ft pct_admitted
54
55
56 **********************************************************************
57 * take a quick look at sector
58 ***************************************************************************
59
60 *cross-tab. levels of sector
61 tab sector
62 tab sector, m // it's a good habit to ALWAYS be thinking about missing values. here there are none.
63

Page 1
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:50 AM
64 * ipeds also included a reference that indicated what these values for sector indicate
65
66 /*
67 Sector of institution (HD2015) 1 Public, 4-year or above
68 Sector of institution (HD2015) 2 Private not-for-profit, 4-year or above
69 Sector of institution (HD2015) 3 Private for-profit, 4-year or above
70 Sector of institution (HD2015) 4 Public, 2-year
71 Sector of institution (HD2015) 5 Private not-for-profit, 2-year
72 Sector of institution (HD2015) 6 Private for-profit, 2-year
73 Sector of institution (HD2015) 7 Public, less-than 2-year
74 Sector of institution (HD2015) 8 Private not-for-profit, less-than 2-year
75 Sector of institution (HD2015) 9 Private for-profit, less-than 2-year
76 */
77
78
79 **********************************************************************
80 * what we want to derive from the sector variable
81 * 1. a way to identify 4-year or not 4-year
82 * 2. a way to identify public
83 **********************************************************************
84
85
86 *identify four-year colleges
87 gen four_year = 0
88 replace four_year = 1 if sector == 1
89 replace four_year = 1 if sector == 2
90 replace four_year = 1 if sector == 3
91
92
93 *identify public colleges (more elegant)
94 gen public = 0
95 replace public = 1 if inlist(sector, 1, 4)
96
97 ***** most elegant approach *****
98 drop four_year public
99
100 *identify four-year coleges
101 gen four_year = inrange(sector, 1,3) // all values outside range, including missing, become zero.
102 // in this dataset, there are no missing values for sector
103
104 *identify public colleges
105 gen public = inlist(sector, 1,4) // all values other than 1 and 4, including missing, become zero.
106 // in this dataset, there are no missing values for sector
107
108 *cross-tab
109 tab sector four_year
110 tab sector public
111
112 **cross-tab. before, we had 1284 Private not-for-profit, 4-year or above (category 2)
113 tab public four_year
114
115 *drop the for-profit colleges.
116 drop if sector == 3
117
118 **cross-tab. do i get 1284 for not public, four-year?
119 tab public four_year // yes
120
121
122 ****************************************************************************************
123 * label values of variables
124 * 1. four-year
125 * 2. public
126 ****************************************************************************************

Page 2
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
127
128
129 *label four-year
130 label define four_year_label 0 "2-Year" 1 "4-Year"
131 label values four_year four_year_label
132
133
134 *did it work?
135 tab four_year
136
137
138 *label public
139 label define public 0 "Private" 1 "Public"
140 label values public public
141
142
143 *did it work?
144 tab public
145
146
147 *************************************************************************************
148 * inspect pct_admitted variable
149 *************************************************************************************
150
151 *summarize
152 su pct_admitted, d
153
154 *look at "obs." what does this tell you?
155 di _N
156
157 *generate indicator variable to investigate missing patterns more carefully
158 gen miss_admit = mi(pct_admitted)
159
160 *cross-tab and compute the percentage of miss_admit by liberal arts
161 tab miss_admit liberal_arts, col // 91.5% non-missing for lib arts.
162 // 51% non-missing for non-liberal arts.
163
164 *how about institions with "community college" in their name?
165 su pct_admitted if regexm(institutionname, "Community College") // only 5 report data!
166
167
168 * look at missing rate using the more granular classification system
169 tab classification miss_admit , m
170 tab classification miss_admit , row
171
172 *based on the description, it seems that most community colleges will be in 1-4.
173 tab classification if regexm(institutionname, "Community College")
174
175
176 *********************************************************************
177 * compute % of students who are full-time
178 *********************************************************************
179
180 *generate new variable
181 gen ft_pct = ugrad_enrl_ft / ugrad_enrl
182
183 *summary stats
184 su ft_pct, d
185
186 *histogram
187 hist ft_pct, d
188
189

Page 3
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
190 *********************************************************************
191 * re-order variables
192 *********************************************************************
193
194 *maybe we prefer this order
195 order unitid institutionname ugrad_enrl_ft ft_pct four_year public pct_admitted liberal_arts
ugrad_enrl
196
197 * move liberal arts to the spot after institutionname
198 order liberal_arts, after(institutionname)
199
200
201
202 save "$clean/us_college_data.dta", replace
203
204
205
206 *********************************************************************
207 * now, we decide we want to include more data about these colleges.
208 * perhaps we want to know more about variability by region.
209 * go back to ipeds, repeat the same steps for identifying the sample, and
210 * get a count of the number of students enrolling in each college by "home state."
211 * this time, for some reason, the data are in an excel spreadsheet, not a .csv file.
212 ********************************************************************
213
214 *clear the clean dataset from stata
215 clear
216
217 *load the new dataset
218 import excel using "$raw/us_data_stud_state.xlsx"
219
220
221
222 *clear it out and try again
223 clear
224
225 *tell stata that the first row has variable names
226 import excel using "$raw/us_data_stud_state.xlsx", firstrow
227
228
229
230 *get rid of the "EF2014C_RV" prefix. keep the rest of the variable name
231 rename EF2014C_RV* *
232
233 * prepare data for merge
234 * the key variable is the identifier--in this case, unitid
235 * 2 things to note
236 * 1. the VARIABLE NAME must match EXACTLY.
237 * 2. the VARIABLE CONTENTS must match EXACTLY.
238
239 *the other dataset, the "master" dataset, has a variable called "unitid"
240 *this dataset, the "using" dataset, has a variable called UnitID
241 *must rename!
242
243 *rename
244 rename UnitID unitid
245
246 *for a merge like this, we can't have duplicates on the merging variable (unitid)
247 *let's double-check
248 duplicates report unitid
249 *duplicates report State
250
251 *a manual way to ask stata to complain about uniqueness

Page 4
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
252 bys State: assert (_n==1)
253 bys unitid: assert (_n==1)
254
255 *save the "using" dataset
256 save "$clean/college_stud_state.dta", replace
257
258
259 ********************************************************************************
260 * load the master dataset and merge in the new data
261 ********************************************************************************
262
263 *master
264 use "$clean/us_college_data.dta", clear // one-step approach for clearing and loading
265
266 *merge
267 merge 1:1 unitid using "$clean/college_stud_state.dta"
268
269 *drop colleges from the "using" dataset that weren't in the master dataset
270 drop if _merge != 3
271
272 *get rid of the auto-generated _merge variable
273 drop _merge
274
275 *drop Institution Name. it was in both datasets with different variable names
276 drop InstitutionName
277
278 **********************************************************************
279 * Loop
280 * 1. change the crazy looking variable labels for the state variables to something more sensible
281 * using a loop
282 * 2. save the new file to "analysis" folder, replacing the old version of the analysis file
283 *********************************************************************
284
285 *change the variable labels for the state variables
286
287 label var Alabama "Alabama Enrollees"
288 label var Alaska "Alaska Enrollees"
289
290
291 *type "help loop" into the command window
292 *use the loop struture below to rename ALL the state variables following the pattern above.
293 *you'll only have to put one line of code into the loop
294
295
296 *loop. note that we can take advantage of the order of the variables using "-"
297 foreach x of varlist Arizona-Wyoming {
298 label var `x' "`x'"
299 }
300
301 *** make the variable names and contents lowercase
302 *rename variables
303 rename Alabama-Wyoming, lower
304
305
306 *replace values of state with lower-case values
307 replace State = lower(State)
308
309
310 ***get rid of spaces in the state variable
311 *note that State still has spaces, but variable names do not (and cannot)
312 replace State = subinstr(State, " ", "", .)
313 tab State
314

Page 5
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
315
316 ***note that missing is equivalent to zero for alabama-wyoming
317 *this seems like a good thing to fix with a loop.
318 *before going for the loop, it's helpful to do a few cases without looping
319 replace alabama = 0 if alabama == .
320 replace alaska = 0 if alaska == .
321
322
323
324 foreach s of varlist arizona-wyoming {
325 replace `s' = 0 if `s' == .
326 }
327
328 ***break down hard problems into easier problems
329
330 *alabama
331 gen out_of_state = .
332 *replace out_of_state = alaska + arizona + ... // hmmm, seems hard
333 drop out_of_state
334
335
336 gen in_state = . // maybe this will be easier. then we could subtract in-state from "UStotal" to
get out of state.
337 replace in_state = alabama if State == "alabama"
338 replace in_state = alaska if State == "alaska"
339
340 *log close
341
342
343
344
345
346
347
348

Page 6

You might also like