-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathTutorial4_CleaningData.Rmd
365 lines (290 loc) · 13.8 KB
/
Tutorial4_CleaningData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
% Tutorial 4: Cleaning and Merging Data
% DPI R Bootcamp
% Jared Knowles
# Overview
In this lesson we hope to learn about:
- The Strategic Data Project
- Checking data for errors
- Recoding data and changing data types
- Diagnostics and error checks
<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>
```{r setup, include=FALSE}
# set global chunk options
opts_chunk$set(fig.path='figure/slides4-', cache.path='cache/slides4-',fig.width=8,fig.height=6,
message=FALSE,error=FALSE,warning=FALSE,echo=TRUE,cache=TRUE,autodep=TRUE,comment=NA)
```
# Data Setup
- Let's read in a new dataset now that has some messiness to it
```{r dataread}
load('data/Student_Attributes.rda')
head(stuatt[,1:4],7)
```
- What's wrong with this?
# How can R help correct this?
- Identify problems
- Enforce business rules for messy data consistently
- Build data cleaning into all analyses tasks across the workflow
- Analyze inconsistencies and do reports
# Strategic Data Project
- [The Strategic Data Project](http://www.gse.harvard.edu/~pfpie/index.php/sdp/) is a project housed at Harvard Center for Education Policy Research aimed at bringing high quality research methods and data analysis to bear on strategic management and policy decisions in the education sector
- SDP was formed on two fundamental premises:
1. Policy and management decisions can directly influence schools' and teachers' ability to improve student achievement
2. Valid and reliable data analysis significantly improves the quality of decision making
- Their focus is on bringing together the right people, assembling the right data, and performing the right analysis because this will improve decisions made by leadership
- They are smart folks who have done a lot of the important but unexciting work of systematically identifying how to clean, document, and transparently evaluate datasets
<p align="center"><img src="img/sdp.gif" height="112" width="329"></p>
# Toolkit - Data Cleaning
- The SDP has come up with a great tutorial and guided analyses using a great synthetic data set to help walk through the process of cleaning data
- This was written in Stata, we are porting it to R (you can contribute to this effort if you like), and are going to walk through just a single lesson of it here (Clean Data Building)
- You can get the toolkit lesson that this tutorial is adapted from [online](http://www.gse.harvard.edu/~pfpie/index.php/sdp/tools)
- There are five toolkits in addition to a data guide that are incredibly helpful, so we are just touching the tip of the iceberg
- Other modules include:
1. How to identify essential data elements for analyzing student achievement
2. Clean, check, and build variables in the dataset
3. Connect relevant datasets from different soruces
4. Analyze datasets
5. Adopt coding best practices to facilitate shared and replicable data analysis
# SDP Task 1 Student Attributes Intro
- Drop the `first_9th_school_year_reported` variable
```{r dropvar9}
stuatt$first_9th_year_reported<-NULL
```
- To drop variables in R we assign them `NULL`, another R quirk
# SDP Task 1 - Step 1: Consistent Gender
- Is gender unique for each student?
```{r uniquegender}
length(unique(stuatt$sid))
length(unique(stuatt$sid,stuatt$male))
```
- Nah, we have 21,803 unique students in our dataset, but *21,806* unique combinations of gender and student
# Testing Uniqueness
- Below we write a small function that automates the check we did on the last slide
- How does this work?
```{r uniquenesstest}
testuniqueness<-function(id,group){
length(unique(id))==length(unique(id,group))
} # Need better varname and some optimization to the speed of this code
testuniqueness(stuatt$sid,stuatt$male)
testuniqueness(stuatt$sid,stuatt$race_ethnicity)
testuniqueness(stuatt$sid,stuatt$birth_date)
```
- Messy...
# Where is the data messy?
```{r seethegenderprob}
stuatt[17:21,1:3]
```
- Student 7 has an inconsistently reported gender in our data
- We need a business rule to handle fixing this, and a way to implement it
- SDP provides the rule, R provides the systematic implementation
# Unifying Consistent Gender Values
- First we create a variable with the number of unique values gender takes per student
- In R to do this we create a summary table of student attributes by collapsing the data set into one row per student using the `plyr` strategy we learned in Tutorial 3
- Then we ask R to tell us how many rows have what values for the length of gender
```{r collapsegender}
library(plyr)
sturow<-ddply(stuatt,.(sid),summarize,nvals_gender=length(unique(male)))
table(sturow$nvals_gender)
```
- So 4 students have more than one unique value for gender
# Fixing the pesky observations
- At this point there are a number of business rules we could adopt
- We could assign students the most recent value, the most frequent value, or even a random value!
- Let's see if replacing it with the most frequent value works
```{r statmode}
# A function to find the most frequent value
library(eeptools)
sturow<-ddply(stuatt,.(sid),summarize,nvals_gender=length(unique(male)),gender_mode=statamode(male),gender_recent=tail(male,1))
head(sturow[7:10,])
```
# Fixing observations II
- Now we have two objects `stuatt` and `sturow` and we need to replace some values from `stuatt` with some values from `sturow`
- `merge` to the rescue!
- Let's `merge` our two data objects into a temporary data object called `tempdf`
```{r reconcilegender}
tempdf<-merge(stuatt,sturow) # R finds the linking variable already
head(tempdf[17:21,c(1,2,3,10,11)])
print(subset(tempdf[,c(1,2,3,10,11)],sid==12506))
```
- We fixed observation 7, but not observation 12506
# Fixing where the mode does not work
```{r reconcilegender2}
print(subset(tempdf[,c(1,2,3,10,11,12)],sid==12506))
```
- Our next business rule is to assign the most recent value of gender from the `gender_recent` variable when there is not a value of `gender_mode` that is valid
- This seems like a pretty simple job for `recoding` our variable!
# Recode Gender
- Two step process: first we assign `tempdf$male` to be the same as `tempdf$gender_mode`
- Then, where `tempdf$male` is now a "." indicating no modal category exists, we assign `tempdf$gender_recent` to be `tempdf$male`
- Go ahead and try this and use `testuniqueness(tempdf$id,tempdf$male)` to check if it worked
# Results
```{r solutiongender}
tempdf$male<-tempdf$gender_mode
tempdf$male[tempdf$male=="."]<-tempdf$gender_recent[tempdf$male=="."]
# we have to put the filter on both sides of the assignment operator
testuniqueness(tempdf$id,tempdf$male)
```
- Now let's clean up our workspace, we created a lot of temporary variables that we don't need
```{r cleanup}
rm(sturow)
stuatt<-tempdf
stuatt$nvals_gender<-NULL
stuatt$gender_mode<-NULL
stuatt$gender_recent<-NULL
# or just run stuatt<-tempdf[,1:9]
rm(tempdf)
```
# Create a consistent race and ethnicity indicator
- Let's practice the same procedure on race
### A Note About Variable Types
- In the SDP Toolkit you are advised to convert the `race_ethnicity` variable to numeric and add labels to it
- This is because Stata and other statistical packages don't have internal data structures that can handle the `factor` variable type like R can, and rely on numeric coding schemes
- Why don't we need to do this in R?
- In fact, in R, we should probably recode the `male` variable as a factor with values `M` and `F`
- One problem is that our datafile uses 'NA' for Native American and we do have to recode that... why?
# Recoding Race
- What's wrong with our race variable?
```{r raceerror}
summary(stuatt$race_ethnicity)
```
- How do we do this?
```{r recodeerrror,eval=FALSE}
length(stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)])
stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)]<-"AI"
summary(stuatt$race_ethnicity)
```
- Why doesn't this work?
# Correct conversion
```{r recodenaCORRECT}
length(stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)])
stuatt$race_ethnicity<-as.character(stuatt$race_ethnicity)
stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)]<-"AI"
stuatt$race_ethnicity<-factor(stuatt$race_ethnicity)
summary(stuatt$race_ethnicity)
```
- Factors are pesky, even though they are useful and keep us from having to remember numeric representations of our data
- In fact, if you read the toolkit, this is a big drawback of _Stata_ because you must constantly refer back to the numbers to remember what number corresponds to "hispanic"
# Inconsistency Within Years
- Let's consider student 3 in our dataset
```{r student3}
stuatt[7:9,c("sid","school_year","race_ethnicity")]
```
- How is this different from our prior problem?
- Since student 3 was recorded twice in the same year and given a different _race/ethnicity_ we now have to figure out some rules for assigning a consistent value
# Business Rule
- Again, we are implementing a business rule which means we are making some arbitrary decisions about the data
- In this case, if a student is _hispanic_ we will code both values as hispanic
- If the student is _not hispanic_ in either observation, we will code the student as _multiple
# Let's calculate the number of values per year
```{r createnvalswithinyearI}
nvals<-ddply(stuatt,.(sid,school_year),summarize,
nvals_race=length(unique(race_ethnicity)),
tmphispanic=length(which(race_ethnicity=="H")))
tempdf<-merge(stuatt,nvals)
# Clean up
rm(nvals)
# Recode race_ethnicity
tempdf$race2<-tempdf$race_ethnicity
tempdf$race2[tempdf$nvals_race>1 & tempdf$tmphispanic==1]<-"H"
tempdf$race2[tempdf$nvals_race>1 & tempdf$tmphispanic!=1]<-"M/O"
tempdf$race_ethnicity<-tempdf$race2
# Clean up by removing old variables
tempdf$race2<-NULL
tempdf$nvals_race<-NULL
tempdf$tmphispanic<-NULL
# Resort our result
tempdf<-tempdf[order(tempdf$sid,tempdf$school_year),]
```
# Compare them
```{r racewithinyearcompare,echo=FALSE}
subset(tempdf[,c("sid","school_year","race_ethnicity")],sid==3 &school_year<2007 | sid==8552 &school_year<2007 | sid==11382 &school_year<2007)
subset(stuatt[,c("sid","school_year","race_ethnicity")],sid==3 &school_year<2007 | sid==8552 &school_year<2007 | sid==11382 &school_year<2007)
```
# OK
- Merge it back together
```{r collapsecleaneddata}
stuatt<-tempdf
rm(tempdf)
```
# Break in Case of Emergency
```{r createnvalswithinyear,eval=FALSE}
# Stupid hack workaround of ddply bug when running too many of these sequentially
ddply_race <- function(x, y,z){
NewColName <- "race_ethnicity"
z <- ddply(x, .(y,z), .fun = function(xx,col){
c(nvals_race = length(unique(xx[,col])))},
NewColName)
z$sid<-z$y
z$school_year<-z$z
z$y<-NULL
z$z<-NULL
return(z)
}
nvals<-ddply_race(stuatt,stuatt$sid,stuatt$school_year)
tempdf<-merge(stuatt,nvals)
tempdf$temp_ishispanic<-NA
tempdf$temp_ishispanic[tempdf$race_ethnicity=="H"& tempdf$nvals_race>1]<-1
```
# Inconsistency across years
- So we are in the clear right?
- No, our data still has messiness across years:
```{r raceunique}
head(stuatt[,c("sid","school_year","race_ethnicity")])
```
- Student 1 and 2 are both listed as black and hispanic at alternate times
# So...
### What do we do?
### Try it on your own
### Remember, this is tough stuff, so feel free to ask for help!
# Answer
```{r racevarcleaning}
tempdf<-ddply(stuatt,.(sid),summarize,var_temp=statamode(race_ethnicity),
nvals=length(unique(race_ethnicity)),most_recent_year=max(school_year),
most_recent_var=tail(race_ethnicity,1))
tempdf$race2[tempdf$var_temp!="."]<-tempdf$var_temp[tempdf$var_temp!="."]
tempdf$race2[tempdf$var_temp=="."]<-paste(tempdf$most_recent_var[tempdf$var_temp=="."])
tempdf<-merge(stuatt,tempdf)
head(tempdf[,c(1,2,4,14)],7)
```
- Why do we have to do a paste command?
- What other parts of this code are important to remember?
- Always filter on both sides
- Always use `summarize` in the `ddply` call in this situation
# A Faster Way
- The nice thing about R is we can role processes together once we understand them
- Let's build a script to do this more efficiently
```{r scripting, eval=FALSE}
task1<-function(df,id,year,var){
require(plyr)
mdf<-eval(parse(text=paste('ddply(',df,',.(',id,'),summarize,
var_temp=statamode(',var,'),
nvals=length(unique(',var,')),most_recent_year=max(',year,'),
most_recent_var=tail(',var,',1))',sep="")))
mdf$var2[mdf$var_temp!="."]<-mdf$var_temp[mdf$var_temp!="."]
mdf$var2[mdf$var_temp=="."]<-
as.character(mdf$most_recent_var[mdf$var_temp=="."])
ndf<-eval(parse(text=paste('merge(',df,',mdf)',sep="")))
rm(mdf)
return(ndf)
}
# Note data must be sorted
tempdf<-task1(stuatt,stuatt$sid,stuatt$school_year,stuatt$race_ethnicity)
```
# Other References
- [The Strategic Data Project Toolkit](http://www.gse.harvard.edu/~pfpie/index.php/sdp/tools)
- [UCLA ATS: R FAQ on Data Management](http://www.ats.ucla.edu/stat/r/faq/default.htm)
- [Video Tutorials](http://www.twotorials.com/)
- [The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham](http://www.jstatsoft.org/v40/i01) available in the Journal of Statistical Software vol 40, Issue 1, April 2011
# Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
```
# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>