Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Data Understanding and Preparation

OPEN REFINE TUTORIAL

Module 1.2 Video – Instructions

1. Basic customer data clean up


a. Download and unzip Customers-messy.zip and get familiar with dataset
b. Create Project > Browse > Customers-messy.zip
c. Character encoding ( browse through the many options available) > select UTF-8
d. Save Project Name > create Project
e. Combine contact firstName and contact lastName to contact fullName
f. Contact firstName > Edit Cells > Common transforms > trim Leading and trailing whitespace
g. Contact lastName > Edit Cells > Common transforms > trim Leading and trailing whitespace
h. Contact firstName > Edit Column > Add Column based on this column
i. Value + " " + cells["contactLastName"].value
j. Enter Column name to contactFullName
k. Undo / Redo to go back and forth
l. Contact FullName > Edit Columns > Remove this column
2. Faceting ( text)
a. City > Edit Cells > Common transforms > To titlecase
b. Country > Facet > Text Facet
c. On left panel scroll down to see multiple vaues : “United states”, “US”, etc..
d. Click on cluster to get the overall view
e. Select multiple method options such as Key Collision and Keying functions such as fingerprint
f. Select merge function between clusters and select merge Selected and Recluster
g. Method > nearest Neighbour & Keying function > PPM select merge and click on merge Selected
and Recluster
3. Faceting (number)
a. CreditLimit > Edit Cells > common transforms > To number
b. CreditLimit > facet > Numeric Facet
c. Uncheck the numeric box on the left facet window and click on change
d. toNumber(value.replace("USD","").replace("$","").replace(",",""))
e. goto creditLimit > Edit Cells > Transform
f. toNumber(value.replace("USD","").replace("$","").replace(",",""))
g. Export > comma seperated value

4. Stocks data cleanup


a. Data (Stocks.Zip JSON)
b. Language Setting to English
c. Create Project > Browse > Stocks.zip
d. Click on the top level JSON enclosures to get to tabular format
Open Refine Tutorial

e. Once preview is available, enable the Trim leading and trailing whitespace from strings ( options
below the table)
f. Enable the Parse cell text into numbers/dates
g. Save Project Name > create Project
h. ALL ( left corner) > Edit Columns > reorder / remove columns ( enables to discard irrelevant
columns )
i. Click on any cell and edit the cell with any information
j. Click on dropdowns for any column ( say Volume ) > sort ( choose options as applicable )

5. Stocks data augmentation


a. Data ( stocks_demo.csv Webservice calls [subset of the original zip file])
b. Create Project > Browse > stocks_demo.csv
c. Save Project Name > create Project
d. Ticker > Edit Column > Add Columns by fetching URLs
e. Test the below URLs on a browser to make sure it works

http://dev.markitondemand.com/MODApis/

http://dev.markitondemand.com/MODApis/Api/v2/Quote?symbol=AAPL

f. Invoke Web API from Open Refine. Use following expression:


'http://dev.markitondemand.com/MODApis/Api/v2/Quote?symbol=' + escape(value,'url')
value.partition("<LastPrice>")[2].partition("</LastPrice>")[0]

Other Examples

6. Remove rows with NA


a. Facet > Text Facet > Include/Exclude the value in the Facet/Filter
b. All > Edit Rows > Remove all Matching Rows

7. Filter rows with NA


a. Column > Text Filter > Include/Exclude the value in the Facet/Filter>
b. All> Edit Rows > Remove all Matching Rows
Open Refine Tutorial

8. Remove brackets around text


Edit Cells > Column Transform > Expression box :
value.split("(")[0].split(")")[0]
Open Refine Tutorial

9. Split a column into multiple columns


Edit Column > Split into Several Columns

10. Multi-column operations


All columns in the project
forEach(row.columnNames,cn,cells[cn].value)

Merge all columns in the project into single column


forEach(row.columnNames,cn,cells[cn].value).join("|")

Count rows with some blank cells


filter(row.columnNames,cn,isBlank(cells[cn].value)).length()
Open Refine Tutorial

Useful Links

https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions

https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

http://schoolofdata.org/handbook/recipes/cleaning-spending-data-open-refine/

http://blogs.worldbank.org/opendata/unpivoting-data-excel-open-refine-and-python

You might also like