This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.
This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.
Original Description:
Big data and cloud computing are everywhere. Try your hand at this data exercise.
This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.
This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.
Acronyms abound Tremendous complexity Use building blocks not code This is easy EPPM of 10 requires 500 professionals http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor- work.html?emc=eta1&_r=0 Data preparation and cleansing: Missing Duplicative Conventions (dates, time, geographies) Spacing Can we measure data cleanliness? Whats our Pareto point?
AWS -> EC2 Launch instance: ami-c6b61fae (US-EAST) Instance type m3.medium Connect You should see some software on the desktop
Scrape all of Craiglists Boston apartment listings using WebHarvy Examine, clean, and prepare the data set using OpenRefine Map our data and apply filters using Tableau
all without writing a single line of code. A hyper-intelligent utility to scrape website data. SysNucleus, makers of USBTrace Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup
HTTP://SHOUTKEY.COM/WIRE 1. Start Config 2. Click on Hungry Mother capture text 3. Click on Hungry Mother capture URL 4. Click on Kendall Square/MIT capture text 5. Click lasts review capture text CLEAR
1. Mine -> Scrape a list of similar links 2. Click on Hungry Mother
Lets start collecting information in the first sub- page.
Edit Clear Navigate into a sub-page Start Config Set as Next Page Link
Scheduler Input keywords Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!) TRY VISITING CRAIGSLIST IN AWS BTW!! Proxy Database export Download Craigslist Boston from http://shoutkey.com/glorify Look at our data: open Boston Dirty.csv (20k rows of mess!) Time to CLEAN: Launch GOOGLE-REFINE.EXE Within MOZILLA, navigate to http://127.0.0.1:3333/ Create Project -> This Computer -> Browse Parse by tab Create Project
1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.
Then run the To Number transform again
Increment the radius to 7 and make judgment calls along the way. Change the Distance Function and do the same thing
Looks like we have SOME really expensive real estate. Data errors???? Boston Clean.csv Load Boston clean.csv Go to Worksheet Great semantic example. Tableau understands that this text translates to a lat/long Look on the map in the lower right corner Lets Filter Data
Under Measures, drag Price onto size in Marks Change sum(Price) to avg(Price) Drag Price, change to max(price) into Filters and select an At Most Right click on the filter and show Quick Filter Drag City onto Label Menu Map -> Map Options Click on a node for info and drill down potential 1. Explored various webpage structures and scraped them 2. Exported the data to Refine 3. Parsed columns to extract critical price and location information 4. Used clustering algorithms to merge related geographies 5. Applied filters to identify errant prices 6. Exported the data to Tableau 7. Completed a real cursory mapping visualization Please come talk to me