Data Mashups in R
Data Mashups in R
Jeremy Leipzig and Xiao-Yi Li
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Data Mashups in R by Jeremy Leipzig and Xiao-Yi Li Copyright © 2011 Jeremy Leipzig and Xiao-Yi Li. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or
[email protected].
Editor: Mike Loukides Production Editor: Kristen Borg Proofreader: Kristen Borg
Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano
Printing History: March 2011:
First Edition.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Data Mashups in R, the image of a black-billed Australian bustard, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
ISBN: 978-1-449-30353-2 [LSI] 1299253461
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Mapping Foreclosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Messy Address Parsing Exploring “streets” Obtaining Latitude and Longitude Using Yahoo Shaking the XML Tree The Many Ways to Philly (Latitude) Using Data Structures Using Helper Methods Using Internal Class Methods Exceptional Circumstances The Unmappable Fake Street No Connection Taking Shape Finding a Usable Map PBSmapping Developing the Plot Preparing to Add Points to Our Map Exploring R Data Structures: geoTable Making Events of Our Foreclosures Turning Up the Heat Factors When You Need Them Filling with Color Gradients
1 3 4 5 6 7 7 7 8 8 9 9 10 10 11 12 14 15 15 16 17
2. Statistics of Foreclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Importing Census Data Descriptive Statistics Descriptive Plots Correlation Final Thoughts
19 22 23 25 26
v
Appendix: Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi | Table of Contents
Introduction
Programmers may spend a good part of their careers scripting code to conform to commercial statistics packages, visualization tools, and domain-specific third-party software. The same tasks can force end users to spend countless hours in copy-paste purgatory, each minor change necessitating another grueling round of formatting tabs and screenshots. Luckily, R scripting offers some reprieve. Because this open source project garners the support of a large community of package developers, the R statistical programming environment provides an amazing level of extensibility. Data from a multitude of sources can be imported into R and processed using R packages to aid statistical analysis and visualization. R scripts can also be configured to produce high-quality reports in an automated fashion—saving time, energy, and frustration. This book will demonstrate how real-world data is imported, managed, visualized, and analyzed within R. Spatial mashups provide an excellent way to explore the capabilities of R—encompassing R packages, R syntax, and data structures. Instead of canned sample data, we will be plotting and analyzing actual current home foreclosure auctions. Through this exercise, we hope to provide an general idea of how the R environment works with R packages as well as its own capabilities in statistical analysis. We will be accessing spatial data in several formats (HTML, XML, shapefiles, and text) both locally and over the web, to produce a map of home foreclosures and perform statistical analysis on these events.
vii
CHAPTER 1
Mapping Foreclosures
Messy Address Parsing To illustrate how to combine data from disparate sources for statistical analysis and visualization, let’s focus on one of the messiest sources of data around: web pages. The Philadelphia sheriff’s office posts foreclosure auctions on its website each month. How do we collect this data, massage it into a reasonable form, and work with it? First, create a new folder (for example, ~/Rmashup) to contain our project files. It is helpful to change the R working directory to your newly created folder. #In Unix/MacOS > setwd("~/Documents/Rmashup/") #In Windows > setwd("C:/~/Rmashup/")
We can download this foreclosure listings web page from within R (or you may instead choose to save the raw HTML from your web browser): > download.file(url="http://www.phillysheriff.com/properties.html", destfile="properties.html")
Here is some of this web page’s source HTML, with addresses highlighted: 6321 Farnsworth St. 62nd Ward 1,379.88 sq. ft. BRT# 621533500 Improvements: Residential Property HOMER SIMPSON C.P. January Term, 2006 No. 002619 $27,537.87 Phelan Hallinan & Schmieg, L.L.P. 243-467 1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 ...
1
The sheriff’s raw HTML listings are inconsistently formatted, but with the right regular expression we can identify street addresses: notice how they appear alone on a line. Our goal is to submit viable addresses to the geocoder. Here are some typical addresses that our regular expression should match: 3509 N. Lee St. 2120-2128 E. Allegheny Ave. 7601 Crittenden St., #E-10 370 Tomlinson Place 2311 N. 33rd St. 6822-24 Old York Rd. 335 W. School House Lane
These are not addresses and should not be matched: 2,700 sq. ft. BRT# 124077100 Improvements: Residential Property C.P. June Term, 2009 No. 00575
R has built-in functions that allow the use of Perl-type regular expressions. For more info on regular expressions, see Mastering Regular Expressions (O’Reilly) and Regular Expression Pocket Reference (O’Reilly). With some minor deletions to clean up address idiosyncrasies, we should be able to correctly identify street addresses from the mess of other data contained in properties.html. We’ll use a single regular expression pattern to do the cleanup. For clarity, we can break the pattern into the familiar elements of an address (number, name, suffix) > > > >
stNum streets[grep("Place",streets)] [1] "1430 Dondill Place" "370 Tomlinson Place" "8025 Pompey Place" [4] "7330 Boreal Place" "2818 Ryerson Place" "8416 Suffolk Place"
To order foreclosures by street number, dispense with non-numeric characters, cast as numeric, and use order() to get the indices. > streets[order(as.numeric(gsub("[^0-9].+",'',streets)))] [1] "21 S. 51st St." "22 E. Garfield St." [3] "26 W. Manheim St." "26 N. Felton St." [5] "30 S. 58th St." "31 N. Columbus Blvd." ... [1259] "12122 Barbary Rd." "12223 Medford Rd." [1261] "12430 Wyndom Rd." "12701 Medford Rd." [1263] "12727 Medford Rd." "13054 Townsend Rd."
Obtaining Latitude and Longitude Using Yahoo To plot our foreclosures on a map, we’ll need to get latitude and longitude coordinates for each street address. Yahoo Maps provides this functionality (called “geocoding”) as a REST-enabled web service. Via HTTP, the service accepts a URL containing a partial or full street address, and returns an XML document with the relevant information. It doesn’t matter whether a web browser or a robot is submitting the request, as long as the URL is correctly formatted. The URL must contain an appid parameter and as many street address arguments as are known. http://local.yahooapis.com/MapsService/V1/geocode?appid=YD-9G7bey8 _JXxQP6rxl.fBFGgCdNjoDMACQA--&street=1+South+Broad+St&city=Philadel phia&state=PA In response we get: 39.951405 -75.163735 1 S Broad St Philadelphia PA 19107-3300 US
4 | Chapter 1: Mapping Foreclosures
To use this service with your mashup, you must sign up with Yahoo! and receive an Application ID. Use that ID in with the appid parameter of the request URL. You can sign up at http://developer.yahoo.com/wsregapp/.
Shaking the XML Tree Parsing well-formed and valid XML should be less convoluted than parsing the sheriff’s HTML. An XML parsing package is available for R; here’s how to install it from CRAN’s repository: > install.packages("XML") > library("XML")
If you are behind a firewall or proxy and getting errors: On Unix, set your http_proxy environment variable. On Windows, try the custom install R wizard with the “internet2” option instead of “standard”. You can find additional information at http: //cran.r-project.org/bin/windows/base/rw-FAQ.html#The-Internet -download-functions-fail_00.
Our goal is to extract values contained within the and leaf nodes. These nodes live within the node, which lives inside a node, which itself lies inside the root node. To find an appropriate library for getting these values, call library(help=XML). This function lists the functions in the XML package. > library(help=XML) #hit space to scroll, q to exit > ?xmlTreeParse
You’ll see that the function xmlTreeParse will accept an XML file or URL and return an R structure. After inserting your Yahoo App ID, paste in this block: > > > >
library(XML) appid lat str(lat) chr "39.951405"
Using Internal Class Methods There are usually multiple ways to accomplish the same task in R. Another means to get to our character lat/long data is to use the value method provided by the node itself: > lat lat