Professional Documents
Culture Documents
(Download PDF) Spark Graphx in Action 1St Edition Michael Malak Online Ebook All Chapter PDF
(Download PDF) Spark Graphx in Action 1St Edition Michael Malak Online Ebook All Chapter PDF
(Download PDF) Spark Graphx in Action 1St Edition Michael Malak Online Ebook All Chapter PDF
Michael Malak
Visit to download the full and correct content document:
https://textbookfull.com/product/spark-graphx-in-action-1st-edition-michael-malak/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://textbookfull.com/product/spark-in-action-second-edition-
covers-apache-spark-3-with-examples-in-java-python-and-scala-
jean-georges-perrin/
https://textbookfull.com/product/micro-frontends-in-action-1st-
edition-michael-geers/
https://textbookfull.com/product/micro-frontends-in-action-1st-
edition-michael-geers-2/
https://textbookfull.com/product/amazon-web-services-in-action-
second-edition-michael-wittig-andreas-wittig/
High Performance Spark Best Practices for Scaling and
Optimizing Apache Spark 1st Edition Holden Karau
https://textbookfull.com/product/high-performance-spark-best-
practices-for-scaling-and-optimizing-apache-spark-1st-edition-
holden-karau/
https://textbookfull.com/product/stream-processing-with-apache-
spark-mastering-structured-streaming-and-spark-streaming-1st-
edition-gerard-maas/
https://textbookfull.com/product/political-action-a-practical-
guide-to-movement-politics-michael-walzer/
https://textbookfull.com/product/crash-boom-bang-how-to-write-
action-movies-michael-lucker/
https://textbookfull.com/product/graph-algorithms-practical-
examples-in-apache-spark-and-neo4j-1st-edition-mark-needham/
Michael S. Malak
Robin East
MANNING
Spark GraphX
in Action
MICHAEL S. MALAK
ROBIN EAST
MANNING
SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: orders@manning.com
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
ISBN 9781617292521
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16
brief contents
PART 1 SPARK AND GRAPHS ......................................................... 1
1 ■ Two important technologies: Spark and graphs 3
2 ■ GraphX quick start 24
3 ■ Some fundamentals 32
iii
contents
preface xi
acknowledgments xiii
about this book xiv
about the cover illustration xviii
meet Big Data 17 GraphX lets you choose: graph parallel or data
■
v
vi CONTENTS
3 Some fundamentals
3.1
32
Scala, the native language of Spark 33
Scala’s philosophy: conciseness and expressiveness 33
Functional programming 34 Inferred typing 38
■
3.2 Spark 43
Distributed in-memory data: RDDs 43 Laziness 44 ■
and sbt 51
3.3 Graph terminology 52
Basics 52 RDF graphs vs. property graphs 55
■
3.4 Summary 57
4 GraphX Basics
4.1
61
Vertex and edge classes 61
4.2 Mapping operations 67
Simple graph transformation 67 ■
Map/Reduce 68
Iterated Map/Reduce 72
4.3 Serialization/deserialization 74
Reading/writing binary format 74 JSON format ■
76
GEXF format for Gephi visualization software 78
4.4 Graph generation 80
Deterministic graphs 80 ■
Random graphs 81
4.5 Pregel API 83
4.6 Summary 89
CONTENTS vii
5 Built-in algorithms
5.1
90
Seek out authoritative nodes: PageRank 91
PageRank algorithm explained 91 Invoking PageRank in
■
serialization 211
9.4 Graph partitioning 213
9.5 Summary 215
Big Data
Hadoop
Machine learning
Apache Spark
Edges and vertices
2005 2007 2009 2011 2013 2015
xi
xii PREFACE
Note that for the generic terms spark and graphs we had to substitute the overly spe-
cific Apache Spark and edges and vertices, but the trends can still be seen. A couple of
these technologies, machine learning and graphs, have long histories within academic
computer science and are attracting new interest in the commercial realm as the avail-
ability of Big Data is now mainstreaming these technologies. If you studied these tech-
nologies in school as theory, the world is ready now for you to put them into practice.
A lot of companies, including the ones we work for and have worked for in the
past, have put Spark—though not necessarily GraphX—into production. This makes
it more than just a little convenient when embarking on prototyping graph solutions
to try GraphX first. If you have a Spark cluster already, or if you decide to spin up a
Spark cluster in the cloud, such as with Databricks or Amazon, you can get started
with graphs without having to set up a new graph-specific cluster or technology, and
you can use your Spark skills in the GraphX API. As more and more applications of
graphs hit the newsstands—from rooting out terrorist networks on Twitter to fraud
detection in credit card transaction data—GraphX becomes an easy platform choice
for trying them out.
In this book, we simultaneously take on two ambitious goals: to cover everything
possible about Spark GraphX, and to assume little to no expertise about any of the
technologies represented by the aforementioned buzzwords. The biggest challenge
was the hefty amount of prerequisites to get into GraphX—specifically, Spark, Scala,
and graphs. Other challenges were the extensive GraphX API and the many different
ways graphs can be used. The result is an In Action book that differs a bit from others:
it takes a while to get started, with the first five chapters laying the groundwork, and
there are a number of interesting examples rather than one that gradually gets built
up over the course of the book. In books about other technologies the reader might
come with a problem to solve; this book attempts to demystify graphs by showing pre-
cisely what problems graphs can solve. And it does so without assuming a lot of back-
ground knowledge and experience.
acknowledgments
We would like to acknowledge the many, many people at Manning Publications who
helped usher this book into being. In particular, three individuals helped us immeasur-
ably in transforming, and in many cases, redirecting our writing to produce the much
better result you are reading now. Marina Michaels, our development editor from the
beginning, tirelessly marked up our chapters with great technical questions despite
being brand new to both Spark and graphs. Michael Roberts, our technical develop-
ment editor for most of the process, was equally as profuse as Marina with his sugges-
tions, but also brought technical expertise to the task. Antonio Magnaghi, our
technical proofreader, went above and beyond by proofing not just the source code,
but by effectively providing an extra technical development editor pass over the text of
the whole book.
We would also like to thank the many reviewers who read early drafts of the book and
provided helpful suggestions, including Andy Petrella, Brent Foust, Charles Feduke,
Gaurav Bhardwaj, Jason Kolter, Justin Fister, Michael Bright, Paul-Michael Sorhaindo,
Rodrigo Abreu, Romi Kuntsman, Sumit Pal, and Vincent Liard.
Michael Malak thanks his wife and children for their patience during the long months
of preparing this manuscript.
Robin East thanks his wife Jane and their two boys. They put up with his strange writ-
ing hours and habit of disappearing upstairs at odd times.
xiii
about this book
With Spark GraphX in Action we hope to bring down to earth the sometimes esoteric
topic of graphs, while explaining how to use them from the in-memory distributed
computing framework that has gained the most mindshare, Apache Spark.
xiv
ABOUT THIS BOOK xv
Author Online
Purchase of Spark GraphX in Action includes free access to a private web forum run by
Manning Publications where you can make comments about the book, ask technical
questions, and receive help from the authors and from other users. To access the
forum and subscribe to it, point your web browser to www.manning.com/books/spark-
graphx-in-action. This page provides information on how to get on the forum once you
are registered, what kind of help is available, and the rules of conduct on the forum.
10 CHAPTER 1 Two important technologies: Spark and graphs
another example of a graph processing system, but Giraph is limited to slow Hadoop
Map/Reduce. GraphX, Giraph, and GraphLab are all separate implementations of
the ideas expressed in the Google Pregel paper. Such graph processing systems are
optimized for running algorithms on the entire graph in a massively parallel manner,
as opposed to working with small pieces of graphs like graph databases. To draw a
comparison to the world of standard relational databases, graph databases like Neo4j
are like OLTP (Online Transaction Processing) whereas graph processing systems like
GraphX are like OLAP (Online Analytical Processing).
Graphs can store various kinds of data: geospatial, social media, paper citation net-
works, and, of course, web page links. A tiny social media network graph is shown in
figure 1.5. “Ann,” “Bill,” “Charles,” “Diane,” and “’Went to gym this morning’” are ver-
tices, and “Is-friends-with,” “Wrote-status,” and “Likes-status” are edges.
Bill Diane
is-friends-with is-friends-with
is-friends-with
wrote-status
Figure 1.5 If Charles shares his status with friends of friends, determining the list
of who could see his status would be cumbersome to figure out if you only had
tables or arrays to work with.
Language: Finnish
Kirj.
Eliza Orzeszko
Suomennos
Oi! hän ei totta tosiaan olisi toivonut, että niin olisi käynyt, mutta ei
auttanut. Halvinkin hame vaati rahaa, ja hänen täytyi kovasti
kamppailla, jotta isän palkka vaan kaikkeen riittäisi. Tähän saakka ei
toki mitään ollut puuttunut, joskin isä sai kieltäymyksiä kestää, sillä
heikko kun terveys oli, olisi ravitsevampi ruoka ollut tarpeellinen,
varsinkin hedelmät…
Ensin hän kulki pää alas painuneena, mutta sitte kohosi katse ja
ihaili puiston puita. Ne seisoivat liikkumattomina hiljaisessa ilmassa,
valossa syysauringon, jonka kultaiset säteet siellä täällä murtautuivat
läpi kellastuneiden tai punettuneiden lehtien. Joskus kahisi kuihtunut
lehti kävelevän jalkojen alla, joka yhä hiljensi askeleitaan, antaen
katseensa liukua puiden latvoista, jotka punakeltaisina hohtivat,
pitkin paksuja runkoja, vihreitten köynnösten kietomia.
Hän teki sen johtopäätöksen, että tämä puisto oli viehättävä, joskin
vain varsin pieni, pikkukaupungin puisto. Mutta ehkäpä se oli
viehättävä juuri siksi, että täällä vallitsi suurkaupungissa mahdoton
hiljaisuus.
Tosin nuo eivät olleet varsin viisaita unelmia, mutta täällä, tässä
ympäristössä, ne vastustamattomat mielikuvitukset kohosivat,
jättäen joksikin aikaa kaihomieltä sydämen sisimpään. Ja mitä sitte
on tässä maailmassa viisasta?
— Me siellä asumme.
— Viisitoista vuotta.
He vaikenivat. Uudelleen hämmennyksissään painoi tyttö päänsä
työhön ja alkoi taas ommella; mies nojasi aitaa vastaan, häntä
katsellen, ollenkaan yrittämättä lähteä. Juuri tuo katseleminen pani
tytön hämilleen.
— Aivan niin.
Mutta näytti siltä kuin lukija samalla olisi jotakin miettinyt, sillä
yht'äkkiä hän sulki kirjan ja kumartaen lausui:
— Tietysti. Kun on niin rikas, tottakai voi tehdä mitä ikinä tahtoo.
Hän nauroi.