Professional Documents
Culture Documents
COLL Report Typesafe Apache Spark
COLL Report Typesafe Apache Spark
COLL Report Typesafe Apache Spark
CONTENTS
Foreword..........................................................................................................................................................3
Apache Spark Survey 2015 - Quick Snapshot..................................................................................................4
31%
are evaluating
Spark now
20%
88% Scala
44% Java
22% Python
of users chose
Spark to replace
MapReduce
13%
TOP 3 LANGUAGES
USED WITH SPARK
78%
of users
need faster
processing
of larger
data sets
67%
of users need
Spark for event
stream processing
62%
RESPONDENTS
74% Developers
8% Data Scientists
7% C-level execs
TOP 3 INDUSTRIES
54%
of users
run Spark
standalone
CHAPTER 1: INTRODUCTION
INTRODUCTION
Back in summer of 2014, we launched the results of a survey on Java 8, which provided us
a lot of information we were looking for but also contained a small, golden nugget of data
that we didnt expect: that out of more than 3000 developers surveyed, a shocking 17% of
them reported using Apache Spark in production. Whoa.
Apache Spark is a fast and general engine for large-scale data processing built using Scala
and Akka, two technologies among many that we at Typesafe recommend for building
Reactive systems. Notice that fast is emphasized in the Spark description? As weve
learned, its actually not the size, but rather the speed or velocity of the data that is the
challenge. So why Scala and Akka, you ask? You can refer to this posting by Matei for
his full answer.
With this foundation in mind, it made a lot of sense to learn more. So we asked a total of
2136 respondents about Spark awareness and adoption, the most-demanded features/
modules, and how organizations use Spark in production today. We partnered with
Databricks (also founded by Matei) in order to bring full lifecycle support for Apache Spark
to Typesafe customers.
We think of this next phase of technology as Reactive Big Data. But whatever you call it,
its already here.
The industry verticals in which respondents place themselves are fairly varied.
The largest consumersTelcos (16%), Banks (12%), Retailers (11%),
Software/Tech (10%) and Advertising (9%)are all huge consumers of
complex data sets, plus their business models often depend on crunching
real-time data for reactive decision making at times of peak traffic/usage.
JOB TYPE/ROLE
INDUSTRY FOCUS
16% Telecommunications / Networks
12% Banking / Finance
11% Retail
74% Developer
4% Healthcare / Insurance
33% Other
Including Biotechnology/Chemistry,
Machinery, Education, Government
and Utilities and other sectors
7%
Core OS
7%
6%
Microsoft Azure
5%
Marathon
4%
Kubernetes
2%
Aurora
What Goals Do
Organizations Hope to Achieve
with Apache Spark?
6%
Currently using
in production
Evaluated,
will use in 2016 or later
2%
31%
13%
Evaluating
Spark now
2011
2013
That said, a similar linear trend exists for searches like Hadoop and Big Data,
so while Spark might defeat Hadoop in the processing power and event
streaming areas, it is also designed to cooperate very well with Hadoop
both are Apache Foundation projects, after all. This is no secret; the creators
of Spark, who later founded Databricks, speak directly to the complementary relationship between Hadoop and Spark in a January 2014 blog post.
Planning to
use in 2015
20%
28%
Um, whats Spark?
11
78%
Fast Batch
Processing of
Large Data Sets
60%
56%
Support for
Event Stream
Processing
55%
Fast Data
Queries in
Real Time
Improved
Programmer
Productivity
12
82%
Streaming Library
(Spark Streaming)
65%
Machine
Learning Library
(MLlib)
59%
Dean Wampler
Author & Big Data Expert, Typesafe
@deanwampler
Integrated SQL
(SparkSQL)
51%
Graph
Algorithms Library
(GraphX)
25%
13
14
How Organizations
Use Spark Today
1st
Scala 88%
2nd
Java 44%
3rd
Python 22%
16
Standalone
54%
YARN
42%
Local Mode
29%
Mesos
26%
Cassandra
17
20%
46% Databases
29% Amazon S3
12% Other*
*Including:
Apache Cassandra, Amazon
Kinesis and Apache HBase
18
WHAT IS YOUR
BIGGEST BARRIER TO
USING SPARK EFFECTIVELY?
Here we get to analyze hundreds of write-in answers by hand...fun! We found the write-in
answers to be generally legible and only occasionally off-topic mumbo jumbo (i.e. something about tabs vs. spaces). We asked about barriers to using Spark effectively at this
time, then manually clustered them into sentiment categories, if you will.
Low
Awareness / Experience
Current
Requirements Dont Fit
Too
Immature
1st
2nd
3rd
20
In line with the previous question, we also had a large collection of suggestions for
improving support. Generally, these mirror the issues perceived as barriers to using Spark
effectively in the previous question, but with some slight differences in semantics.
Here are the top 3 sentiment categories that we hope can serve as useful feedback for
future Spark development.
1st
2nd
3rd
Integration
Integration Integration!
Deeper Examples,
Docs & Tutorials
Maturity
Through Features
21
Final Thoughts
Spark has become the Big Data tool of choice for a future of Reactive Systems,
fueled by organizations in need of faster data and event steaming features.
FINAL THOUGHTS
By this point, were sure you now understand that Spark awareness
and adoption are experiencing remarkable growth. Developers have a
pent-up need to eliminate issues with MapReduce, such as a difficult
API, poor performance, and restriction to batch jobs only.
You should consider Spark as the tool that meets these needs,
providing excellent performance at scale, a concise and intuitive API,
and support for event stream processing and iterative algorithms.
Spark is less mature than older technologies, like MapReduce, so
developers also need good documentation, example applications,
and guidance on runtime performance tuning, management and
monitoring. Spark is also driving interest in Scala, the language in
which Spark is written, but developers and data scientists can also
use Java, Python, and soon, R.
Its all very good, more or less. So if you, like our sensible PR team,
were looking for the Top 3 Takeways From This Survey, here they are in
more shareable form:
Google Trends confirms this and the survey shows that 72% of
respondents have at least evaluation or research experience with
Spark35% are using it or have decided to implement it.
Faster data processing and event streaming are the focus for enterprises.
23
Getting Started
with Spark
Introducing the
Typesafe Reactive Platform
DOWNLOAD
DOWNLOAD
DOWNLOAD
Typesafe (Twitter: @Typesafe) is dedicated to helping developers build Reactive applications on the JVM. Backed by Greylock
Partners, Shasta Ventures, Bain Capital Ventures and Juniper Networks, Typesafe is headquartered in San Francisco with
offices in Switzerland and Sweden. To start building Reactive applications today, download Typesafe Activator.
2015 Typesafe
24