COLL Report Typesafe Apache Spark

APACHE SPARK
PREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA
CONTENTS
Foreword..........................................................................................................................................................3
Apache Spark Survey 2015 - Quick Snapshot..................................................................................................4
INTRODUCTION: Is Apache Spark the Future in Reactive Big Data?.................................. 5

CHAPTER 2: The People and Organizations Interested in Apache Spark......................... 7
CHAPTER 3: What Goals Do Organizations Hope to Achieve with Apache Spark? ........ 10
CHAPTER 4: How Organizations Use Spark Today............................................................ 15
CHAPTER 5: Barriers, Concerns and Support Desires Expressed by Respondents ...... 19
Final Thoughts ....................................................................................................................... 22
FOREWORD BY MATEI ZAHARIA, CREATOR OF APACHE SPARK

Im very excited to see this survey, built with Typesafe, that represents the largest poll of Spark developers yet.
Apache Spark has rapidly been gaining traction over the past few years, and Im thrilled to see the wide variety
of use cases and environments where it is being deployed. This survey of over 2100 developers alone highlights
that over 500 enterprises using or planning to use Spark in production in 2015, in environments ranging from
Hadoop clusters to public and private clouds, with data sources including key-value stores, databases, streaming data and file systems. Their use cases range from batch workloads to SQL queries, stream processing and
machine learning, highlighting Sparks unique capability as a simple, unified platform for data processing.
At Databricks and within the Spark community, this type of feedback is critical in helping us continue to
enhance Spark for many more use cases and make Big Data simpler for enterprises of all sizes.
Matei Zaharia
CTO at Databricks and Vice President, Apache Spark
@matei_zaharia
APACHE SPARK SURVEY 2015 - QUICK SNAPSHOT
31%
are evaluating
Spark now
20%
are running Spark

in production
88% Scala
44% Java
22% Python
are planning to use

Spark in 2015
82%
of users chose
Spark to replace
MapReduce
13%
TOP 3 LANGUAGES
USED WITH SPARK
78%
of users
need faster
processing
of larger
data sets
67%
of users need
Spark for event
stream processing
62%
RESPONDENTS
74% Developers
8% Data Scientists
7% C-level execs
TOP 3 INDUSTRIES
Telecoms, Banks, Retail
of users load data into

Spark with Hadoop DFS
54%
of users
run Spark
standalone
CHAPTER 1: INTRODUCTION
Is Apache Spark the

Future in Reactive Big Data?
INTRODUCTION
Back in summer of 2014, we launched the results of a survey on Java 8, which provided us
a lot of information we were looking for but also contained a small, golden nugget of data
that we didnt expect: that out of more than 3000 developers surveyed, a shocking 17% of
them reported using Apache Spark in production. Whoa.
Apache Spark is a fast and general engine for large-scale data processing built using Scala
and Akka, two technologies among many that we at Typesafe recommend for building
Reactive systems. Notice that fast is emphasized in the Spark description? As weve
learned, its actually not the size, but rather the speed or velocity of the data that is the
challenge. So why Scala and Akka, you ask? You can refer to this posting by Matei for
his full answer.
With this foundation in mind, it made a lot of sense to learn more. So we asked a total of
2136 respondents about Spark awareness and adoption, the most-demanded features/
modules, and how organizations use Spark in production today. We partnered with
Databricks (also founded by Matei) in order to bring full lifecycle support for Apache Spark
to Typesafe customers.
We think of this next phase of technology as Reactive Big Data. But whatever you call it,
its already here.
When we started Spark, we had two goalswe

wanted to work with the Hadoop ecosystem,
which is JVM-based, and we wanted a concise
programming interface similar to Microsofts
DryadLINQ (the first language-integrated Big
Data framework I know of, that begat things
like FlumeJava and Crunch). On the JVM, the
only language that would offer that kind of API
was Scala, due to its ability to capture functions
and ship them across the network. Scalas static
typing also made it much easier to control performance compared to, say, Jython or Groovy.
Matei Zaharia
CTO at Databricks and Vice President, Apache Spark
@matei_zaharia
CHAPTER 2: WHO IS GETTING FIRED UP OVER SPARK?
The People and Organizations

Interested in Apache Spark
WHAT BEST DESCRIBES YOUR ROLE?

The respondents who joined our survey generally adhere to the common
technology industry demographics: a vast majority of software developers
(74%) along with a smattering of other professionals. However, rather than
having a more sizeable segment of Architects (3.5%), we can see higher
representation of Data Scientists (7.5%), C-level Executives (6.5%), clearly
speaking to the ripple effect that Big Data has across an organization.
The industry verticals in which respondents place themselves are fairly varied.
The largest consumersTelcos (16%), Banks (12%), Retailers (11%),
Software/Tech (10%) and Advertising (9%)are all huge consumers of
complex data sets, plus their business models often depend on crunching
real-time data for reactive decision making at times of peak traffic/usage.
JOB TYPE/ROLE
INDUSTRY FOCUS
16% Telecommunications / Networks
12% Banking / Finance
11% Retail
74% Developer
10% Software / Technology

9% Advertising
5% Consulting
7.5% Data Scientist

6.5% C-Level Executive
3.5% Software Architect
3.5% Dev Ops
1% Business Analyst
6.5% Other
4% Healthcare / Insurance
33% Other
Including Biotechnology/Chemistry,
Machinery, Education, Government
and Utilities and other sectors
WHICH OF THE FOLLOWING TECHNOLOGIES DO YOU

USE FOR YOUR PRODUCTION INFRASTRUCTURE?
53% Amazon EC2
34% Docker
22% Cloudera CDH
16% Ansible
14% Mesos
13% OpenStack
12% Apache.org Builds of Hadoop
10% HortonWorks HDP
10% Heroku
8%
Google Compute Engine
7%
Core OS
7%
MapR Hadoop Distribution
6%
Microsoft Azure
5%
Marathon
4%
Kubernetes
2%
Aurora
11% Other XaaS
INFRASTRUCTURE TECHNOLOGIES IN USE
We see quite a lot of complementary technologies in this

breakdown of production infrastructure toolsfrom
IaaS/PaaS to frameworks and containers. The market has
settled on Amazon EC2 (53%), with Docker (34%) and
Cloudera CDH (22%) also retaining good market shares.
From relative obscurity just 2 years ago, its interesting to see
multi-functional Ansible (16%) appear in the mix. Mesos
(14%) and OpenStack (13%) havent always been so close
in market share, so its curious to see where things will head
in 2015-16.
In the end, we are receiving self-reported statistics from a
sample population that includes mainly developers, so its
not always clear if this question was interpreted as have you
ever seen this technology appear in your organization in any
form? as opposed to confirmed instances of enterprise-wide
production usage.
CHAPTER 3: A NEW HOPE
What Goals Do
Organizations Hope to Achieve
with Apache Spark?
WHICH BEST DESCRIBES YOUR COMPANYS

INTEREST (OR AWARENESS) WITH SPARK?
A solid majority representing 72% of respondents have at least some
experience with Apache Spark, and a total of 35% are currently using or
planning to use it this year (or next). Notably, the largest single segment (31%)
is currently evaluating Spark, but since 28% had never heard of Spark at the
time of this survey (funnily, this group is now 0%!), there is still a ways to go.
But trends can be discernedboth in buzz and adoptionfrom sources as
varied as this survey as well as Google Trends:
CURRENT RELATIONSHIP WITH SPARK

Evaluated,
not planning to use
6%
Currently using
in production
Evaluated,
will use in 2016 or later
2%
31%
13%
Evaluating
Spark now
GOOGLE TRENDS - APACHE SPARK INTEREST OVER TIME
2011
2013
That said, a similar linear trend exists for searches like Hadoop and Big Data,
so while Spark might defeat Hadoop in the processing power and event
streaming areas, it is also designed to cooperate very well with Hadoop
both are Apache Foundation projects, after all. This is no secret; the creators
of Spark, who later founded Databricks, speak directly to the complementary relationship between Hadoop and Spark in a January 2014 blog post.
Planning to
use in 2015
20%
28%
Um, whats Spark?
11
WHAT PROBLEMS ARE YOU TRYING TO SOLVE WITH

SPARK THAT OTHER TOOLS DONT SOLVE?
The most prevalent goals to achieve by respondents focus on the gains in processing
speed, which are indeed one of the most exciting benchmarks: recent Spark in-memory
performance tests showed it could process data at up to 100x the speed of Hadoop.
However, users are also excited to implement event stream processing, which was an
impossibility using previous technologies. As Typesafe CTO Jonas Bonr explains in
his 2015 tech trends article in Wired.com, its the velocity of data that concerns most
organizations, not the size.
Most so-called Big Data problems

today are actually better described in
the context of velocity instead of size.
You want Fast Data. Speed is the
problem to solve, not size.
Jonas Bonr
CTO, Typesafe
@jboner
BUSINESS GOALS IN MIND
78%
Fast Batch
Processing of
Large Data Sets
60%
56%
Support for
Event Stream
Processing
55%
Fast Data
Queries in
Real Time
Improved
Programmer
Productivity
12
WHICH OF THE FOLLOWING SPARK FEATURES OR MODULES

ARE MOST LIKELY TO SOLVE YOUR BIG DATA CHALLENGES?
As you can predict, Spark Core API replacement (82%) and to a lesser extent Spark
Streaming (65%) are seen as the biggest benefits of adoption, highlighting the
shortcomings of MapReduce in terms of API friendliness, sheer performance and event
streaming. Sparks MLlib (59%) and SparkSQL (51%) modules are smaller priorities and
GraphX (25%) seems like a distant goal for most.
SPARK FEATURES/MODULES IN DEMAND

Core API as a
Replacement for
MapReduce
82%
Streaming Library
(Spark Streaming)
65%
Machine
Learning Library
(MLlib)
59%
Dean Wampler
Author & Big Data Expert, Typesafe
@deanwampler
Integrated SQL
(SparkSQL)
51%
Spark uses sophisticated caching of

intermediate data in memory between
processing steps, considerably improving the performance of applications
compared to comparable MapReduce
implementations. Compared to the
MapReduce API, the Spark API is
amazingly intuitive, providing concise,
expressive operations that are often
needed for analytics. So, in addition to
addressing a wider class of problems,
Spark is improving the productivity of
developers who use it.
Graph
Algorithms Library
(GraphX)
25%
13
HOW WILL YOU USE SPARK TO PROCESS YOUR DATA?

DATA PROCESSING WITH SPARK
39% Read or Write Data to One or More Databases
41% Static Reports
46% SQL Queries and Business Intelligence
46% Write Data to Hadoop Distributed File System (HDFS)
59% Ad-hoc Queries and Reporting
61% ETL Data from External Sources
67% Event Stream Processing

71% Use Spark as Part of a Larger Data Pipeline
65% Extract Information from Data Sooner Rather than Later
40% Automate Decision Making at Runtime
When it comes to data sources used by Spark, there is a reasonable amount of

variance. Event stream processing (67%), clearly a priority, remains a focus for
over two-thirds of respondentsa further breakdown of this aspect is presented
on this page. The rest of these priorities are speaking to current legacy systems;
developers will use Spark as a replacement for MapReduce in traditional batch
mode applications, including ETL (61%) jobs for moving, cleaning, and re-formatting data sets, and this will affect the rest of data processing methods as well.
Many respondents feel that event stream processing will be a key killer feature
of Spark, and see it helping their entire data pipeline (71%) as a whole, which
points to the idea of extracting data sooner rather than later (65%); seems to
encourage the evolution towards Reactive systems with Big Data at the heart
of it all. Decision making automation at runtime (which sounds a bit to us like
continuous deployment) is also something that about 40% of respondents
consider as data velocity increases.
14
CHAPTER 4: APACHE SPARK IN USE
How Organizations
Use Spark Today
WHICH PROGRAMMING LANGUAGES ARE

IMPORTANT TO YOUR SPARK INSTALLATION?
Considering that Apache Spark was designed with Scala and
Akka, its not surprising that the earliest users of this technology would be focused on Scala (88%). That said, as Spark
adoption goes more mainstream on the JVM, we expect
Java (44%) to increase in priority over time. Python (22%)
is represented by about one-quarter of users, and is the 3rd
language after Scala and Java that Spark documentation
has prioritized. Other languages that users would like to see
supported include R, loved by data scientists and statisticians, plus Clojure, Groovy, Ruby and Go.
WHICH LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?
1st
Scala 88%
2nd
Java 44%
3rd
Python 22%
Honorable mentions: R, Clojure, Groovy, Ruby & Go
16
WHERE ARE YOU RUNNING

SPARK CURRENTLY?
Standalone (54%) and Local mode (29%) installations of Spark seem logical for early
users with different testing purposes, and one can always add to a cluster later. Otherwise,
YARN (42%), aka MapReduce 2, and Mesos (26%) are the general go-to choices for
integrating and running Spark with current systems. Cassandra (20%) is another Apache
project that not only integrates well with Sparks event streaming power, but shares a
similar vision of supporting highly responsive, resilient, elastic systems. Also mentioned by
about 3% of respondents is Amazon Elastic MapReduce.
WHERE DO YOU RUN SPARK?
Standalone
54%
YARN
42%
Local Mode
29%
Mesos
26%
Cassandra
17
20%
HOW DO YOU LOAD YOUR

DATA INTO SPARK?
When it comes to data loading, respondents take from a
wide spectrum of technologiesfrom DBs to messaging and
file systems to plain socket connections, almost anything
goes. The winner here is HDFS (62%) which makes perfect
sensethe things users cannot get done with Hadoop are
designed to be ported over to Spark to finish the job, again
emphasizing the complementary nature of these two
technologies. Unspecific Databases (46%) are in use by
almost half of respondents, and Apache Kafka (41%) is a
hot messaging broker built by LinkedIn using Scala in 2011
that now leverages Sparks event streaming capabilities.
Amazon S3 comes in at 29%, little surprise considering
Amazons infrastructure dominance with EC2 and their fairly
comprehensive stack portfolio.
HOW DO YOU LOAD DATA INTO SPARK?

62% Hadoop Distributed
File System (HDFS)
46% Databases
41% Apache Kafka
29% Amazon S3
18% Other Services
(e.g. over socket connection)
12% Other*
*Including:
Apache Cassandra, Amazon
Kinesis and Apache HBase
18
CHAPTER 5: SO WHATS THE DELAY IN ADOPTION?
Barriers, Concerns and

Support Desires Expressed
by Respondents
WHAT IS YOUR
BIGGEST BARRIER TO
USING SPARK EFFECTIVELY?
Here we get to analyze hundreds of write-in answers by hand...fun! We found the write-in
answers to be generally legible and only occasionally off-topic mumbo jumbo (i.e. something about tabs vs. spaces). We asked about barriers to using Spark effectively at this
time, then manually clustered them into sentiment categories, if you will.
LARGEST BARRIERS TO USING SPARK EFFECTIVELY
Low
Awareness / Experience
Current
Requirements Dont Fit
Too
Immature
1st
2nd
3rd
Low awareness / experience makes sense, since

Spark adoption is still growinga year from now, we predict that awareness of Spark will be considerably higher
and no longer considered a barrier to adoption or use.
Current requirements dont fit reflect a lack of

urgency among the majority of enterprises; however,
since the data shows that most early adopters use Spark
to replace MapReduce, this group will likely re-evaluate
their requirements as the need for data velocity increases.
Too immature regarding integrations with

middleware, platforms, tooling and programming
languages. As adoption increases, you should check
the Spark pages regularly for updates on feature
and API maturity.
20
HOW CAN SUPPORT

BE IMPROVED?
In line with the previous question, we also had a large collection of suggestions for
improving support. Generally, these mirror the issues perceived as barriers to using Spark
effectively in the previous question, but with some slight differences in semantics.
Here are the top 3 sentiment categories that we hope can serve as useful feedback for
future Spark development.
HOW CAN SUPPORT BE IMPROVED?
1st
2nd
3rd
Integration
Integration Integration!
Deeper Examples,
Docs & Tutorials
Maturity
Through Features
Integration integration integration! comes in

loudly as a definite requirement for many users, some of
which may not be aware of currently supported
technologies, since they specifically mentioned Scala,
Java and Hadoop, which are first-class citizens for Spark.
Deeper examples, docs & tutorials are important

for making the case for Spark.We see documentation,
more real-life case studies and tutorial options (like
these) from vendors as answering these needs.
Maturity through features is the final area where

respondents see a lot of room to improve. Specifically
mentioned are immaturity in the Spark feature set
related to the client and streaming functionality, issues
related to clustering and the overall stability of
Spark in production.
21
Final Thoughts
Spark has become the Big Data tool of choice for a future of Reactive Systems,
fueled by organizations in need of faster data and event steaming features.
FINAL THOUGHTS
By this point, were sure you now understand that Spark awareness
and adoption are experiencing remarkable growth. Developers have a
pent-up need to eliminate issues with MapReduce, such as a difficult
API, poor performance, and restriction to batch jobs only.
You should consider Spark as the tool that meets these needs,
providing excellent performance at scale, a concise and intuitive API,
and support for event stream processing and iterative algorithms.
Spark is less mature than older technologies, like MapReduce, so
developers also need good documentation, example applications,
and guidance on runtime performance tuning, management and
monitoring. Spark is also driving interest in Scala, the language in
which Spark is written, but developers and data scientists can also
use Java, Python, and soon, R.
Its all very good, more or less. So if you, like our sensible PR team,
were looking for the Top 3 Takeways From This Survey, here they are in
more shareable form:
Spark awareness and adoption are seeing exponential growth.
Google Trends confirms this and the survey shows that 72% of
respondents have at least evaluation or research experience with
Spark35% are using it or have decided to implement it.
Faster data processing and event streaming are the focus for enterprises.
By far the most desirable features are Sparks vastly improved

processing power over MapReduce (over 78% mention this) and the
ability to process event streams (over 66% mention this), a limitation
of current technologies.
Perceived barriers to adoption are not major blockers.
When asked, respondents mentioned lack of in-house experience and

perceived immaturity of some Spark components and integrations
with other middleware and management tools. Also cited are needs
for better commercial support options and for more comprehensive
documentation and advanced examples.
23
DONT WORRY...WE HAVE MORE FOR YOU HERE

Hands-on Spark Workshop
with Typesafe Activator
Getting Started
with Spark
Introducing the
Typesafe Reactive Platform
DOWNLOAD
DOWNLOAD
DOWNLOAD
Typesafe (Twitter: @Typesafe) is dedicated to helping developers build Reactive applications on the JVM. Backed by Greylock
Partners, Shasta Ventures, Bain Capital Ventures and Juniper Networks, Typesafe is headquartered in San Francisco with
offices in Switzerland and Sweden. To start building Reactive applications today, download Typesafe Activator.
2015 Typesafe
24

COLL Report Typesafe Apache Spark

Uploaded by

Copyright:

Available Formats

You might also like

COLL Report Typesafe Apache Spark

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COLL Report Typesafe Apache Spark

Uploaded by

Copyright:

Available Formats

APACHE SPARK

PREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA

INTRODUCTION: Is Apache Spark the Future in Reactive Big Data?.................................. 5

FOREWORD BY MATEI ZAHARIA, CREATOR OF APACHE SPARK

APACHE SPARK SURVEY 2015 - QUICK SNAPSHOT

are running Spark

are planning to use

Telecoms, Banks, Retail

of users load data into

Is Apache Spark the

When we started Spark, we had two goalswe

CHAPTER 2: WHO IS GETTING FIRED UP OVER SPARK?

The People and Organizations

WHAT BEST DESCRIBES YOUR ROLE?

10% Software / Technology

7.5% Data Scientist

WHICH OF THE FOLLOWING TECHNOLOGIES DO YOU

Google Compute Engine

MapR Hadoop Distribution

11% Other XaaS

INFRASTRUCTURE TECHNOLOGIES IN USE

We see quite a lot of complementary technologies in this

CHAPTER 3: A NEW HOPE

WHICH BEST DESCRIBES YOUR COMPANYS

CURRENT RELATIONSHIP WITH SPARK

GOOGLE TRENDS - APACHE SPARK INTEREST OVER TIME

WHAT PROBLEMS ARE YOU TRYING TO SOLVE WITH

Most so-called Big Data problems

BUSINESS GOALS IN MIND

WHICH OF THE FOLLOWING SPARK FEATURES OR MODULES

SPARK FEATURES/MODULES IN DEMAND

Spark uses sophisticated caching of

HOW WILL YOU USE SPARK TO PROCESS YOUR DATA?

67% Event Stream Processing

When it comes to data sources used by Spark, there is a reasonable amount of

CHAPTER 4: APACHE SPARK IN USE

WHICH PROGRAMMING LANGUAGES ARE

WHICH LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?

Honorable mentions: R, Clojure, Groovy, Ruby & Go

WHERE ARE YOU RUNNING

WHERE DO YOU RUN SPARK?

HOW DO YOU LOAD YOUR

HOW DO YOU LOAD DATA INTO SPARK?

41% Apache Kafka

18% Other Services

(e.g. over socket connection)

CHAPTER 5: SO WHATS THE DELAY IN ADOPTION?

Barriers, Concerns and

LARGEST BARRIERS TO USING SPARK EFFECTIVELY

Low awareness / experience makes sense, since

Current requirements dont fit reflect a lack of

Too immature regarding integrations with

HOW CAN SUPPORT

HOW CAN SUPPORT BE IMPROVED?

Integration integration integration! comes in

Deeper examples, docs & tutorials are important

Maturity through features is the final area where

Spark awareness and adoption are seeing exponential growth.

By far the most desirable features are Sparks vastly improved