COLL Report Typesafe Apache Spark

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

APACHE SPARK

PREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA

CONTENTS
Foreword..........................................................................................................................................................3
Apache Spark Survey 2015 - Quick Snapshot..................................................................................................4

INTRODUCTION: Is Apache Spark the Future in Reactive Big Data?.................................. 5


CHAPTER 2: The People and Organizations Interested in Apache Spark......................... 7
CHAPTER 3: What Goals Do Organizations Hope to Achieve with Apache Spark? ........ 10
CHAPTER 4: How Organizations Use Spark Today............................................................ 15
CHAPTER 5: Barriers, Concerns and Support Desires Expressed by Respondents ...... 19
Final Thoughts ....................................................................................................................... 22

FOREWORD BY MATEI ZAHARIA, CREATOR OF APACHE SPARK


Im very excited to see this survey, built with Typesafe, that represents the largest poll of Spark developers yet.
Apache Spark has rapidly been gaining traction over the past few years, and Im thrilled to see the wide variety
of use cases and environments where it is being deployed. This survey of over 2100 developers alone highlights
that over 500 enterprises using or planning to use Spark in production in 2015, in environments ranging from
Hadoop clusters to public and private clouds, with data sources including key-value stores, databases, streaming data and file systems. Their use cases range from batch workloads to SQL queries, stream processing and
machine learning, highlighting Sparks unique capability as a simple, unified platform for data processing.
At Databricks and within the Spark community, this type of feedback is critical in helping us continue to
enhance Spark for many more use cases and make Big Data simpler for enterprises of all sizes.
Matei Zaharia
CTO at Databricks and Vice President, Apache Spark
@matei_zaharia

APACHE SPARK SURVEY 2015 - QUICK SNAPSHOT

31%

are evaluating
Spark now

20%

are running Spark


in production

88% Scala
44% Java
22% Python

are planning to use


Spark in 2015
82%

of users chose
Spark to replace
MapReduce

13%

TOP 3 LANGUAGES
USED WITH SPARK

78%

of users
need faster
processing
of larger
data sets

67%

of users need
Spark for event
stream processing

62%

RESPONDENTS

74% Developers
8% Data Scientists
7% C-level execs
TOP 3 INDUSTRIES

Telecoms, Banks, Retail

of users load data into


Spark with Hadoop DFS

54%

of users
run Spark
standalone

CHAPTER 1: INTRODUCTION

Is Apache Spark the


Future in Reactive Big Data?

INTRODUCTION
Back in summer of 2014, we launched the results of a survey on Java 8, which provided us
a lot of information we were looking for but also contained a small, golden nugget of data
that we didnt expect: that out of more than 3000 developers surveyed, a shocking 17% of
them reported using Apache Spark in production. Whoa.
Apache Spark is a fast and general engine for large-scale data processing built using Scala
and Akka, two technologies among many that we at Typesafe recommend for building
Reactive systems. Notice that fast is emphasized in the Spark description? As weve
learned, its actually not the size, but rather the speed or velocity of the data that is the
challenge. So why Scala and Akka, you ask? You can refer to this posting by Matei for
his full answer.
With this foundation in mind, it made a lot of sense to learn more. So we asked a total of
2136 respondents about Spark awareness and adoption, the most-demanded features/
modules, and how organizations use Spark in production today. We partnered with
Databricks (also founded by Matei) in order to bring full lifecycle support for Apache Spark
to Typesafe customers.
We think of this next phase of technology as Reactive Big Data. But whatever you call it,
its already here.

When we started Spark, we had two goalswe


wanted to work with the Hadoop ecosystem,
which is JVM-based, and we wanted a concise
programming interface similar to Microsofts
DryadLINQ (the first language-integrated Big
Data framework I know of, that begat things
like FlumeJava and Crunch). On the JVM, the
only language that would offer that kind of API
was Scala, due to its ability to capture functions
and ship them across the network. Scalas static
typing also made it much easier to control performance compared to, say, Jython or Groovy.
Matei Zaharia
CTO at Databricks and Vice President, Apache Spark
@matei_zaharia

CHAPTER 2: WHO IS GETTING FIRED UP OVER SPARK?

The People and Organizations


Interested in Apache Spark

WHAT BEST DESCRIBES YOUR ROLE?


The respondents who joined our survey generally adhere to the common
technology industry demographics: a vast majority of software developers
(74%) along with a smattering of other professionals. However, rather than
having a more sizeable segment of Architects (3.5%), we can see higher
representation of Data Scientists (7.5%), C-level Executives (6.5%), clearly
speaking to the ripple effect that Big Data has across an organization.

The industry verticals in which respondents place themselves are fairly varied.
The largest consumersTelcos (16%), Banks (12%), Retailers (11%),
Software/Tech (10%) and Advertising (9%)are all huge consumers of
complex data sets, plus their business models often depend on crunching
real-time data for reactive decision making at times of peak traffic/usage.

JOB TYPE/ROLE

INDUSTRY FOCUS
16% Telecommunications / Networks
12% Banking / Finance
11% Retail

74% Developer

10% Software / Technology


9% Advertising
5% Consulting

7.5% Data Scientist


6.5% C-Level Executive
3.5% Software Architect
3.5% Dev Ops
1% Business Analyst
6.5% Other

4% Healthcare / Insurance
33% Other

Including Biotechnology/Chemistry,
Machinery, Education, Government
and Utilities and other sectors

WHICH OF THE FOLLOWING TECHNOLOGIES DO YOU


USE FOR YOUR PRODUCTION INFRASTRUCTURE?
53% Amazon EC2
34% Docker
22% Cloudera CDH
16% Ansible
14% Mesos
13% OpenStack
12% Apache.org Builds of Hadoop
10% HortonWorks HDP
10% Heroku
8%

Google Compute Engine

7%

Core OS

7%

MapR Hadoop Distribution

6%

Microsoft Azure

5%

Marathon

4%

Kubernetes

2%

Aurora

11% Other XaaS

INFRASTRUCTURE TECHNOLOGIES IN USE

We see quite a lot of complementary technologies in this


breakdown of production infrastructure toolsfrom
IaaS/PaaS to frameworks and containers. The market has
settled on Amazon EC2 (53%), with Docker (34%) and
Cloudera CDH (22%) also retaining good market shares.
From relative obscurity just 2 years ago, its interesting to see
multi-functional Ansible (16%) appear in the mix. Mesos
(14%) and OpenStack (13%) havent always been so close
in market share, so its curious to see where things will head
in 2015-16.
In the end, we are receiving self-reported statistics from a
sample population that includes mainly developers, so its
not always clear if this question was interpreted as have you
ever seen this technology appear in your organization in any
form? as opposed to confirmed instances of enterprise-wide
production usage.

CHAPTER 3: A NEW HOPE

What Goals Do
Organizations Hope to Achieve
with Apache Spark?

WHICH BEST DESCRIBES YOUR COMPANYS


INTEREST (OR AWARENESS) WITH SPARK?
A solid majority representing 72% of respondents have at least some
experience with Apache Spark, and a total of 35% are currently using or
planning to use it this year (or next). Notably, the largest single segment (31%)
is currently evaluating Spark, but since 28% had never heard of Spark at the
time of this survey (funnily, this group is now 0%!), there is still a ways to go.
But trends can be discernedboth in buzz and adoptionfrom sources as
varied as this survey as well as Google Trends:

CURRENT RELATIONSHIP WITH SPARK


Evaluated,
not planning to use

6%
Currently using
in production

Evaluated,
will use in 2016 or later
2%

31%

13%

Evaluating
Spark now

GOOGLE TRENDS - APACHE SPARK INTEREST OVER TIME

2011

2013

That said, a similar linear trend exists for searches like Hadoop and Big Data,
so while Spark might defeat Hadoop in the processing power and event
streaming areas, it is also designed to cooperate very well with Hadoop
both are Apache Foundation projects, after all. This is no secret; the creators
of Spark, who later founded Databricks, speak directly to the complementary relationship between Hadoop and Spark in a January 2014 blog post.

Planning to
use in 2015

20%

28%
Um, whats Spark?

11

WHAT PROBLEMS ARE YOU TRYING TO SOLVE WITH


SPARK THAT OTHER TOOLS DONT SOLVE?
The most prevalent goals to achieve by respondents focus on the gains in processing
speed, which are indeed one of the most exciting benchmarks: recent Spark in-memory
performance tests showed it could process data at up to 100x the speed of Hadoop.
However, users are also excited to implement event stream processing, which was an
impossibility using previous technologies. As Typesafe CTO Jonas Bonr explains in
his 2015 tech trends article in Wired.com, its the velocity of data that concerns most
organizations, not the size.

Most so-called Big Data problems


today are actually better described in
the context of velocity instead of size.
You want Fast Data. Speed is the
problem to solve, not size.
Jonas Bonr
CTO, Typesafe
@jboner

BUSINESS GOALS IN MIND

78%
Fast Batch
Processing of
Large Data Sets

60%

56%

Support for
Event Stream
Processing

55%

Fast Data
Queries in
Real Time

Improved
Programmer
Productivity

12

WHICH OF THE FOLLOWING SPARK FEATURES OR MODULES


ARE MOST LIKELY TO SOLVE YOUR BIG DATA CHALLENGES?
As you can predict, Spark Core API replacement (82%) and to a lesser extent Spark
Streaming (65%) are seen as the biggest benefits of adoption, highlighting the
shortcomings of MapReduce in terms of API friendliness, sheer performance and event
streaming. Sparks MLlib (59%) and SparkSQL (51%) modules are smaller priorities and
GraphX (25%) seems like a distant goal for most.

SPARK FEATURES/MODULES IN DEMAND


Core API as a
Replacement for
MapReduce

82%

Streaming Library
(Spark Streaming)

65%

Machine
Learning Library
(MLlib)

59%

Dean Wampler
Author & Big Data Expert, Typesafe
@deanwampler

Integrated SQL
(SparkSQL)

51%

Spark uses sophisticated caching of


intermediate data in memory between
processing steps, considerably improving the performance of applications
compared to comparable MapReduce
implementations. Compared to the
MapReduce API, the Spark API is
amazingly intuitive, providing concise,
expressive operations that are often
needed for analytics. So, in addition to
addressing a wider class of problems,
Spark is improving the productivity of
developers who use it.

Graph
Algorithms Library
(GraphX)

25%

13

HOW WILL YOU USE SPARK TO PROCESS YOUR DATA?


DATA PROCESSING WITH SPARK
39% Read or Write Data to One or More Databases
41% Static Reports
46% SQL Queries and Business Intelligence
46% Write Data to Hadoop Distributed File System (HDFS)
59% Ad-hoc Queries and Reporting
61% ETL Data from External Sources

67% Event Stream Processing


71% Use Spark as Part of a Larger Data Pipeline
65% Extract Information from Data Sooner Rather than Later
40% Automate Decision Making at Runtime

When it comes to data sources used by Spark, there is a reasonable amount of


variance. Event stream processing (67%), clearly a priority, remains a focus for
over two-thirds of respondentsa further breakdown of this aspect is presented
on this page. The rest of these priorities are speaking to current legacy systems;
developers will use Spark as a replacement for MapReduce in traditional batch
mode applications, including ETL (61%) jobs for moving, cleaning, and re-formatting data sets, and this will affect the rest of data processing methods as well.
Many respondents feel that event stream processing will be a key killer feature
of Spark, and see it helping their entire data pipeline (71%) as a whole, which
points to the idea of extracting data sooner rather than later (65%); seems to
encourage the evolution towards Reactive systems with Big Data at the heart
of it all. Decision making automation at runtime (which sounds a bit to us like
continuous deployment) is also something that about 40% of respondents
consider as data velocity increases.

14

CHAPTER 4: APACHE SPARK IN USE

How Organizations
Use Spark Today

WHICH PROGRAMMING LANGUAGES ARE


IMPORTANT TO YOUR SPARK INSTALLATION?
Considering that Apache Spark was designed with Scala and
Akka, its not surprising that the earliest users of this technology would be focused on Scala (88%). That said, as Spark
adoption goes more mainstream on the JVM, we expect
Java (44%) to increase in priority over time. Python (22%)
is represented by about one-quarter of users, and is the 3rd
language after Scala and Java that Spark documentation
has prioritized. Other languages that users would like to see
supported include R, loved by data scientists and statisticians, plus Clojure, Groovy, Ruby and Go.

WHICH LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?

1st
Scala 88%

2nd
Java 44%
3rd
Python 22%

Honorable mentions: R, Clojure, Groovy, Ruby & Go

16

WHERE ARE YOU RUNNING


SPARK CURRENTLY?
Standalone (54%) and Local mode (29%) installations of Spark seem logical for early
users with different testing purposes, and one can always add to a cluster later. Otherwise,
YARN (42%), aka MapReduce 2, and Mesos (26%) are the general go-to choices for
integrating and running Spark with current systems. Cassandra (20%) is another Apache
project that not only integrates well with Sparks event streaming power, but shares a
similar vision of supporting highly responsive, resilient, elastic systems. Also mentioned by
about 3% of respondents is Amazon Elastic MapReduce.

WHERE DO YOU RUN SPARK?

Standalone

54%

YARN

42%

Local Mode

29%

Mesos

26%

Cassandra
17

20%

HOW DO YOU LOAD YOUR


DATA INTO SPARK?
When it comes to data loading, respondents take from a
wide spectrum of technologiesfrom DBs to messaging and
file systems to plain socket connections, almost anything
goes. The winner here is HDFS (62%) which makes perfect
sensethe things users cannot get done with Hadoop are
designed to be ported over to Spark to finish the job, again
emphasizing the complementary nature of these two
technologies. Unspecific Databases (46%) are in use by
almost half of respondents, and Apache Kafka (41%) is a
hot messaging broker built by LinkedIn using Scala in 2011
that now leverages Sparks event streaming capabilities.
Amazon S3 comes in at 29%, little surprise considering
Amazons infrastructure dominance with EC2 and their fairly
comprehensive stack portfolio.

HOW DO YOU LOAD DATA INTO SPARK?


62% Hadoop Distributed
File System (HDFS)

46% Databases

41% Apache Kafka

29% Amazon S3

18% Other Services

(e.g. over socket connection)

12% Other*
*Including:
Apache Cassandra, Amazon
Kinesis and Apache HBase

18

CHAPTER 5: SO WHATS THE DELAY IN ADOPTION?

Barriers, Concerns and


Support Desires Expressed
by Respondents

WHAT IS YOUR
BIGGEST BARRIER TO
USING SPARK EFFECTIVELY?

Here we get to analyze hundreds of write-in answers by hand...fun! We found the write-in
answers to be generally legible and only occasionally off-topic mumbo jumbo (i.e. something about tabs vs. spaces). We asked about barriers to using Spark effectively at this
time, then manually clustered them into sentiment categories, if you will.

LARGEST BARRIERS TO USING SPARK EFFECTIVELY

Low
Awareness / Experience

Current
Requirements Dont Fit

Too
Immature

1st

2nd

3rd

Low awareness / experience makes sense, since


Spark adoption is still growinga year from now, we predict that awareness of Spark will be considerably higher
and no longer considered a barrier to adoption or use.

Current requirements dont fit reflect a lack of


urgency among the majority of enterprises; however,
since the data shows that most early adopters use Spark
to replace MapReduce, this group will likely re-evaluate
their requirements as the need for data velocity increases.

Too immature regarding integrations with


middleware, platforms, tooling and programming
languages. As adoption increases, you should check
the Spark pages regularly for updates on feature
and API maturity.

20

HOW CAN SUPPORT


BE IMPROVED?

In line with the previous question, we also had a large collection of suggestions for
improving support. Generally, these mirror the issues perceived as barriers to using Spark
effectively in the previous question, but with some slight differences in semantics.
Here are the top 3 sentiment categories that we hope can serve as useful feedback for
future Spark development.

HOW CAN SUPPORT BE IMPROVED?

1st

2nd

3rd

Integration
Integration Integration!

Deeper Examples,
Docs & Tutorials

Maturity
Through Features

Integration integration integration! comes in


loudly as a definite requirement for many users, some of
which may not be aware of currently supported
technologies, since they specifically mentioned Scala,
Java and Hadoop, which are first-class citizens for Spark.

Deeper examples, docs & tutorials are important


for making the case for Spark.We see documentation,
more real-life case studies and tutorial options (like
these) from vendors as answering these needs.

Maturity through features is the final area where


respondents see a lot of room to improve. Specifically
mentioned are immaturity in the Spark feature set
related to the client and streaming functionality, issues
related to clustering and the overall stability of
Spark in production.

21

Final Thoughts
Spark has become the Big Data tool of choice for a future of Reactive Systems,
fueled by organizations in need of faster data and event steaming features.

FINAL THOUGHTS
By this point, were sure you now understand that Spark awareness
and adoption are experiencing remarkable growth. Developers have a
pent-up need to eliminate issues with MapReduce, such as a difficult
API, poor performance, and restriction to batch jobs only.
You should consider Spark as the tool that meets these needs,
providing excellent performance at scale, a concise and intuitive API,
and support for event stream processing and iterative algorithms.
Spark is less mature than older technologies, like MapReduce, so
developers also need good documentation, example applications,
and guidance on runtime performance tuning, management and
monitoring. Spark is also driving interest in Scala, the language in
which Spark is written, but developers and data scientists can also
use Java, Python, and soon, R.
Its all very good, more or less. So if you, like our sensible PR team,
were looking for the Top 3 Takeways From This Survey, here they are in
more shareable form:

Spark awareness and adoption are seeing exponential growth.

Google Trends confirms this and the survey shows that 72% of
respondents have at least evaluation or research experience with
Spark35% are using it or have decided to implement it.
Faster data processing and event streaming are the focus for enterprises.

By far the most desirable features are Sparks vastly improved


processing power over MapReduce (over 78% mention this) and the
ability to process event streams (over 66% mention this), a limitation
of current technologies.
Perceived barriers to adoption are not major blockers.

When asked, respondents mentioned lack of in-house experience and


perceived immaturity of some Spark components and integrations
with other middleware and management tools. Also cited are needs
for better commercial support options and for more comprehensive
documentation and advanced examples.

23

DONT WORRY...WE HAVE MORE FOR YOU HERE


Hands-on Spark Workshop
with Typesafe Activator

Getting Started
with Spark

Introducing the
Typesafe Reactive Platform

DOWNLOAD

DOWNLOAD

DOWNLOAD

Typesafe (Twitter: @Typesafe) is dedicated to helping developers build Reactive applications on the JVM. Backed by Greylock
Partners, Shasta Ventures, Bain Capital Ventures and Juniper Networks, Typesafe is headquartered in San Francisco with
offices in Switzerland and Sweden. To start building Reactive applications today, download Typesafe Activator.

2015 Typesafe

24

You might also like