The Art of Collaborative Data Science at Scale

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

The Art of

Collaborative
Data Science
at Scale
A unified approach that boosts
data science agility and productivity
Introduction
The creation and consumption of data is accelerating at an incredible pace. This ongoing growth in data is fueling big
investments in strategies and technologies that will empower enterprises to harness data-driven insights to uncover new
innovations. Whether it is discovering new drugs to solve cancer, creating a more engaging shopping experience that will
delight customers, or uncovering ways to improve supply chain efficiencies — enterprises are turning to data science and
AI as a means to tap into the immense value embedded in the mountains of data they are generating.

The Hottest Big Data Job on the Planet


Consequently, the role of the data scientist in today’s modern enterprise is more critical than ever. The demand for data science and analytics roles across
industries is through the roof. In fact, studies show that by 2020, the number of jobs for all US data professionals will increase by 364,000 openings to 2,720,000
and that the fastest-growing roles driving this survey are data scientists and advanced analysts, which are projected to see demand for their services spike by
28% by 20201. But with such a huge demand for these roles, finding qualified candidates have been a struggle.

Machine learning, big data, and data science skills are the most challenging to recruit for and potentially can create the greatest disruption to ongoing product
development and go-to-market strategies if not filled. With companies struggling to tap into a reliable and steady flow of data science talent, they need to shift
their focus on making their current data science teams more productive. If an organization doesn’t have the necessary tools and processes in place that are
required for a data scientist to be effective in doing their job, what’s the point?

1  https://www.forbes.com/sites/louiscolumbus/2017/05/13/ibm-predicts-demand-for-data-scientists-will-soar-28-by-2020/#5a40ab5b7e3b

2
Challenges Data Scientists Face
The effectiveness of a data scientist is often dependent on having access to the right
sets of tools, technologies, and processes that they can use to do their incredibly
complex tasks. Here are the key factors that tend to contribute to an unproductive and
ineffective data science and analytics team.

Infrastructure Complexity
The move to the cloud is fast becoming a primary objective for businesses looking to
reduce costs and create competitive differentiation. Part of the challenge associated
with this inexorable shift is the complexity that surrounds setting up and maintaining
a big data infrastructure. The explosion in data growth pushes organizations to move
faster with infrastructure investments that can harness and derive value from this
data. But doing so ultimately creates an over-reliance on DevOps teams to do the
heavy lifting.2 For companies that don’t have dedicated DevOps teams to help with
these infrastructure issues, the responsibility often falls on the data scientists to fend
for themselves. This creates an environment where they are not able to focus on their
core responsibilities because they are spending so much time configuring and setting
up infrastructure.

2 https://www.cloudcomputing-news.net/news/2016/sep/01/vendor-lock-in-is-big-roadblock-to-cloud-success-survey-finds/

3
Challenges Data Scientists Face
Disparate Technologies DATA DATA BUSINESS
Companies are trying to use a myriad of technologies to achieve their goals of SCIENTIST ENGINEER ANALYST

a more data-driven business. Open source projects such as Apache Spark™,


Hive, Presto, Kafka, MapReduce, and Impala, offer the promise of a competitive
advantage, but also come with management complexity and unexpected costs.3
Relying on disparate technologies can be incredibly challenging as they all
follow different release cycles, lack institutional support mechanisms, and
have varying performance deliverables. Additionally, the skills to integrate
these technologies are in short supply and can jeopardize innovation in order
to maintain a stitched together infrastructure. Again, this responsibility can fall
on the data scientists which further slows their progress.

Siloed teams
The productivity of the team structured across a data organization can be severely
impacted without a seamless and dependable big data platform. It’s very difficult
for the traditionally siloed functional roles of data scientist, data engineer, and
business user to achieve any synergy and work together — both within a function
DATA SCIENTIST DATA ENGINEER BUSINESS ANALYST and across teams — to explore data and solve business problems. By viewing data
through separate lenses, collaboration is very difficult, trust in the analytics can be
misplaced,4 and speed of innovation is slowed.

3 https://www.sungardas.com/en-GB/resources/white-papers/the-cloud-hangover/
4 https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/getting-big-impact-from-big-data

4
Challenges Data Scientists Face
Data exploration at scale
Exploring data at scale can be difficult and costly. Most organizations rely
on single threaded tools to perform data exploration. The limitations of this
approach are directly associated with the amount of memory on the data
scientist’s machine, impacting their ability to scale. As a result, they are often
forced to train models against small samples of data which can result in less
accurate models.

Model training is resource intensive


Model training involves using the data to incrementally improve the model’s
ability to solve a given problem. This process involves many steps including
training the model, evaluating the results, tuning parameters to further optimize
your model, and repeat. Training complex machine learning models against
massive data sets can be very challenging in isolation without the ability to
collaborate on models with peers. The more complex the models, the longer it
will take to bring new capabilities to market.

Difficult to share insights


Part of the role of a data scientist is the need to share results with team members
and stakeholders for input and decision making. The trick is sharing the insights
in a way that resonates with non-technical audiences. The inability to do so can
hamper cross-team collaboration and slow progress.

5
The Power of Collaboration in Data Science
It’s no secret that better collaboration often leads to improved operational efficiency and
productivity. With no shortage of data to sift through and the pressures of the business to
build accurate models quickly, it’s important for data scientists to work effectively with
other team members, colleagues across teams such as engineering, and stakeholders.

Achieving a highly collaborative environment can positively impact team efficiency,


productivity, and innovation — resulting in delivering more models to production faster
which can result in more revenue. As organizations continue to try to become more data-
driven, creating easier access and visibility into the data, models trained against the data,
and insights uncovered within the data is critical.

Notebooks have become the ideal environment for data scientists to explore data, build
and train models, and share insights with others. Notebooks provide a more superior data
science experience over R and Python alone as they tend to provide easier access to data
and computing resources, but capabilities that foster productivity and collaboration.

6
Not all Notebooks Are Created Equal
There are a number of open source notebooks solutions available today. They all
provide a flexible coding and prototyping environment, but many have critical
shortcomings that can impede data science productivity. The fundamental problem
is simple, data science is collaborative and most open source notebooks are built for
individual users doing work.

Most open source notebooks require extensive DevOps work to setup and configure
that can severely limit a data scientist’s ability to focus on the data. Furthermore, they
lack the collaborative capabilities that has made Databricks’ integrated workspace a
staple in some of the most innovative companies in the world.

7
Databricks
The Databricks
Interactive
Advantage:
Workspace:
Performance
A Unified Approach to Data Science

With so many data challenges facing enterprises that act as a brake on Focus On Your Data, Not DevOps
innovation, distracting the organization from their core competencies and A cloud-native platform that abstracts the complexities of Apache Spark
slowing time to market for new products and insights, a new approach needs management, resulting in a highly elastic, reliable and performant platform
to be considered. to build innovative products.
• Automated Cluster Management: Launch expertly-tuned Spark clusters
Databricks offers an interactive workspace that takes traditional notebook with a few clicks. Databricks Spark clusters are fully managed and
environments to the next level. By integrating and streamlining the individual automatically scale to your workload.
elements that comprise the analytics lifecycle, these teams can quickly access • Optimized for Performance: Built on top of Spark and native to the cloud,

data, provision compute resources, and work together to build models, Databricks Runtime optimizes Spark, making it 10-40x faster and more
reliable.
creating a culture of accelerated innovation.
• Enterprise Security: Databricks protects your data at every level with a
unified security model featuring fine-grained controls, data encryption,
identity management, rigorous auditing, and support for compliance
standards.

8
Databricks Interactive Workspace:
A Unified Approach to Data Science

Accelerate Innovation with Collaborative Data Science Share Insights via Interactive Dashboards
Increase the productivity of your data science team by 4-5x through Share insights with your colleagues and customers, or let them run interactive
collaboration and the democratization of data and insights. queries with Spark-powered dashboards.
• Collaborative Workspace: Speed up iterative model building and tuning • One Click Publishing: Create shareable dashboards from notebooks with a
with interactive notebooks purpose-built to instill collaboration across single click. One notebook can be tailored into multiple dashboard views.
teams. • Continuous Updates: Publish dashboards and schedule the content to be
• Support for Multiple Programming Languages: Interactively query large- updated continuously.
scale data sets in R, Python, Scala, or SQL. • Parameterized Dashboards: Enable non-technical users to perform
• Built-in Visualizations: Visualize insights through a wide assortment of scenario analysis directly from published dashboards.
point-and-click visualizations. Or use powerful scriptable options like • Dashboard Widgets: Input widgets allow you to parameterize your
matplotlib, ggplot, and D3. dashboards.
• Highly Extensible: Make use of popular libraries within your notebook or
job such as scikit-learn, nltk ML, pandas, etc.
The following pages present a deeper dive into the differentiating capabilities
of the Databricks collaborative notebook compared to today’s common open
source notebooks.

9
Databricks Interactive Workspace:
A Unified Approach to Data Science

Feature Databricks

Multi-Language Support Enables commands across Python, Scala, SQL, or R (including markdown) in the same notebook — allowing users
seamlessly to mix and match as needed.

Collaboration Designed for collaboration, Databricks’ notebooks contain features such as comments, viewer log, and history.

Live Sharing & Editing Real time collaboration among team members performing data modeling or analysis.

Revision History Simplifies version control by not having to create, save and manage notebook changes made during the
development of a Spark application

GitHub and Bitbucket Integration Enables source and version control of a notebook.

SparkR Support Allows data scientists to leverage Spark’s distributed processing engine to analyze large scale datasets and
interactively run jobs on them from the R shell.

No Vendor Lock-in Users can import and export notebooks in source format, allowing for seamless migration of source code (Scala,
Python, R, SQL) in and out of Databricks to use across an IDE of your choice.

Attach Multiple Notebooks to a Cluster Enables efficient cluster resource management as many users and notebooks can share a single running cluster.

Debugging Easier debugging capabilities for Spark jobs executed from notebooks.

Keyboard Shortcuts Extensive keyboard shortcuts for easy in-notebook development and navigation.

REST API Trigger or query a notebook externally via the Databricks REST API for easy programmatic access.

10
Databricks Interactive Workspace:
A Unified Approach to Data Science

Feature Databricks

Notebooks Workflows Ability to invoke notebooks that can invoke each other.

Autocomplete Automatically completes function names and invocations within the notebook.

Parameterized Queries More easily reuse notebooks by utilizing parameterized queries.

Extensibility Make use of popular libraries within your notebook or job such as scikit-learn, nltk ML, pandas, etc.

One-Click Visualizations Provide a wide range of visualizations including full big-data pivoting, histograms, scatterplots, maps, etc.

Pipeline Workflows Data Scientists and Data Engineers can use the same notebook for creating models to production jobs.

One Click Publishing from Notebooks Create shareable dashboards from notebooks with a single click.

Continuous Dashboard Updates Publish dashboards and schedule the content to be updated continuously.

Parameterized Dashboards Provide drop-downs in the dashboards to enable changing input parameters to dashboard values.

Schedule Dashboards Execute jobs for production pipelines on a specified schedule directly from a dashboard.

Dashboard Widgets Input widgets allow you to parameterize your dashboards.

11
Case in Point:
How Leading Companies Have Increased
Data Science Productivity with Databricks

• Company: Overstock.com is the premier online destination for furniture and home décor, using
technology to help savvy shoppers find the best prices on top styles to create their dream homes.
• Use Case: Identifying unique patterns and tendencies that indicate a user is ready to purchase.
They apply a user score that represents a customer’s likelihood of purchase. Based on this score, they
can determine how to create a more personalized shopping experience that drives conversion and
customer lifetime value.
• Solution: With a Data Science team versed in different programming languages, it was difficult to share
data and collaborate on insights. Databricks has greatly simplified model training as teams can work in
the same interactive notebooks with different programming languages and quickly train and iterate on
machine learning models.
• Business Benefits: They were able to significantly close the gap from development to production.
–– Decreased cost of moving models to production by nearly half
–– Stand up new models in one fifth the time previously required
–– Can make inter-day improvements on existing models without new deploys
–– In-notebook version control allows to roll-back single moves inside a notebook
–– Makes exploration and general trial/error approach to exploratory analysis seamless

12
Case in Point:
How Leading Companies Have Increased
Data Science Productivity with Databricks

• Company: HomeAway, a subsidiary of Expedia, allows travelers to search for vacation rentals
in desired destinations.
• Use Case: To facilitate a match between traveler and vacation rental, HomeAway must show search results
that are relevant to the traveler’s specific interests by serving the best images based on search criteria.
• Solution: HomeAway selected Databricks to replace its homegrown environment consisting of open source
Spark and Zeppelin notebooks as they were spending too much time on DevOps and not enough on the
data. They needed a platform that fostered team collaboration and accelerated model training.
• Business Benefits: With Databricks, the management of their Spark infrastructure has been greatly
simplified through its native access to S3, interactive notebooks, and cluster management capabilities —
increasing productivity by 4x.

13
Case in Point:
How Leading Companies Have Increased
Data Science Productivity with Databricks

• Company: Dollar Shave Club is a men’s lifestyle brand and e-commerce company on a mission to change
the way men address their shaving and grooming needs.
• Use Case: Their interactions with members and guests generate a mountain of data. Knowing that the
data would be an asset in improving member experience, they strive to serve product recommendations
prior to an order shipment that suggests additional products to add to their shipment.
• Solution and Benefits: Dollar Shave Club adopted Databricks within their Data Science & Engineering
team due to its interactive notebooks that drives collaboration across teams. The notebook interface
offers real-time commenting and interactive visualizations which has accelerated data science and analyst
productivity. The improved productivity has been faster time-to-market of their recommendation engine,
resulting in an increase in customer engagement and revenue.

14
Conclusion

Databricks’ collaborative notebook environment serves


as a development environment that makes it easier for
developers, data engineers, and data scientists to tap into the
power of Apache Spark, explore data at scale, train accurate
machine learning models, and share insights with others.
Get started with a free trial of Databricks and take your
data science to the next level today

START YOUR FREE TRIAL

Contact us for a personalized demo


databricks.com/contact-databricks

© Databricks 2018. All rights reserved. Apache, Apache Spark, Spark and the Spark
logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use 15

You might also like