Download as pdf or txt
Download as pdf or txt
You are on page 1of 109

Administering Data Science Cloud

F11247-02
January 2019
Administering Data Science Cloud,

F11247-02

Copyright © 2018, 2019, Oracle and/or its affiliates. All rights reserved.

This software and related documentation are provided under a license agreement containing restrictions on
use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your
license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify,
license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means.
Reverse engineering, disassembly, or decompilation of this software, unless required by law for
interoperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. If
you find any errors, please report them to us in writing.

If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on
behalf of the U.S. Government, then the following notice is applicable:

U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software,
any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are
"commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-
specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the
programs, including any operating system, integrated software, any programs installed on the hardware,
and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.
No other rights are granted to the U.S. Government.

This software or hardware is developed for general use in a variety of information management applications.
It is not developed or intended for use in any inherently dangerous applications, including applications that
may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you
shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its
safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this
software or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of
their respective owners.

Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are
used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron,
the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro
Devices. UNIX is a registered trademark of The Open Group.

This software or hardware and documentation may provide access to or information about content, products,
and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly
disclaim all warranties of any kind with respect to third-party content, products, and services unless otherwise
set forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not be
responsible for any loss, costs, or damages incurred due to your access to or use of third-party content,
products, or services, except as set forth in an applicable agreement between you and Oracle.
Contents
Preface
Documentation Accessibility v
Conventions v

1 Introduction

2 Architecture
Overview 2-1
Requirements and Limits 2-4
Supported Public Clouds 2-8
Docker Containers and Services 2-8
Security 2-9

3 Installation
Prerequisites 3-1
Airgap Installation 3-2
RHEL 7 Installation Instructions 3-7

4 Configuration
Admin Console Settings 4-1
LDAP Integration 4-2
SMTP Integration 4-6
Git Provider Integration 4-8
Environment Management 4-9
Enabling Hadoop and Spark 4-12
On-Demand Instances 4-16
SSO Configuration 4-18

iii
5 Administration
Upgrading 5-1
User Management 5-8
Cluster Management 5-21
Database Administration 5-25
Monitoring and Logging 5-27
On-Demand Resources 5-32
Resource Management Dashboard 5-32

6 Troubleshooting

7 Appendix
Dockerfile Basics and Best Practices 7-1
Amazon AWS Examples 7-2
Google Compute Engine Examples 7-12
Environments and Dependencies 7-16
Changing Paths of Installed Components 7-18
Changing the SSL Certificate 7-21
Docker Commands Glossary 7-23

8 Release Notes

iv
Preface
Administering Data Science Cloud describes how to install , configure, and administer
Data Science Cloud.

Topics
• Documentation Accessibility
• Conventions

Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle
Accessibility Program website at http://www.oracle.com/pls/topic/lookup?
ctx=acc&id=docacc.

Access to Oracle Support


Oracle customers that have purchased support have access to electronic support
through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/
lookup?ctx=acc&id=info or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs
if you are hearing impaired.

Conventions
The following text conventions are used in this document:

Convention Meaning
boldface Boldface type indicates graphical user interface elements associated
with an action, or terms defined in text or the glossary.
italic Italic type indicates book titles, emphasis, or placeholder variables for
which you supply particular values.
monospace Monospace type indicates commands within a paragraph, URLs, code
in examples, text that appears on the screen, or text that you enter.

v
1
Introduction
The DataScience.com platform combines the tools, libraries, and languages your team
loves with the infrastructure and workflows your organization needs.

What is the DataScience.com Platform?


The platform combines three key components:
• Infrastructure - systems tasks (like spawning servers) are abstracted and
handled automatically so data scientists can focus on the substance of their work
• Tools - open source tools (like Jupyter, R Shiny, or modeling libraries) that data
scientists need are integrated into a centralized place
• Workflow - automation for tasks, collaboration, and communication that let data
science teams effectively deliver on their mission

How to Read These Docs


In the Using Data Science Cloud guide, you’ll learn how to set up an account and
launch different types of analyses across the platform. In this guide, you’ll learn about
the Platform architecture, security, installation process, and other configuration.
• Architecture
• Installation
• Configuration
• Administration
• Troubleshooting
• Appendix
• Release Notes

1-1
2
Architecture
Learn about the Data Science Cloud architecture.

Topics
• Overview
• Requirements and Limits
• Supported Public Clouds
• Docker Containers and Services
• Security

Overview
Introduction
The DataScience.com Platform’s core infrastructure consists of three node types:
• A single master node
• One or more core nodes
• One or more worker nodes
Each of these node types runs a set of components or a set of containers that serve
various functions.
Most of these components communicate with one another internally; however, some
communicate with various external services like Quay and Datadog. For details such
as port numbers and egress points, see Requirements and Limits.

2-1
Chapter 2
Overview

2-2
Chapter 2
Overview

Components
The following components and their constituent containers make up the underlying
infrastructure of the DataScience.com Platform. We’ve included short descriptions of
each, as well as links to external documentation in the Further Reading section.

Node Provisioning
The Master node is automatically provisioned during the installation process; however,
Core and Worker nodes must be manually added to the cluster. For detailed
instructions on how to add or remove nodes, see Cluster Management.

Installation Paths
The following files and directories are created on the hosts during installation:
• systemd files (root:root 644)
– /usr/lib/systemd/system/docker.service
– /etc/systemd/system/replicated.service
– /etc/systemd/system/replicated-operator.service
– /etc/systemd/system/replicated-ui.service
• sysconfig files (root:root 644)
– /etc/sysconfig/replicated
– /etc/sysconfig/replicated-operator
• executables (root:root 755)
– /usr/bin/docker
– /usr/bin/docker-containerd
– /usr/bin/docker-containerd-ctr
– /usr/bin/docker-containerd-shim
– /usr/bin/dockerd
– /usr/bin/docker-init
– /usr/bin/docker-proxy
– /usr/bin/docker-runc
• utilities
– /var/lib/docker(root:root755)
– /var/lib/replicated(replicated:docker755)
– /var/lib/replicated-operator(replicated:docker755)
• pid & process
– /var/rundocker(root:root700)
– /var/run/replicated(root:root755)
– /var/run/replicated-operator(root:root755)
• logs

2-3
Chapter 2
Requirements and Limits

– /var/log/dscloud(root:root755)

Users Created
The following users and groups are created on the hosts during installation:
• users
– replicated:x:1001:993::/home/replicated:/bin/bash
• groups
– docker:x:993:replicated

Further Reading
• Consul Architecture
• Datadog
• Docker Swarm
• Logspout
• Logstash
• Nginx
• PostgreSQL
• Registrator
• Replicated

Requirements and Limits


The requirements in this guide are minimum suggestions. Each installation is unique
and comes with its own constraints. We recommend that you evaluate this guide within
the framework of your existing infrastructure and expected usage.

Required Information
In order to configure the DataScience.com Platform, you’ll need certain information
about your instance.

Host Requirements
Per host:
• RAM: 32GB
• CPU: 8 core
• Disk Space: 300GB

• The optimal configuration is 300G of disk space at the root


volume. If this is not possible in your installation, at least
100G should be allocated to the directories in the following
table. For more information, please see Changing Paths of
Installed Components.

2-4
Chapter 2
Requirements and Limits

Directory Purpose
/var/lib/docker Runtime environment for docker containers
and images
/var/lib/replicated Storage for installation metadata
/var/log/dscloud Log files for run scripts sessions and APIs

Supported Operating Systems


(64-bit distributions)

Supported Browsers
The DataScience.com Platform relies on native flexbox support, which requires the
following minimum versions: - Apple Safari 10+ - Google Chrome 49+ - Microsoft Edge
14+ - Microsoft Internet Explorer 11+ (partial support; there are some known issues
with flexbox) - Mozilla Firefox 51+ - Opera 43+

Additional Software
The installation script for the DataScience.com Platform will automatically install the
correct version of docker-engine; please ensure this version is not overwritten by other
configuration management tools.
• docker 17.03-ce+

Port Configuration
The following ports should be opened between the specified sources and destinations.
“Administrative IP(s)” refers to the IP(s) from which Systems Administrators will need
to access the instance. “User IP(s)” refers to the IP(s) from which users of the
DataScience.com Platform will be accessing the application.
Caution
LDAP and SMTP Ports
For integrations such as LDAP and SMTP, we’ve provided the most commonly used
ports. Please confirm these ports with your service administrator(s).
For a real world example of this configuration, see Amazon AWS Examples.

Port Usage Source(s) Destination(s)


25 Unencrypted SMTP Master node SMTP server
traffic
80 HTTP (redirects to Administrative IP(s) & Master node
HTTPS) User IP(s)
389 (optional) Non-SSL LDAP traffic Master node LDAP server
443 HTTPS Administrative IP(s) & Master node
User IP(s)
465 Encrypted SMTP Master node SMTP server
traffic
636 (optional) SSL LDAP traffic Master node LDAP server
2376 Docker remote socket Master node All nodes
2377 Docker Swarm API All nodes All nodes

2-5
Chapter 2
Requirements and Limits

Port Usage Source(s) Destination(s)


5000 Logstash ingress All nodes Master node
5432 Postgres traffic Master & Core nodes Postgres endpoint
7946 Docker Swarm All nodes All nodes
8080 HTTP (redirects to Administrative IP(s) & Master node
HTTPS) User IP(s)
8085 GitHub OAuth github.com & All All nodes
authentication DS nodes
Docker Event Listener
8300-8302 Consul All nodes All nodes
8500 Consul All nodes All nodes
8600 Consul All nodes All nodes
8686 Darkroom All nodes All nodes
8800 Admin Console Administrative IP(s) Master node
8830 Acquiesce All nodes All nodes
8899 Graphite & Statsd All nodes Master node
9870 - 9880 Cluster management All nodes All nodes
32768-61000 Proxy routing to Master node All nodes
containers

The Master node also require egress allowed to the following:

Host Purpose
d3dgvlzhobmmuy.cloudfront.net • Ongoing access
• Delivers static site assets for the
DataScience.com Platform
api.replicated.com • Initial installation upgrades and ongoing
access
• Checks for updates and syncs license
get.datascience.com • Initial installation and adding new nodes
• Houses the provisioning script for new
nodes
install.datascience.com • Initial installation and upgrades
• Houses the installation script for the
Admin Console
hub.docker.com • Initial installation and upgrades
• Houses some dependencies for the
Admin Console and DataScience.com
Platform
quay.io • Initial installation and upgrades
• Houses the images required for the Admin
Console and DataScience.com Platform

Caution
Installations Without Internet Egress
If your installation does not have egress to the internet, see Airgap Installation.
Important
Connecting to data sources

2-6
Chapter 2
Requirements and Limits

In addition to the previous information, please ensure that routes are open between
the DataScience.com Platform and whatever data sources you plan to connect.

Databases
For more detailed information on configuring PostgreSQL for use with the
DataScience.com Platform, please see Database Management.
• Engine: PostgreSQL
• RAM: 8GB
• CPU: 2 core
• Space: 300GB
• Max connections: 1000
• Postgresql engine version 9.5.4
• Schema: public
• Database name: platform

Git Providers
In order to create projects in the DataScience.com Platform, you must integrate with a
Git provider. We currently support the following:
• GitHub.com
• GitHub Enterprise 2.9+
• Bitbucket.org
• GitLab.com
• GitLab Enterprise 7+

SMTP Server
The Platform relies on SMTP for sending various notifications. It’s also required when
using built-in authentication, as initial user invitations and password reset emails are
sent using SMTP.
If you do not have an existing SMTP server, you may use a third party SMTP service.
The following services have been tested and verified as compatible with the
DataScience.com Platform:
• Amazon AWS SES
• FoxPass
• Mandrill
• SendGrid

LDAP Providers
The Platform offers optional LDAP integration for user authentication. At the moment,
it supports the following providers: - OpenLDAP - Microsoft Active Directory

Limits
File upload/download limit: 200MB

2-7
Chapter 2
Supported Public Clouds

Google Compute Engine and SMTP: GCE does not currently support the use of
standard SMTP servers. They do, however, offer support for their own Gmail service
as well as several third party providers, including SendGrid.

Supported Public Clouds


Introduction
The DataScience.com Platform was built to run anywhere, including in public clouds
such as Amazon AWS, Microsoft Azure, and Google Compute Engine. This article
assumes you already have a working knowledge of the cloud of your choice; if not,
we’ve included links to their documentation.
We’ve also included links to some helpful tools that may help you speed up the
process of creating infrastructure in these clouds.

Cloud Documentation Links


• Amazon AWS
• Google Compute Engine
• Microsoft Azure

Useful Tools
• Ansible: Configuration management that uses human readable YAML and
agentless architecture
• Terraform: Infrastructure as code; supports a number of providers, including
Amazon AWS, Microsoft Azure, and Google Cloud

Docker Containers and Services


Introduction
Users of the DataScience.com Platform can create and interact with their own
services, which are segregated by containerization.
Containers allow each user to install dependencies, inject environment variables, and
run code, all without interfering with other users’ workflows or the underlying host.
Containers are similar to virtual machines (“VMs”) in that they can house and run
applications and their dependencies. Unlike VMs, however, containers share the same
underlying operating system (“OS”), making them more lightweight and portable.
The DataScience.com Platform leverages Docker for containerization. The official
Docker documentation includes a wealth of information about containerization as well
as Docker’s overall architecture.

2-8
Chapter 2
Security

Security
Introduction
Data integrity is important to us, and because of that, the DataScience.com Platform
offers several features that ensure the security of your environment.

Product Features
• Single Tenant Architecture Environments managed by the DataScience.com
team are deployed in single-tenant architectures. Single tenant architecture gives
the benefit of secure and isolated environments. Isolated environments prevent
bad outside actors from discovering sensitive information. Additionally, isolation
increases fault tolerance between environments.
• Containerization via Docker Doing data science work necessitates working with
data. Reading data into a computing environment increases the risk of a data
breach. The DataScience.com Platform mitigates breach through containerization.

2-9
Chapter 2
Security

The Platform provisions compute resources with Docker. Containers reduce


passive breach through isolated computing environments. Active breach is
reduced through the ease of starting and stopping containers. Doing data science
work in containers keeps data secure for users and hard to breach for external
actors.
• Encrypted Secrets The DataScience.com Platform secures private data
connections through environment variables. Connecting to data is secured by
encrypting secrets both in transit or at rest.
• Logging and Audit Trails The DataScience.com Platform includes robust logging
features. Built-in logging is available as part of a downloadable support bundle.
We also offer the option to route logs to the logging provider of your choice.
• Role-Based Permissions Roles offer granular control over access to Projects
and other areas of the DataScience.com Platform. These permissions can be
assigned at a team level or at an individual user level.
• LDAP and Active Directory integrations LDAP and Active Directory integrations
make it easy to map existing security groups to the DataScience.com Platform.
• Amazon AWS GovCloud The DataScience.com Platform is approved for Amazon
AWS GovCloud. Customers with sensitive data and regulated IT workloads can
install the DataScience.com Platform on-premises or in the AWS GovCloud to
meet strict compliance and regulatory standards.

Additional Security
As part of our commitment to building a secure platform, the DataScience.com
Platform undergoes regular assessments by an external security firm. These
assessments are conducted in accordance with OWASP and PTES standards, and
they allow us to quickly identify vulnerabilities in our software and infrastructure.

Configuration
• Logging and audit trails
• LDAP and Active Directory integrations

2-10
3
Installation
We’ve aimed to make installing the DataScience.com Platform as easy as possible but
if you have any questions or run into problems, feel free to reach out to our support
team.
Airgapped Installations
Note that the following instructions require internet access from the host. To install the
DataScience.com Platform in an airgapped environment, start with our Airgap
installation instructions.
Red Hat Installations
If you are installing the Datascience.com Platform on RHEL, you must install a newer
version of Docker than is available from RHEL’s repositories. Please follow the initial
instructions here and then come back to this guide.

Prerequisites
• Provisioned hosts that meet the minimum requirements
• Sudo access on the provisioned hosts
• Proxy address if the hosts require a proxy to access the internet
• LDAP information, if using; a full list of required details is in the LDAP Integration
section
• SMTP information; a full list of required details is in the SMTP Integration section
• Git provider information; more details can be found in the Git Provider Integration
section
• Postgres instance information:
– Database endpoint and port
– Database administrator credentials
– The platform database should be created prior to beginning the installation.
Instructions for creating the database can be found in our Database
Administration article.

Installation
Rebooting the Master Node
If the Master node needs to be rebooted during the installation process, every effort
should be made to ensure it is rebooted cleanly to prevent corruption of the Admin
Panel’s internal database. If this isn’t possible, and the database does become
corrupted, re-run the installation script to restart the installation process.

3-1
Chapter 3
Airgap Installation

Airgap Installation
Introduction
An airgapped environment is one that is isolated from potentially insecure networks
such as the internet. Installation of the DataScience.com Platform in these
environments involves a few more steps than on hosts with egress allowed to the
internet.

Requirements
• An airgap-enabled license for the DataScience.com Platform
• Sudo permissions on the hosts on which you’ll be installing the DataScience.com
Platform
• A way to transfer installation files onto the host(s)
• An instance of PostgreSQL that is reachable from the Platform cluster
• A Git provider that is reachable from the Platform cluster
• Ability for all hosts to communicate with one another as described in our
Networking section.

Installation
1. Provision the hosts needed for your installation. For a clustered installation
(recommended), you will need a minimum of three hosts.
2. Download the following items from a host with internet access, and then transfer
them to the Master node:
• Docker 17.03-ce+ and any dependencies, such as libltdl
• The tarball for installing the Platform Admin console
• The Platform airgap package, which contains all the images necessary to run
the DataScience.com Platform. (A unique link and password will be provided
to you by the DataScience.com team when you receive your license.)
3. Once you have transferred everything to the Master node, first install Docker and
its dependencies.
4. Next, run the following sequence of commands to install the Admin console:

$ tar -xvzf replicated.tar.gz cat ./install.sh | sudo bash -s airgap

Note:
If the installer is unable to detect the IP address of the host, it may
prompt you to choose a network interface. Generally, this will be a host
interface, not a Docker network interface.

5. Next, navigate to the Admin console in a web browser, either directly on the
Master node’s host at https://localhost:8800 or from another host with access
to the Master node.

3-2
Chapter 3
Airgap Installation

6. Bypass your browser’s TLS warning if applicable, and either upload an SSL cert or
choose the option to use a self-signed cert.

3-3
Chapter 3
Airgap Installation

7. On the following screen, upload your airgap-enabled license.

3-4
Chapter 3
Airgap Installation

8. Click to select Airgapped, then click Continue.

3-5
Chapter 3
Airgap Installation

9. Provide the full path to the .airgap file you transferred in step 2, then click
Continue.

10. The Admin Console will verify the airgap package. This may take a few minutes.

11. From here, you may proceed with the installation the same as you would in an
internet-routable environment, starting with step 5 of our Installation guide.

3-6
Chapter 3
RHEL 7 Installation Instructions

RHEL 7 Installation Instructions


Kernel Version
Installation on RHEL 7 requires a minimum kernel patch level of 3.10.0-514. To
upgrade your kernel, run

$ yum -y update kernel-3.10.0-514.el7.x86_64

on the host and then reboot.

Installing Docker 17 CE
If you are installing the Datascience.com Platform on RHEL7, you will need to install a
newer version of Docker than is available from RHEL’s repositories. These commands
should be run on all hosts you are planning to use for your installation.

$ sudo yum install -y yum-utils


$ sudo yum install -y policycoreutils
$ curl -o container-selinux.rpm https://get.datascience.com/container-
selinux-2.21-1.el7.noarch.rpm
$ sudo rpm -i container-selinux.rpm
$ sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-
ce.repo
$ sudo yum makecache fast
$ sudo yum -y install docker-ce
$ sudo systemctl start docker

3-7
4
Configuration
The DataScience.com Platform comes with an extensive number of customizations for
both Users and Administrators. This section will walk you through the custom settings
available in the Admin Console, SMTP and LDAP integration, and configuring the
Platform to seamlessly sync with your Git provider.
Admin Console Settings: This section includes settings like hostname and database
endpoints, as well as integrations with services like Datadog so you can ship usage
statistics to your existing provider. You can reach the admin panel by navigating to
https://yourplatformhostname.com:8800 in a browser.

LDAP Integration: The DataScience.com Platform integrates with your existing LDAP
or Active Directory instance. This will allow you to control access to the
DataScience.com Platform using existing users and groups.
SMTP Integration: Configuring the DataScience.com Platform to work with your
existing mail server allows the platform to send invitations to new users via email. In
addition to being a convenient way to add users, this also offers heightened security;
users create their own passwords, which means there is no need to transfer
credentials.
Git Provider Integration: Integration with your existing Git provider like GitHub or
GitLab allows Users to work from a central location and collaborate more easily on
projects.
Environment Management: Environments are customized, pre-installed collections of
dependencies and packages that can be created by Admins and distributed to Users
on the DataScience.com Platform.
Enabling Hadoop and Spark: Connect to an external Hadoop cluster from the
DataScience.com Platform to allow your users to analyze your Big Data.
On-Demand Instances: Instances of the DataScience.com Platform running in Amazon
AWS can use on-demand resources for workloads that exceed the available resources
on the cluster.
SSO Configuration: The DataScience.com Platform supports Single Sign On (SSO)
integration for both LDAP and SAML.

Admin Console Settings


Introduction
The DataScience.com Platform offers a number of customization settings and optional
integrations.

4-1
Chapter 4
LDAP Integration

LDAP Integration
Introduction
In addition to a built-in user management system, the DataScience.com Platform
offers LDAP integration. This will allow you to control access to the Platform via
existing LDAP users.

Prerequisites
• A supported LDAP provider
• Hostname and port for the LDAP server
• Search user credentials
• LDAP schema:
• Base DN
• User search DN
• Restricted user group (optional)
• Username field
• Test user credentials (optional)

4-2
Chapter 4
LDAP Integration

Configuration
1. From the Settings view in the Admin Console, click Authentication on the left-hand
menu.

4-3
Chapter 4
LDAP Integration

2. Click the radio button for LDAP.

4-4
Chapter 4
LDAP Integration

3. You will be prompted to enter the LDAP settings necessary for the integration.

4-5
Chapter 4
SMTP Integration

4. To verify your LDAP settings are correct, provide the credentials for the test user,
then click Test Credentials.

SMTP Integration
Introduction
Although you can install the DataScience.com Platform and create the administrator
user without SMTP integration, the integration is required in order to invite additional
users. Configuring SMTP integration also allows the Platform to automatically send
outputs via email.

Prerequisites
• The routable address and port for your SMTP server
• Credentials for a user with permission to log in to the SMTP server and send mail
• A desired “From” address for mail sent from the DataScience.com Platform

4-6
Chapter 4
SMTP Integration

Configuration
1. From the Settings view in the Admin Dashboard, click Email Server Settings in the

left-hand menu.
2. Click the radio button next to Enable SMTP.

4-7
Chapter 4
Git Provider Integration

3. You will be prompted to provide SMTP settings necessary for the integration.
When you’re finished, test your configuration by clicking Test SMTP
Authentication.

Git Provider Integration


Introduction
Connecting projects to your code repository allows users to easily interact with
branches and files from inside the Platform.

Supported Providers
• GitHub.com
• GitHub Enterprise 2.9+
• Bitbucket.org
• GitLab.com
• GitLab Enterprise 7+

GitHub OAuth Integration


Connecting the DataScience.com Platform to your GitHub repositories first requires an
authentication integration. The following steps show how to create a GitHub app and
connect it to the DataScience.com Platform.

Bitbucket Integration
The Bitbucket integration uses Bitbucket’s App Passwords feature to grant the
DataScience.com Platform access to your repositories. A Bitbucket App Password is
just like your account password, but meant for other apps to control Bitbucket on your
behalf.
Each user must bring their own Bitbucket App Password over to the DataScience.com
Platform. For end-user documentation on how to register a Bitbucket App Password,
see Files and Version Control > Authenticate.
To enable the Bitbucket integration, visit Settings > Git Providers, select your Bitbucket
type from the list of provider options, and fill in the details about your Bitbucket
account.

GitLab Integration
How users authenticate with GitLab depends on the version of GitLab they’ll be
connecting to. GitLab 7 and 8 use passwords for authentication, while GitLab 9 and
GitLab.com use access tokens.
Each user must bring their own GitLab access token over to the DataScience.com
Platform. For end-user documentation on how to register a GitLab access token, see
the Git Configuration section of our User documentation.
To enable the GitLab integration as an administrator, visit Settings > Git Providers.
Select GitLab from the list of provider options, then fill in the details about your GitLab
account.

4-8
Chapter 4
Environment Management

Manually Editing Providers in Postgres


1. To manually add providers in Postgres, first connect to the database endpoint
using administrative credentials. We recommend using psql for this.
2. Once you have authenticated to the endpoint, ensure that you’re connected to the
‘platform’ database. Then, delete the provider you intend to modify or remove. To
create or recreate a provider, return to the DataScience.com Platform settings
page and go the the Git Providers tab.
Make sure to replace the following placeholder values with your own:
{git_provider_id}

SELECT * FROM git_providers; – find the id of the provider


DELETE FROM git_providers WHERE id = {git_provider_id};
3. To verify that your providers have been modified correctly, run the following query:
SELECT*fromgit_providers;

Environment Management
Introduction
Environments are customized, pre-installed collections of dependencies and packages
that can be created by Admins and distributed to Users on the DataScience.com
Platform. Users require environments to be built in order to do work on the Platform.
Therefore, it is critical that an Admin create environments during the installation
process. Additional environments can be built anytime thereafter.
In this guide, you will learn how to build environments on behalf of Users.

Background
The DataScience.com Platform uses Docker to containerize workloads on your
instance. Docker containers allow Users to spin up isolated work environments that
have all of the software needed for their analysis pre-installed. A Docker container is a
running instance created from a Docker Image. Docker Images are immutable files
that define the runtime of containers. When a user runs a script or launches a Jupyter
session on the Platform, they are running a Docker Image that is stored in an internal
Docker Registry. The Environments feature allows Admin users to create their own
Docker Images and submit them to the Docker Registry by writing a Dockerfile within
the Platform interface.
If you are unfamiliar with Docker and Dockerfiles, check out Docker’s documentation
for more details. Please also refer to our Dockerfile Basics and Best Practices
documentation for several example Dockerfiles and best practices.

What is an Environment?
An environment is defined by a Dockerfile and is associated with metadata such as
name, description, and a README. There are two categories of environments: Base
and User.
On the DataScience.com Platform, all environments except for the Default Base
environment must inherit from a pre-existing Base environment. All Base environments

4-9
Chapter 4
Environment Management

can be expanded in Base and User environments that inherit from it. On the other
hand, User environments cannot be used to seed other environments.
It is convenient to envision these relationships as an inheritance tree. Default Base is
the root node of this tree. Any Base or User environments that are subsequently built
are nodes branching off of it, with User nodes always representing leaf nodes. In this
analogy, the main difference between Base and User environments is that the former
can become parent nodes while the latter can only be a child node of a Base
environment (i.e., a leaf).

4-10
Chapter 4
Environment Management

4-11
Chapter 4
Enabling Hadoop and Spark

DataScience.com Standard Example Environments


During the installation process, your Admin will have built the DataScience.com
Standard Example Base and User environments. These environments have Python
2.7 and R 3.3.3.
The DataScience.com Standard Example Base environment inherits from the Default
Base Environment and expands upon it with a selection of popular Python and R data
science packages that we have curated. In turn, the DataScience.com Standard
Example User one inherits from its namesake base environment and has all the
Platform tools enabled. Please refrain from deleting or modifying these environments.
They have been designed to be fully compatible with all user onboarding,
engagement, and education materials that we will provide. If these environments are
modified in any way, we are unable to guarantee that these materials will be able to
run successfully.

Enabling Hadoop and Spark


Introduction
The DataScience.com Platform provides seamless integration for Apache Hadoop,
Hive, and Spark. The Platform will connect to your data where it lives, so there is no
need to move data or add/replace costly infrastructure. To enable Hadoop, Hive, and
Spark on your instance, you will need to follow a two-step process: (i) configure your
Hadoop cluster and (ii) build your Hadoop-enabled environments.

Hadoop Cluster Configuration


To configure your cluster, navigate to Administration in the top menu bar and click on
the Hadoop Cluster tab. Select your Hadoop provider from the drop-down list to begin.

4-12
Chapter 4
Enabling Hadoop and Spark

4-13
Chapter 4
Enabling Hadoop and Spark

Fill in the form with the cluster name, provider version, and additional security
information. Then, choose to enable Hadoop, Hive, and/or Spark by checking the
respective boxes. When you choose to enable a certain framework, you will be
prompted to add additional information in the form of configuration files that you will
upload into the form. In this next section, you will learn how to locate and acquire
these files.
Please gather configuration materials from edge nodes or client machines that have
successfully connected to the cluster. Cluster server notes may not have the proper
configurations.

Building a Hadoop-Enabled Environment


Building a Hadoop-enabled environment is similar to building any Base environment,
except all of the Dockerfile commands for installation are form-field driven. Simply
navigate to the Environments screen, select Add Environment > Base Environment,
and choose the Install Hadoop Dependencies option. Once selected, choose your
Hadoop distribution as your provider, select your version, and build.

4-14
Chapter 4
Enabling Hadoop and Spark

Currently, the DataScience.com Platform supports MapR versions 5.2.1 and 5.2.2 as
well as Cloudera version 5.10.

4-15
Chapter 4
On-Demand Instances

For more information about building environments, see our Environment Management
documentation.
After you have created an available Hadoop-enabled Base environment, create a User
environment from this new Base environment to enable your Users to connect to the
cluster.

Note:
The DataScience.com environment build tool only supports installing one
version of Spark at a time in a given environment. A Hadoop-enabled
environment will install either Spark 1.6.0 or the appropriate Spark 2.x
release according to the specified Hadoop provider distribution version.

Other Providers
Coming soon! First class support for EMR and Hortonworks are in development. Don’t
see your provider? Contact success@datascience.com.

On-Demand Instances
Introduction
Instances of the DataScience.com Platform running in Amazon AWS can use on-
demand resources for workloads that exceed the available resources on the cluster.
Support for on-demand resources in Microsoft Azure and Google Compute Engine is
coming soon.

Configuring EC2 Options in the Admin Console


Here's a list of descriptions and necessary configuration information.

4-16
Chapter 4
On-Demand Instances

4-17
Chapter 4
SSO Configuration

Configuring the IAM Policy

SSO Configuration
The DataScience.com Platform offers Single Sign On (SSO) integration for both LDAP
and SAML. The following sections outline necessary steps for an Admin to configure
the integration.

SAML2.0 Based SSO


• SAML2.0 is an open standard used by the majority of SSO providers.
• SSO via SAML works by redirecting Users to the login page of their SSO provider.
Once they are logged in, the provider will recognize the app they are trying to
access and redirect their request to the correct address with information identifying
the User.

Proxy Based LDAP SSO


• This feature signs Users onto the Platform based on a request header, which is
expected to contain identifying LDAP information about the User.
• The name of the header and the LDAP field to be used are configurable in the
Admin Panel.
Only use this feature if needed.
In order for the Platform to be secure with this functionality enabled, it must only be
accessible to Users via a proxy server that has knowledge of the User’s identity and
that enforces all requests have a header containing a User’s identifying LDAP
information.
Please use caution when enabling this feature.
The proxy must guarantee that all requests will have a target header name overwritten
by either the user’s LDAP identifying information or a blank/invalid value. If the server
is accessible from a connection that is not routed through the proxy or if request
headers are not overridden, the Platform would become highly vulnerable.

4-18
5
Administration
The DataScience.com Platform was designed with administrative simplicity in mind.
That’s why it’s built on familiar tools like Docker and Postgres, and comes with a built-
in Admin Console that does most of the heavy lifting.
This section gives you an overview of common administrative tasks such as upgrading
the platform and managing the cluster. We’ve also included instructions for things like
manually adding users in Postgres in the event that you lose administrative access to
the UI.
Upgrading Shows you how to check for updates to the Platform and how to install
them.
User Management Provides instructions for user management, both from within the
Platform and in Postgres.
Cluster Management Lists everything you need to know about adding and removing
nodes from the cluster.
Database Administration Shows you how to create the Platform’s database from the
command line using psql.
Monitoring and Logging Directs you to the various monitoring and logging features
available in the Admin Console.
On-Demand Resources Provides instructions for managing on-demand resources, as
well as hints for how to resolve common issues.
Resource Management Dashboard Shows you how to use the Resource Management
Dashboard to see the health and availability of the compute hosts, what services are
running, and the amount of compute resource each service is using.

Upgrading
Introduction
Platform Restart Required
Because updates to the DataScience.com Platform require container restarts, we
highly recommend scheduling a maintenance window prior to performing an upgrade.
Container restarts could result in lost work and other unintended consequences.

Installing Upgrades
1. From the main Dashboard view in the Admin Console, you can see:
• Whether the Platform is up to date
• The last time the Admin Console checked for an update
• The current version of the Platform

5-1
Chapter 5
Upgrading

5-2
Chapter 5
Upgrading

2. If an update is available, the Check Now button will turn into a View Update button.
Clicking the View Update button will take you to the Release History view.

3. The Release History view will show you all versions available to you, both past
and present, and the date they were installed (if applicable). If an update is

5-3
Chapter 5
Upgrading

available, the Release History view will display the release notes for that version.

5-4
Chapter 5
Upgrading

4. To install the newest update, click Install Update:

Or to install a different update, find it in the list and click Install:

5-5
Chapter 5
Upgrading

5. After you’ve selected an update to install, Replicated will automatically pull images
for the update and push them to the appropriate nodes.

6. When the update has finished installing, the DataScience.com Platform should
automatically restart.

Airgapped Environments
To upgrade airgapped environments, you must first tell the Admin Console where to
look for new packages.
1. In the Admin Console, click the gear icon in the top right menu, then select
Console Settings.

5-6
Chapter 5
Upgrading

2. From the Console Settings navigation menu, click Airgapped Settings.

3. The Update Path is the location the Admin Console will continuously check for
new Platform packages. This setting will default to the directory from which you
uploaded the initial ‘.airgap’ package. If desired, change the update path and/or

5-7
Chapter 5
User Management

license file path.

4. If you’ve made changes, scroll to the bottom of the page and click Save.

User Management
Introduction
The DataScience.com Platform currently offers two methods of User authentication:
built-in and LDAP integration. This article focuses solely on User administration using
built-in authentication; for information about our LDAP integration, please see the
appropriate article on LDAP Integration.

User Roles and Permissions

Action Admin Standard Read Only


View outputs and Yes Yes Yes
activity
View files Yes Yes No
Launch containers Yes Yes No
Manage instance Yes No No
users and teams
Edit user roles Yes No No

5-8
Chapter 5
User Management

Adding a User
1. While logged in to the Platform as an Administrator, navigate to Administration via
the top menu bar.

5-9
Chapter 5
User Management

2. You will be directed to the Users and Teams tab. Next to Users in the left-hand
portion of the screen, click Add New to create a new User.

5-10
Chapter 5
User Management

3. Provide the new User’s name and email address, and specify their role. When you
are finished, click Invite. This will send an email to the new User, prompting them
to accept the invitation and create a password.
Adding users with LDAP
If you manage your Users through LDAP, adding a User is equivalent to looking up the
User in the LDAP server. Clicking the + Add New button will open a modal where you
can search for the User and assign the User a permission as normal. The User must
exist in the active directory first for the DataScience.com Platform to be aware of the
User. The User will then use their LDAP username and password to log into the
DataScience.com Platform. Note: LDAP groups are currently not supported.

5-11
Chapter 5
User Management

Creating a Team
1. While logged in to the Platform as an administrator, navigate to Administration via
the top menu bar.

5-12
Chapter 5
User Management

2. You will be directed to the Users and Teams tab. Next to Teams in the right-hand
portion of the screen, click Add New to create a new Team.

5-13
Chapter 5
User Management

3. On the following screen, provide a name for your new Team and then add Users
as collaborators. When you are finished, save the new Team by clicking Save.

5-14
Chapter 5
User Management

Adding Users to Teams


1. To add users to an existing team, navigate to Administration via the top menu bar.

5-15
Chapter 5
User Management

2. You will be directed to the Users and Teams tab. Select the team you want to add
users to by clicking it.

5-16
Chapter 5
User Management

3. You can edit team settings, including members, by clicking Edit Team.

5-17
Chapter 5
User Management

4. Add the new team member by typing their name into the Add Members field.
When you’re finished, click Save.

5-18
Chapter 5
User Management

Manually Adding Users in Postgres


Users can also be manually added to the DataScience.com Platform in Postgres.
1. To manually add users in Postgres, first connect to the database endpoint from the
command line using administrative credentials. We recommend using psql for this.
2. Once you have authenticated to the endpoint, ensure that you’re connected to the
Platform database with the command \cplatform, then run the following
command for each user you want to add.
Make sure to replace the following placeholder values with your own:
• {id} - Numeric identifier for the new user; this value must be unique.
• {email} - User’s email address
• {pass_hash} - Hash of the user’s password
• {first_name} - User’s first name
• {last_name} - User’s last name
• {role} - User’s role (ADMIN, READ-ONLY, or STANDARD)

INSERT
INTO
users
(
id
,
email
,
password
,
first_name
,
last_name
,
status
,
created_at
,
updated_at
,
role
)
VALUES
(
'
{id}
'
,
'
{email}
'
,
'
{pass_hash}
'
,
'

5-19
Chapter 5
User Management

{first_name}
'
,
'
{last_name}
'
,
'ACTIVE'
,
current_timestamp
,
current_timestamp
,
'
{role}
'
);
3. To verify that the users have been added successfully, run the following command
for each user and confirm the results. Remember to replace {user_email} with the
user’s email address.

SELECT
*
from
users
WHERE
email
=
'
{user_email}
'
;

Manually Crafting Invitation Links


Integrating with an SMTP server or service will allow the Platform to automatically
send invitation emails to new users. However, in cases where the Platform is unable to
connect to SMTP, you can manually craft invitation links using the following pattern:
https://<platformURL>/auth/new-password?token=<token>

The token can be obtained by doing the following:


1. Invite the new user from within the Platform, as usual, or manually add them in
Postgres.
2. Connect to the platform database and query the users table to obtain the new
user’s ID. For example:

SELECT
*
FROM
users
WHERE
email
=
'user@domain.com`;

5-20
Chapter 5
Cluster Management

3. Use the user ID obtained from step 2 to query the reset_invitations table and
obtain that user’s unique token. For example:

SELECT
*
FROM
reset_invitations
WHERE
id
=
1
;

You can then use the user’s token to craft an invitation link for them.

Cluster Management
Introduction
Downtime Required
We highly recommend adding and removing nodes during maintenance windows or
other quiet periods, as you will need to restart the DataScience.com Platform.
Although downtime should be brief, there is still a risk of users losing unsaved work.

Adding Nodes
Adding a node to the cluster is relatively straightforward; however, we recommend that
you remind users to save their work, as you will need to restart the Platform.
In most cases, after the initial installation, you’ll be adding Worker nodes in order to
expand the pool of standing resources. However, the instructions below are applicable
to any node type.

5-21
Chapter 5
Cluster Management

1. In the Dashboard view of the Admin Console, stop the application.

2. In the Admin Console, navigate to the Cluster view using the menu in the upper
right-hand corner.

5-22
Chapter 5
Cluster Management

3. Provision the additional node(s) first by obtaining the daemon token. This is found
in the Cluster view.

4. Run the following command on each of the new hosts. Remember to replace
{master_node_ip}, {daemon_token}, {new_host_priv_ip}, {new_host_pub_ip},
and {tags} with the appropriate values.
Tags
The {tags} option accepts tag names as comma-separated values, i.e.
Worker,Common. There should be no spaces between tag names. Please make
sure your tags are correct before running this command. If you provision a node
that is tagged incorrectly, you will need to destroy and recreate it.

$ curl -sSL -o operator.sh https://get.datascience.com/operator.sh


$ chmod +x operator.sh
$ sudo ./operator.sh daemon-endpoint
={master_node_ip
}:9879
\
daemon-token
={daemon_token
}
\
private-address
={new_host_priv_ip
}
\
public-address
={new_host_pub_ip
}
\
tags
={tags
}

5-23
Chapter 5
Cluster Management

5. Once the new node has been added, you may restart the application.

Removing Nodes
1. When removing Core nodes, you must first stop the DataScience.com Platform. If
you’re removing a Worker, this step is not necessary. To stop the Platform,
navigate to the Dashboard view in the Admin Console and click Stop Now.

2. Once the Platform has stopped, stop the ‘replicated-operator’ service. Note that
the example below is applicable only for distributions using upstart; yours may
differ.

$ stop replicated-operator
3. On the Master node, run the following command. You can find the node’s ID in the
Cluster view in the Admin Console or by running ‘docker node ls’.

$ docker node rm

5-24
Chapter 5
Database Administration

4. Finally, in the Cluster view of the Admin Console, remove the node by clicking the
trash icon next to the disconnected node.

5. Once the node has been removed, you may restart the Platform.

Database Administration
Introduction
This article provides some notes on common PostgreSQL configuration pitfalls and
instructions for creating the initial DataScience.com Platform database.

Prerequisites
• An existing instance of PostgreSQL — please see our requirements section for
sizing suggestions
• Credentials for a database user who has permissions to create new databases
and who is an administrator of the public schema
• Access to psql or some other PostgreSQL management tool, either locally or on
the database host itself

5-25
Chapter 5
Database Administration

Basic Configuration
Configuration File Locations
PostgreSQL configuration file locations may differ based on your Linux distribution. To
find postgresql.conf, you can query the database with SHOW config_file. Normally,
pg_hba.conf are in the same location but you can verify that by finding its default path
in postgresql.conf.

Stand-By Configuration
PostgreSQL offers stand-by replication for high availability. Essentially, this setup
maintains a replica of the database on a secondary server, which can be used as a
read-only failover should something happen to the primary PostgreSQL host.
As a first step, create two PostgreSQL servers and configure them using the previous
steps in Basic Configuration.

Failing over to the hot standby


Before attempting a failover, it is important to ensure your hot standby is fully
operational by having the above-mentioned monitoring in place.
1. Ensure that PostgreSQL on the master is stopped with
systemctlstoppostgresql-9.5. Depending on the circumstances, you may also
want to disable it from restarting on reboot with chkconfig.
2. On the standby, run touch-f/tmp/postgresl.trigger.5432. This is the “trigger
file” that switches the server from standby to master mode.
3. In order for the Platform to use the hot standby server, you must do one of the
following (depending on your configuration):
a. In the settings page, change the PostgreSQL host from the old master’s IP to
the standby’s IP.
b. If you have a DNS record for the master, change it to point to the standby.
4. Restart the DataScience.com Platform from the settings page in order for the
change to take effect.

Database Backups
Database backups are an essential component of a disaster-recovery plan. If your
PostgreSQL is hosted on a cloud-provider managed service such as AWS RDS or
Google Cloud SQL, you should enable regular backups on a schedule that suits your
disaster recovery requirements.
If you are self-managing your PostgreSQL server, a simple backup example is outlined
below. More robust backup configurations could include running the backup from a
different server, scheduling the backup as a cron job, or archiving and removing old
backup files.

$ pg_dump platform >> backup-'date +%F'

In the event of a database failure, you would restore with a command similar to:

$ psql platform < backup-YYYY-MM-DD

5-26
Chapter 5
Monitoring and Logging

Further Reading
• Installing and configuring PostgreSQL
• Using the ‘psql’ command

Monitoring and Logging


Introduction
The DataScience.com Platform offers several options for monitoring resources and
capturing logs. You can use the built-in features, integrate with external providers, or
both.

Monitoring Resources
You can check basic usage information at any time using the graphs displayed in the
Dashboard view of the Admin Console.

5-27
Chapter 5
Monitoring and Logging

5-28
Chapter 5
Monitoring and Logging

You can also send usage metrics to your existing Datadog account. For detailed
instructions on how to configure this integration, please see the Datadog section of our
Settings article.

Logging
The Admin Console allows you to download logs as a support bundle.
1. In the Admin Console, navigate to the Support view using the menu at the top right
corner.

2. To download the support bundle, click Download Support Bundle.

3. The Admin Console will generate your support bundle in ‘.tar.gz’ format. This may
take a few minutes, depending on how many logs have been generated.

4. Unzip and untar the file. Its contents will include stdout and stderr for all
containers, configuration details of all nodes in the cluster, daemon logs and
configurations, etc.

Audit Trails
Any major changes made in the Admin Console are tracked in the Audit Log.
Examples of these changes are application stops and starts, upgrades, and support
bundle downloads.

5-29
Chapter 5
Monitoring and Logging

To access the Audit Log:


1. Navigate to the Audit Log view using the menu at the top right corner of the Admin
Console.

5-30
Chapter 5
Monitoring and Logging

2. In this view, you can see a list of recent major changes, including the date and
time the changes were made.

5-31
Chapter 5
On-Demand Resources

On-Demand Resources
Introduction
This article contains instructions for managing on-demand resources, as well as hints
for how to resolve common issues.

Garbage Collection for Orphaned Nodes


In general, each record in on_demand_instance has a one-to-one correspondence with
a runnable_instances record. Thus, if at any point there are records in
on_demand_instance that lack a counterpart in runnable_instances, it is safe to
assume that these instances have been orphaned and should be manually terminated
in AWS.

Resource Management Dashboard


In this article you will learn how to use the Resource Management Dashboard. Note
that the resource management dashboard is only accessible to users with the Admin
level of privileges.
The dashboard displays the following information:
• The health and availability of the compute hosts
• What services are running
• The amount of compute resource each service is using
This snapshot shows you where you can access the Resource Management
Dashboard. Click on Resources in the top menu bar.

5-32
Chapter 5
Resource Management Dashboard

5-33
Chapter 5
Resource Management Dashboard

Within the Resource Management Dashboard, you will see two tabs: Hosts and
Services. Hosts corresponds to the instances serving the Platform. Information such
as IP address, Host ID, Status, Tags, Memory Usage, CPU Usage, Services, and
Largest Available are available at the host level. The definitions for each field follow:
• Host ID: Corresponds to the Docker swarm node ID
• IP Address: The IP address of the host
• Status: Corresponds to the Docker status of these nodes. Four values are
possible: Unknown, Down, Ready, and Disconnected. Most often, Ready or Down
will be displayed.
• Tags: Tags assigned to the host. Current values are: “-” and On-demand, to
distinguish between standing pool and on-demand hosts.
• Memory Usage: There are two figures here: X/Y GB. X represents the memory
currently reserved for user run services (e.g. Jupyter sessions, runs, deployed
APIs, etc.) and Y represents the total amount of memory available on the host.
The same convention applies to CPU Usage. Note that we adopted the following
color scheme:
– green: any size container can be spawned on the host
– yellow: only a subset of container sizes can be spawned on the host
– red: nothing can be spawned on the host
• Services: The number of platform services that are currently running on this host
• Largest Available: The largest Docker container available on this host; this
container is available to host Jupyter sessions, deployed APIs, scheduled runs,
etc.
If you click any Host ID, you can see the user services that are running on the host.
You can click on each service and reach the corresponding service page for more
details. The Status field can take the following values: Created, Restarting, and
Running. The state Running follows the Created or Restarting states.

5-34
Chapter 5
Resource Management Dashboard

5-35
Chapter 5
Resource Management Dashboard

When you click on the Services tab, you can see a list of all the services that are
currently running. Current services include Jupyter sessions, Deployed APIs, RStudio
sessions, Shiny Apps, Script Runs, and Scheduled Runs.

5-36
Chapter 5
Resource Management Dashboard

5-37
Chapter 5
Resource Management Dashboard

For each service, you can click on the Service Name and reach the Service page. You
can also remove the service by clicking on the Remove option that will appear next to
the Host ID.

5-38
Chapter 5
Resource Management Dashboard

5-39
Chapter 5
Resource Management Dashboard

Mem and CPU are the amount of memory and number of CPUs reserved for the this
particular service.

5-40
6
Troubleshooting
Introduction
These articles address some of the more common issues you may experience with the
DataScience.com Platform. However, if you can’t find the answer to your question
here, you’re always welcome to reach out to our support team at
support@datascience.com.

Blank login page


To identify the underlying problem, inspect the login page for errors using your
browser’s developer tools. These can be accessed in the following ways:
• Chrome
• Edge
• Firefox
• Safari

Master not joining cluster


Cause: This usually occurs when a firewall, either on the host or elsewhere on the
network, disallows traffic to the Master node in the cluster management port range
(9870 - 9880).
Solution: Double-check local and external firewalls to ensure all nodes are able to
communicate with one another and themselves over ports 9870 - 9880.

6-1
7
Appendix
This section includes various real-world examples of configurations like environments,
security groups, and databases. We’ll be adding to this section frequently, so be sure
to check back often.
Dockerfile Basics and Best Practices Best practices and examples of Dockerfiles for
building environments
Amazon AWS Examples Examples of security group and RDS configurations in
Amazon AWS
Google Compute Engine Examples Examples of GCE configuration options
Environments and Dependencies: Environments from the User’s perspective, including
where to find information, choosing an environment, and adding dependencies
Changing Paths of Installed Components Options for administrators who wish to use
alternate mount points or paths
Changing the SSL Certificate Instructions for updating the SSL certificate being used
by both the DataScience.com Platform and the Admin Console
Docker Commands Glossary Helpful tips for troubleshooting the Platform with Docker
commands

Dockerfile Basics and Best Practices


In this section, you will learn how to create custom Docker images for your team of
data scientists.
In order to build Docker images that contain the tools and dependencies your team
needs, you need to write instructions in a Dockerfile, which is a text file that contains
all the commands (in order) that need to be run to build the desired image. If you are
not familiar with Dockerfile, we recommend reading this Docker tutorial.
Prior to beginning, please review the following best practices and warnings for building
environments on the Platform with Dcokerfiles.

Best Practices
• If you reference files in your instructions, use relative paths, not absolute paths.
• Conda and R don’t play well together. If you intend to have R and Python cross
dependencies, avoid using Conda. Instead, install Python, R, and their respective
libraries via pip, R and apt-get commands.
Environment Dockerfiles are based on Debian 8.5 linux distribution. This means
you can only use Debian’s system package manager apt-get. Other package
managers, such as yum, will not work.
• Make sure that your Python and R interpreters are in your PATH. The version that is
in your PATH will be executed. If you have installed multiple versions of the Python

7-1
Chapter 7
Amazon AWS Examples

interpreter (e.g. Python 2.7, Python 3.6), make sure you activate the right one in
your PATH. The same goes for R.
• Generally, we recommend having separate environments for Python 2 and 3.
• Take advantage of image “inheritance”. Build base images that could be used for
other Base or User environments. Avoid creating images with very intricate sets of
dependencies by breaking them into smaller images. This will help with
debugging.
• Read the build logs very carefully. Sometimes installation errors will occur, yet the
image build could still be successful.

Dockerfile Basics
This section outlines the basics of writing Dockerfiles. For those who are already
familiar with Dockerfile, you may skip this section and proceed to the next one.
Below is a description of all Dockerfile instructions currently supported for Base and
User environments.

Putting It All Together


This section contains a few examples of Dockerfiles you can use in your workflow. The
example starts with a base image that installs the package manager Conda. This base
image should be built from the default base image environment.

Amazon AWS Examples


Introduction
This section includes several example configurations for Amazon AWS RDS and
security groups.
These examples assume that you are using an AWS account with VPC networking. If
you are using EC2 Classic, your installation may vary substantially. Please contact
support@datascience.com for more detailed suggestions.

Required Resources
An AWS installation that follows these examples presupposes that the following AWS
resources exist and are configured correctly:
• An AWS account
• A VPC with subnets, route tables, and internet and/or NAT gateways configured to
allow unrestricted egress to the internet. It is also possible to allow egress to the
internet with direct connect through another data center.
• Either a root account or an IAM user account with privileges to create, edit, and
delete EC2 instances, EBS volumes, security groups, IAM users, IAM policies,
SES users, AMIs, tags, and keypairs. For troubleshooting purposes, it is ideal to
have at least view access to VPC, subnet, and route tables as well.
• An AMI maintained by your organization that meets the minimum requirements
and includes the SSH keys of the person or team that will be conducting the
installation. If you do not currently maintain an AMI that meets the requirements,
DataScience.com can offer recommendations that can be adapted to fit your
needs.

7-2
Chapter 7
Amazon AWS Examples

Other required resources not specific to AWS:


• A method of provisioning servers, volumes, security groups and other resources.
This can be open-source tools such as Terraform, Puppet, Chef, or SaltStack; it
can be AWS-native methods such as CloudFormation or the AWS Console; or it
can be tooling proprietary to your organization.
• A method of creating and updating DNS records. Route 53 is preferred, but
externally maintained DNS is also acceptable.
• (Recommended for production) A method of creating and exporting SSL
certificates in PEM format. At this time, AWS Certificate Manager is not suitable,
since its certificates cannot be exported. While it is possible to use self-signed
certificates with the DataScience.com Platform, this configuration is not
recommended for production, as your users will see a browser security warning.

Resources Created During the Installation


Required:
• EC2 instances and EBS volumes with specifications meeting the minimum
requirements
• Unless you use direct connect and manage your network’s firewall externally to
AWS, you will need to create security groups.
Optional:
• If you do not have an existing SMTP service, you will need to create an SES
Domain with SMTP credentials.
• In order to use the on-demand feature, create the resources as outlined in On-
Demand Instances.

Security Groups
Unless your DataScience.com Platform installation must meet specific compliance
requirements, we suggest installing all infrastructure components (Master host, Core
hosts, Worker hosts, and the database) in the same security group, with all TCP traffic
allowed between nodes. You may need to open additional ports in order to connect to
your data sources.
For more information about the specific ports the Platform uses to communicate,
please see our Networking section.

7-3
Chapter 7
Amazon AWS Examples

Subnet
For on-demand instances to properly start and launch, they should originate from a
private subnet. Once you have created a private subnet, the on-demand instance
routes properly to the public subnet.
1. Create a subnet that doesn’t overlap with the main/public subnet.

7-4
Chapter 7
Amazon AWS Examples

2. Create a NAT gateway (Create New EIP) that resides in the main/public subnet.

7-5
Chapter 7
Amazon AWS Examples

3. Create a route table for the private subnet. Add route 0.0.0.0/0 and select the NAT
gateway that was created in the previous step.

7-6
Chapter 7
Amazon AWS Examples

4. Associate the private subnet with the route table that was created in the previous
steps.

7-7
Chapter 7
Amazon AWS Examples

7-8
Chapter 7
Amazon AWS Examples

7-9
Chapter 7
Amazon AWS Examples

RDS
Amazon AWS’s RDS managed database service is a convenient way to set up a
PostgreSQL server. Below are some minimal suggestions for launching RDS in the
AWS console. Note that this example does not include provisioned IOPS, which are a
recommended feature for production installations.
In these examples, you should substitute the security group you plan to use with your
Master, Core, and Worker servers in the Network & Security section.

7-10
Chapter 7
Amazon AWS Examples

7-11
Chapter 7
Google Compute Engine Examples

Google Compute Engine Examples


Introduction
This section includes examples for SMTP settings and database configurations in
Google Compute Engine (“GCE”).

Firewall Rules for Master Node


When creating your Master node, make sure to set the network tag as master. Then,
create a firewall rule to open the following: tcp:80;tcp:8080;tcp:8085;tcp:443;tcp:
900;tcp:8800;tcp:8500

7-12
Chapter 7
Google Compute Engine Examples

7-13
Chapter 7
Google Compute Engine Examples

SMTP Server Settings


Outgoing SMTP ports 25, 486, and 587 are blocked in GCE except to smtp-
relay.gmail.com and smtp.gmail.com.

To set up SMTP using smtp.gmail.com, navigate to My Account → Sign-In & Security


from Google, and then click Create App Passwords. Once you get the password,
configure your SMTP settings in the Admin Console.

Supply your Google email address as the SMTP username, and the app password as
the SMTP password. The From Address should match your Google email address.

7-14
Chapter 7
Google Compute Engine Examples

Database Configuration
Create a GCE database using PostgreSQL.

When configuring the database instance, allow access from your Master node. Note
that in GCE, you may only whitelist access from the Master node’s public IP address.

Once the PostgreSQL instance has been created, you may create the necessary
Platform database using GCE’s web UI.

7-15
Chapter 7
Environments and Dependencies

Environments and Dependencies


Introduction
Environments are customized, pre-installed collections of dependencies and packages
that can be created by Admins and distributed to Users on the DataScience.com
Platform.

Launching Environments
To run analyses or create outputs on the Platform, you can launch Docker containers
to host your work. When spawning containers, you can configure the environment that
you want to run. Choose an environment from the dropdown menu. Only environments
that are available for your tool will be available to select and run.

Adding Additional Requirements


When configuring your container, you can specify additional requirements to install at
runtime by clicking the Add Requirements button on the action modals. Depending on
the language selected, you’ll find forms for pip, R (which runs
install.packages("...")), and apt dependencies. When you include a list (in text
file format) of packages for these installers, the Platform will install them before
running your code.
If you are using the Conda package manager, supplying pip dependencies via Add
Requirements is not currently supported. With Conda-based environments, avoid pip-
only dependencies where possible. If this is not an option, install the required pip
dependencies during environment building.

7-16
Chapter 7
Environments and Dependencies

7-17
Chapter 7
Changing Paths of Installed Components

Notice that the form above points to a text file called requirements.txt. While you can
call that file anything you want, it must be formatted as a different package name on
each line. The apt-get and R installers accept only package names and installs the
latest stable version. The pip installer accepts either package name or a version-
locked name, as in the example below:

# install locked version of plotly


plotly
==
2.0
.
12
# install latest version of seaborn
seaborn

For pip only, the comments in the example above are valid syntax.

Changing Paths of Installed Components


If your server provisioning process restricts the space available on the root volume or
you prefer to configure installed components with separate mount points, you can do
so during the initial Platform installation.
If you have root volume restrictions, another good alternative is to mount /var on a
separate volume; however, this would need to be done before the operating system is
installed.
These commands are destructive.
If you are running these commands after the initial installation, the directories below
should be backed up before proceeding.
In the below example, a volume is mounted to /datascience.

systemctl
stop
docker
mv
/
var
/
lib
/
docker
/
datascience
/
var
/
lib
/
docker
ln
-
s
/
datascience
/

7-18
Chapter 7
Changing Paths of Installed Components

var
/
lib
/
docker
/
var
/
lib
/
docker
mv
/
var
/
lib
/
replicated
/
datascience
/
var
/
lib
/
replicated
ln
-
s
/
datascience
/
var
/
lib
/
replicated
/
var
/
lib
/
replicated
systemctl
start
docker

If space on your root volume is extremely limited, you may also wish to move the
internal logging directory. Use the full path to the logging directory you wish to use,
including a trailing slash:

7-19
Chapter 7
Changing Paths of Installed Components

It is also possible to use a volume mount for your Postgres server(s). This example
shows the data directory for Postgres 9.5 on RHEL being mounted on a volume in /
apps/datascience. You may adjust as needed for packages/distros that use data
directories other than /var/lib/pgsql. It’s also possible to do this after installing
Postgres if you mv the directory.

mkdir
-
p
/
apps
/
datascience
/
var
/
lib
/
pgsql
ln
-
s
/
apps
/
datascience
/
var
/
lib
/
pgsql
/
var
/
lib
/
pgsql
yum
install
-
y
https
:
//

7-20
Chapter 7
Changing the SSL Certificate

download
.
postgresql
.
org
/
pub
/
repos
/
yum
/
9.5
/
redhat
/
rhel
-
7
-
x86_64
/
pgdg
-
redhat95
-
9.5
-
2.
noarch
.
rpm
yum
list
postgresql
*
yum
install
-
y
postgresql95
-
server

Changing the SSL Certificate


Should you wish to update the SSL certificate being used by both the
DataScience.com Platform and the Admin Console, you must have both the new cert
and its corresponding private key.

7-21
Chapter 7
Changing the SSL Certificate

1. To update the cert, first navigate to the Admin Panel and click the gear icon in the
top right corner. In the dropdown menu, click Console Settings.

2. From here, you can choose to use a self-signed certificate, provide the path to a
cert and key file already on the Master node, or upload the cert and key file.

3. Once you have updated the SSL cert, scroll to the bottom of the settings page and
click Save.

4. Restart the DataScience.com Platform to use the newly updated certificate.

7-22
Chapter 7
Docker Commands Glossary

Docker Commands Glossary


If the Datascience.com Platform is your first introduction to the Docker ecosystem,
some common commands can help you quickly familiarize yourself with the basics of
operating and troubleshooting Docker containers in the context of the Platform. Of
course, there is much more information available in the official Docker documentation.

docker info
What it’s for: High-level Docker daemon information, including installed version,
available memory and CPU
When to use it: If you want to confirm which Docker version you are running,
especially after initially installing it, or upgrading the OS

$ docker info
Containers: 33
Running: 13
Paused: 0
Stopped: 20
Images: 215
Server Version: 17.06.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 609
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs

docker ps
What it’s for: Showing currently running containers on your host. Each row will
include a unique container ID, the image from which it was launched, a user-friendly
name for the container, the shell command that runs when the container launches,
timing data about the launch, and which ports are forwarded between host and the
container.
When to use it: Verifying that the container components of the Datascience.com
Platform are running.

# confirm that the main component of the platform is running


$ docker ps | grep platform
9fd4ebbcb041 quay.io/datascience/platform:latest "config/docker/
doc..." 10 hours ago Up 10 hours 0.0.0.0:8000->8000/tcp,
0.0.0.0:8080->8080/tcp

# the -a flag shows all containers, including exited ones


$docker ps -a | grep pollen
5e9e29621b10 10.20.1.144:9874/datascience-pollen:latest

docker logs
What it’s for: To obtain logs for system-level containers
When to use it: Most commonly used during initial installation, this command will
output everything logged to a container since it started running. As you can see in the

7-23
Chapter 7
Docker Commands Glossary

example, it takes a container ID or name as a parameter. The below example shows


an authentication error with a Git provider that corresponds to an error you would see
in the browser.

$ docker logs platform | grep -i error


{"name":"platform","hostname":"d9f6f726a9d4","pid":7051,"level":50,"err":
{"message":"No git authorization for current user.","name":"Error","stack":"Error:
No git authorization for current user.

docker service ls
What it’s for: Unlike ‘docker ps’, which is specific to the current host, this command
shows services running on all nodes in your cluster. Whereas ‘docker ps’ will show
system-level containers, this will show all the sessions launched by your users. Most
of this data is exposed to you in the resources dashboard. Note that the service ID will
be different from the container ID that you see in ‘docker ps’.
When to use it: If you suspect a user’s session has failed for reasons that aren’t
obvious in either the session logs or the resource dashboard, you can look at the
service list to get more information.

$ docker service ls
ID NAME MODE REPLICAS IMAGE
0e6fwj479kvi deploy-another-test-706570-v1 replicated 1/1
10.20.1.144:9874/another-test-706570:1
jb7hwui2mtuu jupyter-another-test-repo-jupyter-1915-v1 replicated 0/1
10.20.1.144:9874/datascience-jupyter-python2:2.3.1

docker service inspect and docker service ps


What it’s for: Getting more detail about a user’s service
When to use it: Most of the useful information outputted by these commands is
available in the resources dashboard and the session details page, but you can see
detailed information about how and where a user’s session launched.

$ docker service ps 0e6fwj479kvi


ID NAME
IMAGE NODE DESIRED STATE CURRENT
STATE ERROR PORTS
q34p9z65o3bs deploy-another-test-706570-v1.1 10.20.1.144:9874/another-
test-706570:1 ip-10-20-2-115 Running Running 23 hours ago *:33350-
>8001/tcp

$ docker service inspect 0e6fwj479kvi

{
"ID": "0e6fwj479kvind6giq3tx6s5d",
"Version": {
"Index": 319046558
},
"CreatedAt": "2017-08-02T05:42:04.412705186Z",
"UpdatedAt": "2017-08-02T05:42:04.418101318Z",
"Spec": {
"Name": "deploy-another-test-706570-v1",
"TaskTemplate": {
"ContainerSpec": {

7-24
Chapter 7
Docker Commands Glossary

"Image": "10.20.1.144:9874/another-test-706570:1",
"Labels": {
"SERVICE_8001_NAME": "deploy-another-test-706570-v1",
"SERVICE_CHECK_HTTP": "/deploy/deploy-another-test-706570-v1/
status",
"SERVICE_CHECK_INTERVAL": "30s",
"SERVICE_NAME": "deploy-another-test-706570-v1",
"SERVICE_TAGS": "deploy"
},
"Env": [
"MODEL_FILE=problematic_script.py",
"MODEL_FUNCTION=prop_yes_dec",
"BASE_PATH=/deploy/deploy-another-test-706570-v1",

7-25
8
Release Notes
Version 5.0.0 - December 8, 2017 (Preview)
The following release is a major feature update, which is available in preview for
selected customers.

Version 4.2.2 - October 4, 2017


The following release is a minor feature update, ready for installation on the available
release channel.

Version 4.1.1 - September 20, 2017


The following release is a minor feature update, ready for installation on the available
release channel.

Version 4.0.1 - September 6, 2017


The following release is a major feature update, ready for installation on the available
release channel.

Version 3.9.1 - August 23, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.8.1 - August 9, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.7.1 - July 26, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.6.1 - July 13, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.5.1 - June 28, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.4.1 - June 15, 2017


The following release is a minor feature update, available on the Stable release
channel.

8-1
Chapter 8

Version 3.3.1 - June 7, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.2.1 - May 31, 2017


The following release is a minor feature update, available on the Stable release
channel.

Version 3.1.1 - May 4, 2017


The following release is a major feature update. It is now available as version 3.1.1.

8-2

You might also like