Professional Documents
Culture Documents
Datascience Admin Guide
Datascience Admin Guide
F11247-02
January 2019
Administering Data Science Cloud,
F11247-02
Copyright © 2018, 2019, Oracle and/or its affiliates. All rights reserved.
This software and related documentation are provided under a license agreement containing restrictions on
use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your
license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify,
license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means.
Reverse engineering, disassembly, or decompilation of this software, unless required by law for
interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If
you find any errors, please report them to us in writing.
If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on
behalf of the U.S. Government, then the following notice is applicable:
U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software,
any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are
"commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-
specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the
programs, including any operating system, integrated software, any programs installed on the hardware,
and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.
No other rights are granted to the U.S. Government.
This software or hardware is developed for general use in a variety of information management applications.
It is not developed or intended for use in any inherently dangerous applications, including applications that
may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you
shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its
safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this
software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of
their respective owners.
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are
used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron,
the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro
Devices. UNIX is a registered trademark of The Open Group.
This software or hardware and documentation may provide access to or information about content, products,
and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly
disclaim all warranties of any kind with respect to third-party content, products, and services unless otherwise
set forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not be
responsible for any loss, costs, or damages incurred due to your access to or use of third-party content,
products, or services, except as set forth in an applicable agreement between you and Oracle.
Contents
Preface
Documentation Accessibility v
Conventions v
1 Introduction
2 Architecture
Overview 2-1
Requirements and Limits 2-4
Supported Public Clouds 2-8
Docker Containers and Services 2-8
Security 2-9
3 Installation
Prerequisites 3-1
Airgap Installation 3-2
RHEL 7 Installation Instructions 3-7
4 Configuration
Admin Console Settings 4-1
LDAP Integration 4-2
SMTP Integration 4-6
Git Provider Integration 4-8
Environment Management 4-9
Enabling Hadoop and Spark 4-12
On-Demand Instances 4-16
SSO Configuration 4-18
iii
5 Administration
Upgrading 5-1
User Management 5-8
Cluster Management 5-21
Database Administration 5-25
Monitoring and Logging 5-27
On-Demand Resources 5-32
Resource Management Dashboard 5-32
6 Troubleshooting
7 Appendix
Dockerfile Basics and Best Practices 7-1
Amazon AWS Examples 7-2
Google Compute Engine Examples 7-12
Environments and Dependencies 7-16
Changing Paths of Installed Components 7-18
Changing the SSL Certificate 7-21
Docker Commands Glossary 7-23
8 Release Notes
iv
Preface
Administering Data Science Cloud describes how to install , configure, and administer
Data Science Cloud.
Topics
• Documentation Accessibility
• Conventions
Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle
Accessibility Program website at http://www.oracle.com/pls/topic/lookup?
ctx=acc&id=docacc.
Conventions
The following text conventions are used in this document:
Convention Meaning
boldface Boldface type indicates graphical user interface elements associated
with an action, or terms defined in text or the glossary.
italic Italic type indicates book titles, emphasis, or placeholder variables for
which you supply particular values.
monospace Monospace type indicates commands within a paragraph, URLs, code
in examples, text that appears on the screen, or text that you enter.
v
1
Introduction
The DataScience.com platform combines the tools, libraries, and languages your team
loves with the infrastructure and workflows your organization needs.
1-1
2
Architecture
Learn about the Data Science Cloud architecture.
Topics
• Overview
• Requirements and Limits
• Supported Public Clouds
• Docker Containers and Services
• Security
Overview
Introduction
The DataScience.com Platform’s core infrastructure consists of three node types:
• A single master node
• One or more core nodes
• One or more worker nodes
Each of these node types runs a set of components or a set of containers that serve
various functions.
Most of these components communicate with one another internally; however, some
communicate with various external services like Quay and Datadog. For details such
as port numbers and egress points, see Requirements and Limits.
2-1
Chapter 2
Overview
2-2
Chapter 2
Overview
Components
The following components and their constituent containers make up the underlying
infrastructure of the DataScience.com Platform. We’ve included short descriptions of
each, as well as links to external documentation in the Further Reading section.
Node Provisioning
The Master node is automatically provisioned during the installation process; however,
Core and Worker nodes must be manually added to the cluster. For detailed
instructions on how to add or remove nodes, see Cluster Management.
Installation Paths
The following files and directories are created on the hosts during installation:
• systemd files (root:root 644)
– /usr/lib/systemd/system/docker.service
– /etc/systemd/system/replicated.service
– /etc/systemd/system/replicated-operator.service
– /etc/systemd/system/replicated-ui.service
• sysconfig files (root:root 644)
– /etc/sysconfig/replicated
– /etc/sysconfig/replicated-operator
• executables (root:root 755)
– /usr/bin/docker
– /usr/bin/docker-containerd
– /usr/bin/docker-containerd-ctr
– /usr/bin/docker-containerd-shim
– /usr/bin/dockerd
– /usr/bin/docker-init
– /usr/bin/docker-proxy
– /usr/bin/docker-runc
• utilities
– /var/lib/docker(root:root755)
– /var/lib/replicated(replicated:docker755)
– /var/lib/replicated-operator(replicated:docker755)
• pid & process
– /var/rundocker(root:root700)
– /var/run/replicated(root:root755)
– /var/run/replicated-operator(root:root755)
• logs
2-3
Chapter 2
Requirements and Limits
– /var/log/dscloud(root:root755)
Users Created
The following users and groups are created on the hosts during installation:
• users
– replicated:x:1001:993::/home/replicated:/bin/bash
• groups
– docker:x:993:replicated
Further Reading
• Consul Architecture
• Datadog
• Docker Swarm
• Logspout
• Logstash
• Nginx
• PostgreSQL
• Registrator
• Replicated
Required Information
In order to configure the DataScience.com Platform, you’ll need certain information
about your instance.
Host Requirements
Per host:
• RAM: 32GB
• CPU: 8 core
• Disk Space: 300GB
2-4
Chapter 2
Requirements and Limits
Directory Purpose
/var/lib/docker Runtime environment for docker containers
and images
/var/lib/replicated Storage for installation metadata
/var/log/dscloud Log files for run scripts sessions and APIs
Supported Browsers
The DataScience.com Platform relies on native flexbox support, which requires the
following minimum versions: - Apple Safari 10+ - Google Chrome 49+ - Microsoft Edge
14+ - Microsoft Internet Explorer 11+ (partial support; there are some known issues
with flexbox) - Mozilla Firefox 51+ - Opera 43+
Additional Software
The installation script for the DataScience.com Platform will automatically install the
correct version of docker-engine; please ensure this version is not overwritten by other
configuration management tools.
• docker 17.03-ce+
Port Configuration
The following ports should be opened between the specified sources and destinations.
“Administrative IP(s)” refers to the IP(s) from which Systems Administrators will need
to access the instance. “User IP(s)” refers to the IP(s) from which users of the
DataScience.com Platform will be accessing the application.
Caution
LDAP and SMTP Ports
For integrations such as LDAP and SMTP, we’ve provided the most commonly used
ports. Please confirm these ports with your service administrator(s).
For a real world example of this configuration, see Amazon AWS Examples.
2-5
Chapter 2
Requirements and Limits
Host Purpose
d3dgvlzhobmmuy.cloudfront.net • Ongoing access
• Delivers static site assets for the
DataScience.com Platform
api.replicated.com • Initial installation upgrades and ongoing
access
• Checks for updates and syncs license
get.datascience.com • Initial installation and adding new nodes
• Houses the provisioning script for new
nodes
install.datascience.com • Initial installation and upgrades
• Houses the installation script for the
Admin Console
hub.docker.com • Initial installation and upgrades
• Houses some dependencies for the
Admin Console and DataScience.com
Platform
quay.io • Initial installation and upgrades
• Houses the images required for the Admin
Console and DataScience.com Platform
Caution
Installations Without Internet Egress
If your installation does not have egress to the internet, see Airgap Installation.
Important
Connecting to data sources
2-6
Chapter 2
Requirements and Limits
In addition to the previous information, please ensure that routes are open between
the DataScience.com Platform and whatever data sources you plan to connect.
Databases
For more detailed information on configuring PostgreSQL for use with the
DataScience.com Platform, please see Database Management.
• Engine: PostgreSQL
• RAM: 8GB
• CPU: 2 core
• Space: 300GB
• Max connections: 1000
• Postgresql engine version 9.5.4
• Schema: public
• Database name: platform
Git Providers
In order to create projects in the DataScience.com Platform, you must integrate with a
Git provider. We currently support the following:
• GitHub.com
• GitHub Enterprise 2.9+
• Bitbucket.org
• GitLab.com
• GitLab Enterprise 7+
SMTP Server
The Platform relies on SMTP for sending various notifications. It’s also required when
using built-in authentication, as initial user invitations and password reset emails are
sent using SMTP.
If you do not have an existing SMTP server, you may use a third party SMTP service.
The following services have been tested and verified as compatible with the
DataScience.com Platform:
• Amazon AWS SES
• FoxPass
• Mandrill
• SendGrid
LDAP Providers
The Platform offers optional LDAP integration for user authentication. At the moment,
it supports the following providers: - OpenLDAP - Microsoft Active Directory
Limits
File upload/download limit: 200MB
2-7
Chapter 2
Supported Public Clouds
Google Compute Engine and SMTP: GCE does not currently support the use of
standard SMTP servers. They do, however, offer support for their own Gmail service
as well as several third party providers, including SendGrid.
Useful Tools
• Ansible: Configuration management that uses human readable YAML and
agentless architecture
• Terraform: Infrastructure as code; supports a number of providers, including
Amazon AWS, Microsoft Azure, and Google Cloud
2-8
Chapter 2
Security
Security
Introduction
Data integrity is important to us, and because of that, the DataScience.com Platform
offers several features that ensure the security of your environment.
Product Features
• Single Tenant Architecture Environments managed by the DataScience.com
team are deployed in single-tenant architectures. Single tenant architecture gives
the benefit of secure and isolated environments. Isolated environments prevent
bad outside actors from discovering sensitive information. Additionally, isolation
increases fault tolerance between environments.
• Containerization via Docker Doing data science work necessitates working with
data. Reading data into a computing environment increases the risk of a data
breach. The DataScience.com Platform mitigates breach through containerization.
2-9
Chapter 2
Security
Additional Security
As part of our commitment to building a secure platform, the DataScience.com
Platform undergoes regular assessments by an external security firm. These
assessments are conducted in accordance with OWASP and PTES standards, and
they allow us to quickly identify vulnerabilities in our software and infrastructure.
Configuration
• Logging and audit trails
• LDAP and Active Directory integrations
2-10
3
Installation
We’ve aimed to make installing the DataScience.com Platform as easy as possible but
if you have any questions or run into problems, feel free to reach out to our support
team.
Airgapped Installations
Note that the following instructions require internet access from the host. To install the
DataScience.com Platform in an airgapped environment, start with our Airgap
installation instructions.
Red Hat Installations
If you are installing the Datascience.com Platform on RHEL, you must install a newer
version of Docker than is available from RHEL’s repositories. Please follow the initial
instructions here and then come back to this guide.
Prerequisites
• Provisioned hosts that meet the minimum requirements
• Sudo access on the provisioned hosts
• Proxy address if the hosts require a proxy to access the internet
• LDAP information, if using; a full list of required details is in the LDAP Integration
section
• SMTP information; a full list of required details is in the SMTP Integration section
• Git provider information; more details can be found in the Git Provider Integration
section
• Postgres instance information:
– Database endpoint and port
– Database administrator credentials
– The platform database should be created prior to beginning the installation.
Instructions for creating the database can be found in our Database
Administration article.
Installation
Rebooting the Master Node
If the Master node needs to be rebooted during the installation process, every effort
should be made to ensure it is rebooted cleanly to prevent corruption of the Admin
Panel’s internal database. If this isn’t possible, and the database does become
corrupted, re-run the installation script to restart the installation process.
3-1
Chapter 3
Airgap Installation
Airgap Installation
Introduction
An airgapped environment is one that is isolated from potentially insecure networks
such as the internet. Installation of the DataScience.com Platform in these
environments involves a few more steps than on hosts with egress allowed to the
internet.
Requirements
• An airgap-enabled license for the DataScience.com Platform
• Sudo permissions on the hosts on which you’ll be installing the DataScience.com
Platform
• A way to transfer installation files onto the host(s)
• An instance of PostgreSQL that is reachable from the Platform cluster
• A Git provider that is reachable from the Platform cluster
• Ability for all hosts to communicate with one another as described in our
Networking section.
Installation
1. Provision the hosts needed for your installation. For a clustered installation
(recommended), you will need a minimum of three hosts.
2. Download the following items from a host with internet access, and then transfer
them to the Master node:
• Docker 17.03-ce+ and any dependencies, such as libltdl
• The tarball for installing the Platform Admin console
• The Platform airgap package, which contains all the images necessary to run
the DataScience.com Platform. (A unique link and password will be provided
to you by the DataScience.com team when you receive your license.)
3. Once you have transferred everything to the Master node, first install Docker and
its dependencies.
4. Next, run the following sequence of commands to install the Admin console:
Note:
If the installer is unable to detect the IP address of the host, it may
prompt you to choose a network interface. Generally, this will be a host
interface, not a Docker network interface.
5. Next, navigate to the Admin console in a web browser, either directly on the
Master node’s host at https://localhost:8800 or from another host with access
to the Master node.
3-2
Chapter 3
Airgap Installation
6. Bypass your browser’s TLS warning if applicable, and either upload an SSL cert or
choose the option to use a self-signed cert.
3-3
Chapter 3
Airgap Installation
3-4
Chapter 3
Airgap Installation
3-5
Chapter 3
Airgap Installation
9. Provide the full path to the .airgap file you transferred in step 2, then click
Continue.
10. The Admin Console will verify the airgap package. This may take a few minutes.
11. From here, you may proceed with the installation the same as you would in an
internet-routable environment, starting with step 5 of our Installation guide.
3-6
Chapter 3
RHEL 7 Installation Instructions
Installing Docker 17 CE
If you are installing the Datascience.com Platform on RHEL7, you will need to install a
newer version of Docker than is available from RHEL’s repositories. These commands
should be run on all hosts you are planning to use for your installation.
3-7
4
Configuration
The DataScience.com Platform comes with an extensive number of customizations for
both Users and Administrators. This section will walk you through the custom settings
available in the Admin Console, SMTP and LDAP integration, and configuring the
Platform to seamlessly sync with your Git provider.
Admin Console Settings: This section includes settings like hostname and database
endpoints, as well as integrations with services like Datadog so you can ship usage
statistics to your existing provider. You can reach the admin panel by navigating to
https://yourplatformhostname.com:8800 in a browser.
LDAP Integration: The DataScience.com Platform integrates with your existing LDAP
or Active Directory instance. This will allow you to control access to the
DataScience.com Platform using existing users and groups.
SMTP Integration: Configuring the DataScience.com Platform to work with your
existing mail server allows the platform to send invitations to new users via email. In
addition to being a convenient way to add users, this also offers heightened security;
users create their own passwords, which means there is no need to transfer
credentials.
Git Provider Integration: Integration with your existing Git provider like GitHub or
GitLab allows Users to work from a central location and collaborate more easily on
projects.
Environment Management: Environments are customized, pre-installed collections of
dependencies and packages that can be created by Admins and distributed to Users
on the DataScience.com Platform.
Enabling Hadoop and Spark: Connect to an external Hadoop cluster from the
DataScience.com Platform to allow your users to analyze your Big Data.
On-Demand Instances: Instances of the DataScience.com Platform running in Amazon
AWS can use on-demand resources for workloads that exceed the available resources
on the cluster.
SSO Configuration: The DataScience.com Platform supports Single Sign On (SSO)
integration for both LDAP and SAML.
4-1
Chapter 4
LDAP Integration
LDAP Integration
Introduction
In addition to a built-in user management system, the DataScience.com Platform
offers LDAP integration. This will allow you to control access to the Platform via
existing LDAP users.
Prerequisites
• A supported LDAP provider
• Hostname and port for the LDAP server
• Search user credentials
• LDAP schema:
• Base DN
• User search DN
• Restricted user group (optional)
• Username field
• Test user credentials (optional)
4-2
Chapter 4
LDAP Integration
Configuration
1. From the Settings view in the Admin Console, click Authentication on the left-hand
menu.
4-3
Chapter 4
LDAP Integration
4-4
Chapter 4
LDAP Integration
3. You will be prompted to enter the LDAP settings necessary for the integration.
4-5
Chapter 4
SMTP Integration
4. To verify your LDAP settings are correct, provide the credentials for the test user,
then click Test Credentials.
SMTP Integration
Introduction
Although you can install the DataScience.com Platform and create the administrator
user without SMTP integration, the integration is required in order to invite additional
users. Configuring SMTP integration also allows the Platform to automatically send
outputs via email.
Prerequisites
• The routable address and port for your SMTP server
• Credentials for a user with permission to log in to the SMTP server and send mail
• A desired “From” address for mail sent from the DataScience.com Platform
4-6
Chapter 4
SMTP Integration
Configuration
1. From the Settings view in the Admin Dashboard, click Email Server Settings in the
left-hand menu.
2. Click the radio button next to Enable SMTP.
4-7
Chapter 4
Git Provider Integration
3. You will be prompted to provide SMTP settings necessary for the integration.
When you’re finished, test your configuration by clicking Test SMTP
Authentication.
Supported Providers
• GitHub.com
• GitHub Enterprise 2.9+
• Bitbucket.org
• GitLab.com
• GitLab Enterprise 7+
Bitbucket Integration
The Bitbucket integration uses Bitbucket’s App Passwords feature to grant the
DataScience.com Platform access to your repositories. A Bitbucket App Password is
just like your account password, but meant for other apps to control Bitbucket on your
behalf.
Each user must bring their own Bitbucket App Password over to the DataScience.com
Platform. For end-user documentation on how to register a Bitbucket App Password,
see Files and Version Control > Authenticate.
To enable the Bitbucket integration, visit Settings > Git Providers, select your Bitbucket
type from the list of provider options, and fill in the details about your Bitbucket
account.
GitLab Integration
How users authenticate with GitLab depends on the version of GitLab they’ll be
connecting to. GitLab 7 and 8 use passwords for authentication, while GitLab 9 and
GitLab.com use access tokens.
Each user must bring their own GitLab access token over to the DataScience.com
Platform. For end-user documentation on how to register a GitLab access token, see
the Git Configuration section of our User documentation.
To enable the GitLab integration as an administrator, visit Settings > Git Providers.
Select GitLab from the list of provider options, then fill in the details about your GitLab
account.
4-8
Chapter 4
Environment Management
Environment Management
Introduction
Environments are customized, pre-installed collections of dependencies and packages
that can be created by Admins and distributed to Users on the DataScience.com
Platform. Users require environments to be built in order to do work on the Platform.
Therefore, it is critical that an Admin create environments during the installation
process. Additional environments can be built anytime thereafter.
In this guide, you will learn how to build environments on behalf of Users.
Background
The DataScience.com Platform uses Docker to containerize workloads on your
instance. Docker containers allow Users to spin up isolated work environments that
have all of the software needed for their analysis pre-installed. A Docker container is a
running instance created from a Docker Image. Docker Images are immutable files
that define the runtime of containers. When a user runs a script or launches a Jupyter
session on the Platform, they are running a Docker Image that is stored in an internal
Docker Registry. The Environments feature allows Admin users to create their own
Docker Images and submit them to the Docker Registry by writing a Dockerfile within
the Platform interface.
If you are unfamiliar with Docker and Dockerfiles, check out Docker’s documentation
for more details. Please also refer to our Dockerfile Basics and Best Practices
documentation for several example Dockerfiles and best practices.
What is an Environment?
An environment is defined by a Dockerfile and is associated with metadata such as
name, description, and a README. There are two categories of environments: Base
and User.
On the DataScience.com Platform, all environments except for the Default Base
environment must inherit from a pre-existing Base environment. All Base environments
4-9
Chapter 4
Environment Management
can be expanded in Base and User environments that inherit from it. On the other
hand, User environments cannot be used to seed other environments.
It is convenient to envision these relationships as an inheritance tree. Default Base is
the root node of this tree. Any Base or User environments that are subsequently built
are nodes branching off of it, with User nodes always representing leaf nodes. In this
analogy, the main difference between Base and User environments is that the former
can become parent nodes while the latter can only be a child node of a Base
environment (i.e., a leaf).
4-10
Chapter 4
Environment Management
4-11
Chapter 4
Enabling Hadoop and Spark
4-12
Chapter 4
Enabling Hadoop and Spark
4-13
Chapter 4
Enabling Hadoop and Spark
Fill in the form with the cluster name, provider version, and additional security
information. Then, choose to enable Hadoop, Hive, and/or Spark by checking the
respective boxes. When you choose to enable a certain framework, you will be
prompted to add additional information in the form of configuration files that you will
upload into the form. In this next section, you will learn how to locate and acquire
these files.
Please gather configuration materials from edge nodes or client machines that have
successfully connected to the cluster. Cluster server notes may not have the proper
configurations.
4-14
Chapter 4
Enabling Hadoop and Spark
Currently, the DataScience.com Platform supports MapR versions 5.2.1 and 5.2.2 as
well as Cloudera version 5.10.
4-15
Chapter 4
On-Demand Instances
For more information about building environments, see our Environment Management
documentation.
After you have created an available Hadoop-enabled Base environment, create a User
environment from this new Base environment to enable your Users to connect to the
cluster.
Note:
The DataScience.com environment build tool only supports installing one
version of Spark at a time in a given environment. A Hadoop-enabled
environment will install either Spark 1.6.0 or the appropriate Spark 2.x
release according to the specified Hadoop provider distribution version.
Other Providers
Coming soon! First class support for EMR and Hortonworks are in development. Don’t
see your provider? Contact success@datascience.com.
On-Demand Instances
Introduction
Instances of the DataScience.com Platform running in Amazon AWS can use on-
demand resources for workloads that exceed the available resources on the cluster.
Support for on-demand resources in Microsoft Azure and Google Compute Engine is
coming soon.
4-16
Chapter 4
On-Demand Instances
4-17
Chapter 4
SSO Configuration
SSO Configuration
The DataScience.com Platform offers Single Sign On (SSO) integration for both LDAP
and SAML. The following sections outline necessary steps for an Admin to configure
the integration.
4-18
5
Administration
The DataScience.com Platform was designed with administrative simplicity in mind.
That’s why it’s built on familiar tools like Docker and Postgres, and comes with a built-
in Admin Console that does most of the heavy lifting.
This section gives you an overview of common administrative tasks such as upgrading
the platform and managing the cluster. We’ve also included instructions for things like
manually adding users in Postgres in the event that you lose administrative access to
the UI.
Upgrading Shows you how to check for updates to the Platform and how to install
them.
User Management Provides instructions for user management, both from within the
Platform and in Postgres.
Cluster Management Lists everything you need to know about adding and removing
nodes from the cluster.
Database Administration Shows you how to create the Platform’s database from the
command line using psql.
Monitoring and Logging Directs you to the various monitoring and logging features
available in the Admin Console.
On-Demand Resources Provides instructions for managing on-demand resources, as
well as hints for how to resolve common issues.
Resource Management Dashboard Shows you how to use the Resource Management
Dashboard to see the health and availability of the compute hosts, what services are
running, and the amount of compute resource each service is using.
Upgrading
Introduction
Platform Restart Required
Because updates to the DataScience.com Platform require container restarts, we
highly recommend scheduling a maintenance window prior to performing an upgrade.
Container restarts could result in lost work and other unintended consequences.
Installing Upgrades
1. From the main Dashboard view in the Admin Console, you can see:
• Whether the Platform is up to date
• The last time the Admin Console checked for an update
• The current version of the Platform
5-1
Chapter 5
Upgrading
5-2
Chapter 5
Upgrading
2. If an update is available, the Check Now button will turn into a View Update button.
Clicking the View Update button will take you to the Release History view.
3. The Release History view will show you all versions available to you, both past
and present, and the date they were installed (if applicable). If an update is
5-3
Chapter 5
Upgrading
available, the Release History view will display the release notes for that version.
5-4
Chapter 5
Upgrading
5-5
Chapter 5
Upgrading
5. After you’ve selected an update to install, Replicated will automatically pull images
for the update and push them to the appropriate nodes.
6. When the update has finished installing, the DataScience.com Platform should
automatically restart.
Airgapped Environments
To upgrade airgapped environments, you must first tell the Admin Console where to
look for new packages.
1. In the Admin Console, click the gear icon in the top right menu, then select
Console Settings.
5-6
Chapter 5
Upgrading
3. The Update Path is the location the Admin Console will continuously check for
new Platform packages. This setting will default to the directory from which you
uploaded the initial ‘.airgap’ package. If desired, change the update path and/or
5-7
Chapter 5
User Management
4. If you’ve made changes, scroll to the bottom of the page and click Save.
User Management
Introduction
The DataScience.com Platform currently offers two methods of User authentication:
built-in and LDAP integration. This article focuses solely on User administration using
built-in authentication; for information about our LDAP integration, please see the
appropriate article on LDAP Integration.
5-8
Chapter 5
User Management
Adding a User
1. While logged in to the Platform as an Administrator, navigate to Administration via
the top menu bar.
5-9
Chapter 5
User Management
2. You will be directed to the Users and Teams tab. Next to Users in the left-hand
portion of the screen, click Add New to create a new User.
5-10
Chapter 5
User Management
3. Provide the new User’s name and email address, and specify their role. When you
are finished, click Invite. This will send an email to the new User, prompting them
to accept the invitation and create a password.
Adding users with LDAP
If you manage your Users through LDAP, adding a User is equivalent to looking up the
User in the LDAP server. Clicking the + Add New button will open a modal where you
can search for the User and assign the User a permission as normal. The User must
exist in the active directory first for the DataScience.com Platform to be aware of the
User. The User will then use their LDAP username and password to log into the
DataScience.com Platform. Note: LDAP groups are currently not supported.
5-11
Chapter 5
User Management
Creating a Team
1. While logged in to the Platform as an administrator, navigate to Administration via
the top menu bar.
5-12
Chapter 5
User Management
2. You will be directed to the Users and Teams tab. Next to Teams in the right-hand
portion of the screen, click Add New to create a new Team.
5-13
Chapter 5
User Management
3. On the following screen, provide a name for your new Team and then add Users
as collaborators. When you are finished, save the new Team by clicking Save.
5-14
Chapter 5
User Management
5-15
Chapter 5
User Management
2. You will be directed to the Users and Teams tab. Select the team you want to add
users to by clicking it.
5-16
Chapter 5
User Management
3. You can edit team settings, including members, by clicking Edit Team.
5-17
Chapter 5
User Management
4. Add the new team member by typing their name into the Add Members field.
When you’re finished, click Save.
5-18
Chapter 5
User Management
INSERT
INTO
users
(
id
,
email
,
password
,
first_name
,
last_name
,
status
,
created_at
,
updated_at
,
role
)
VALUES
(
'
{id}
'
,
'
{email}
'
,
'
{pass_hash}
'
,
'
5-19
Chapter 5
User Management
{first_name}
'
,
'
{last_name}
'
,
'ACTIVE'
,
current_timestamp
,
current_timestamp
,
'
{role}
'
);
3. To verify that the users have been added successfully, run the following command
for each user and confirm the results. Remember to replace {user_email} with the
user’s email address.
SELECT
*
from
users
WHERE
email
=
'
{user_email}
'
;
SELECT
*
FROM
users
WHERE
email
=
'user@domain.com`;
5-20
Chapter 5
Cluster Management
3. Use the user ID obtained from step 2 to query the reset_invitations table and
obtain that user’s unique token. For example:
SELECT
*
FROM
reset_invitations
WHERE
id
=
1
;
You can then use the user’s token to craft an invitation link for them.
Cluster Management
Introduction
Downtime Required
We highly recommend adding and removing nodes during maintenance windows or
other quiet periods, as you will need to restart the DataScience.com Platform.
Although downtime should be brief, there is still a risk of users losing unsaved work.
Adding Nodes
Adding a node to the cluster is relatively straightforward; however, we recommend that
you remind users to save their work, as you will need to restart the Platform.
In most cases, after the initial installation, you’ll be adding Worker nodes in order to
expand the pool of standing resources. However, the instructions below are applicable
to any node type.
5-21
Chapter 5
Cluster Management
2. In the Admin Console, navigate to the Cluster view using the menu in the upper
right-hand corner.
5-22
Chapter 5
Cluster Management
3. Provision the additional node(s) first by obtaining the daemon token. This is found
in the Cluster view.
4. Run the following command on each of the new hosts. Remember to replace
{master_node_ip}, {daemon_token}, {new_host_priv_ip}, {new_host_pub_ip},
and {tags} with the appropriate values.
Tags
The {tags} option accepts tag names as comma-separated values, i.e.
Worker,Common. There should be no spaces between tag names. Please make
sure your tags are correct before running this command. If you provision a node
that is tagged incorrectly, you will need to destroy and recreate it.
5-23
Chapter 5
Cluster Management
5. Once the new node has been added, you may restart the application.
Removing Nodes
1. When removing Core nodes, you must first stop the DataScience.com Platform. If
you’re removing a Worker, this step is not necessary. To stop the Platform,
navigate to the Dashboard view in the Admin Console and click Stop Now.
2. Once the Platform has stopped, stop the ‘replicated-operator’ service. Note that
the example below is applicable only for distributions using upstart; yours may
differ.
$ stop replicated-operator
3. On the Master node, run the following command. You can find the node’s ID in the
Cluster view in the Admin Console or by running ‘docker node ls’.
$ docker node rm
5-24
Chapter 5
Database Administration
4. Finally, in the Cluster view of the Admin Console, remove the node by clicking the
trash icon next to the disconnected node.
5. Once the node has been removed, you may restart the Platform.
Database Administration
Introduction
This article provides some notes on common PostgreSQL configuration pitfalls and
instructions for creating the initial DataScience.com Platform database.
Prerequisites
• An existing instance of PostgreSQL — please see our requirements section for
sizing suggestions
• Credentials for a database user who has permissions to create new databases
and who is an administrator of the public schema
• Access to psql or some other PostgreSQL management tool, either locally or on
the database host itself
5-25
Chapter 5
Database Administration
Basic Configuration
Configuration File Locations
PostgreSQL configuration file locations may differ based on your Linux distribution. To
find postgresql.conf, you can query the database with SHOW config_file. Normally,
pg_hba.conf are in the same location but you can verify that by finding its default path
in postgresql.conf.
Stand-By Configuration
PostgreSQL offers stand-by replication for high availability. Essentially, this setup
maintains a replica of the database on a secondary server, which can be used as a
read-only failover should something happen to the primary PostgreSQL host.
As a first step, create two PostgreSQL servers and configure them using the previous
steps in Basic Configuration.
Database Backups
Database backups are an essential component of a disaster-recovery plan. If your
PostgreSQL is hosted on a cloud-provider managed service such as AWS RDS or
Google Cloud SQL, you should enable regular backups on a schedule that suits your
disaster recovery requirements.
If you are self-managing your PostgreSQL server, a simple backup example is outlined
below. More robust backup configurations could include running the backup from a
different server, scheduling the backup as a cron job, or archiving and removing old
backup files.
In the event of a database failure, you would restore with a command similar to:
5-26
Chapter 5
Monitoring and Logging
Further Reading
• Installing and configuring PostgreSQL
• Using the ‘psql’ command
Monitoring Resources
You can check basic usage information at any time using the graphs displayed in the
Dashboard view of the Admin Console.
5-27
Chapter 5
Monitoring and Logging
5-28
Chapter 5
Monitoring and Logging
You can also send usage metrics to your existing Datadog account. For detailed
instructions on how to configure this integration, please see the Datadog section of our
Settings article.
Logging
The Admin Console allows you to download logs as a support bundle.
1. In the Admin Console, navigate to the Support view using the menu at the top right
corner.
3. The Admin Console will generate your support bundle in ‘.tar.gz’ format. This may
take a few minutes, depending on how many logs have been generated.
4. Unzip and untar the file. Its contents will include stdout and stderr for all
containers, configuration details of all nodes in the cluster, daemon logs and
configurations, etc.
Audit Trails
Any major changes made in the Admin Console are tracked in the Audit Log.
Examples of these changes are application stops and starts, upgrades, and support
bundle downloads.
5-29
Chapter 5
Monitoring and Logging
5-30
Chapter 5
Monitoring and Logging
2. In this view, you can see a list of recent major changes, including the date and
time the changes were made.
5-31
Chapter 5
On-Demand Resources
On-Demand Resources
Introduction
This article contains instructions for managing on-demand resources, as well as hints
for how to resolve common issues.
5-32
Chapter 5
Resource Management Dashboard
5-33
Chapter 5
Resource Management Dashboard
Within the Resource Management Dashboard, you will see two tabs: Hosts and
Services. Hosts corresponds to the instances serving the Platform. Information such
as IP address, Host ID, Status, Tags, Memory Usage, CPU Usage, Services, and
Largest Available are available at the host level. The definitions for each field follow:
• Host ID: Corresponds to the Docker swarm node ID
• IP Address: The IP address of the host
• Status: Corresponds to the Docker status of these nodes. Four values are
possible: Unknown, Down, Ready, and Disconnected. Most often, Ready or Down
will be displayed.
• Tags: Tags assigned to the host. Current values are: “-” and On-demand, to
distinguish between standing pool and on-demand hosts.
• Memory Usage: There are two figures here: X/Y GB. X represents the memory
currently reserved for user run services (e.g. Jupyter sessions, runs, deployed
APIs, etc.) and Y represents the total amount of memory available on the host.
The same convention applies to CPU Usage. Note that we adopted the following
color scheme:
– green: any size container can be spawned on the host
– yellow: only a subset of container sizes can be spawned on the host
– red: nothing can be spawned on the host
• Services: The number of platform services that are currently running on this host
• Largest Available: The largest Docker container available on this host; this
container is available to host Jupyter sessions, deployed APIs, scheduled runs,
etc.
If you click any Host ID, you can see the user services that are running on the host.
You can click on each service and reach the corresponding service page for more
details. The Status field can take the following values: Created, Restarting, and
Running. The state Running follows the Created or Restarting states.
5-34
Chapter 5
Resource Management Dashboard
5-35
Chapter 5
Resource Management Dashboard
When you click on the Services tab, you can see a list of all the services that are
currently running. Current services include Jupyter sessions, Deployed APIs, RStudio
sessions, Shiny Apps, Script Runs, and Scheduled Runs.
5-36
Chapter 5
Resource Management Dashboard
5-37
Chapter 5
Resource Management Dashboard
For each service, you can click on the Service Name and reach the Service page. You
can also remove the service by clicking on the Remove option that will appear next to
the Host ID.
5-38
Chapter 5
Resource Management Dashboard
5-39
Chapter 5
Resource Management Dashboard
Mem and CPU are the amount of memory and number of CPUs reserved for the this
particular service.
5-40
6
Troubleshooting
Introduction
These articles address some of the more common issues you may experience with the
DataScience.com Platform. However, if you can’t find the answer to your question
here, you’re always welcome to reach out to our support team at
support@datascience.com.
6-1
7
Appendix
This section includes various real-world examples of configurations like environments,
security groups, and databases. We’ll be adding to this section frequently, so be sure
to check back often.
Dockerfile Basics and Best Practices Best practices and examples of Dockerfiles for
building environments
Amazon AWS Examples Examples of security group and RDS configurations in
Amazon AWS
Google Compute Engine Examples Examples of GCE configuration options
Environments and Dependencies: Environments from the User’s perspective, including
where to find information, choosing an environment, and adding dependencies
Changing Paths of Installed Components Options for administrators who wish to use
alternate mount points or paths
Changing the SSL Certificate Instructions for updating the SSL certificate being used
by both the DataScience.com Platform and the Admin Console
Docker Commands Glossary Helpful tips for troubleshooting the Platform with Docker
commands
Best Practices
• If you reference files in your instructions, use relative paths, not absolute paths.
• Conda and R don’t play well together. If you intend to have R and Python cross
dependencies, avoid using Conda. Instead, install Python, R, and their respective
libraries via pip, R and apt-get commands.
Environment Dockerfiles are based on Debian 8.5 linux distribution. This means
you can only use Debian’s system package manager apt-get. Other package
managers, such as yum, will not work.
• Make sure that your Python and R interpreters are in your PATH. The version that is
in your PATH will be executed. If you have installed multiple versions of the Python
7-1
Chapter 7
Amazon AWS Examples
interpreter (e.g. Python 2.7, Python 3.6), make sure you activate the right one in
your PATH. The same goes for R.
• Generally, we recommend having separate environments for Python 2 and 3.
• Take advantage of image “inheritance”. Build base images that could be used for
other Base or User environments. Avoid creating images with very intricate sets of
dependencies by breaking them into smaller images. This will help with
debugging.
• Read the build logs very carefully. Sometimes installation errors will occur, yet the
image build could still be successful.
Dockerfile Basics
This section outlines the basics of writing Dockerfiles. For those who are already
familiar with Dockerfile, you may skip this section and proceed to the next one.
Below is a description of all Dockerfile instructions currently supported for Base and
User environments.
Required Resources
An AWS installation that follows these examples presupposes that the following AWS
resources exist and are configured correctly:
• An AWS account
• A VPC with subnets, route tables, and internet and/or NAT gateways configured to
allow unrestricted egress to the internet. It is also possible to allow egress to the
internet with direct connect through another data center.
• Either a root account or an IAM user account with privileges to create, edit, and
delete EC2 instances, EBS volumes, security groups, IAM users, IAM policies,
SES users, AMIs, tags, and keypairs. For troubleshooting purposes, it is ideal to
have at least view access to VPC, subnet, and route tables as well.
• An AMI maintained by your organization that meets the minimum requirements
and includes the SSH keys of the person or team that will be conducting the
installation. If you do not currently maintain an AMI that meets the requirements,
DataScience.com can offer recommendations that can be adapted to fit your
needs.
7-2
Chapter 7
Amazon AWS Examples
Security Groups
Unless your DataScience.com Platform installation must meet specific compliance
requirements, we suggest installing all infrastructure components (Master host, Core
hosts, Worker hosts, and the database) in the same security group, with all TCP traffic
allowed between nodes. You may need to open additional ports in order to connect to
your data sources.
For more information about the specific ports the Platform uses to communicate,
please see our Networking section.
7-3
Chapter 7
Amazon AWS Examples
Subnet
For on-demand instances to properly start and launch, they should originate from a
private subnet. Once you have created a private subnet, the on-demand instance
routes properly to the public subnet.
1. Create a subnet that doesn’t overlap with the main/public subnet.
7-4
Chapter 7
Amazon AWS Examples
2. Create a NAT gateway (Create New EIP) that resides in the main/public subnet.
7-5
Chapter 7
Amazon AWS Examples
3. Create a route table for the private subnet. Add route 0.0.0.0/0 and select the NAT
gateway that was created in the previous step.
7-6
Chapter 7
Amazon AWS Examples
4. Associate the private subnet with the route table that was created in the previous
steps.
7-7
Chapter 7
Amazon AWS Examples
7-8
Chapter 7
Amazon AWS Examples
7-9
Chapter 7
Amazon AWS Examples
RDS
Amazon AWS’s RDS managed database service is a convenient way to set up a
PostgreSQL server. Below are some minimal suggestions for launching RDS in the
AWS console. Note that this example does not include provisioned IOPS, which are a
recommended feature for production installations.
In these examples, you should substitute the security group you plan to use with your
Master, Core, and Worker servers in the Network & Security section.
7-10
Chapter 7
Amazon AWS Examples
7-11
Chapter 7
Google Compute Engine Examples
7-12
Chapter 7
Google Compute Engine Examples
7-13
Chapter 7
Google Compute Engine Examples
Supply your Google email address as the SMTP username, and the app password as
the SMTP password. The From Address should match your Google email address.
7-14
Chapter 7
Google Compute Engine Examples
Database Configuration
Create a GCE database using PostgreSQL.
When configuring the database instance, allow access from your Master node. Note
that in GCE, you may only whitelist access from the Master node’s public IP address.
Once the PostgreSQL instance has been created, you may create the necessary
Platform database using GCE’s web UI.
7-15
Chapter 7
Environments and Dependencies
Launching Environments
To run analyses or create outputs on the Platform, you can launch Docker containers
to host your work. When spawning containers, you can configure the environment that
you want to run. Choose an environment from the dropdown menu. Only environments
that are available for your tool will be available to select and run.
7-16
Chapter 7
Environments and Dependencies
7-17
Chapter 7
Changing Paths of Installed Components
Notice that the form above points to a text file called requirements.txt. While you can
call that file anything you want, it must be formatted as a different package name on
each line. The apt-get and R installers accept only package names and installs the
latest stable version. The pip installer accepts either package name or a version-
locked name, as in the example below:
For pip only, the comments in the example above are valid syntax.
systemctl
stop
docker
mv
/
var
/
lib
/
docker
/
datascience
/
var
/
lib
/
docker
ln
-
s
/
datascience
/
7-18
Chapter 7
Changing Paths of Installed Components
var
/
lib
/
docker
/
var
/
lib
/
docker
mv
/
var
/
lib
/
replicated
/
datascience
/
var
/
lib
/
replicated
ln
-
s
/
datascience
/
var
/
lib
/
replicated
/
var
/
lib
/
replicated
systemctl
start
docker
If space on your root volume is extremely limited, you may also wish to move the
internal logging directory. Use the full path to the logging directory you wish to use,
including a trailing slash:
7-19
Chapter 7
Changing Paths of Installed Components
It is also possible to use a volume mount for your Postgres server(s). This example
shows the data directory for Postgres 9.5 on RHEL being mounted on a volume in /
apps/datascience. You may adjust as needed for packages/distros that use data
directories other than /var/lib/pgsql. It’s also possible to do this after installing
Postgres if you mv the directory.
mkdir
-
p
/
apps
/
datascience
/
var
/
lib
/
pgsql
ln
-
s
/
apps
/
datascience
/
var
/
lib
/
pgsql
/
var
/
lib
/
pgsql
yum
install
-
y
https
:
//
7-20
Chapter 7
Changing the SSL Certificate
download
.
postgresql
.
org
/
pub
/
repos
/
yum
/
9.5
/
redhat
/
rhel
-
7
-
x86_64
/
pgdg
-
redhat95
-
9.5
-
2.
noarch
.
rpm
yum
list
postgresql
*
yum
install
-
y
postgresql95
-
server
7-21
Chapter 7
Changing the SSL Certificate
1. To update the cert, first navigate to the Admin Panel and click the gear icon in the
top right corner. In the dropdown menu, click Console Settings.
2. From here, you can choose to use a self-signed certificate, provide the path to a
cert and key file already on the Master node, or upload the cert and key file.
3. Once you have updated the SSL cert, scroll to the bottom of the settings page and
click Save.
7-22
Chapter 7
Docker Commands Glossary
docker info
What it’s for: High-level Docker daemon information, including installed version,
available memory and CPU
When to use it: If you want to confirm which Docker version you are running,
especially after initially installing it, or upgrading the OS
$ docker info
Containers: 33
Running: 13
Paused: 0
Stopped: 20
Images: 215
Server Version: 17.06.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 609
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
docker ps
What it’s for: Showing currently running containers on your host. Each row will
include a unique container ID, the image from which it was launched, a user-friendly
name for the container, the shell command that runs when the container launches,
timing data about the launch, and which ports are forwarded between host and the
container.
When to use it: Verifying that the container components of the Datascience.com
Platform are running.
docker logs
What it’s for: To obtain logs for system-level containers
When to use it: Most commonly used during initial installation, this command will
output everything logged to a container since it started running. As you can see in the
7-23
Chapter 7
Docker Commands Glossary
docker service ls
What it’s for: Unlike ‘docker ps’, which is specific to the current host, this command
shows services running on all nodes in your cluster. Whereas ‘docker ps’ will show
system-level containers, this will show all the sessions launched by your users. Most
of this data is exposed to you in the resources dashboard. Note that the service ID will
be different from the container ID that you see in ‘docker ps’.
When to use it: If you suspect a user’s session has failed for reasons that aren’t
obvious in either the session logs or the resource dashboard, you can look at the
service list to get more information.
$ docker service ls
ID NAME MODE REPLICAS IMAGE
0e6fwj479kvi deploy-another-test-706570-v1 replicated 1/1
10.20.1.144:9874/another-test-706570:1
jb7hwui2mtuu jupyter-another-test-repo-jupyter-1915-v1 replicated 0/1
10.20.1.144:9874/datascience-jupyter-python2:2.3.1
{
"ID": "0e6fwj479kvind6giq3tx6s5d",
"Version": {
"Index": 319046558
},
"CreatedAt": "2017-08-02T05:42:04.412705186Z",
"UpdatedAt": "2017-08-02T05:42:04.418101318Z",
"Spec": {
"Name": "deploy-another-test-706570-v1",
"TaskTemplate": {
"ContainerSpec": {
7-24
Chapter 7
Docker Commands Glossary
"Image": "10.20.1.144:9874/another-test-706570:1",
"Labels": {
"SERVICE_8001_NAME": "deploy-another-test-706570-v1",
"SERVICE_CHECK_HTTP": "/deploy/deploy-another-test-706570-v1/
status",
"SERVICE_CHECK_INTERVAL": "30s",
"SERVICE_NAME": "deploy-another-test-706570-v1",
"SERVICE_TAGS": "deploy"
},
"Env": [
"MODEL_FILE=problematic_script.py",
"MODEL_FUNCTION=prop_yes_dec",
"BASE_PATH=/deploy/deploy-another-test-706570-v1",
7-25
8
Release Notes
Version 5.0.0 - December 8, 2017 (Preview)
The following release is a major feature update, which is available in preview for
selected customers.
8-1
Chapter 8
8-2