Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Data Storage Services

Architecting with Google Compute Engine

CLOUD STORAGE, CLOUD SQL, CLOUD DATASTORE

Every application needs to store data, whether it's business data, media to be
streamed or sensor data from devices. Sometimes we characterize data by the three
Vs:
Variety which refers to how similar, structured or variable the data is; velocity or
essentially how fast the data comes; and volatility, which refers to for how long the
data retains value and therefore needs to be accessible.
From an application centered perspective, the technology is to store and retrieve the
data, whether it's a database or an object store is less important than whether that
service supports the application’s requirements for efficiently storing and retrieving the
data, given its characteristics.

When you take a look at Google's offering for data storage services, there are a lot of
services to choose from. In this module, we will cover Cloud Storage, Cloud SQL,
Cloud Spanner, Cloud Datastore and Cloud Bigtable. To get you more comfortable
with these service, you will get to apply them in three labs.
Let me start by giving you a high-level overview of these different services.
Data Processing
2
Services
Data storage services
Cloud Storage Cloud SQL Cloud Spanner Cloud Cloud BigQuery
Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally scalable Persistent Key-values, Relational
metaphor system database RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional reads filter objects scan rows SELECT rows
local disk and writes on property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID transactions
Strong consistency
Usage Store blobs No-ops SQL High availability Structured data No-ops, high Interactive SQL*
database on from App throughput, querying fully
the cloud Engine apps scalable, managed
flattened data warehouse

This table shows the different services that I just mentioned and highlights attributes
such as the capacity, access metaphor, and usage. I am not going to cover the whole
table right now as I will cover each column as we cover each service.
Speaking of columns, I also listed another service on the right side which is BigQuery.
I grayed this service out as it sits on the edge between data storage and data
processing. You can store data in BigQuery, but the usual reason to do this is to use
BigQuery's big data analysis and interactive querying capabilities. For this reason,
BigQuery is covered later in the course.
https://cloud.google.com/storage-options/
3

Storage and database decision chart


Is your data
No
structured?

Yes

Cloud Is your workload


Yes
Storage analytics?

No

Do you need Is your data Do you need updates


Yes
horizontal scalability? relational? or low latency?

No Yes No Yes No

Cloud Cloud Cloud Cloud


BigQuery
SQL Spanner Datastore Bigtable

Now, if text-heavy tables aren’t your thing, I also added this decision tree to help you
identify the solution that best fits your application. Let’s walk through this together:
First, ask yourself,if your data is structured. If it’s not, choose Cloud Storage.
If your data is structured, ask if your workload focusses on analytics. If it is, you will
want to choose Cloud Bigtable or BigQuery depending on if you need updates or low
latency.
Otherwise, check if your data is relational. If it’s not relational, choose Cloud
Datastore.
If it is relational, you will want to choose Cloud SQL or Cloud Spanner depending on
your need for horizontal scalability.
Now, depending on your application you might use one or several of these services to
get the job done.

For more information, including mobile solutions, see


https://cloud.google.com/storage-options/
4

Scope

Infrastructure Track Data Engineering Track


• Infrastructure • How to use a database system
• Service differentiators • Design, organization, structure,
• When to consider using each schema, and use for an
service application
• Basic knowledge for setting up and • Details about how a service stores
connecting to a service and retrieves structured data
• Administration tasks

Before we dive into each of the data storage services, let’s define the scope of this
module.
The purpose is to understand, from an infrastructure perspective, what services are
available and when to consider using them.
So, we'll be covering things like databases, Bigtable, no-SQL databases, certain
circumstances in which you might say, this sounds very data-centric. Which it is, but
from an infrastructure standpoint, you need to understand what those service
differentiators are. When would I decide which platform that I would like to bring in to
my organization and support.
Now, you may not know all the intricacies, for example, how to run complex queries,
optimizations, etc. But you need to have that basic knowledge for setting everything
up, connecting to different services, being able to audit, make sure that everything is
online all the time.
I recommend looking into the Data Engineering courses, if you want a deeper dive
into the design, organizations, structures, schemas and details on exactly how data
would be optimized, served and stored properly within those different services.
5

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Let’s start with Google Cloud Storage.


6

Data storage services


Cloud Storage Cloud SQL Cloud Cloud Cloud BigQuery
Spanner Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally Persistent Key-values Relational


metaphor system database scalable RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional filter objects on scan rows SELECT rows
local disk reads and writes property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID
transactions
Usage Store blobs No-ops SQL Strong Structured data No-ops, high Interactive SQL*
database on the consistency from App Engine throughput, querying fully
cloud High availability apps scalable, managed
flattened data warehouse

On the left-hand side you can see Cloud Storage is really designed for scale,
specifically petabytes and even exabytes of capacity. Some like to think of this as files
in a file system but it’s not really a file system, it's just simply a bucket and you're
going to drop objects in that bucket. You can create directories so to speak but really
a directory is just another object that points to different objects in the bucket.
You’re not going to easily be able to index all of these files like you would in a file
system. None of that's really kept track of, you just have a specific URL to access all
of these.
7

Storage classes

Regional Multi-Regional Nearline Coldline


Design Data that is used in Data that is used Backups Archival or
Patterns one region or needs globally and has no Data that is Disaster Recovery
to remain in region regional restrictions accessed no more (DR) data that is
than once a month accessed no more
than once a year

Feature Regional Geo-redundant Backup Archived or DR

Availability 99.9% 99.95% 99.0% 99.0%

Durability 99.999999999% 99.999999999% 99.999999999% 99.999999999%

Duration Hot data Hot data 30-day minimum 90-day minimum

Retrieval none none $ $$


cost

When it comes to Cloud Storage there are four different types of storage classes:
Regional, Multi-regional, Nearline and Coldline.
Regional Storage enables you to store data at lower cost, with the trade-off of data
being stored in a specific regional location, instead of having redundancy distributed
over a large geographic area. So for example, your data may be stored in us-central1,
europe-west1 or asia-east1.
Multi-Regional Storage, on the other hand, is geo-redundant, which means Cloud
Storage stores your data redundantly in at least two geographic locations separated
by at least 100 miles within the multi-regional location of the bucket. Therefore,
Multi-Regional Storage can be placed only in multi-regional locations, such as the
United States, the European Union, or Asia, not specific regional locations like
us-central1 or asia-east1.
Multi-Regional Storage is appropriate for storing data that is frequently accessed,
such as serving website content, interactive workloads, or data supporting mobile and
gaming applications. Whereas, regional storage is appropriate for storing data in the
same regional location as Compute Engine instances or Kubernetes Engine clusters
that use the data, providing you better performance for data-intensive computations.
Nearline storage is a low-cost, highly durable storage service for storing infrequently
accessed data. This storage class is a better choice than Multi-Regional Storage or
Regional Storage in scenarios where you plan to read or modify your data on average
once a month or less. For example, if you want to continuously add files to Cloud
Storage and plan to access those files once a month for analysis, Nearline Storage is
a great choice.
Coldline Storage is a very-low-cost, highly durable storage service for data
archiving, online backup, and disaster recovery. However, unlike other "cold" storage
services, your data is available within milliseconds, not hours or days. Coldline
Storage is the best choice for data that you plan to access at most once a year, due to
its slightly lower availability, 90-day minimum storage duration, costs for data access,
and higher per-operation costs. For example, if you want to archive data or have
access in the event of a disaster recovery event.
Now, let’s focus on durability and availability. All of these storage classes have 11
nines of durability, but what does that mean? Does that mean you have access to
your files at all times? No. What that means is you won't lose data. You may not be
able to access the data which is like going to your bank and saying well my money is
in there, it's 11 nines durable. But when the bank is closed I don't have access to it
which is availability and differs between the storage classes.
8

Cloud Storage overview


Buckets Object
• Naming requirements
• Cannot be nested
Objects
• Inherit storage class of bucket when created
• No minimum size; unlimited storage
Access
• gsutil command
• (RESTful) JSON API or XML API

Bucket

So Cloud Storage is broken down into a couple of different items here. First of all,
there are buckets which are required to have a globally unique name and cannot be
nested.
The data that you put into those buckets are objects that inherit the storage class of
the bucket and those objects could be text files, doc files, video files, etc. There is no
minimum size to those objects and you can scale this as much as you want as long
as your quota allows it.
To access the data, you can use the gsutil command in Cloud Shell or either the
JSON or XML APIs.
9

Changing default storage classes

Multi-Regional Regional

Coldline Nearline Coldline Nearline

● You can change the default storage class of a bucket.


● The default class is applied to objects as they are created in the bucket.
● The change only affects new objects added after the change.
● A Regional bucket can never be changed to Multi-Regional.
● A Multi-Regional bucket can never be changed to Regional.
● Objects can be moved from one bucket to another bucket with the same storage
class from the GCP Console; however moving objects to buckets of different storage
classes requires using the gsutil command from CloudShell.

Now, once you have created a bucket, you might want to change the storage type of
that bucket. You can change the default storage class of a bucket but you can't
change a regional bucket to a multi-regional and vice versa. As the slides illustrates,
multi-regional can be changed to coldline and nearline and regional can be changed
to coldline and nearline as well.
You can also move objects from one bucket to another using the GCP Console;
however moving objects to buckets of different storage classes requires using the
gsutil command in CloudShell.
10

Access control

Project Can be used together


Cloud IAM
Bucket

ACLs

Signed
Signed URL
Cloud Identity and Policy
Access
Management Document
Object
Roles:
● Owner
Signed and Timed
● Editor Access Control Cryptographic
● Viewer Lists Key
Controls File
buckets and objects = storage.admin
Upload Policy
objects only = storage.objectAdmin

Let’s look at access control for your objects and buckets that are part of a project. We
can use IAM for the project to say what individual user or service account can see the
bucket, list the objects in the bucket, view the names of the objects in the bucket or
create new buckets.
For most purposes, Cloud IAM is sufficient, and roles are inherited from project to
bucket to object. Access control lists or ACLs offer finer control.
For more detailed control, signed URLs provide a cryptographic key that gives
time-limited access to a bucket or object.
Finally, a signed policy document further refines the control by determining what kind
of file can be uploaded by someone with a signed URL.
Let’s take a closer look at ACLs and signed URLs.
11

Access control lists (ACLs)


Scope Permission
● allUsers
● allAuthenticatedUsers Owner
● Google accounts
○ individuals
○ groups
○ domains Writer
● Cloud Storage ID
● Project convenience values
○ owners-[CS project ID] Reader
○ editors-[CS project ID]
○ viewers-[CS project ID]

ACLs
100 entries (max) per ACL

Predefined ACL project-private is applied by default to all new buckets and objects.

An ACL is a mechanism you can use to define who has access to your buckets and
objects, as well as what level of access they have. Each ACL consists of one or more
entries and these entries consists of two pieces of information:
A scope, which defines who can perform the specified actions (for example, a specific
user or group of users). And a permission, which defines what actions can be
performed (for example, read or write).
https://cloud.google.com/storage/docs/access-control/lists
12

Signed URLs
● “Valet key” access to buckets and objects via ticket:
○ Ticket is a cryptographically signed URL
○ Time-limited
○ Operations specified in ticket: HTTP GET, PUT, DELETE (not POST)
○ Program can decide when or whether to give out signed URL
○ Any user with URL can invoke permitted operations
● Example using private account key and gsutil:
gsutil signurl -d 10m path/to/privatekey.p12 gs://bucket/object

For some applications, it is easier and more efficient to grant limited-time access
tokens that can be used by any user, instead of using account-based authentication
for controlling resource access (e.g., when you don’t want to require users to have
Google accounts).
Signed URLs allow you to do this for Cloud Storage. You create a URL that grants
read or write access to a specific Cloud Storage resource and specifies when the
access expires. That URL is signed using a private key associated with a service
account. When the request is received, Cloud Storage can verify that the
access-granting URL was issued on behalf of a trusted security principal (the service
account), and delegates its trust of that account to the holder of the URL.
After you give out the signed URL, it is out of your control. So you want the signed
URL to expire after some reasonable amount of time, as shown in this example, which
is set to expire after 10 months.
13

Example: Signed URL


https://storage.googleapis.com/testdata.txt?
GoogleAccessId=1234567890123@developer.gserviceaccount.com&
Expires=1331155464&
Signature=BClz9e...LoDeAGnfzCd4fTsWcLbal9sFpqXsQI8IQi1493mw%3D

● Query parameters
○ GoogleAccessId: Email address of the service account
○ Expires: When the signature expires in UNIX epoch
○ Signature: Signature is cryptographic hash of composite URL string
● The string that is digitally signed must contain:
○ HTTP verb (GET)
○ Expiration
○ Canonical resource (/bucket/object)

Here's another example of what a signed URL looks like. Your applications could call
it or you just simply make that available to individuals for download. That way you can
ensure that there's going to be an expiration to the access of a file instead of looking
at lifecycle management to delete older files.
14

Cloud Storage features

● Customer-supplied encryption key (CSEK)


○ Use your own key instead of Google-managed keys
● Object Lifecycle Management
○ Automatically delete or archive objects
● Object Versioning
○ Maintain multiple versions of objects*
● Directory synchronization
○ Synchronizes a VM directory with a bucket
● Object Change Notification
● Data import
● Strong consistency

There are also a number of different features that come with Cloud Storage. I am
going to cover these at a high-level for now as we are about to dive deeper into some
of them.
Earlier in the course, we already talked a little about Customer-supplied encryption
keys when attaching persistent disks to virtual machines. This allows you to supply
your own encryption keys instead of the Google-managed keys, which is also
available for Cloud Storage.
Cloud Storage also provides Object Lifecycle Management which let’s you
automatically delete or archive objects.
Another feature is object versioning which allows you to maintain multiple version of
objects in your bucket. Now, I placed a star here because you are charged for the
versions as if they were multiple files which is something to keep in mind.
Cloud Storage also offers directory synchronization so that you can sync a VM
directory with a bucket.
As for Object Change Notification, data import and strong consistency, we will discuss
these in more detail after going into object versioning and object lifecycle
management.
15

Object Versioning

Cloud Storage Object Versioning

Bucket Archive
Object A (g1) Object A (g1)

Object A (g2)
New Object A

● Buckets with Object Versioning enabled maintain a history of modifications


(overwrite/delete) of objects.
● Bucket users can list archived versions of an object, restore an object to an
older state, or delete a version.
● Supports conditional updates.

To support the retrieval of objects that are deleted or overwritten, Google Cloud
Storage offers the Object Versioning feature. You enable Object Versioning for a
bucket. Once enabled, Cloud Storage creates an archived version of an object each
time the live version of the object is overwritten or deleted. The archived version
retains the name of the object but is uniquely identified by a generation number as
illustrated with (g1).
When Object Versioning is enabled, you can list archived versions of an object,
restore the live version of an object to an older state, or permanently delete an
archived version, as needed. You can turn versioning on or off for a bucket at any
time. Turning versioning off leaves existing object versions in place and causes the
bucket to stop accumulating new archived object versions.

For more information, see: https://cloud.google.com/storage/docs/object-versioning


16

Object Lifecycle Management

● Lifecycle management policies specify actions to be performed


on objects that meet certain rules.
● Examples include:
○ Downgrade storage class on objects older than a year
○ Delete objects created before a specific date
○ Keep only the 3 most recent versions of an object
● Object inspection occurs in asynchronous batches, so rules
may not be applied immediately.
● Changes to lifecycle configurations can take 24 hours to apply.

To support common use cases like setting a Time to Live for objects, archiving older
versions of objects, or "downgrading" storage classes of objects to help manage
costs, Cloud Storage offers Object Lifecycle Management.
You can assign a lifecycle management configuration to a bucket. The configuration is
nothing but a set of rules which apply to all the objects in the bucket. So when an
object meets the criteria of one of the rules, Cloud Storage automatically performs a
specified action on the object. Here are some example use cases:
Downgrade the storage class of objects older than a year to let’s say Coldline
Storage.
Delete objects created before a specific date. For example, January 1, 2017.
And keep only the 3 most recent versions of each object in a bucket with versioning
enabled.
Now, updates to your lifecycle configuration may take up to 24 hours to go into effect.
This means that when you change your lifecycle configuration, Object Lifecycle
Management may still perform actions based on the old configuration for up to 24
hours. So keep that in mind.

For more information, see: https://cloud.google.com/storage/docs/lifecycle


18

Object Change Notification


Cloud Storage can watch a bucket and send notifications to external applications when
objects change (through a web hook).
Cloud Storage Object Change Notification

Photos Bucket App Thumbnails Bucket


Notification
POST http://app/newthumb Thumbnail
App Servers
New Photo
Compute Engine
New Photo

Cloud Pub/Sub Notifications for Cloud Storage is the recommended way to track
changes.

Object Change Notification can be used to notify an application when an object is


updated or added to a bucket through a watch request.
Completing a watch request creates a new notification channel, which is the means
by which a notification message is sent to an application watching a bucket using a
webhook.
After a notification channel is initiated, Cloud Storage notifies the application any time
an object is added, updated, or removed from the bucket. For example, as shown
here, when you add a new picture to a bucket, an application could be notified to
create a thumbnail.
Now, that being said, Cloud Pub/Sub notifications are the recommended way to track
changes to objects in your Cloud Storage buckets because they're faster, more
flexible, easier to set up, and more cost-effective. Cloud Pub/Sub is Google’s
distributed real-time messaging service which is covered in the Developing
Applications course.

Cloud Pub/Sub Notifications are the recommended way to track changes to objects in
your Cloud Storage buckets because they're faster, more flexible, easier to set up,
and more cost-effective.

For more information, see:


https://cloud.google.com/storage/docs/object-change-notification
18

Data import
● Storage Transfer Service enables high-performance imports of online data into Cloud
Storage buckets
○ Imports from another bucket, an S3 bucket, or web source
○ Can schedule transfers, create filters, etc.
○ Accessible via console or APIs
● Google Transfer Appliance enables secure transfer of up to a petabyte of data by
leasing a high capacity storage server from Google
● Offline Media Import is a service where physical media is sent to a third-party provider
who uploads the data

The GCP Cloud Storage Browser allows you to upload individual files to your bucket
but what if you have to upload terabytes or even petabytes of data? There are three
services that address this: Storage Transfer Service, Google Transfer Appliance and
Offline Media Import.
The Storage Transfer Service enables high-performance imports of online data. That
data source can be another Cloud Storage bucket, an Amazon S3 bucket, or an
HTTP/HTTPS location.
Transfer Appliance on the other hand is a high-capacity storage server that you lease
from Google. You simply connect it to your network, load it with data, and then ship it
to an upload facility where the data is uploaded to Cloud Storage. This service
enables you to securely transfer up to a petabyte of data on a single appliance.
Finally, Offline Media Import is a third party service where physical media, such as
storage arrays, hard disk drives (HDDs), tapes, and USB flash drives, is sent to a
provider who uploads the data.

For more information, see:


https://cloud.google.com/transfer-appliance/docs/introduction
https://cloud.google.com/storage/docs/offline-media-import-export
19

Strong consistency

Cloud Storage provides strong global consistency, including both data and metadata:

• Read-after-write
• Read-after-metadata-update
• Read-after-delete
• Bucket listing
• Object listing
• Granting access to resources

When you upload an object to Cloud Storage and you receive a success response,
the object is immediately available for download and metadata operations from any
location where Google offers service. This is true whether you create a new object or
overwrite an existing object. Because uploads are strongly consistent, you will never
receive a 404 Not Found response or stale data for a read-after-write or
read-after-metadata-update operation.
Strong global consistency also extends to deletion operations on objects. If a deletion
request succeeds, an immediate attempt to download the object or its metadata will
result in a 404 Not Found status code. You get the 404 error because the object no
longer exists after the delete operation succeeds.
Also, bucket listing is strongly consistent. For example, if you create a bucket, then
immediately perform a list buckets operation, the new bucket appears in the returned
list of buckets.
Finally, object listing is also strongly consistent. For example, if you upload an object
to a bucket and then immediately perform a list objects operation, the new object
appears in the returned list of objects.
20

Choosing Cloud Storage


start

unstructured
Structured or no no Read only by no
Read < 1 Read < 1
unstructured VMs in a specific
per year? per month?
data? region ?

structured yes yes yes

Consider a Consider Consider Consider Multi-


structured data Coldline Nearline Regional Regional
storage service Storage Storage Storage Storage

To summarize, let’s look at this decision tree to help you find the appropriate storage
class in Cloud Storage.
If you are going to read your data less than once a year, you should consider using
Coldline storage.
If you are going to read your data less than once a month, you should consider using
Nearline storage.
And if you're going to be doing reads and writes more often than that, then you should
consider choosing Multi-Regional or Regional, depending on your locality needs.
Speaking of that, another reason to choose Regional storage is if you want to ensure
that your data stays within a specific geographic or political region, to comply with
legal requirements, for example.
So far we only considered unstructured data. Structured data services are covered
next in this module.
21

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Let’s take some of the Cloud Storage concepts that we just discussed and apply them
in a lab.
22

Lab: Cloud Storage

Objectives
Completion: 60 minutes
In this lab, you learn how to perform the
following tasks: Access: 120 minutes

● Create and use buckets


● Set access control lists to restrict
access Cloud
Storage

● Use your own encryption keys


● Implement version controls
● Use directory synchronization
● Share a bucket across projects using
Cloud IAM

In this lab, you'll create buckets and exercise many of the advanced options available
in Cloud Storage. You'll set access control lists to limit who can have access to your
data and what they're allowed to do with it. You'll use the ability to supply and manage
your own encryption keys for additional security. You'll enable object versioning to
track changes in the data and you'll configure lifecycle management so that objects
are automatically archived or deleted after a specified period. Finally, you’ll use the
directory synchronization feature that I mentioned and share your buckets across
projects using Cloud IAM.

The lab will take about 60 minutes to complete, but you have 120 minutes before it
times out.
23

Lab review

In this lab you learned to create and work with buckets and objects, and you learned about the
following features for Cloud Storage:

● CSEK: Customer-supplied encryption key


○ Use your own encryption keys
○ Rotate keys
● ACL: Access control list
○ Set an ACL for private, and modify to public
● Lifecycle management
○ Set policy to delete objects after 31 days
● Versioning
○ Create a version and restore a previous version
● Directory synchronization
○ Recursively synchronize a VM directory with a bucket
● Cross-project resource sharing using Cloud IAM
○ Enable access to resources across projects

In this lab, you learned to create and work with buckets and objects, and you learned
about the following Cloud Storage features:
● Customer-supplied encryption keys
● Access control lists
● Lifecycle management
● Versioning
● Directory synchronization
● And Cross-project resource sharing using IAM
Now that you're familiar with many of the advanced features of Cloud Storage, you
might consider using them in a variety of applications that you might not have
previously considered. A common, quick, and easy way to start using GCP, is to use
Cloud Storage as a backup service.
24

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Let’s jump into the structured or relational data storage services. First up, is Cloud
SQL.
25

Data storage services


Cloud Storage Cloud SQL Cloud Cloud Cloud BigQuery
Spanner Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally Persistent Key-values, Relational


metaphor system database scalable RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional filter objects on scan rows SELECT rows
local disk reads and writes property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID
transactions
Usage Store blobs No-ops SQL Strong Structured data No-ops, high Interactive SQL*
database on the consistency from App Engine throughput, querying fully
cloud High availability apps scalable, managed
flattened data warehouse

● Cloud SQL is a fully-managed or no-ops database service that makes it easy


to set up, maintain, manage, and administer your relational databases on
Google Cloud Platform. Cloud SQL has a high capacity, capable of handling
terabytes of storage.
● The databases are relational, which means that you're simply going to be
running queries, such as select statements to read fields data or insert
statements to write fields data.
● Alternatively, you could install a SQL Server application image on a VM using
Compute Engine, but there are certain benefits of using Cloud SQL as a
managed service inside of GCP.
● Let’s take a look at those benefits.
26

Cloud SQL

● Fully managed MySQL and PostgreSQL databases


○ Fully managed instances
○ Patches and updates automatically applied
○ You still have to administer MySQL users
● Cloud SQL supports many clients
○ gcloud sql
○ App Engine, G Suite scripts
○ Applications and tools Cloud SQL
■ SQL Workbench, Toad
■ External applications using standard MySQL drivers

Since Cloud SQL is a fully managed service of either MySQL or PostgreSQL


databases, patches and updates are automatically applied. However, you still have to
administer MySQL users with the native authentication tools that come with these
databases.
Cloud SQL supports many clients, such as, Cloud Shell, App Engine and G Suite
scripts. It also supports other applications and tools that you might be used to, like
SQL Workbench, Toad and other external applications using standard MySQL drivers.
27

Cloud SQL generations

● Second Generation (recommended)


○ Up to 7x throughput and 20x storage capacity of First Generation
○ Up to 208 GB of RAM and 10 TB of data storage
○ MySQL 5.6 or 5.7
○ InnoDB only
● First Generation
○ Up to 16 GB of RAM and 500 GB of data storage
○ MySQL 5.5
○ IPv6 connectivity and On Demand activation policy
● Some MySQL features not supported; UDFs

Cloud SQL offers two different generations. I recommend using the second generation
because it provides up to 7 times the throughput and 20 times the storage capacity of
the first generation, giving you up to 208 GB of RAM and 10 TB of data storage. This
generation works with either MySQL 5.6 or 5.7 but only support InnoDB as the
storage engine.
The first generation provides lower memory and storage and only works with MySQL
5.5. However, this generation supports connections over both IPv4 and IPv6
addresses and supports the On Demand activation policy.
Generations aside, the MySQL functionality provided by a Cloud SQL instance is the
same as the functionality provided by a locally-hosted MySQL instance. However,
there are a few differences between a standard MySQL instance and a Cloud SQL
instance. For example, user-defined functions are not supported. For a full list of
differences between Cloud SQL and standard MySQL functionality, I recommend
looking at the Cloud SQL documentation.
28

Cloud SQL services


Replica services Import/Export
• Read replicas • Import data to Cloud SQL
• Failover replicas • Export data from Cloud SQL
• External replicas Scaling
• External master to Cloud SQL • Scale up (change machine
first gen replica capacity)
Backup service • Requires restart
• On demand • Scale out
• Scheduled/automated • Via read replicas
Auto-quiesce

Let’s focus on some other services provided by Cloud SQL:


There is replica service which can replicate data between multiple zones with
automatic failover, if an outage occurs.
Cloud SQL also provides automated and on-demand backups with point-in-time
recovery.
You can import and export databases using mysqldump, or import and export CSV
files.
Cloud SQL can also scale up which does require a machine restart or scale out using
read replicas.

https://cloud.google.com/sql/docs/mysql/replication/tips
29

Connecting to Cloud SQL

● Basic connection ● Instance level access


○ IP address ● MySQL database level access
○ Add authorized network ○ MySQL users
● Secure access ○ Database access
○ Whitelist IP addresses
○ Add to authorized networks list
○ Configure SSL
○ Static IP

There are a couple of options when it comes to connecting to Cloud SQL instances.
Using the most basic connection, you can grant any application access to a Cloud
SQL instance by authorizing the IP addresses that the application uses to connect.
This is a great way to get started but I don’t recommend it for production instances.
For a more secure access, you can temporarily whitelist IP addresses to easily
connect from the GCP Console. This is best for quick administration tasks requiring
the mysql command-line tool. You can also configure SSL certificate management for
a Cloud SQL instance and connect to the mysql client using SSL which you will do in
the next lab.
Cloud SQL also provides instance-level access to authorize access to your Cloud
SQL instance from an application or client (that could be running on Google App
Engine or externally) or another GCP service, such as Compute Engine.
Also, as I mentioned earlier, you can use the native MySQL Access Privilege System
to control which MySQL users have access to the data in your instance.
30

Cloud SQL Proxy and Secure SSL

Compute Engine Cloud SQL 2nd Gen

TCP: 3307
Application

MySQL
Proxy Client Proxy Server

Normal Secure TCP: 3306 Normal


MySQL Tunnel Unsecured IP MySQL

Here is an example of connecting a mysql client to your Google Cloud SQL instance
using the Cloud SQL Proxy, rather than over IP.
The Cloud SQL Proxy provides secure access to your Cloud SQL Second Generation
instances without having to whitelist IP addresses or configure SSL.
Cloud SQL Proxy works by having a local client, called the proxy, running in the local
environment. Your application communicates with the proxy with the standard
database protocol used by your database. The proxy uses a secure tunnel to
communicate with its companion process running on the server.
31

Choosing Cloud SQL


yes
Specific OS
start requirements?

no
yes
Is Is
Need full yes yes max 4000 yes
Is max 10 location Custom DB yes
relational concurrent management configuration
TB okay?
capability? connections okay?* requirements?
okay?
no no no no no

Special
Consider a yes
backup
NoSQL service requirements?

no Consider
hosting
your own
Consider Consider
DB on a
Cloud Spanner Cloud SQL
VM

To summarize, let’s look at this decision tree to help you find the right data storage
service with full relational capability.
If you need more than 10 TB of storage space, over 4000 concurrent connections to
your database or if you want your application design to be responsible for scaling,
availability, and location management when scaling up globally, then consider using
Cloud Spanner which we will cover later in this module.
If you have no concerns with these constraints, ask yourself if you have specific OS
requirements, custom database configuration requirements or special backup
requirements. If you do, you want to consider hosting your own database on a VM
using Compute Engine. Otherwise, I strongly recommend using Cloud SQL as a fully
managed service for your relational databases.
32

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Let’s take some of the Cloud SQL concepts that we just discussed and apply them in
a lab.
33

Lab: Cloud SQL

Objectives
Completion: 45 minutes
In this lab, you learn how to perform the
following tasks: Access: 90 minutes

● Create a Cloud SQL instance


● Create a VM to serve as a database
client and install software
Cloud SQL
● Restrict access to the Cloud SQL
instance to a single IP address
● Download sample GCP billing data in
*.csv format and load that into the
database
● Configure the Cloud SQL instance and
the client to use SSL encryption

In this lab, you will create a Cloud SQL instance. Then, you will create a VM to serve
as the database client and to populate the database with sample GCP billing data. To
access the Cloud SQL server, you will first restrict access to a single IP address and
then employ SSL encryption for a more secure connection.

The lab will take about 45 minutes to complete, but you have 90 minutes before it
times out.
34

Lab Review

In this lab you:

● Configured Cloud SQL for use by a client.


● Improved security by requiring SSL certificates.

In this lab you configured Cloud SQL for use by a client. Then, you improved security
by requiring SSL certificates.
Cloud SQL is a great way to rapidly provision a familiar database environment while
offloading many of the administrative activities. This was a good example of a
managed service where some parts of the service parameters are yours to configure
and some parts handled automatically by GCP. One thing to keep in mind, is where
that dividing line is and what your responsibilities are with respect to security of the
overall solution.
35

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

If Cloud SQL does not fit your requirements because you need horizontal scalability,
consider using Cloud Spanner.
36

Data storage services


Cloud Storage Cloud SQL Cloud Cloud Cloud BigQuery
Spanner Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally Persistent Key-values, Relational


metaphor system database scalable RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional filter objects on scan rows SELECT rows
local disk reads and writes property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID
transactions
Usage Store blobs No-ops SQL Strong Structured data No-ops, high Interactive SQL*
database on the consistency from App Engine throughput, querying fully
cloud High availability apps scalable, managed
flattened data warehouse

Cloud Spanner is the service built for the cloud specifically to combine the benefits of
relational database structure with non-relational horizontal scale. This service can
provide petabytes of capacity and offers transactional consistency at global scale,
schemas, SQL, and automatic, synchronous replication for high availability.
Natural use cases include financial applications and inventory applications
traditionally served by relational database technology.
37

Cloud Spanner

● Strong consistency
○ Strongly consistent secondary indexes
○ Globally consistent
● SQL support
○ SQL (ANSI 2011 with extensions)
○ ALTER statements for schema changes Cloud
Spanner
● Managed instances
○ High availability through data replication

Cloud Spanner offers strong consistency, including strongly consistent secondary


indexes. It offers SQL support, with ALTER statements for schema changes. Cloud
Spanner also offers managed instances with high availability through transparent,
synchronous, built-in data replication.
38

Characteristics

Cloud Spanner Relational DB Non-Relational DB

Schema Yes Yes No

SQL Yes Yes No

Consistency Strong Strong Eventual

Availability High Failover High

Scalability Horizontal Vertical Horizontal

Replication Automatic Configurable Configurable

Let’s compare Cloud Spanner with both relational and non-relational databases.
Like a relational database, Cloud Spanner has schema, SQL, and strong consistency.
Also, like a non-relational database, Cloud Spanner offers high availability, horizontal
scalability and configurable replication.
In essence, Cloud Spanner offers the best of the relational and non-relational worlds.
These features allow for mission-critical uses cases, such as, building consistent
systems for transactions and inventory management in the financial services and
retail industries.
39

Cloud Spanner

Open standards Workloads


• Standard SQL (ANSI 2011) • Transactional
• Encryption, Audit logging, Identity • Scale-out
and Access Management • Global data plane
• Client libraries in popular • Database consolidation
languages
• Java, Python, Go, Node.js
• JDBC driver

Cloud Spanner support many open standards such as the ones that I listed here. It
also supports many workloads, like transactional workloads where companies that
have outgrown their single-instance relational database management system and
have already moved to a NoSQL solution, but need transactional consistency, or are
looking to move to a scalable solution. Cloud Spanner also allows for database
consolidation where companies that store their business data in multiple database
products with variable maintenance overheads and capabilities, need consolidation of
their data.
To better understand how all of this works, let’s look at the architecture of Cloud
Spanner.
40

Architecture

Cloud Spanner instance

Zone 1 Zone 2 Zone 3

DB 1 DB 1 DB 1

DB 2 DB 2 DB 2

A Cloud Spanner instance replicates data in N cloud zones which can be within one
region or across several regions. The database placement is configurable, meaning
you can choose which region to put your database in. This architecture allows for high
availability and global placement.
41

Data replication

Update

Zone 1 Zone 2 Zone 3

Table 1 Table 1 Table 1

Table 2 Table 2 Table 2

The replication of data is going to be synchronized across zones using Google’s


global fiber network. Using atomic clocks, ensures atomicity whenever you are
updating your data.

Cloud Spanner best practices: https://cloud.google.com/spanner/docs/best-practices


42

Cloud IAM roles


You can apply Cloud IAM roles to individual databases
• spanner.admin
• spanner.databaseAdmin
• spanner.databaseReader
• spanner.databaseUser
• spanner.viewer

Cloud Spanner has its own set of IAM access roles. This allows you to have the same
security mechanisms without having to create something separate for your database.
These IAM permissions can be granted to a database, instance, or GCP project.

For more information, see https://cloud.google.com/spanner/docs/


43

Feature details
● Tables
○ Optional parent-child relationships between tables
○ Interleaving of child rows within parent rows on primary key
● Primary keys and secondary keys
● Database splits
● Transactions
○ Locking read-write, and read-only
● Timestamp bounds
○ Strong, Bounded staleness, Exact staleness, Maximum staleness

https://cloud.google.com/spanner/

Cloud Spanner offers many different features as you can see here: tables, Primary
and secondary keys, Database splits, Transactions, and Timestamp bounds. Feel free
to explore the documentation to dive deeper into these features. However, because
the scope of this course is to understand under what circumstances you would use
Cloud Spanner, let’s look at a decision tree.

https://cloud.google.com/spanner/docs/schema-and-data-model
44

Choosing Cloud Spanner


start

Outgrown no
single instance Are you Need full yes
RDBMS ? sharding for DB no relational
throughput? Need capability?
yes transactional no
consistency? Global no
yes
data + strong no
consistency ?
yes
DB
no Consider a
consolidation?*
yes NoSQL service

yes

Consider
Consider Cloud Spanner Cloud SQL

If you have outgrown any relational database,


are sharding your databases for throughput high performance,
need transactional consistency,
global data and strong consistency,
or just want to consolidate your database,
consider using Cloud Spanner. If you don’t need any of these, nor full relational
capabilities, consider a NoSQL service, such as Cloud Datastore which we will cover
next.
45

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

If you are looking for a highly-scalable NoSQL database for your applications,
consider using Cloud Datastore.
46

Data storage services


Cloud Storage Cloud SQL Cloud Cloud Cloud BigQuery
Spanner Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally Persistent Key-values, Relational


metaphor system database scalable RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional filter objects on scan rows SELECT rows
local disk reads and writes property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID
transactions
Usage Store blobs No-ops SQL Strong Structured data No-ops, high Interactive SQL*
database on the consistency from App Engine throughput, querying fully
cloud High availability apps scalable, managed
flattened data warehouse

You can think of Cloud Datastore as a persistent hashmap that offers terabytes of
capacity. One of its main use cases is to store structured data from App Engine apps.
47

Cloud Datastore
App Engine

● API support for several languages


memcache
● Data replicated across several data
centers
● Queries scale with the size of result
set, not the size of data set
● Fast and easy provisioning
● Autoscale Cloud Datastore
● RESTful Endpoints
● Schemaless access
● ACID Transactions
● SQL-like capabilities
● Local Development tools
● Authentication, secure sharing
● Built-in Redundancy

Cloud Datastore is paired with a memcache service to increase performance for


repeatedly read data. Typically in development, the App Engine application will try
memcache first, and then on a cache-miss, access Cloud Datastore. This strategy
radically improves performance and reduces costs.
Cloud Datastore automatically handles sharding and replication, providing you with a
highly available and durable database that scales automatically to handle your
applications' load. Cloud Datastore provides a myriad of capabilities such as ACID
transactions, SQL-like queries, indexes and much more.

Information on consistency models in Cloud Datastore:


https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consi
stency-with-google-cloud-datastore/
48

Cloud Datastore
Kind: category of an object
Entity: a single object
Property: individual data for an object
Key: unique ID

Entity groups and ancestor queries


• An "ancestor query" enables a strong consistency result, versus a non-ancestor
query, which yields an eventually consistent result.
• Enables application developers to balance requests.

While the Cloud Datastore interface has many of the same features as traditional
databases, as a NoSQL database it differs from them in the way that it describes
relationships between data objects. In Cloud Datastore, a category of object is known
as a Kind, one object is an Entity, individual data for an object is a Property and a
unique ID for an object is a Key. Whereas, in a relational database these would be a
table, row, field and primary key, respectively.
49

Cloud Datastore replication


● Multiple locations
○ Multi-Regional
■ Multi-region redundancy; higher availability
○ Regional locations
■ Lower write latency; co-location with other resources
○ Global points of presence: lower latency for the end user
● Service-level agreement
○ Regional (multi-zone) and Multi-Regional replication enable high availability with
an SLA covering:
■ 99.95% for Multi-Regional locations
■ 99.9% for Regional locations

Built into Datastore is synchronous replication over a wide geographic area. When
you first use Cloud Datastore, you must choose a location where the project's data is
stored. To reduce latency and increase availability, store your data close to the users
and services that need it.
There are two types of locations where you can store data using Cloud Datastore:
multi-region locations and regional locations.
Multi-region locations provide multi-region redundancy and higher availability.
Regional locations provide lower write latency and co-location with other GCP
resources that your application may use.
Both of these options enable high availability with slightly different SLAs for monthly
uptime percentage.
50

Choosing Cloud Datastore

start

Want low
Schema might
no no maintenance no Transactional yes
change and need Scale down
overhead consistency
an adaptable to zero?
scaling up to required?
database?
TBs?
yes
yes yes no

Consider a SQL
cost / size solution such as
Cloud SQL
Consider Cloud or
Consider Cloud Datastore Bigtable Cloud Spanner

To summarize, let’s look at this decision tree to help you determine if Cloud Datastore
is the right storage service for your data.
If your schema might change and you need an adaptable database,
you need to scale to zero
or you want low maintenance overhead scaling up to terabytes consider using Cloud
Datastore. Also, if you don’t need require transactional consistency,
you might want to consider Cloud Bigtable depending on the cost or size.
I am going to cover Cloud Bigtable after the next lab.
52

Cloud Firestore is the next generation of Cloud Datastore


Datastore mode (new server projects):

● Compatible with Datastore applications


● Strong consistency
● No entity group limits

Native mode (new mobile and web apps): Cloud


Firestore
● Strongly consistent storage layer
● Collection and document data model
● Real-time updates
● Mobile and Web client libraries

Cloud Firestore is the next generation of Cloud Datastore. Cloud Firestore can
operate in Datastore mode, making it backwards- compatible with Cloud Datastore.
By creating a Cloud Firestore database in Datastore mode, you can access Cloud
Firestore's improved storage layer while keeping Cloud Datastore system behavior.
This removes the following Cloud Datastore limitations:
● Queries are no longer eventually consistent; instead, they are all strongly
consistent.
● Transactions are no longer limited to 25 entity groups.
● Writes to an entity group are no longer limited to 1 per second.
Cloud Firestore in Native mode introduces new features such as:
● A new, strongly consistent storage layer
● A collection and document data model
● Real-time updates
● Mobile and Web client libraries
Cloud Firestore is backwards compatible with Cloud Datastore, but the new data
model, real-time updates, and mobile and web client library features are not. To
access all of the new Cloud Firestore features, you must use Cloud Firestore in
Native mode. A general rule of thumb is to use Cloud Firestore in Datastore mode for
new server projects and Native mode for new mobile and web apps. For more
information on choosing between Native mode and Datastore mode, refer to the
documentation: https://cloud.google.com/datastore/docs/firestore-or-datastore

As the next generation of Cloud Datastore, Cloud Firestore is compatible with all
Cloud Datastore APIs and client libraries. Existing Cloud Datastore users will be
live-upgraded to Cloud Firestore automatically at a future date. You can learn more
about this upgrade here: https://cloud.google.com/datastore/docs/upgrade-to-firestore
52

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Let’s take some of the Cloud Datastore concepts that we just discussed and apply
them in a lab.
53

Lab: Cloud Datastore

Objectives
Completion: 40 minutes
In this lab, you learn how to perform the Cloud
Datastore
Access: 80 minutes
following tasks:

● Initialize Cloud Datastore Kind: Task

● Create content in the database


Entity 1 Entity 2 Entity 3
● Query the content using both GQL and
Kind queries
Key Key Key
● Access the Cloud Datastore Admin
console Description Description Description

created created created

done done done

In this lab, you'll create a database in Cloud Datastore with three entities. Then, you
will perform basic operations, like querying content and accessing the Cloud
Datastore Admin console.

The lab will take about 40 minutes to complete, but you have 80 minutes before it
times out.
54

Lab Review

In this lab you:

● Created a Cloud Datastore database


● Populated the database with data entities
● Ran
○ Query by kind
○ Query by GQL queries
● Enabled the Cloud Datastore Admin console so that you could learn
how to:
○ Clean up data
○ Remove test data

In this lab, you created a Cloud Datastore database. You populated the database with
data entities and ran both Query by kind and Query by GQL queries. You then
enabled the Cloud Datastore Admin console so that you could see how to clean up
and remove test data.
Datastore is an extremely flexible and scalable data storage service. It's used in many
applications were scalable growth and consistent performance are required.
55

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz

Cloud Bigtable is Google's NoSQL Big Data database service.


56

Data Storage Services


Cloud Storage Cloud SQL Cloud Cloud Cloud BigQuery
Spanner Datastore Bigtable

Capacity Petabytes + Terabytes Petabytes Terabytes Petabytes Petabytes

Access Like files in a file Relational Globally Persistent Key-values, Relational


metaphor system database scalable RDBMS Hashmap HBase API

Read Have to copy to SELECT rows transactional filter objects on scan rows SELECT rows
local disk reads and writes property

Write One file INSERT row put object put row Batch/stream

Update An object (a Field SQL, Schemas Attribute Row Field


granularity “file”) ACID
transactions
Usage Store blobs No-ops SQL Strong Structured data No-ops, high Interactive SQL*
database on the consistency from App Engine throughput, querying fully
cloud High availability apps scalable, managed
flattened data warehouse

Cloud Bigtable is a sparsely populated table that can scale to billions of rows and
thousands of columns, allowing you to store terabytes or even petabytes of data. A
single value in each row is indexed and this value is known as the row key.
Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very
low latency. It supports high read and write throughput at low latency, so it's a great
choice for both operational and analytical applications, including IoT, user analytics,
and financial data analysis.
Cloud Bigtable is actually the same database that powers many of Google’s core
services, including Search, Analytics, Maps, and Gmail.
57

Cloud Bigtable

● Fully managed NoSQL database


● Petabyte-scale with very low latency
● Seamless scalability for throughput
● Learns and adjusts to access patterns

In short, Cloud Bigtable is a fully managed NoSQL database with petabyte-scale and
very low latency. It seamlessly scales for throughput and it learns to adjust to specific
access patterns.
58

Cloud Bigtable

● Flexible scalable data service


● High throughput, low latency
Data API
● High capacity (Petabytes)
● Ideal for storing single-keyed data
with very low latency
Streaming ● Excellent for big data applications
● Can be access via HBase extension
● Benefits over HBase
Cloud
Bigtable Batch ○ Scales by machine count, not
Processing
limited by QPS
○ Low Ops: admin tasks are
automatic
○ Resize in seconds

There are different ways to interact with Cloud Bigtable. Cloud Bigtable is exposed to
applications through multiple client libraries, including a supported extension to the
Apache HBase library. Cloud Bigtable also excels as a storage engine for batch
MapReduce operations, stream processing/analytics, and machine-learning
applications.
Cloud Bigtable's powerful back-end servers offer several key advantages over a
self-managed HBase installation:
From a scalability perspective, a self-managed HBase installation has a design
bottleneck that limits the performance after a certain QPS rate is reached. Cloud
Bigtable does not have this bottleneck, and so you can scale your cluster up to handle
more queries by increasing your machine count.
Also, Cloud Bigtable handles administration tasks like upgrades and restarts
transparently and can resize clusters without downtime.
59

Structure and schema


"follows" column family

Follows

Row Key gwashington jadams tjefferson wmckinley

gwashington 1

jadams 1 1

tjefferson 1 1 1

wmckinley 1

● NoSQL, sparse wide-column database


multiple versions
● Single index: the row key
● Atomic single-row transactions

Cloud Bigtable stores data in massively scalable tables, each of which is a sorted
key/value map as the one on this slide. The table is composed of rows, each of which
typically describes a single entity, and columns, which contain individual values for
each row. Each row is indexed by a single row key, and columns that are related to
one another are typically grouped together into a column family.
The example here has a follows column family that represents a hypothetical social
network in which United States presidents are following each other.
Also, Cloud Bigtable tables are sparse; if a cell does not contain any data, it does not
take up any space.
60

Processing is separated from storage

Clients

Processing Bigtable node Bigtable node Bigtable node

Storage Colossus file system

This diagram shows a simplified version of Cloud Bigtable’s overall architecture. It


illustrates that processing, which is done through a front-end server pool and nodes,
is handled separately from the storage.
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to
help balance the workload of queries (Tablets are similar to HBase regions.) Tablets
are stored on Colossus, which is Google's file system, in SSTable format.
An SSTable provides a persistent, ordered immutable map from keys to values, where
both keys and values are arbitrary byte strings.
61

Learns access patterns

Clients

Processing Bigtable node Bigtable node Bigtable node

Storage A B C D
Colossus file system E

As I mentioned earlier, Cloud Bigtable, learns to adjust to specific access patterns. If a


certain Bigtable node is frequently accessing a certain subset of data...
62

Rebalances without moving data

Clients

Processing Bigtable node Bigtable node Bigtable node

Storage A B C D
Colossus file system E

… Cloud Bigtable will update the indexes so that other nodes can distribute that
workload evenly as shown here.
63

Throughput scales linearly

QPS QPS QPS


80,000 400,000 4m

60,000 300,000 3m

40,000 200,000 2m

20,000 100,000 1m

0 0 0
0 2 4 6 8 0 10 20 30 40 0 100 200 300 400
Bigtable Nodes Bigtable Nodes Bigtable Nodes

That throughput scales linearly, so for every single node that you do add you're going
to see a linear scale of throughput performance, up to hundreds of nodes.
64

Choosing Cloud Bigtable

start

no no r/w latency < 10 no no


Very high
Storing > 1 TB millisecond and HBase API
volume of
structured strong compatible?
writes?
data ? consistency?

yes yes yes yes

Consider
● Bigtable scales UP well
Consider Cloud Bigtable Cloud
● Datastore scales DOWN well
Datastore

In summary, if you need to store more than 1 TB of structured data,


have very high volume of writes,
need read/write latency of less than 10 milliseconds along with strong consistency
or need a storage service that is compatible with the HBase API,
consider using Cloud Bigtable,
If you don’t need any of these and are looking for a storage service that scales down
well, consider using Cloud Datastore.
Speaking of scaling, the smallest Cloud Bigtable cluster you can create has three
nodes and can handle 30,000 operations per second. Keep in mind that you pay for
those nodes while they are operational, whether your application is using them or
not.
65

Agenda ● Google Cloud Storage


● Lab
● Cloud SQL
● Lab
● Cloud Spanner
● Cloud Datastore
● Lab
● Cloud Bigtable
● Quiz
66

Quiz

What data storage service might you select if you just needed to migrate a standard
relational database running on a single machine in a data center to the cloud?

1. Cloud SQL
2. BigQuery
3. Persistent Disk
4. Cloud Storage
67

Quiz

What data storage service might you select if you just needed to migrate a standard
relational database running on a single machine in a data center to the cloud?

1. Cloud SQL *
2. BigQuery
3. Persistent Disk
4. Cloud Storage

Explanation:
Cloud SQL offers a PostgreSQL server or a MySQL server as a managed service.
68

Quiz

Which GCP data storage service offers ACID transactions and can scale globally?

1. Cloud Storage
2. Cloud CDN
3. Cloud Spanner
4. Cloud SQL
69

Quiz

Which GCP data storage service offers ACID transactions and can scale globally?

1. Cloud Storage
2. Cloud CDN
3. Cloud Spanner *
4. Cloud SQL

Explanation:
Cloud Spanner provides ACID (Atomicity, Consistency, Isolation, Durability) properties
that enable transactional reads and writes on the database. It can also scale globally.
70

Quiz

Which data storage service provides data warehouse services for storing data but also
offers an interactive SQL interface for querying the data?

1. BigQuery
2. Cloud Dataproc
3. Cloud Datalab
4. Cloud SQL
71

Quiz

Which data storage service provides data warehouse services for storing data but also
offers an interactive SQL interface for querying the data?

1. BigQuery *
2. Cloud Dataproc
3. Cloud Datalab
4. Cloud SQL

Explanation:
BigQuery is a data warehousing service that allows the storage of huge data sets
while making them immediately processable without having to extract or run the
processing in a separate service.
72

More resources

Cloud Storage
https://cloud.google.com/storage/docs/

Cloud SQL
https://cloud.google.com/sql/docs/

Cloud Spanner
https://cloud.google.com/spanner/docs/

Cloud Datastore
https://cloud.google.com/datastore/docs/

Cloud Bigtable
https://cloud.google.com/bigtable/docs/

You might also like