Introduction To Google Cloud Big Data Platform: Lecturer: Phd. Tran Minh Quang Data Engineering - Group 12

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Introduction to Google Cloud

Big Data Platform


Lecturer: PhD. Tran Minh Quang
Data Engineering - Group 12
Google BigQuery
Agenda
● What is BigQuery?
● Why BigQuery?
● BigQuery Organization
● Accessing BigQuery
● BigQuery Architecture: Dremel
● References
What is BigQuery?
● BigQuery is a service provided by Google Cloud Platform,
a suite of products & services that includes application
hosting, cloud computing, database services, … on
Google’s scalable infrastructure
● BigQuery is Google’s solution for companies who need a
fully-managed and cloud-based interactive query service
for massive datasets
Why BigQuery?
● Service for interactive analysis of massive datasets (TBs)
○ Query billions of rows: seconds to write, seconds to return
○ Uses a SQL-style query syntax
○ It’s a service, can be accessed by API
Why BigQuery? (cont’d)
● Reliable and Secure
○ Replicated across multiple machines
○ Secured through Access Control Lists
Why BigQuery? (cont’d)
● Scalable
○ Store hundreds of terabytes
○ Pay only for what you use
● Fast
○ Run ad hoc queries on multi-terabyte datasets in seconds
BigQuery Organization
BigQuery is structured as a hierarchy with 4 levels:

● Projects: Top-level containers in the Google Cloud Platform that store the data
● Datasets: Within projects, datasets hold one or more tables of data
● Tables: Within datasets, tables are row-column structures that hold actual data
● Jobs: The tasks you are performing on the data, such as running queries, loading data,
and exporting data
Example: BigQuery, Datasets, and Tables
● Here is an example of the left-pane
navigation within BigQuery
● Project are identified by the project name, for
example ‘bigquery-public-data’
● You can expand projects to see the
corresponding datasets, for example ‘github-
repos’
● Tables are referenced by their project and
dataset as: <project>.<dataset>.<table>
○ for example ‘bigquery-public-
data.github_repos.contents’
Accessing BigQuery
● Web UI (bigquery.cloud.google.com)
● console/command line (gcloud)
● Third party Tools
○ Tableau
○ QlikView
○ R
○ Excel
○ …
● Restful API
Restful API
Method HTTP Request

delete DELETE /bigquery/v2/projects/{projectId}/datasets/{datasetId}

get GET /bigquery/v2/projects/{projectId}/datasets/{datasetId}

insert POST /bigquery/v2/projects/{projectId}/datasets

list GET /bigquery/v2/projects/{projectId}/datasets

patch PATCH /bigquery/v2/projects/{projectId}/datasets/{datasetId}

update PUT /bigquery/v2/projects/{projectId}/datasets/{datasetId}

For Dataset
Restful API
Method HTTP Request

cancel POST /bigquery/v2/projects/{projectId}/jobs/{jobId}/cancel

get GET /bigquery/v2/projects/{projectId}/jobs/{jobId}

getQueryResults GET /bigquery/v2/projects/{projectId}/queries/{jobId}

insert POST /bigquery/v2/projects/{projectId}/jobs


POST /upload/bigquery/v2/projects/{projectId}/jobs

list GET /bigquery/v2/projects/{projectId}/jobs

query POST /bigquery/v2/projects/{projectId}/queries

For Jobs
BigQuery Architecture: Dremel
● Data model/Storage
● Query execution
Data model/Storage
● Columnar Storage
● Nested/Repeated Fields
● No indexing => Single full table
scan from disk
BOOK 1:
AUTHOR: Dumas
TITLE: The Three Musketeers
PRICE:
DISCOUNT: 0
USD: 20
EUR: 19
BOOK 2:
AUTHOR: Yrsa Sigurdardottir
AUTHOR: Tina Flecken
AUTHOR: Elma Klein
TITLE: Feuernacht

BOOK 3:
TITLE: Get Fit, Stay Fit
PRICE:
DISCOUNT: 0
EUR: 12
PRICE:
DISCOUNT: 1
EUR: 11
Columnar Representation

AUTHOR PRICE.EU
Dumas (0, 1) 19 (0, 2)
Yrsa Sigurdardottir (0, PRICE.DISCOUNT NULL (0, 0)
1) 0 (0, 2) 12 (0, 2)
Tina Flecken (1, 1) NULL (0, 0) 11 (1, 2)
Elma Klein (1, 1) 0 (0, 2)
NULL (0, 0) 1 (1, 2)

PRICE.USD
TITLE 20 (0, 2)
The Three Musketeers (0,1) NULL (0, 0)
Feuernacht (0, 1) NULL (0, 1)
Get Fit, Stay Fit (0, 1) NULL (1, 1)
BOOK 1: R D
AUTHOR: Dumas AUTHOR 0 1 R = In the
TITLE: The Three Musketeers TITLE 0 1
PRICE: path to the
DISCOUNT: 0 PRICE.DISCOUNT 0 2
USD: 20 PRICE.USD 0 2 field, what
BOOK 2:
EUR: 19 PRICE.EUR 0 2 is the last
AUTHOR: Yrsa Sigurdardottir AUTHOR 0 1 repeated
AUTHOR: Tina Flecken AUTHOR[1] 1 1
AUTHOR: Elma Klein AUTHOR[2] 1 1 field ?
TITLE: Feuernacht TITLE 0 1
(PRICE)
(DISCOUNT): NULL (PRICE).(DISCOUNT) 0 0
(EUR): NULL (PRICE).(EUR) 0 0
(USD): NULL (PRICE).(USD) 0 0 D = In the
BOOK 3:
(AUTHOR): NULL (AUTHOR) 0 0
path to the
TITLE: Get Fit, Stay Fit TITLE 0 1 field, how
PRICE:
DISCOUNT: 0 PRICE.DISCOUNT 0 2 many
EUR: 12 PRICE.EUR 0 2
(USD): NULL PRICE.(USD) 0 1
defined
PRICE: fields ?
DISCOUNT: 1 PRICE[1].DISCOUNT 1 2
EUR: 11 PRICE[1].EUR 1 2
(USD): NULL PRICE[1].(USD) 1 1
Query execution
● Tree architecture
● Using about tens thousands
machines over Google’s petabit
network (+1Petabits/s)
DEMO
References
● https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/
● https://cloud.google.com/files/BigQueryTechnicalWP.pdf
● https://cloud.google.com/bigquery/docs/
● https://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_tue_1415_Ryan
Boyd.pdf

You might also like