Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Data Storage services in GCP

Cloud Dataproc Cloud SQL Bigtable BigQuery Cloud Datastore Cloud Storage Cloud Spanner

Managed Services
Serverless Services

Relational database Data Warehouse NoSQL Big Data database service

Cloud SQL Cloud Spanner BigQuery Cloud Datastore Bigtable

Useful links
https://db-engines.com/en/system/Google+Cloud+Bigtable%3BGoogle+Cloud+Datastore
https://cloud.google.com/solutions/data-lifecycle-cloud-platform#processing_large-scale_data
https://cloud.google.com/solutions/data-lifecycle-cloud-platform#storing_object_data
Unstructured/
Structured Semi-Structured
Structured 

Transactional Analytical Fully Indexed Row Key

Cloud Cloud
Cloud SQL Cloud BigQuery Cloud Store
DataStore BigTable
Spanner
Google Cloud Interconnect

Cloud VPN Partner Interconnect Dedicated Interconnect

Bandwidth
Yes  <10 Gbps
No

public
Yes No
Internet

On-Premises “Render Farm”

Local
Gateway
Compute
Cloud Storage

1) Google Cloud Storage & Large Object upload speeds


Breaking this down, gsutil can automatically use object composition to perform uploads in parallel for large, local files being
uploaded to Google Cloud Storage. This process works by splitting a large file will into component pieces that are uploaded in
parallel and then composed in the cloud (and the temporary components finally deleted).
You can enable this by setting the `parallel_composite_upload_threshold` option on gsutil (or, updating your .boto file, like
the console output suggests)
Cloud SQL vs Cloud Spanner

Cloud SQL Cloud Spanner

1) Spanner is horizontally scalable whereas Cloud SQL is not.


2) Cloud Spanner offers regional and multi-region instance configurations vs Cloud SQL offers regional config
3) Cloud Spanner support relational, structured, and semi-structured data vs Cloud SQL supports relational data
4) Cloud Spanner don’t support open source database like Cloud SQL.
5) Cloud SQL supports only MySQL or PostgreSQL. 
6) Cloud Spanner can scale both Horizontally and vertically vs Cloud SQL Horizontally

Cloud SQL horizontal scaling explanation


https://cloud.google.com/community/tutorials/horizontally-scale-mysql-database-backend-with-google-cloud-sql-and-proxysql
BigQuery

1) Export data limitations


https://cloud.google.com/bigquery/docs/exporting-data
2) Big query charges for Storage, Queries, and streaming inserts.
note:- Loading data from file and exporting data are free operations.
3) Cached results in BigQuery
https://cloud.google.com/bigquery/docs/cached-results
4) BigQuery loading data
https://cloud.google.com/bigquery/docs/loading-data
5) BigQuery partition tables
6) Sources to load data into BigQuery ( Google Drive, Google Bigtable, File upload, Google Storage)
7) Differences between Standard SQL vs Legacy SQL
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
8) BigQuery Web UI limitations.
9) Concurrent rate limit for interactive queries — 100 concurrent queries
Note :- Dry run queries do not count against this limit. You can specify a dry run query using the  --dry_run flag or by setting the dryRun property in a query job.
10) Cross-region federated querying — 1 TB per project per day
11) Query execution time limit — 6 hours
12) BigQuery Export Limitations
 You cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage. For information on saving query results, see  
Downloading and saving query results.
 You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a  wildcard to export the data into multiple files. When you export data to
multiple files, the size of the files will vary.
BigQuery Metadata query

1) How to get table metadata.


SELECT * FROM DataSet.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = $YOUR TABLE NAME’

2) Best way to Deduplicate a partition of data in table


MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

4) Select All Columns Except Some in Google BigQuery?


Select * except(title, comment) from publicdata.samples.wikipedia limit 10

5) How to recover a dropped table?


Method 1:- Recover table with no cost (Imp) :- bq cp dataset.table@1577833205000 dataset.new_table
Method 2:- SELECT * FROM `project.dataset.table`FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
BigQuery BQ Command
line

 Mechanism to export BigQuery table to Google cloud storage.


bq extract --destination_format CSV 'bigquery-public-data:austin_311.311_service_requests’ gs://dataflow01-test/myfile.csv

 Mechanism to export BigQuery schema in json format to Google cloud storage.


bq show --format=prettyjson 'bigquery-public-data:austin_311.311_service_requests' > 'gs://dataflow01-test/311_service_requests.json'
GCP Cloud Storage
 Buckets can be Regional, Dual-Regional. Multi-Regional
 Class of the Storage are Standard(Frequently access data), Nearline (> 30 days ), Cold line ( less than a year)
BigQuery
 External data source or federated data source can query directly.
 Supports Cloud BT, Cloud Storage, G Drive, Cloud SQL.
Use cases for external data sources:
 Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
 Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it
is updated.

Controlling costs in BigQuery


 Avoid Select * (Query only the columns that you need.)
 Don't run queries to explore or preview table data. Instead use preview
 BigQuery supports the following data preview options:
• In the Cloud Console, on the table details page, click the Preview tab to sample the data.
• In the bq command-line tool, use the bq head command and specify the number of rows to preview.
• In the API, use tabledata.list to retrieve table data from a specified set of rows.
Managing storage in GCP
 Zonal persistent disk are available in HDD and SSD.
 Create blank disks or create disks from a source (Existing persistent disks, Snapshots, Images).
 Blank disks must format after attaching them to an instance.
 Can increase the size of the zonal persistent disk but not decrease/reduce.
 It is a best practice to back up your disks using snapshots to prevent unintended data loss.
 Sudo resize2fs to resize persistent disk.
 Set autoDelete value to True if one what the disk to get deleted when the instance got deleted.
 Sharing static data between multiple instances from one persistent disk is cheaper than replicating your data to unique disks for individual instances.

Regional persistent disks


 Regional persistent disks can't be used as boot disks.
 You can create a regional persistent disk from a snapshot but not an image.
 You can't use a regional persistent disk with a memory-optimized machine type or a compute-optimized machine type .
virtual machine (VM) in GCP
Instance groups
 Are of two types (MIG) and Unmanaged instance groups.

Shielded VMs 
 Quickly protect VMs against advanced threats.
 Ensure workloads are trusted and verified.
 Help protect secrets against exfiltration and replay.
Python Pandas
Ask Pandascommand
Distinct drop_duplicates(),unique(){togetcountnunique()}
Top 10 records from table head(10)
Show columns df.columns
Show index "df.index"
Group by groupby(‘columns’)[‘columns’].sum()
Give alias to agg filed reset_index(name='QTY_Sum’)
Order by sort_values('QTY_Sum',ascending=False)
Lower str.lower()
substring str.slice(1)/Split
Cast df['DataFrameColumn'].astype(int)
Get count of each record in the column value_counts()
Summarize the DataFrame describe()
Select df[['column1','column2']]
Get column and records count Df.shape
Select all columns except last 3 df.iloc[:,:-3]

Ex:- users.groupby(['gender','occupation'])['occupation'].count().reset_index(name='CNT_Gen').sort_values(by=['occupation','gender'], ascending=False)

Unstack and stack

You might also like