Download as xlsx, pdf, or txt
Download as xlsx, pdf, or txt
You are on page 1of 24

Python Key Term

numpy
__name__ , “__main__”
dict()

list
sorted()
reversed()
pprint
set
Spark Window functions
first_value(parameter),
lag()
lead()
last_value(parameter)
.read in spark
.write in spark
Functionalities

If the python interpreter is running that module (the source file) as the main program, it sets the special __n
constructor builds dictionaries directly from sequences of key-value pairs
mapping of key to value
mutable sequence of objects
sorts any iterable series, returns a list
reverses any iterable series, returns a reverse iterator
pretty-print: that allows you to customize the formatting of your output
delimited by {}, single comma seperated items, empty {} creates Dict so we need to specify set()
holds first value
holds first value
holds previous value
holds next value
holds last value
read the file in the desired location
writes the file in the desired location
Syntax/Examples

if __name__ == “__main__”
dict([('key1', value1), ('key2', value2), ('key3', value3)])

form pprint import pprint as pp

spark.read.load("examples/src/main/resources/people.json", format="json")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
ETL

.read()
.write()

createDataFrame()

filter()

dbutils.fs
dbutils.fs.ls(path)
spark.window()

databricks secrets

mount_point
Directly connect Blob

coelesce
repartition

select()

for each partitions

.range(1000, 10000)
%run command
ARM Template

CI CD
Error Records
how to write data from file to SQL server

dataframe and RDD

validate Email
Dataframe column sequence
regexp_replace
Lookup
Get Metadata
badRecordsPath using option parameter.

logging

udf
Extract

logDF = (spark
.read
.option("header", True)
.csv()
)

(df.write.format('parquet')

.bucketBy(100, 'year', 'month')

.mode("overwrite")
.saveAsTable('bucketed_table'))
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))
df4.groupBy("year").pivot("course").sum("earnings").collect()
df.select(df.name).orderBy(df.name.asc()).collect()
df.select(df.age.cast(StringType()).alias('ages')).collect()

provides utilities for working with FileSystems


Shows list of files in specified path

from pyspark.sql.window import Window

Secrets allow you to secure your credentials, and reference in your code instead of hard-
code

accountKey = "fs.azure.account.key.<blob name>.blob.core.windows.net"


dbutils.fs.mount(
source = "wasbs://<container name>@<blob name>.blob.core.windows.net",
mount_point = "/mnt/azblobinputdata",
extra_configs = {accountKey:"<generate token to get this key>"})
storage_account_name = '<blob name>'

storage_account_access_key = '<key>'

blob_container = '<container name>'


spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)
we can reduce no of partitions, creates unequal sized partitions, Minimum shufflling df.coeles
we can increase or decrease no of partitions, creates equal sized partitions, it will shuffle and
evenly distribute the data
df.createOrReplaceGlobalTempView("people")
>>> df2 = df.filter(df.age > 3)
>>> df2.createOrReplaceGlobalTempView("people")
>>> df3 = spark.sql("select * from global_temp.people")
>>> sorted(df3.collect()) == sorted(df2.collect()
def f(people):
... for person in people:
... print(person.name)
>>> df.foreachPartition(f)
spark.range(1000, 10000)
To call a notebook inside another

if you need a way of deploying infrastructure-as-code to Azure, then Azure Resource


Manager (ARM) Templates are the obvious way of doing it simply and repeatedly. They
define the objects you want, their types, names and properties in a JSON file which can be
understood by the ARM API.

badRecordsPath using option parameter.
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)

dbcHostname = "<hostname>"
val jdbcPort = 1433
val jdbcDatabase = "<database>"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

// Create a Properties() object to hold the parameters.


import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
Resilient Distributed datastructure is type safe, fault tolerant , developer has to take care of
optimization
Dataframe built on top of RDD, more optimized,

Regular expression
DataFrame[["column1", "column2", "column3"]]
to remove the extra spaces in the listed string
to read the content of the database tables or files
which allows reading metadata of its sources.
to implement error handling in Databricks and from where we read the  error logs

log files are better way to capture information because it allows us to see our Logged
information overtime in one place
5 standard logging levels: Debug, Info, warning, error, critical

import org.apache.spark.sql.functions.udf
Transform

serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)

df.write.format('json').save(os.path.join(tempfile.mkdtemp(),
'data'))

df.filter('bar not in ("a","b")').show()


salesDf.filter(col('Item')=='Pencil').select("Region","Rep","Item
","Units").show()
df.withColumnRenamed('age', 'age2').collect()
[Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]

display(dbutils.fs.ls(writePath))
are functions that perform a calculation over a group
of records
calculating a cumulative sum
accessing the values of a row appearing before the
current row

pip install databricks-cli


databricks configure --token
databricks secrets create-scope --scope gws-blob-storage
databricks secrets put --scope gws-blob-storage --key storage-acct-
key
databricks secrets list-scopes
databricks secrets list --scope gws-blob-storage
df = spark.read.option("inferSchema", "true")\
.option("header", "true")\
.csv("/mnt/azblobinputdata/retail_data.csv")
display(df.limit(5))

filePath = "wasbs://" + blob_container + "@" +


storage_account_name +
".blob.core.windows.net/SampleData.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema =
True, header = True)
salesDf.show()
ows.net', storage_account_access_key)
df.coelesce()
df.repartition()
val employees_table = spark.read.jdbc(jdbcUrl,
"employees", connectionProperties)

RDD: it was difficult to do simple operations like Join, Group By,


Outer join, for beginner they find difficult to write codes
DF gives user feeling of a table, its built on top of RDD,

val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.parquet("/input/parquetFile")

import logging
log = logging.getLogger("my-logger")
log.info("Hello, world")
logging.basicConfig(filename='log_file.log',level=loggi
ng.DEBUG,format = '%(asctime)s:%(levelname)s:%
(message)s')

val upper: String => String = _.toUpperCase


val upperUDF = udf(upper)
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
dataset.withColumn("upper", upperUDF('text)).show
Load
(serverErrorDF.write
.parquet())

df.write.mode('append').parquet(os.path.join(te
mpfile.mkdtemp(), 'data'))

df.write.partitionBy('year',
'month').parquet(os.path.join(tempfile.mkdtemp(
), 'data'))

serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)

display(serverErrorDF)

from pyspark.sql.window import Window


# Defines partitioning specification and ordering
specification.
windowSpec = \
Window \
.partitionBy(...) \
.orderBy(...)
# Defines a Window Specification with a ROW frame.
windowSpec.rowsBetween(start, end)
# Defines a Window Specification with a RANGE
frame.
windowSpec.rangeBetween(start, end)
In Spark, a DataFrame is a distributed
collection of data organized into named
columns. ... DataFrames can be constructed
from a wide array of sources such as:
structured data files, tables in Hive, external
databases, or existing RDDs
Libraries Syntax
pyspark.sql.dataframe.DataFrame .range(100,1000)
spark.createDataFrame

pyspark.sql.dataframe.DataFrame
pyspark.sql.dataframe.DataFrame corruptDF.dropna("any")
from pyspark.sql.functions import col, max, min min("colname")
col("id")
max("id")
pyspark.sql.dataframe.DataFrame df.dropDuplicates()

from pyspark.sql.functions import col, lower, translat .lower("columnname")

pyspark.sql.dataframe.DataFrame .split("good")
from pyspark.sql.types import StringType spark.udf.register()

from pyspark.sql.functions import sha1, rand

%sql SQL API


Examples

corruptDF = spark.createDataFrame([
(11, 66, 5),
(12, 68, None),
(1, None, 6),
(2, 72, 7)],
["hour", "temperature", "wind"]
)

duplicateDF.dropDuplicates(["id", "favorite_color"])
dupedWithColsDF = (dupedDF
.select(col("*"),
lower(col("firstName")).alias("lcFirstName"),
lower(col("lastName")).alias("lcLastName"),
lower(col("middleName")).alias("lcMiddleName"),
translate(col("ssn"), "-", "").alias("ssnNums")
))

def manual_split(x):
return x.split("s")

manual_split("My name is Lipsa")

manualSplitPythonUDF = spark.udf.register("manualSplitSQLUDF", manual_split, StringType())


randomDF = (spark.range(1, 10000 * 10 * 10 * 10)
.withColumn("random_value",
rand(seed=10).cast("string"))
.withColumn("hash", sha1("random_value"))

%sql
SELECT id,
hash,
manualSplitSQLUDF(hash) as augmented_col
FROM
randomTable
1. Introduction about the experience, projects worked on Azure ( ADF & Databricks ).
2. Asked about ADF pipeline I worked on, explained the flow.
3. What is lookup table, why we used it.
4. What is self-hosted IR?
5. How and where you will configure your self-hosted IR ?
6. Scenario based question on ADF
If we have data coming from 30 different countries to one place and data will be coming in daily basis and
we need to process it.
The schema will be same for the data. How would you design the pipeline ? How would your pipeline will
look like ? How many pipelines you will create ? How would you check data coming, how would you read it
?
7. What windows function in spark you worked on ?
8. What component I worked on my search engine project ?
9. What Databricks transformation I worked on ?
10. Will ask to explain some transformation function which we worked, how we did it, the steps.
11. What to connect databricks to blob, what is Mount, what is Mount Point?
12. What language I worked on, python, spark?
13. What Azure Technologies I know other than ADF and Databricks. 
14. Difference between coalesce () and repartition(). Is there any difference or not ?
1. Introduction
2. Roles and responsibility in current project and previous project- followed by a lot of sub queries
3. Work on Databricks
4. How to read files and export files in Databricks
5. Various connections from Databricks

6. Distribution in Azure Synapse

7. How to improve performance in Azure Synapse


8. How to authenticate BLOB from ADF
9. What will you do if you do not fully understand a PBI in a sprint
10. What will you do if you find that you need more infrastructure to achieve a task?
Roles and responsibility in current project – and cross questioning on the same
Not a single question on Python
Write file with some specific name
coalesce () and repartition() functions
logging mechanism in ADF and Databricks
Pass variable from one activity to other in ADF
Different activities used in ADF
Connection to ADLS from Databricks
Have you used any other source system than cloud – how to implement the same
Null handling
Logic apps
Cluster configuration
Best file to be written? Why?
A distribution is the basic unit of storage and processing for
parallel queries that run on distributed data. When Synapse
SQL runs a query, the work is divided into 60 smaller queries
that run in parallel.
Dataflows
Validation activity

look up activity

Azure Synapse

Round Robin Table


Hash Table

Hash Funtion

Polybase
Steps of Polybase

Integration Runtime
Way of writing transformations without writing any code by drag and drop
works as a traffic control of your ADF course of actions
Validation activity task is to provide better control of your ADF pipeline execution
Retrieve a dataset from any of the ADF data sources
the output of lookup activity can be stored in subsequent activity

Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple
processors
A round-robin distributed table is a table where the data is evenly (or as evenly as possible) , Staging to
production layer it could be used
A distributed hash table is a distributed system that provides a lookup service: key-value pairs , uses Hash
function, improves query performance on large fact tables, source to staging it is used
A hash function is a function which when given a key, generates an address in the table
Polybase is a technology that access external Data stored in Azure Blob storage, Azure data lake, by Transact-SQL
language
Connection : connection to SQL DW
Create external datasource
external Data format
create schema for external table
create external table
create schema for table
create SQL DW table and load data

An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to


provide data integration capabilities such as Data Flows and Data Movement. It has access
to resources in either public networks, or hybrid scenarios (public and private networks
child items, last modified date, exists, created

TSQL: set of programming extension that add several features to the Structured Query Language
(SQL)
CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123secretword';

CREATE DATABASE SCOPED CREDENTIAL mycredential


WITH IDENTITY = 'credential', Secret =
'VgqTsEqg1beXqhb+mO/wZh9OUZ+ByhKJJj7qc9pBne9e+BsdFo4Mfdl+u8Gh94tDqCR0/
uNZ4KHr0r4WuK85lA=='

CREATE EXTERNAL DATA SOURCE mycustomers


WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://mycontainer@polybasestoragesqlshack.blob.core.windows.net/',
CREDENTIAL = mycredential
);
CREATE EXTERNAL FILE FORMAT csvformat SELECT [name]
WITH ( ,[lastname]
FORMAT_TYPE = DELIMITEDTEXT, ,[email]
FORMAT_OPTIONS ( FROM
FIELD_TERMINATOR = ',', [AdventureWorks2016CTP3].
First_Row=2 [dbo].[customerstable]
)
);
CREATE EXTERNAL TABLE customerstable
(
name VARCHAR(128),
lastname VARCHAR(128),
email VARCHAR(100)
)
WITH
(
LOCATION = '/',
DATA_SOURCE = mycustomers,
FILE_FORMAT = csvformat

)
you can write project description here

You might also like