Interview Prep

Python Key Term
numpy
__name__ , “__main__”
dict()
list
sorted()
reversed()
pprint
set
Spark Window functions
first_value(parameter),
lag()
lead()
last_value(parameter)
.read in spark
.write in spark
Functionalities
If the python interpreter is running that module (the source file) as the main program, it sets the special __n
constructor builds dictionaries directly from sequences of key-value pairs
mapping of key to value
mutable sequence of objects
sorts any iterable series, returns a list
reverses any iterable series, returns a reverse iterator
pretty-print: that allows you to customize the formatting of your output
delimited by {}, single comma seperated items, empty {} creates Dict so we need to specify set()
holds first value
holds first value
holds previous value
holds next value
holds last value
read the file in the desired location
writes the file in the desired location
Syntax/Examples
if __name__ == “__main__”
dict([('key1', value1), ('key2', value2), ('key3', value3)])
form pprint import pprint as pp
spark.read.load("examples/src/main/resources/people.json", format="json")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
ETL
.read()
.write()
createDataFrame()
filter()
dbutils.fs
dbutils.fs.ls(path)
spark.window()
databricks secrets
mount_point
Directly connect Blob
coelesce
repartition
select()
for each partitions
.range(1000, 10000)
%run command
ARM Template
CI CD
Error Records
how to write data from file to SQL server
dataframe and RDD
validate Email
Dataframe column sequence
regexp_replace
Lookup
Get Metadata
badRecordsPath using option parameter.
logging
udf
Extract
logDF = (spark
.read
.option("header", True)
.csv()
)
(df.write.format('parquet')
.bucketBy(100, 'year', 'month')
.mode("overwrite")
.saveAsTable('bucketed_table'))
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))
df4.groupBy("year").pivot("course").sum("earnings").collect()
df.select(df.name).orderBy(df.name.asc()).collect()
df.select(df.age.cast(StringType()).alias('ages')).collect()
provides utilities for working with FileSystems

Shows list of files in specified path
from pyspark.sql.window import Window
Secrets allow you to secure your credentials, and reference in your code instead of hard-
code
accountKey = "fs.azure.account.key.<blob name>.blob.core.windows.net"

dbutils.fs.mount(
source = "wasbs://<container name>@<blob name>.blob.core.windows.net",
mount_point = "/mnt/azblobinputdata",
extra_configs = {accountKey:"<generate token to get this key>"})
storage_account_name = '<blob name>'
storage_account_access_key = '<key>'
blob_container = '<container name>'

spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)
we can reduce no of partitions, creates unequal sized partitions, Minimum shufflling df.coeles
we can increase or decrease no of partitions, creates equal sized partitions, it will shuffle and
evenly distribute the data
df.createOrReplaceGlobalTempView("people")
>>> df2 = df.filter(df.age > 3)
>>> df2.createOrReplaceGlobalTempView("people")
>>> df3 = spark.sql("select * from global_temp.people")
>>> sorted(df3.collect()) == sorted(df2.collect()
def f(people):
... for person in people:
... print(person.name)
>>> df.foreachPartition(f)
spark.range(1000, 10000)
To call a notebook inside another
if you need a way of deploying infrastructure-as-code to Azure, then Azure Resource

Manager (ARM) Templates are the obvious way of doing it simply and repeatedly. They
define the objects you want, their types, names and properties in a JSON file which can be
understood by the ARM API.
badRecordsPath using option parameter.
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
dbcHostname = "<hostname>"
val jdbcPort = 1433
val jdbcDatabase = "<database>"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.

import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
Resilient Distributed datastructure is type safe, fault tolerant , developer has to take care of
optimization
Dataframe built on top of RDD, more optimized,
Regular expression
DataFrame[["column1", "column2", "column3"]]
to remove the extra spaces in the listed string
to read the content of the database tables or files
which allows reading metadata of its sources.
to implement error handling in Databricks and from where we read the error logs
log files are better way to capture information because it allows us to see our Logged
information overtime in one place
5 standard logging levels: Debug, Info, warning, error, critical
import org.apache.spark.sql.functions.udf
Transform
serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)
df.write.format('json').save(os.path.join(tempfile.mkdtemp(),
'data'))
df.filter('bar not in ("a","b")').show()

salesDf.filter(col('Item')=='Pencil').select("Region","Rep","Item
","Units").show()
df.withColumnRenamed('age', 'age2').collect()
[Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]
display(dbutils.fs.ls(writePath))
are functions that perform a calculation over a group
of records
calculating a cumulative sum
accessing the values of a row appearing before the
current row
pip install databricks-cli

databricks configure --token
databricks secrets create-scope --scope gws-blob-storage
databricks secrets put --scope gws-blob-storage --key storage-acct-
key
databricks secrets list-scopes
databricks secrets list --scope gws-blob-storage
df = spark.read.option("inferSchema", "true")\
.option("header", "true")\
.csv("/mnt/azblobinputdata/retail_data.csv")
display(df.limit(5))
filePath = "wasbs://" + blob_container + "@" +

storage_account_name +
".blob.core.windows.net/SampleData.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema =
True, header = True)
salesDf.show()
ows.net', storage_account_access_key)
df.coelesce()
df.repartition()
val employees_table = spark.read.jdbc(jdbcUrl,
"employees", connectionProperties)
RDD: it was difficult to do simple operations like Join, Group By,

Outer join, for beginner they find difficult to write codes
DF gives user feeling of a table, its built on top of RDD,
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.parquet("/input/parquetFile")
import logging
log = logging.getLogger("my-logger")
log.info("Hello, world")
logging.basicConfig(filename='log_file.log',level=loggi
ng.DEBUG,format = '%(asctime)s:%(levelname)s:%
(message)s')
val upper: String => String = _.toUpperCase

val upperUDF = udf(upper)
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
dataset.withColumn("upper", upperUDF('text)).show
Load
(serverErrorDF.write
.parquet())
df.write.mode('append').parquet(os.path.join(te
mpfile.mkdtemp(), 'data'))
df.write.partitionBy('year',
'month').parquet(os.path.join(tempfile.mkdtemp(
), 'data'))
serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)
display(serverErrorDF)
from pyspark.sql.window import Window

# Defines partitioning specification and ordering
specification.
windowSpec = \
Window \
.partitionBy(...) \
.orderBy(...)
# Defines a Window Specification with a ROW frame.
windowSpec.rowsBetween(start, end)
# Defines a Window Specification with a RANGE
frame.
windowSpec.rangeBetween(start, end)
In Spark, a DataFrame is a distributed
collection of data organized into named
columns. ... DataFrames can be constructed
from a wide array of sources such as:
structured data files, tables in Hive, external
databases, or existing RDDs
Libraries Syntax
pyspark.sql.dataframe.DataFrame .range(100,1000)
spark.createDataFrame
pyspark.sql.dataframe.DataFrame
pyspark.sql.dataframe.DataFrame corruptDF.dropna("any")
from pyspark.sql.functions import col, max, min min("colname")
col("id")
max("id")
pyspark.sql.dataframe.DataFrame df.dropDuplicates()
from pyspark.sql.functions import col, lower, translat .lower("columnname")
pyspark.sql.dataframe.DataFrame .split("good")
from pyspark.sql.types import StringType spark.udf.register()
from pyspark.sql.functions import sha1, rand
%sql SQL API

Examples
corruptDF = spark.createDataFrame([
(11, 66, 5),
(12, 68, None),
(1, None, 6),
(2, 72, 7)],
["hour", "temperature", "wind"]
)
duplicateDF.dropDuplicates(["id", "favorite_color"])
dupedWithColsDF = (dupedDF
.select(col("*"),
lower(col("firstName")).alias("lcFirstName"),
lower(col("lastName")).alias("lcLastName"),
lower(col("middleName")).alias("lcMiddleName"),
translate(col("ssn"), "-", "").alias("ssnNums")
))
def manual_split(x):
return x.split("s")
manual_split("My name is Lipsa")
manualSplitPythonUDF = spark.udf.register("manualSplitSQLUDF", manual_split, StringType())

randomDF = (spark.range(1, 10000 * 10 * 10 * 10)
.withColumn("random_value",
rand(seed=10).cast("string"))
.withColumn("hash", sha1("random_value"))
%sql
SELECT id,
hash,
manualSplitSQLUDF(hash) as augmented_col
FROM
randomTable
1. Introduction about the experience, projects worked on Azure ( ADF & Databricks ).
2. Asked about ADF pipeline I worked on, explained the flow.
3. What is lookup table, why we used it.
4. What is self-hosted IR?
5. How and where you will configure your self-hosted IR ?
6. Scenario based question on ADF
If we have data coming from 30 different countries to one place and data will be coming in daily basis and
we need to process it.
The schema will be same for the data. How would you design the pipeline ? How would your pipeline will
look like ? How many pipelines you will create ? How would you check data coming, how would you read it
?
7. What windows function in spark you worked on ?
8. What component I worked on my search engine project ?
9. What Databricks transformation I worked on ?
10. Will ask to explain some transformation function which we worked, how we did it, the steps.
11. What to connect databricks to blob, what is Mount, what is Mount Point?
12. What language I worked on, python, spark?
13. What Azure Technologies I know other than ADF and Databricks.
14. Difference between coalesce () and repartition(). Is there any difference or not ?
1. Introduction
2. Roles and responsibility in current project and previous project- followed by a lot of sub queries
3. Work on Databricks
4. How to read files and export files in Databricks
5. Various connections from Databricks
6. Distribution in Azure Synapse
7. How to improve performance in Azure Synapse

8. How to authenticate BLOB from ADF
9. What will you do if you do not fully understand a PBI in a sprint
10. What will you do if you find that you need more infrastructure to achieve a task?
Roles and responsibility in current project – and cross questioning on the same
Not a single question on Python
Write file with some specific name
coalesce () and repartition() functions
logging mechanism in ADF and Databricks
Pass variable from one activity to other in ADF
Different activities used in ADF
Connection to ADLS from Databricks
Have you used any other source system than cloud – how to implement the same
Null handling
Logic apps
Cluster configuration
Best file to be written? Why?
A distribution is the basic unit of storage and processing for
parallel queries that run on distributed data. When Synapse
SQL runs a query, the work is divided into 60 smaller queries
that run in parallel.
Dataflows
Validation activity
look up activity
Azure Synapse
Round Robin Table

Hash Table
Hash Funtion
Polybase
Steps of Polybase
Integration Runtime
Way of writing transformations without writing any code by drag and drop
works as a traffic control of your ADF course of actions
Validation activity task is to provide better control of your ADF pipeline execution
Retrieve a dataset from any of the ADF data sources
the output of lookup activity can be stored in subsequent activity
Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple
processors
A round-robin distributed table is a table where the data is evenly (or as evenly as possible) , Staging to
production layer it could be used
A distributed hash table is a distributed system that provides a lookup service: key-value pairs , uses Hash
function, improves query performance on large fact tables, source to staging it is used
A hash function is a function which when given a key, generates an address in the table
Polybase is a technology that access external Data stored in Azure Blob storage, Azure data lake, by Transact-SQL
language
Connection : connection to SQL DW
Create external datasource
external Data format
create schema for external table
create external table
create schema for table
create SQL DW table and load data
An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to

provide data integration capabilities such as Data Flows and Data Movement. It has access
to resources in either public networks, or hybrid scenarios (public and private networks
child items, last modified date, exists, created
TSQL: set of programming extension that add several features to the Structured Query Language
(SQL)
CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123secretword';
CREATE DATABASE SCOPED CREDENTIAL mycredential

WITH IDENTITY = 'credential', Secret =
'VgqTsEqg1beXqhb+mO/wZh9OUZ+ByhKJJj7qc9pBne9e+BsdFo4Mfdl+u8Gh94tDqCR0/
uNZ4KHr0r4WuK85lA=='
CREATE EXTERNAL DATA SOURCE mycustomers

WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://mycontainer@polybasestoragesqlshack.blob.core.windows.net/',
CREDENTIAL = mycredential
);
CREATE EXTERNAL FILE FORMAT csvformat SELECT [name]
WITH ( ,[lastname]
FORMAT_TYPE = DELIMITEDTEXT, ,[email]
FORMAT_OPTIONS ( FROM
FIELD_TERMINATOR = ',', [AdventureWorks2016CTP3].
First_Row=2 [dbo].[customerstable]
)
);
CREATE EXTERNAL TABLE customerstable
(
name VARCHAR(128),
lastname VARCHAR(128),
email VARCHAR(100)
)
WITH
(
LOCATION = '/',
DATA_SOURCE = mycustomers,
FILE_FORMAT = csvformat
)
you can write project description here

Interview Prep

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interview Prep

Uploaded by

Copyright:

Available Formats

Python Key Term

form pprint import pprint as pp

for each partitions

dataframe and RDD

.bucketBy(100, 'year', 'month')

provides utilities for working with FileSystems

from pyspark.sql.window import Window

accountKey = "fs.azure.account.key.<blob name>.blob.core.windows.net"

blob_container = '<container name>'

if you need a way of deploying infrastructure-as-code to Azure, then Azure Resource

// Create a Properties() object to hold the parameters.

df.filter('bar not in ("a","b")').show()

pip install databricks-cli

filePath = "wasbs://" + blob_container + "@" +

RDD: it was difficult to do simple operations like Join, Group By,

val upper: String => String = _.toUpperCase

from pyspark.sql.window import Window

from pyspark.sql.functions import col, lower, translat .lower("columnname")

from pyspark.sql.functions import sha1, rand

%sql SQL API

manual_split("My name is Lipsa")

manualSplitPythonUDF = spark.udf.register("manualSplitSQLUDF", manual_split, StringType())

6. Distribution in Azure Synapse

7. How to improve performance in Azure Synapse

Round Robin Table

An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to

CREATE DATABASE SCOPED CREDENTIAL mycredential

CREATE EXTERNAL DATA SOURCE mycustomers

You might also like