Professional Documents
Culture Documents
Interview Prep
Interview Prep
numpy
__name__ , “__main__”
dict()
list
sorted()
reversed()
pprint
set
Spark Window functions
first_value(parameter),
lag()
lead()
last_value(parameter)
.read in spark
.write in spark
Functionalities
If the python interpreter is running that module (the source file) as the main program, it sets the special __n
constructor builds dictionaries directly from sequences of key-value pairs
mapping of key to value
mutable sequence of objects
sorts any iterable series, returns a list
reverses any iterable series, returns a reverse iterator
pretty-print: that allows you to customize the formatting of your output
delimited by {}, single comma seperated items, empty {} creates Dict so we need to specify set()
holds first value
holds first value
holds previous value
holds next value
holds last value
read the file in the desired location
writes the file in the desired location
Syntax/Examples
if __name__ == “__main__”
dict([('key1', value1), ('key2', value2), ('key3', value3)])
spark.read.load("examples/src/main/resources/people.json", format="json")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
ETL
.read()
.write()
createDataFrame()
filter()
dbutils.fs
dbutils.fs.ls(path)
spark.window()
databricks secrets
mount_point
Directly connect Blob
coelesce
repartition
select()
.range(1000, 10000)
%run command
ARM Template
CI CD
Error Records
how to write data from file to SQL server
validate Email
Dataframe column sequence
regexp_replace
Lookup
Get Metadata
badRecordsPath using option parameter.
logging
udf
Extract
logDF = (spark
.read
.option("header", True)
.csv()
)
(df.write.format('parquet')
.mode("overwrite")
.saveAsTable('bucketed_table'))
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))
df4.groupBy("year").pivot("course").sum("earnings").collect()
df.select(df.name).orderBy(df.name.asc()).collect()
df.select(df.age.cast(StringType()).alias('ages')).collect()
Secrets allow you to secure your credentials, and reference in your code instead of hard-
code
storage_account_access_key = '<key>'
badRecordsPath using option parameter.
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
dbcHostname = "<hostname>"
val jdbcPort = 1433
val jdbcDatabase = "<database>"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
Resilient Distributed datastructure is type safe, fault tolerant , developer has to take care of
optimization
Dataframe built on top of RDD, more optimized,
Regular expression
DataFrame[["column1", "column2", "column3"]]
to remove the extra spaces in the listed string
to read the content of the database tables or files
which allows reading metadata of its sources.
to implement error handling in Databricks and from where we read the error logs
log files are better way to capture information because it allows us to see our Logged
information overtime in one place
5 standard logging levels: Debug, Info, warning, error, critical
import org.apache.spark.sql.functions.udf
Transform
serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)
df.write.format('json').save(os.path.join(tempfile.mkdtemp(),
'data'))
display(dbutils.fs.ls(writePath))
are functions that perform a calculation over a group
of records
calculating a cumulative sum
accessing the values of a row appearing before the
current row
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.parquet("/input/parquetFile")
import logging
log = logging.getLogger("my-logger")
log.info("Hello, world")
logging.basicConfig(filename='log_file.log',level=loggi
ng.DEBUG,format = '%(asctime)s:%(levelname)s:%
(message)s')
df.write.mode('append').parquet(os.path.join(te
mpfile.mkdtemp(), 'data'))
df.write.partitionBy('year',
'month').parquet(os.path.join(tempfile.mkdtemp(
), 'data'))
serverErrorDF = (logDF
.filter((col("code") >= 500) & (col("code") < 600))
.select("date", "time", "extention", "code")
)
display(serverErrorDF)
pyspark.sql.dataframe.DataFrame
pyspark.sql.dataframe.DataFrame corruptDF.dropna("any")
from pyspark.sql.functions import col, max, min min("colname")
col("id")
max("id")
pyspark.sql.dataframe.DataFrame df.dropDuplicates()
pyspark.sql.dataframe.DataFrame .split("good")
from pyspark.sql.types import StringType spark.udf.register()
corruptDF = spark.createDataFrame([
(11, 66, 5),
(12, 68, None),
(1, None, 6),
(2, 72, 7)],
["hour", "temperature", "wind"]
)
duplicateDF.dropDuplicates(["id", "favorite_color"])
dupedWithColsDF = (dupedDF
.select(col("*"),
lower(col("firstName")).alias("lcFirstName"),
lower(col("lastName")).alias("lcLastName"),
lower(col("middleName")).alias("lcMiddleName"),
translate(col("ssn"), "-", "").alias("ssnNums")
))
def manual_split(x):
return x.split("s")
%sql
SELECT id,
hash,
manualSplitSQLUDF(hash) as augmented_col
FROM
randomTable
1. Introduction about the experience, projects worked on Azure ( ADF & Databricks ).
2. Asked about ADF pipeline I worked on, explained the flow.
3. What is lookup table, why we used it.
4. What is self-hosted IR?
5. How and where you will configure your self-hosted IR ?
6. Scenario based question on ADF
If we have data coming from 30 different countries to one place and data will be coming in daily basis and
we need to process it.
The schema will be same for the data. How would you design the pipeline ? How would your pipeline will
look like ? How many pipelines you will create ? How would you check data coming, how would you read it
?
7. What windows function in spark you worked on ?
8. What component I worked on my search engine project ?
9. What Databricks transformation I worked on ?
10. Will ask to explain some transformation function which we worked, how we did it, the steps.
11. What to connect databricks to blob, what is Mount, what is Mount Point?
12. What language I worked on, python, spark?
13. What Azure Technologies I know other than ADF and Databricks.
14. Difference between coalesce () and repartition(). Is there any difference or not ?
1. Introduction
2. Roles and responsibility in current project and previous project- followed by a lot of sub queries
3. Work on Databricks
4. How to read files and export files in Databricks
5. Various connections from Databricks
look up activity
Azure Synapse
Hash Funtion
Polybase
Steps of Polybase
Integration Runtime
Way of writing transformations without writing any code by drag and drop
works as a traffic control of your ADF course of actions
Validation activity task is to provide better control of your ADF pipeline execution
Retrieve a dataset from any of the ADF data sources
the output of lookup activity can be stored in subsequent activity
Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple
processors
A round-robin distributed table is a table where the data is evenly (or as evenly as possible) , Staging to
production layer it could be used
A distributed hash table is a distributed system that provides a lookup service: key-value pairs , uses Hash
function, improves query performance on large fact tables, source to staging it is used
A hash function is a function which when given a key, generates an address in the table
Polybase is a technology that access external Data stored in Azure Blob storage, Azure data lake, by Transact-SQL
language
Connection : connection to SQL DW
Create external datasource
external Data format
create schema for external table
create external table
create schema for table
create SQL DW table and load data
TSQL: set of programming extension that add several features to the Structured Query Language
(SQL)
CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123secretword';
)
you can write project description here