Spark With Scala Recently Asked Interview Questions: Trendytech Insights

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

For suppose if there is a file of 500MB and there are 1000 node cluster.

how to consider
block size in Hadoop a) consider as 64Mb block size b) consider 128Mb block size c) no use
The answer was to consider 64MB can know why? Sorry, I couldn't frame the question
exactly. I felt if we divide into small sizes then there is a chance of an increase in the
metadata in NameNode. I thought it was not an ideal solution. Can you please explain this
clearly? and there is one more question in vice versa i.e., with 4 node cluster there the correct
answer given was 256Mb I think. Can you please go through these two questions can you
give clarification?

YOU HAVE NOT REACTED TO THIS POST. CLICK TO REACT WITH A HEART. THE
TOTAL NUMBER OF REACTIONS FOR THIS ITEM IS 0.01 COMMENT

Trendytech Insights
MODERATOR8 MONTHS AGO

Spark with Scala recently asked Interview


questions
1. Please explain how a SMB Join works Internally. 2. What is a Catalyst Optimizer in Spark.
3. What are the different techniques to tune your spark application. 4. what is a companion
object in Scala. 5. How do you implement a Singleton design pattern in Scala. 6. what
happens internally when you submit a spark job. 7. How will you optimize when joining two
large tables. 8. what is the difference between sort aggregate and hash aggregate. 9.
Difference between Client mode vs Cluster mode in Spark. 10. Why do we say Parquet file
format works well with Spark. Let's try to answer these! Post your answers in comments
section. #SumitTeaches P.S. It is never too late to be what you might have been. ~George
Eliot

Data Skewness & SKEW JOIN :


Distribution of keys is uneven, which causes slowness of few reducers during execution.

🔸Let's say we have below dataset:


<Position> - <Number of rows in table>
SystemEngineer - 2000
Analyst - 200
Manager - 20
Admin - 27

In above data, you will notice data is not evenly distributed on "position". Hence we call it as
skewed table on key "position".
If you create partitions of data on this column then one partition will have 2000 records while
other three partitions have comparatively less records.

🔸3 tasks working on smaller partitions can get completed but, the task working on large partition
will still be running.
This impacts overall performance.

🔹If any of two tables is skewed, then we should use skew join.

▪️Suppose we want to join two tables Sales & Product.

🔹SALES TABLE SKEWED on COLUMN ID=30


🔹Product TABLE also has COLUMN ID=30 but it's not skewed.
🔹So, Product table having id=10 is loaded to in-memory hash table.
🔹Set of mappers are created which read records having COLUMN ID=30 from SALES table and
MAP JOIN is performed with Product table. No data needs to go to reducers.

▪️Hive Properties:
🔸hive.optimize.skewjoin=true;
🔸hive.skewjoin.key=500000; --threshold

You might also like