Professional Documents
Culture Documents
S Q L
S Q L
2. Can you explain the steps involved in creating an ETL process for moving data
between databases? I'm curious about handling data validation, errors, and
incremental loading.
b. `orders`:
i. `order_id` (Primary Key)
ii. `customer_id` (Foreign Key referencing `customers.customer_id`)
iii. `order_date`
iv. `total_amount`
c. `order_items`:
i. `order_item_id` (Primary Key)
ii. `order_id` (Foreign Key referencing `orders.order_id`)
iii. `product_id`
iv. `quantity`
v. `unit_price`
Write an SQL query to identify the top 5 customers who have spent the most
money on the platform. The query should calculate the total amount spent by
each customer across all their orders and include their names and email addresses
in the result. Consider scenarios where customers may have multiple orders and
optimize the query for performance.
Solution :-
SELECT c.name, c.email, SUM(o.total_amount) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id
ORDER BY total_spent DESC
LIMIT 5;
This query joins the customers and orders tables on the customer_id field, calculates
the total amount spent by each customer using the SUM() function, groups the results by
customer_id, orders them in descending order based on total spent, and limits the
output to the top 5 customers.
Solution :-
1. Top 5 busiest airports :
SELECT airline_name,
AVG(duration) AS avg_flight_duration
FROM flights f
JOIN airlines al ON f.airline_code = al.airline_code
GROUP BY al.airline_code
ORDER BY avg_flight_duration DESC
LIMIT 1;
You're tasked with writing a SQL query to find the top 5 customers who have spent the
most money (total_amount) on orders, along with their total spending and the
number of orders they've placed. Additionally, include the city of each top-spending
customer.
● `customer_id`
● `name`
● `total_spending`
● `order_count`
● `city`
Solution :-
SELECT
c.customer_id,
c.name,
SUM(o.total_amount) AS total_spending,
COUNT(o.order_id) AS order_count,
C.city
FROM
Customers c
JOIN
GROUP BY
ORDER BY
LIMIT 5;
Explanation :-
We first connect the Customers table with the Orders table using the customer_id to
gather details about orders placed by each customer. Next, we group the data by
customer_id, name, and city, then calculate the total spending for each customer by
adding up the total_amount from all their orders. Additionally, we count the number of
orders placed by each customer. Finally, we arrange the results by total_spending and
order_count in descending order and limit the output to the top 5 customers. This
approach ensures we identify the top-spending customers while effectively handling
ties in spending.
6. How would you optimize a query that is running slowly due to a large
dataset? Discuss techniques such as indexing, query restructuring, and
materialized views.
Solution :-
a. Indexing:
i. Identify frequently used columns in queries and create indexes on those
columns to speed up data retrieval.
ii. Be cautious not to over-index, as it can impact data modification
operations.
b. Query Restructuring:
i. Simplify complex queries by breaking them into smaller, more
manageable parts.
ii. Use JOIN conditions effectively to filter data early in the query execution
process.
iii. Specify only necessary columns to reduce unnecessary data retrieval.
c. Materialized Views:
7. Describe the risks of SQL injection attacks and techniques for preventing
them in SQL queries. Provide examples of vulnerable code and how it could be
secured.
Solution :-
1. Data Leakage: Attackers can extract sensitive data from the database, such as
usernames and passwords.
2. Data Manipulation: Attackers can modify or delete data in the database, leading
to data corruption or loss.
3. Unauthorized Access: Attackers can gain unauthorized access to restricted areas
of the application or escalate their privileges.
Solution :-
Subqueries:
● Definition: Nested queries within another query.
● Usage: Filter or manipulate data based on results of another query.
● Performance: Can impact performance, especially if inner query is complex or
returns large result set.
● Readability: Makes queries more concise and readable, useful for expressing
complex logic.
JOINs:
Scenarios:
9. What are star and snowflake schemas? When would you use each schema
design in a data warehousing environment? Provide examples to illustrate your
answer.
Solution :-
Star Schema:
● Definition: Central fact table surrounded by denormalized dimension tables.
● Usage: Simple, efficient for querying, suitable for clear relationships between fact
and dimensions.
● Example: In retail, sales fact table surrounded by product, time, and store
dimensions.
Snowflake Schema:
Comparison:
a. Star Schema: Simple, denormalized, suitable for better query performance and
clearer relationships.
b. Snowflake Schema: More normalized, reduces redundancy, suitable for complex
hierarchical relationships.
10. How would you integrate SQL Server with big data technologies such as
Hadoop or Apache Spark? Discuss challenges and best practices for data
integration in a hybrid environment.
Solution :-
Challenges:
1. Data Formats and Protocols:
● Different formats and protocols between SQL Server and big data
technologies can hinder seamless data exchange.
2. Data Consistency and Integrity:
● Maintaining consistency across heterogeneous systems is crucial for
ensuring data accuracy and reliability.
3. Performance and Scalability:
● Achieving optimal performance and scalability when processing
large datasets distributed across multiple systems.
4. Security and Access Control:
● Ensuring data security and access control across hybrid
environments with diverse authentication mechanisms.
5. Metadata Management:
● Managing metadata and data lineage across systems for data
governance and compliance.
Best Practices: