Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Q1) Define big data. Why is big data required?

How does traditional BI environment differ Q9) Write a short note on Classification of Analytics? Analytics can be classified into three Q15) Discuss data locality in detail and how it helps improve performance in Hadoop?
from big data environment? - Big Data: Big data refers to extremely large and complex main categories: 1. Descriptive Analytics: - Descriptive analytics focuses on summarizing Explain scenarios where data locality may or may not be optimal. (10 marks) - Data locality
datasets that exceed the capabilities of traditional data processing methods. It is characterized historical data to provide insights into what has happened in the past. It involves the in Hadoop refers to scheduling map tasks on the same server where the input data resides
by its volume, velocity, variety, and sometimes veracity, encompassing vast amounts of examination of patterns, trends, and key performance indicators (KPIs) to understand the rather than transferring data across nodes. This provides several benefits: • Improves network
information that traditional systems struggle to handle. Why Big Data is Required: Big data is current state of affairs. 2. Predictive Analytics: - Predictive analytics uses statistical algorithms bandwidth utilization as data transfer is minimized • Lowers latency for task execution as data
necessary for in-depth analysis, informed decision-making, predictive analytics, and gaining a and machine learning techniques to analyze historical data and predict future outcomes. It movement not required • Better cluster utilization as tasks are distributed based on data
competitive advantage. It enables organizations to process and derive insights from massive enables organizations to make informed decisions by forecasting trends, identifying potential location • Enables scaling out linearly by adding nodes without network bottlenecks. However,
datasets, providing a comprehensive view of business operations and customer behavior. risks, and understanding probable scenarios. 3. Prescriptive Analytics: - Prescriptive analytics strict data locality is not always optimal. Scenarios where data locality may not help: •
Traditional BI vs. Big Data Environment: Traditional BI environments primarily deal with goes beyond predicting future outcomes; it recommends actions to optimize results. By Unbalanced or skewed data distribution can lead to uneven cluster utilization • Nodes storing
structured data in relational databases, operate in batch mode, and may face scalability considering various possible actions and their potential impacts, prescriptive analytics helps critical intermediate data become hotspots • Fetching data from remote nodes may be faster
challenges. In contrast, big data environments handle diverse data types, support real-time organizations make strategic decisions that lead to desired outcomes. This classification than waiting for a busy node • Some data processing requires cross-rack communication
processing, scale horizontally, and offer cost-effective solutions for storing and processing large provides a framework for organizations to leverage analytics at different levels, from Hadoop tries to achieve optimized data locality by allowing customizable data locality
datasets. The shift to big data reflects the need for more flexible, scalable, and real-time data understanding historical data (descriptive) to predicting future trends (predictive) and relaxation. The scheduler relaxes locality constraints if tasks are waiting for a long time. This
processing capabilities in today's digital landscape. prescribing optimal actions (prescriptive) for improved decision-making. provides a balance between locality and optimal utilization.

Q2) What are the challenges with big data?- •Challenge in incorporating diverse, large Q10) What is big data analytics? Also write and explain importance of big data. - Big Data Q16) Analyse the small file problem in HDFS and discuss various solutions with their pros
datasets into analytical platforms leading to gaps and inaccurate insights. •Acute shortage of Analytics: Big data analytics refers to the process of examining and uncovering meaningful and cons. (10 marks)- The small file problem in HDFS arises when storing a large number of
skilled professionals in Big Data analysis, creating high demand for data scientists. •Ensuring patterns, trends, and insights from large and complex datasets, commonly known as big data. small files. Challenges with small files: • Namenode memory overhead to store file metadata
meaningful insights from Big Data analytics and managing access to relevant departments It involves using advanced analytical techniques, including statistical analysis, machine • Low storage utilization due to lots of under-filled blocks • Excessive seeks and network
pose challenges. •Handling growing volumes of data daily is a constant challenge for learning, and data mining, to extract valuable information that can guide business decision- chatter during file access Solutions to the small file problem include: • Sequence files to pack
businesses. •Uncertainty in choosing suitable technologies for Big Data analytics without making and strategic planning. The importance of Big Data are: • Big Data's importance lies in small files into larger files. Con is code changes needed. • HAR files to archive small files. Con
introducing new problems. •Managing storage of massive data presents challenges in data efficient utilization, not just the volume of data. • Enables cost savings through tools like is accessing archives needs unpacking. • Compression to reduce storage usage. Con is
lakes/warehouses, leading to quality issues. •Security and privacy concerns arise with the use Hadoop and Cloud-Based Analytics. • Time reductions achieved with high-speed tools for quick compressed files use CPU. • In-memory filesystems like Alluxio to cache files. Con is memory
of Big Data tools, increasing the risk of data exposure. data analysis and decision-making. • Enhances understanding of market conditions through overhead. • Hive/HBase to store small records in own system. Integration complexity. •
customer behavior analysis. • Controls online reputation via sentiment analysis tools. • Utilizes Increase block size to pack more small files. Con is reduced parallelism. There is no single
Q3) Define big data. Why is big data required? Write a note on data warehouse Big Data Analytics for customer acquisition, retention, and loyalty. • Solves advertising optimal solution. Based on access patterns, storage overhead, and need for native HDFS, an
environment? Big Data: Big data refers to extremely large and complex datasets that exceed problems and provides marketing insights through data observation. appropriate technique should be used to optimize handling of small files.
the capabilities of traditional data processing methods. It is characterized by its volume,
velocity, variety, and sometimes veracity, encompassing vast amounts of information that Q11) Write a short note on data science and data science process? Data Science: Data science Q17) Explain the key characteristics of NoSQL databases and how they differ from traditional
traditional systems struggle to handle. Why Big Data is Required: Big data is necessary for in- is a multidisciplinary field that utilizes scientific methods, processes, algorithms, and systems RDBMS. (10 marks) - NoSQL databases are designed for storing and managing large volumes
depth analysis, informed decision-making, predictive analytics, and gaining a competitive to extract knowledge and insights from structured and unstructured data. It combines of structured, semi- structured and unstructured data. Key characteristics: • Flexible schema -
advantage. It enables organizations to process and derive insights from massive datasets, expertise from various domains, including statistics, mathematics, computer science, and No fixed schema allows easy evolution of data models. • Scalability - Built with horizontal
providing a comprehensive view of business operations and customer behavior. •Big Data domain- specific knowledge, to uncover patterns, trends, and valuable information from large scaling to distribute data across servers. • Eventual consistency - Favors availability over ACID
handles large, diverse datasets for insights and decision-making. •Big Data analytics uncovers datasets. Data Science Process: The data science process involves several iterative steps: • guarantees for high performance. • Non-relational - Uses document, key-value, graph or wide
patterns, relationships, and market trends for informed decisions. •Data Warehouse (DW) Problem Definition: Clearly define the problem or question at hand and establish objectives column storage models rather than relational tables. • High performance - In-memory caching
consolidates and analyzes corporate information for business decisions. •DW sourced from that data science aims to address. • Data Collection: Gather relevant data from various and distributed architecture allow high throughput and low latency. In contrast, RDBMS have
operational systems and external data, populated via ETL tools. •Business intelligence sources, ensuring it is comprehensive and representative of the problem domain. • Data fixed schemas, vertical scaling, strong consistency, and table- relational models suitable for
software accesses and analyzes data within the Data Warehouse. •Both Big Data and Data Cleaning and Preprocessing: Cleanse and preprocess the data to handle missing values, structured transactional data. NoSQL is better suited for web, mobile and Big Data applications
Warehouse serve organizations in making informed business decisions. outliers, and ensure it is in a suitable format for analysis. • Exploratory Data Analysis (EDA): dealing with variably structured data at scale.
Conduct exploratory data analysis to understand the characteristics of the data, identify
Q4) What are the three characteristics of big data? Explain the differences between BI and patterns, and gain initial insights. • Feature Engineering: Create new features or transform Q18) Compare and contrast Graph databases and Document databases - their data models,
Data Science? Characterized by high-volume, velocity, and variety information assets 1. existing ones to enhance the predictive power of models. • Model Development: Utilize query languages, and use cases. (10 marks) - Graph databases like Neo4j use nodes and edges
Objective: BI: The primary objective of BI is to provide insights into past and current business statistical and machine learning techniques to build models that address the defined problem as the data model to represent networked data. Relationships are first class citizens enabling
performance. It focuses on reporting, querying, and data visualization to support strategic and or question. • Model Evaluation: Evaluate the performance of models using appropriate fast traversal across networks. Query languages are based on graph patterns and traversal. Use
operational decision- making. Data Science: Data Science is forward-looking, emphasizing the metrics to ensure their effectiveness. • Model Deployment: Deploy the model into a cases include social networks, knowledge graphs, IoT networks. Document databases like
discovery of patterns, correlations, and trends in data to make predictions and inform future production environment, making it accessible for real-world applications. • Monitoring and MongoDB use documents (JSON/BSON) as the primary data model. Documents allow nested
strategic decisions. 2. Scope: BI: BI typically deals with structured data, often sourced from Maintenance: Continuously monitor the model's performance in a live environment and make structures without fixed schema. Query languages provide declarative ways to filter and
internal databases and transactional systems. It is oriented towards predefined reports and necessary adjustments to ensure its accuracy and relevance over time. The data science aggregate documents. Use cases include content management, ecommerce, web apps. While
dashboards. Data Science: Data Science has a broader scope, encompassing both structured process is iterative, with feedback loops allowing for refinement and improvement at each graphs focus on relationships, documents focus on structure within records. Graphs allow
and unstructured data. It involves exploring diverse datasets, including text, images, and social stage. This structured approach enables data scientists to derive actionable insights and efficient traversal whereas documents make it easy to query nested data. Graphs are suited
media, and uses advanced analytical techniques. 3. Methodology: BI: BI follows a structured solutions from data, contributing to informed decision-making in various domains. for interconnected data while documents suit applications with semi-structured records.
and predefined approach. It focuses on key performance indicators (KPIs), standardized
reporting, and delivering actionable insights based on historical and current data. Data Q12) Write a short note on Soft state eventual consistency? - Soft state eventual consistency Q19) Discuss the trade-offs between consistency, availability and partition tolerance in
Science: Data Science adopts an exploratory and iterative approach. It involves hypothesis is a concept in distributed computing that aims to strike a balance between system availability distributed NoSQL systems. Explain with examples. (10 marks) - According to the CAP
testing, experimentation, and the development of predictive models to uncover insights and and data consistency. In systems employing eventual consistency, the goal is to ensure that, theorem, distributed systems can only guarantee two of the following: • Consistency - all
patterns in data. 4. Tools and Techniques: BI: BI tools are designed for data visualization, given enough time and no new updates, all replicas of a piece of data will converge to the same nodes see the latest updated data • Availability - guarantees response despite failures •
reporting, and simple analytics. SQL queries and predefined metrics are commonly used for state. Key Points: 1. Eventual Consistency:- Soft state eventual consistency acknowledges that, Partition tolerance - system works despite arbitrary partitioning. Most NoSQL systems favor
analysis. Data Science: Data Science employs a more extensive range of tools and techniques, in distributed systems, achieving immediate consistency across all nodes may not always be availability and partition tolerance over strong consistency. For example, Cassandra offers
including statistical modeling, machine learning algorithms, and programming languages such practical or efficient. 2. Trade-off between Consistency and Availability: - It represents a trade- tunable consistency levels like ANY and QUORUM to allow flexible consistency guarantees.
as Python and R. It involves creating and deploying complex models to handle predictive off between ensuring data consistency at all times (strong consistency) and allowing systems DynamoDB and MongoDB also relax consistency for high availability and low latency.
analytics. 5. Decision-Making Horizon: BI: BI supports short to medium-term decision-making to remain available for read and write operations even during network partitions or failures. Applications have to account for inconsistency - like reading stale data or merging conflicting
by providing insights into current and historical performance. Data Science: Data Science 3. Acceptance of Temporary Inconsistency: - Soft state eventual consistency accepts the writes. However, for large scale web and mobile apps, availability and performance often
supports both short-term and long-term decision-making by forecasting future trends and existence of temporary inconsistencies among replicas but expects that, over time and without trump strong consistency requirements. The trade-offs allow NoSQL systems to scale, perform
outcomes. further updates, all replicas will converge to a consistent state. 4. Use Cases: - This approach and remain highly available at internet scale.
is often suitable for systems where real-time consistency is not critical, and temporary
Q5) Describe the current analytical architecture for data scientists? - The current analytical divergences in data state can be tolerated for the sake of system availability and fault
Q20) What are the characteristics of big data and what are the challenges in analyzing them
architecture for data scientists typically revolves around a cloud-based ecosystem. It involves tolerance. 5. Implementation:- Soft state eventual consistency is implemented by allowing
using traditional statistical techniques? How do machine learning techniques help in big data
scalable storage solutions such as data lakes, allowing the storage of vast and diverse datasets. replicas to continue serving read and write requests independently, with eventual
analytics? - Big data refers to large, complex datasets characterized by 3V - Volume, Velocity
Data processing and transformation are facilitated by distributed computing frameworks like reconciliation to synchronize data across nodes over time. In summary, soft state eventual
and Variety. Traditional statistical methods are inadequate for big data due to: • Scalability
Apache Spark. Advanced analytics and machine learning models are developed using consistency is a pragmatic approach in distributed systems that acknowledges the challenges
issues for large volumes of data • Difficulty in integrating heterogeneous and unstructured
programming languages such as Python and R, with frameworks like TensorFlow and PyTorch of immediate consistency and prioritizes system availability while allowing for eventual
data types • Requirement of complex modelling and sampling techniques Machine learning
gaining prominence. Containerization technologies like Docker facilitate seamless deployment convergence of data state across all replicas.
provides solutions through: • Distributed learning algorithms like MapReduce that scale
of models. Data visualization tools like Tableau or Power BI are integrated for communicating across clusters • Ability to uncover patterns from all kinds of unstructured data • Techniques
insights effectively. This architecture ensures flexibility, scalability, and accessibility, Q13) What are different phases of the Data Analytics Lifecycle? Explain each in detail? - 1. like neural networks that learn complex models from raw data. Thus, machine learning,
empowering data scientists to analyze, model, and visualize data efficiently. Business Understanding: - In this initial phase, the focus is on defining the business problem especially deep learning, provides the capabilities to effectively process big data and drive
or question that data analytics aims to address. Understanding the business context, goals, predictive analytics.
Q6) What are the key roles for the New Big Data Ecosystem? 1.Data Engineer: - Responsible and stakeholders' requirements is crucial to guide the entire analytics process. 2. Data
for designing, constructing, and maintaining the architecture that enables the processing of Collection and Preparation: - This phase involves gathering relevant data from various sources.
Q21) Explain the role of dimensionality reduction in analyzing high dimensional big data
large volumes of data. They develop data pipelines, ensuring efficient data ingestion, storage, Data is then cleaned, preprocessed, and transformed to ensure quality and compatibility with
using machine learning. Discuss various techniques, their working and comparative
and retrieval. 2.Data Scientist:- Focuses on analyzing and interpreting complex datasets to analytics tools. Handling missing values, outliers, and formatting issues is essential during this
advantages.- Real-world big data is often characterized by very high dimensionality with
extract valuable insights. Data scientists utilize statistical methods, machine learning, and stage. 3. Exploratory Data Analysis (EDA):- EDA involves analyzing and visualizing the data to
numerous features. Analysing such data directly leads to computational complexity and
predictive modeling to derive actionable information from big data. 3. Data Analyst: - gain insights into its characteristics, patterns, and potential relationships. Techniques such as
overfitting. Dimensionality reduction transforms data to lower dimensions while preserving
Examines and interprets data trends, patterns, and correlations to provide insights that statistical summaries, visualizations, and correlation analyses help in understanding the
useful information. Key techniques: • Feature selection - identify and retain most salient
support business decision-making. Data analysts play a crucial role in transforming raw data dataset's structure and informing subsequent analytical decisions. 4. Model Development:- In
features based on correlations. Works well for sparse data. • Feature extraction - combine
into understandable and actionable information. 4. Machine Learning Engineer: - Specialized this phase, statistical and machine learning models are built to address the defined business
features using methods like PCA. Useful for dense data. • Autoencoders - neural network-
in designing, implementing, and deploying machine learning models within the big data problem. This involves selecting appropriate algorithms, splitting the data into training and
based models that compress input into lower dimensional code. Captures non-linear
ecosystem. They work closely with data scientists to integrate and operationalize machine testing sets, and tuning model parameters to achieve optimal performance. 5. Model
relationships. Dimensionality reduction removes noise, collapses redundant information and
learning solutions. 5.Cloud Architect: - Designs and oversees the cloud infrastructure to Evaluation and Validation:- The developed models are rigorously evaluated using various
results in more robust models. The choice of technique depends on data characteristics and
support the storage, processing, and analysis of big data. Cloud architects ensure the metrics to assess their performance and generalization to new, unseen data. Validation
modelling objectives.
scalability, reliability, and security of the entire big data ecosystem. These roles collaboratively ensures that the models are reliable and capable of providing accurate predictions or
contribute to the effective functioning of the new big data ecosystem, facilitating the classifications. 6. Deployment:- Once a satisfactory model is obtained, it is deployed into a
production environment. This involves integrating the model into the operational workflow, Q22) What are the additional steps required in the machine learning workflow for analyzing
extraction of meaningful insights and driving data-driven decision-making within
making predictions on new data, and ensuring that the model's output aligns with the streaming big data? Explain with examples. -The lifecycle for analysing streaming data
organizations.
intended business objectives. 7. Monitoring and Maintenance:- Continuous monitoring of the involves additional considerations: • Preprocess and extract features from continuously
deployed model is essential to ensure its ongoing accuracy and relevance. If the model's incoming data streams • Train models on smaller data batches in real-time rather than full
Q7) What are key skill sets and behavioural characteristics of a data scientist? 1. Analytical datasets • Adaptive learning algorithms like online learning to update models incrementally •
performance degrades over time, adjustments and updates may be necessary. Monitoring
Skills: - Proficiency in analyzing complex datasets, identifying patterns, and extracting valuable Predictions need continuous evaluation and model retraining to handle concept drift • Store
helps maintain the model's effectiveness in a changing business environment. 8.
insights using statistical methods and machine learning algorithms. 2. Programming only summary statistics rather than full data history. For instance, real-time anomaly
Communication of Results:- Throughout the entire lifecycle, effective communication of
Proficiency: - Strong programming skills, particularly in languages like Python or R, for data detection on user activity streams. The model trains on latest batches, makes predictions,
results is crucial. Findings and insights are communicated to stakeholders through reports,
manipulation, statistical analysis, and the development of machine learning models. 3. measures error, retrain to address drift. Summary stats like rolling averages are tracked to
visualizations, and presentations, ensuring that the analytics outcomes are comprehensible
Domain Knowledge: - A solid understanding of the specific industry or domain in which the detect anomalies. The workflow must operate continuously at low latency on streaming data.
and actionable.This lifecycle is iterative, allowing for refinements, adjustments, and additional
data scientist operates, enabling them to contextualize analyses and provide more relevant
analyses based on feedback and changing business requirements. It forms a structured
insights. 4.Communication Skills: - Effective communication skills to convey complex findings
framework for conducting data analytics projects effectively. Q23) A. Explain the evolution of big data processing frameworks from MapReduce to current
and insights to non-technical stakeholders. Data scientists should be able to articulate their
platforms. B. Discuss how YARN improved resource management over classic MapReduce.
results in a clear and understandable manner. 5. Curiosity and Problem-Solving Attitude: - A
Q14) Explain in detail the Hadoop architecture and how the various components achieve C. Explain how Hive and Impala enabled SQL-like data warehousing capabilities over
natural curiosity and a problem-solving mindset, encouraging data scientists to explore data,
reliable, distributed processing of big data. (10 marks)- The core Hadoop architecture consists Hadoop. D. Discuss the role of Cloudera and Snowflake as data platform vendors. Answer
ask relevant questions, and derive actionable solutions from the information at hand.
of: • HDFS (Hadoop Distributed File System): HDFS has a master-slave architecture with a A - MapReduce provided the initial breakthrough in distributed processing of big data on
Namenode (master) and Datanodes (slaves). Namenode manages the filesystem metadata and commodity hardware. However, it had limitations like lack of resource management and reuse
Q8) What is big data analytics? Explain in detail with its example. - Big Data Analytics: Big across jobs. B - YARN (Yet Another Resource Negotiator) opened up Hadoop for diverse
datanodes store the actual data in blocks. The data is replicated across datanodes for reliability.
data analytics refers to the process of examining and uncovering meaningful patterns, trends, workloads beyond MapReduce. It introduced a Resource Manager to allocate resources across
• YARN (Yet Another Resource Negotiator): YARN has a Resource Manager (master) which
and insights from large and complex datasets, commonly known as big data. It involves using clusters and Application Master to negotiate resources for each job. This enabled running
manages resources and schedules jobs. The NodeManagers (slaves) run on all nodes in the
advanced analytical techniques, including statistical analysis, machine learning, and data multiple applications on Hadoop without contention.C - Hive provides a SQL interface and data
cluster to launch and monitor containers running tasks. • MapReduce: MapReduce utilizes
mining, to extract valuable information that can guide business decision-making and strategic warehouse capabilities on top of Hadoop. It converts SQL queries into MapReduce jobs. Impala
YARN to execute data processing tasks in parallel. The developer creates map and reduce
planning. Example: Consider a retail company that collects vast amounts of data from various made SQL query processing on Hadoop faster through a distributed query engine. This enabled
functions to process data. Map tasks are distributed to nodes and run parallelly on the local
sources, including sales transactions, customer interactions, and social media. Through big analysts to leverage Hadoop using familiar SQL syntax. D - Cloudera offers a robust open-source
data. The framework sorts the outputs which are then input to the reduce tasks. This
data analytics, the company can analyze this data to gain insights into customer preferences, based Hadoop distribution along with proprietary management and governance tools. It aims
architecture provides reliability through replication and parallel processing to achieve
buying patterns, and market trends. For instance, the analytics may reveal that certain to make Hadoop usable in enterprise environments with better security, automation and
tremendous scalability. Data locality optimizes task scheduling to move computation closer to
products are often purchased together, allowing the company to optimize product placement optimization. Snowflake provides a cloud data platform with separation of storage and
the data. Hadoop can scale linearly by simply adding commodity hardware. Failures are
in stores. Additionally, sentiment analysis on social media data may provide insights into compute. It delivers data warehouse capabilities built specially for the cloud. Snowflake allows
automatically handled by rerunning failed tasks. This enables Hadoop to handle very large
customer opinions and preferences. By leveraging big data analytics, the retail company can enterprises to leverage cloud infrastructure for analytics at scale. The evolution from
datasets across hundreds or thousands of nodes.
make informed decisions on inventory management, marketing strategies, and customer MapReduce to YARN to SQL interfaces shows how big data processing frameworks have
engagement, ultimately improving overall business performance. matured over time by becoming more versatile, scalable and easier to use for enterprises.
Vendors like Cloudera and Snowflake have further enabled Hadoop and cloud adoption in
mainstream business analytics.

You might also like