DS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

1) What is EDA? Explain any two types of 4) What is kurtosis? Explain its types. 7) What is Data wrangling?

7) What is Data wrangling? Explain with any one


visualization. Kurtosis is a statistical measure of the shape of a package.
> EDA stands for Exploratory Data Analysis. It is a probability distribution. It describes how peaked or flat > Data wrangling is the process of cleaning,
process of investigating and analyzing a dataset to the distribution is, and how thick or thin the tails are. transforming, and manipulating data to prepare it for
discover patterns, trends, and relationships. EDA is an There are three main types of kurtosis: Mesokurtic: A analysis. It is a crucial step in any data science project,
essential step in any data science project, as it helps to mesokurtic distribution has a normal kurtosis, meaning as it ensures that the data is of high quality and in a
ensure that the data is well-understood and that any that it has a moderate peak and tails that are neither format that can be easily analyzed. Data wrangling can
subsequent analysis is appropriate and meaningful. too thick nor too thin. Leptokurtic: A leptokurtic be a complex and time-consuming process, but it is
Data visualization is a key component of EDA. It distribution has a positive kurtosis, meaning that it has essential for producing accurate and meaningful
involves creating visual representations of the data, a sharp peak and heavy tails. This type of distribution results. The specific steps involved in data wrangling
such as charts, graphs, and maps. This makes it easier is often associated with outliers. Platykurtic: A will vary depending on the nature of the data and the
to identify patterns and trends, and to spot anomalies platykurtic distribution has a negative kurtosis, desired outcome. However, some common tasks
or outliers. Two common types of data visualization meaning that it has a flat peak and light tails. This type include: Identifying and removing errors: This may
used in EDA are: Histograms: Histograms show the of distribution is often associated with data that is involve correcting typos, removing duplicate records,
distribution of a quantitative variable. They are useful tightly clustered around the mean. Kurtosis can be and handling missing values. Transforming the data:
for identifying outliers, skewness, and other patterns used to identify anomalies in data, to assess the risk of This may involve changing the format of the data,
in the data. Scatter plots: Scatter plots show the an investment, and to understand the behaviour of converting units of measurement, and creating new
relationship between two quantitative variables. They complex systems. For example, a leptokurtic variables. Cleaning the data: This may involve
can be used to identify correlations, trends, and distribution may indicate that an investment is more removing outliers, identifying and correcting
patterns. likely to experience extreme returns, either positive or inconsistencies, and normalizing the data. There are a
negative. number of software packages that can be used for data
2) What is data normalization? Illustrate any one wrangling. One popular package is Pandas, a Python
type of data normalization technique with an 5) What is Box plot? Decribe the process to library that provides high-performance, easy-to-use
example. identify an outlier with Box plot. data structures and data analysis tools.
> Data normalization is a process of organizing data in > A box plot, also known as a box-and-whisker plot, is import pandas as pd
a database to minimize redundancy and improve data a graphical method for displaying the distribution of # Create a DataFrame
integrity. It involves dividing the data into smaller, data. It is a useful tool for identifying outliers and df = pd.DataFrame({'name': ['Alice', 'Bob',
more manageable tables that are linked together using understanding the spread of the data. A box plot is 'Carol'], 'age': [25, 30, 35]})
relationships. There are several different types of data constructed by first calculating the following statistics: # Identify and remove duplicate records
normalization, each with its own set of rules. The most Median: The middle value in the sorted dataset df = df.drop_duplicates()
common types of data normalization are: First normal First quartile (Q1): The median of the lower half of # Transform the data by changing the format of
form (1NF): Each attribute in a table must be atomic, the dataset Third quartile (Q3): The median of the the 'age' column
meaning that it cannot be divided into smaller parts. upper half of the dataset Interquartile range (IQR): df['age'] = df['age'].astype('int32')
Additionally, each record in a table must be unique. The difference between the third and first quartiles # Clean the data by removing outliers
Second normal form (2NF): All non-key attributes in (IQR = Q3 - Q1) The box plot is then constructed as df = df.loc[df['age'] < 40]
a table must be fully dependent on the primary key. follows: A box is drawn from Q1 to Q3. A line is drawn # Normalize the data by scaling the 'age' column
This means that they cannot be dependent on only a through the box at the median. Whiskers are drawn df['age'] = df['age'] / df['age'].max()
part of the primary key. Third normal form (3NF): from the box to the minimum and maximum values in # Print the DataFrame
All non-key attributes in a table must be directly the dataset, or to 1.5 IQRs from the box, whichever is print(df)
dependent on the primary key. This means that they smaller. Any data points that fall outside of the output
cannot be transitively dependent on the primary key whiskers are considered to be outliers. Outliers are name age
through another non-key attribute. Here is an plotted as circles outside of the whiskers. Box plots are 0 Alice 0.625
example of a data normalization technique, using a simple but effective way to identify outliers and 1 Bob 0.75
the first normal form (1NF): understand the distribution of data. They are often 2 Carol 0.875
StudentID Name Address Courses used in data science and other fields to explore and
analyze data. Here are the steps to identify an outlier
1 John Doe 123 Main Math, with a box plot: Calculate the median, first quartile, 8) Explain data, information and knowledge
Street English third quartile, and interquartile range of the data. triangle.
2 John Doe 456 Elm Science, Construct the box plot by drawing a box from the first > The data, information, and knowledge (DIKW)
Street History quartile to the third quartile, a line through the box at triangle is a model that illustrates the relationship
Normalized table: the median, and whiskers from the box to the minimum between data, information, and knowledge in the
and maximum values in the dataset, or to 1.5 IQRs context of data science. The triangle is often depicted
Student ID Name Address
from the box, whichever is smaller. Identify any data as a pyramid, with data at the bottom and knowledge
1 John Doe 123 Main points that fall outside of the whiskers. These data at the top. Data is the raw, unprocessed facts that are
Street points are considered to be outliers. collected. It can be in any format, such as numbers,
2 John Doe 456 Elm
letters, images, or videos. Information is data that has
Street
6) Explain the terms data, information and been processed and organized so that it has meaning.
knowledge. It is the result of giving context and interpretation to
Course ID Student ID Course Name > Data is raw, unprocessed facts. It can be anything data. Knowledge is the understanding of how to use
1 1 Math from numbers and letters to images and videos. Data is information to solve problems or make decisions. It is
2 1 English often collected and stored in databases or the ability to apply information to real-world situations.
3 2 Science spreadsheets. Information is data that has been The DIKW triangle shows that data is the foundation
4 2 History processed and organized so that it has meaning. It is for information and knowledge. Without data, there
The original table is not in 1NF because the Courses the result of giving context and interpretation to data. would be nothing to process or understand.
attribute contains multiple values for each record. This For example, the number "100" is data, but the Information is the bridge between data and
is called a repeating group. To normalize the table, we statement "The average temperature today was 100 knowledge. It provides the context and interpretation
split the Courses attribute into a separate table, degrees Fahrenheit" is information. Knowledge is the that is needed to turn data into knowledge. Knowledge
Courses, and link the two tables using the Student ID understanding of how to use information to solve is the ultimate goal of data science. It is the ability to
attribute. Now, each record in both tables is unique problems or make decisions. It is the ability to apply use data to solve problems and make better decisions.
and each attribute is atomic. The database is also information to real-world situations. For example, The DIKW triangle is a useful model for understanding
easier to maintain and update, as we can now add or knowing that the average temperature today was 100 the different stages of data science projects. Data
remove courses without having to edit the student degrees Fahrenheit might help you decide whether or scientists typically start by collecting and cleaning
records. Data normalization is a complex topic, but it is not to go for a walk. Here is a simple analogy to help data. Once the data is clean, they can begin to process
an important skill for any data scientist or database understand the difference between data, information, and analyze it to extract information. The information
administrator to have. By normalizing your data, you and knowledge: Data is like a pile of bricks. can then be used to generate knowledge, which can be
can improve the quality and integrity of your database, Information is like a blueprint for a house. Knowledge used to solve problems or make decisions.
and make it easier to use and maintain. is like the ability to build a house using the blueprint.
Data is the foundation of information and knowledge. 9) Explain the process of Web crawling
3) What is data? Explain types of data. Without data, there would be nothing to process or > Web crawling, also known as web spidering or web
understand. Information is the bridge between data indexing, is the process of automatically exploring the
> Data is any collection of information that has been
and knowledge. It provides the context and World Wide Web and downloading its pages to create a
organized so that it can be processed and analyzed.
interpretation that is needed to turn data into copy of the web. Web crawlers, or spiders, are
Data can be quantitative or qualitative, structured or knowledge. Knowledge is the ultimate goal of data programs that systematically browse the web, starting
unstructured, and big or small. Quantitative data is science. It is the ability to use data to solve problems from a list of known URLs, and following hyperlinks to
numerical data that can be measured and analyzed and make better decisions. discover new pages. Web crawling is an essential
using mathematical methods. Examples of quantitative process in data science, as it allows researchers to
data include customer ages, product sales, and 10) Explain MapReduce Architecture? collect large datasets from the web. These datasets
employee salaries. Qualitative data is non-numerical MapReduce and HDFS are the two major components can be used for a variety of purposes, such as:
data that represents descriptions, characteristics, or of Hadoop which makes it so powerful and efficient to Search engine optimization (SEO): Web crawlers
opinions. Examples of qualitative data include use. MapReduce is a programming model used for are used by search engines to index the web and
customer reviews, social media posts, and product efficient processing in parallel over large data-sets in a provide relevant results to users. Market research:
descriptions. Structured data is data that is organized distributed manner. The data is first split and then Web crawling can be used to collect data on consumer
in a predefined format, such as a database or combined to produce the final result. The libraries for behavior, product sentiment, and market trends.
spreadsheet. This makes it easy to store, query, and MapReduce is written in so many programming Natural language processing (NLP): Web crawling
analyze the data. Examples of structured data include languages with various different-different can be used to collect large corpora of text data, which
optimizations. The purpose of MapReduce in Hadoop is can be used to train NLP models. Machine learning:
customer contact information, product catalogs, and
to Map each of the jobs and then it will reduce it to Web crawling can be used to collect data to train
financial transactions. Unstructured data is data that machine learning models for a variety of tasks, such as
equivalent tasks for providing less overhead over the
does not have a predefined format. This type of data image classification, spam filtering, and fraud
cluster network and to reduce the processing power.
can be more difficult to store, query, and analyze, but The MapReduce task is mainly divided into two phases detection.
it can also contain valuable insights. Examples of Map Phase and Reduce Phase Components of
unstructured data include images, videos, and text MapReduce Architecture: Client: The MapReduce
documents. Big data is a term used to describe client is the one who brings the Job to the MapReduce
datasets that are too large and complex to be for processing. There can be multiple clients available
processed using traditional data processing methods. that continuously send jobs for processing to the
Big data datasets can contain billions or even trillions Hadoop MapReduce Manager. Job: The MapReduce
of data points. Examples of big data include data from Job is the actual work that the client wanted to do
social media, sensors, and financial markets. which is comprised of so many smaller tasks that the
client wants to process or execute. Hadoop
MapReduce Master: It divides the particular job into
subsequent job-parts. Job-Parts: The task or sub-jobs
that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final
output. Input Data: The data set that is fed to the
MapReduce for processing. Output Data: The final
result is obtained after the processing.
10) What is NoSQL? Briefly explain its types. 14) Discuss the 5 V's of Data. 17) How to create, use, show and delete
> NoSQL is an umbrella term for non-relational > The 5 V's of Data are a framework for understanding databases in MongoDB? Give example.
database management systems (DBMS). NoSQL the characteristics of big data. They are: > Create a database in MongoDB
databases are designed to handle large volumes of Volume: The amount of data that is being generated To create a database in MongoDB, you can use the use
data that are difficult to store and manage in relational and stored. Velocity: The speed at which data is being command. This command will create the database if it
databases. NoSQL databases are typically classified generated and processed. Variety: The different types does not already exist, or switch to the database if it
into four types: Document stores: Document stores of data that are being generated and stored. Veracity: does exist. For example, to create a database named
store data in JSON, XML, or other document-oriented The accuracy and reliability of the data. Value: The my_database, you would use the following command:
formats. Document stores are flexible and scalable, usefulness of the data for analysis and decision- use my_database MongoDB will prompt you with a
and they are often used to store web application data. making. Volume: The volume of data is growing message that it has switched to the new database.
Key-value stores: Key-value stores store data as key- rapidly. In 2023, it is estimated that the world will Use a database in MongoDB: Once you have created
value pairs. Key-value stores are simple and efficient, generate 97 zettabytes of data. This is an enormous a database, you can use the use command to switch to
and they are often used for caching and session amount of data, and it is difficult to store and process it. For example, to switch to the my_database
management. Column-family stores: Column-family using traditional methods. Velocity: The velocity of database, you would use the following command:
stores store data in columns, which makes them well- data is also increasing. Data is now being generated in use my_database You can also use the db object to
suited for analytical workloads. Column-family stores real time from a variety of sources, such as social reference the current database. For example, to get a
are scalable and fault-tolerant, and they are often used media, sensors, and financial transactions. This real- list of all the collections in the current database, you
for big data applications. Graph databases: Graph time data can be used to power applications such as would use the following code:
databases store data as nodes and edges, which makes fraud detection and stock trading. Variety: The variety db.listCollections().toArray(function(err,
them well-suited for representing relationships of data is also increasing. Traditional data sources, collections) {
between data points. Graph databases are scalable and such as relational databases, typically store structured console.log(collections);
efficient, and they are often used for social network data. However, new data sources, such as social media }); Show databases in MongoDB To show a list of all
analysis and fraud detection. MongoDB is a document and sensors, are generating unstructured data. the databases in MongoDB, you can use the show dbs
store database that is popular among web developers Unstructured data is more difficult to store and command. This command will print a list of all the
and data scientists. MongoDB is easy to use and scale, process than structured data, but it can contain database names to the console. For example, to show a
and it offers a variety of features that make it well- valuable insights. Veracity: The veracity of data is list of all the databases, you would use the following
suited for storing and managing complex data. important for ensuring that the results of analysis are command: show dbs Delete a database in MongoDB
accurate and reliable. Data can be inaccurate or To delete a database in MongoDB, you can use the
11) What is collection in MongoDB? Give an incomplete for a variety of reasons, such as human db.dropDatabase() method. This method will delete
example to create collection MongoDB. error, sensor errors, and data corruption. It is the database and all of its collections. For example, to
A collection in MongoDB is a database object that important to clean and validate data before using it for delete the my_database database, you would use the
stores documents. A document is a JSON-like object analysis. Value: The value of data is the most following command: db.dropDatabase() MongoDB
that can contain any type of data, such as strings, important V of all. Data is only valuable if it can be will prompt you to confirm that you want to delete the
numbers, arrays, and embedded documents. used to generate insights and make better decisions. database.
Collections are similar to tables in relational Data scientists use a variety of techniques to extract
databases, but they are more flexible and scalable. value from data, such as machine learning, statistical 18) Explain in detail homogeneous distributed
Collections can be dynamically created and expanded, analysis, and data visualization. The 5 V's of Data are a database and heterogeneous distributed
and they can store any type of data, regardless of its useful framework for understanding the challenges database.
structure. To create a collection in MongoDB, you can and opportunities of big data. By understanding the > A homogeneous distributed database is a distributed
use the db.createCollection() command. For different characteristics of big data, organizations can database system in which all nodes are identical. This
example, to create a collection called users, you would develop the right tools and processes to manage and means that all nodes run the same database
use the following command: analyze their data effectively. management system (DBMS) software and store the
db.createCollection("users") same database schema. Homogeneous distributed
You can also create a collection by inserting a 15) How to create indexes in MongoDB? Give databases are relatively easy to manage and maintain,
document into a non-existent collection. MongoDB will example. as there is no need to worry about compatibility issues
automatically create the collection if it does not > To create an index in MongoDB, you can use the between different DBMS software or database
already exist. For example, to create a collection called createIndex() method. This method takes a document schemas. Examples of homogeneous distributed
products and insert a document into it, you would use as its argument, which specifies the fields to be database systems include: MySQL Cluster. PostgreSQL
the following command: indexed and the sort order for each field. For example, Cluster. Oracle Real Application Clusters (RAC).
db.products.insertOne({ to create an ascending index on the name field of the A heterogeneous distributed database is a distributed
name: "Product 1", users collection, you would use the following database system in which different nodes can run
price: 100, command: db.users.createIndex({ name: 1 }); different DBMS software and store different database
quantity: 10 To create a compound index on the name and age schemas. This makes heterogeneous distributed
}) Once you have created a collection, you can start fields, you would use the following command: databases more flexible and versatile than
adding more documents to it. You can also query and db.users.createIndex({ name: 1, age: 1 }); homogeneous distributed databases, as they can be
update the documents in a collection using the You can also create a unique index, which prevents used to store and manage a wider variety of data.
MongoDB query language. duplicate values from being stored in the indexed field. However, heterogeneous distributed databases can be
To create a unique index, you would set the unique more complex to manage and maintain, as there is a
12) Write note on XPath. option to true. For example, to create a unique index need to ensure compatibility between different DBMS
> XPath (XML Path Language) is a language for on the email field of the users collection, you would software and database schemas. Examples of
navigating around XML documents. It uses a path-like use the following command: db.users.createIndex({ heterogeneous distributed database systems include:
syntax to identify nodes in an XML document. XPath email: 1 }, { unique: true }); Apache Hadoop HBase. Apache Hive. Apache Spark
can be used to select nodes, extract data from nodes, const MongoClient = SQL.
and modify nodes. XPath is used in a variety of require('mongodb').MongoClient;
applications, including: XSLT (Extensible Stylesheet const client = new 19) Write a short note on Hadoop Architecture.
Language Transformations): XSLT is a language for MongoClient('mongodb://localhost:27017'); State its advantages.
transforming XML documents into other formats, such client.connect(function(err, db) { > Hadoop is a distributed computing framework that
as HTML or PDF. XPath is used in XSLT to select the if (err) throw err; provides a way to process large datasets in a scalable
nodes in the input XML document that are to be const usersCollection = db.collection('users'); and fault-tolerant manner. Hadoop is based on the
transformed. XQuery (XML Query Language): usersCollection.createIndex({ name: 1 }, { MapReduce programming model, which breaks down a
XQuery is a language for querying XML documents. unique: true }, function(err, indexName) { large processing job into smaller tasks that can be
XPath is used in XQuery to select the nodes in the XML if (err) throw err; executed in parallel on multiple nodes. The Hadoop
document that are to be returned by the query. console.log('Index created successfully: ' + architecture consists of two main components:
XML Schema: XML Schema is a language for defining indexName); Hadoop Distributed File System (HDFS): HDFS is
the structure and content of XML documents. XPath is client.close(); a distributed file system that stores data across
used in XML Schema to specify the constraints on the }); multiple nodes in the cluster. HDFS is highly scalable
nodes in an XML document. XPath Syntax: XPath }); and fault-tolerant, as it replicates data across multiple
expressions use a path-like syntax to identify nodes in nodes to ensure that data is not lost if a node fails.
an XML document. The path starts at the root element 16) What is MongoDB? State its features. YARN: YARN is a resource management framework
of the document and then follows a sequence of steps > MongoDB is a NoSQL document database that that allocates resources to Hadoop applications. YARN
to the desired node. Each step in the path consists of a stores data in flexible, JSON-like documents. It is a is responsible for scheduling and monitoring Hadoop
node type, a predicate, and a list of axis specifiers. The popular choice for modern applications because it is jobs, and it ensures that jobs are completed efficiently.
node type specifies the type of node to select, such as scalable, performant, and easy to use. Advantages of Hadoop Architecture: Scalability:
element, attribute, or text node. The predicate is an Features of MongoDB: Document model: MongoDB Hadoop is highly scalable, as it can scale horizontally
optional expression that can be used to filter the nodes stores data in documents, which are self-contained by adding more nodes to the cluster. This makes
that are selected. The axis specifiers specify the units of data that can be easily queried and updated. Hadoop well-suited for processing large datasets.
relationship between the current node and the node to This makes MongoDB a good choice for applications Fault tolerance: Hadoop is fault-tolerant, as it
be selected. XPath Example: The following XPath that need to store complex data structures. Schema- replicates data across multiple nodes in the cluster.
expression selects all of the element nodes with the less: MongoDB does not require a predefined schema, This means that Hadoop applications can continue to
name product that are descendants of the element which gives developers more flexibility in how they process data even if a node fails. Cost-effectiveness:
node with the name category: /category//product store and manage their data. Horizontal scaling: Hadoop is a cost-effective solution for processing large
MongoDB can be scaled horizontally by adding more datasets, as it can run on commodity hardware.
13) Write a note on HBase. servers to a cluster. This makes it well-suited for Flexibility: Hadoop is a flexible platform that can be
> HBase is a distributed, column-oriented NoSQL applications that need to handle large amounts of data used to process a variety of data types, including
database that runs on top of Hadoop. It is designed for and traffic. Replication: MongoDB supports structured, unstructured, and semi-structured data.
storing and querying large amounts of data, such as replication, which allows you to create multiple copies
log files, web pages, and social media data. HBase is of your data for high availability and disaster recovery. 20) What is JSON? How to read JSON file in R
modeled after Google's Bigtable database, and it Performance: MongoDB is a very performant with an example?
provides similar features, such as scalability, fault database, especially for read-heavy applications. > JSON (JavaScript Object Notation) is a lightweight
tolerance, and real-time access to data. HBase stores data-interchange format that is easy for humans to
data in tables, which are similar to tables in relational 23) Write a short note on Smoothing by mean read and write. It is based on a subset of JavaScript
databases. However, HBase tables are column- technique. in data science syntax, but it is language-independent and can be used
oriented, meaning that data is stored in columns > Smoothing by mean technique is a data smoothing with any programming language. JSON is used to store
rather than rows. This makes HBase well-suited for technique that uses the mean of the neighboring data and exchange data between different systems and
querying data by column, which is a common operation points to smooth the data. This technique is often used applications. It is a popular format for web APIs, and it
in big data applications. HBase also provides a number to reduce noise and improve the readability of data. To is also used in many other applications, such as
of features that make it well-suited for real-time smooth data using the mean technique, first calculate databases, configuration files, and log files. To read a
applications. For example, HBase can be configured to the mean of the neighboring data points. Then, replace JSON file in R, you can use the jsonlite package. The
flush data to disk asynchronously, which means that the original data point with the mean. This process is jsonlite package provides a simple and efficient way
data can be written to HBase and made available for repeated for all of the data points in the dataset. The to read and write JSON data in R. Here is an example
querying immediately. Additionally, HBase supports number of neighboring data points to use depends on of how to read a JSON file in R:
atomic read and write operations, which means that the desired level of smoothing. A higher number of Code snippet # Load the `jsonlite` package.
multiple applications can read and write to the same neighboring data points will result in smoother data, library(jsonlite) # Read the JSON file.
data without causing conflicts. HBase is used by a but it will also lose more detail. Smoothing by mean json_data <- fromJSON("example.json")
variety of companies to store and query large amounts technique is a simple and effective data smoothing # Print the JSON data.
of data. For example, Facebook uses HBase to store technique that can be used to improve the readability print(json_data) Use code with caution. Learn more
and query user data, Twitter uses HBase to store and and accuracy of data. It is often used in a variety of The fromJSON() function takes the path to the JSON
query tweets, and Yahoo uses HBase to store and data science applications, such as time series analysis, file as input and returns the JSON data as a list in R.
query search logs. forecasting, and machine learning.
30) Write a short note on Ensemble Methods. 26) What is Decision tree? What are its 21) Write a short note on AWS.
> Ensemble methods are a type of machine learning advantages? > Amazon Web Services (AWS) is a cloud computing
algorithm that combines multiple models to produce a A decision tree is a machine learning algorithm that platform that offers a broad set of global compute,
more accurate and robust prediction. Ensemble uses a tree-like structure to make predictions. Each storage, database, analytics, application, and
methods work by training multiple models on the same node in the tree represents a feature of the data, and deployment services that help organizations move
data and then averaging their predictions. This each branch represents a possible decision. The tree is faster, lower IT costs, and scale applications. AWS
averaging process helps to reduce the variance of the constructed by starting at the root node and then offers over 200 fully featured services from data
model and improve its overall performance. There are recursively splitting the data into smaller subsets centers globally. Millions of customers — including the
a number of different ensemble methods, including: based on the values of the features. The process stops fastest-growing startups, largest enterprises, and
Bagging: Bagging works by creating multiple subsets when all of the data points in a subset have the same leading government agencies — rely on AWS to power
of the training data and training a model on each label. Once the tree is constructed, it can be used to their most critical applications, all with high
subset. The predictions of the models are then make predictions on new data points by starting at the availability, scalability, and reliability. AWS is the
averaged to produce the final prediction. Boosting: root node and following the branches down the tree world's leading cloud platform, and it is used by a wide
Boosting works by training a sequence of models, based on the values of the features in the new data variety of companies, from small businesses to large
where each model is trained on the residuals of the point. The prediction is made at the leaf node where enterprises. AWS offers a wide range of services,
previous model. The predictions of the models are then the tree terminates. Decision trees have a number including computing, storage, networking, databases,
weighted and averaged to produce the final prediction. of advantages, including: Interpretability: Decision analytics, machine learning, and artificial intelligence.
Stacking: Stacking works by training a meta-model on trees are very interpretable, which means that it is Benefits of using AWS: Scalability: AWS is highly
the predictions of multiple base models. The meta- easy to understand how the tree makes predictions. scalable, and it can be used to scale applications up or
model is then used to make the final prediction. This is because the tree can be visualized as a down as needed. This makes AWS well-suited for
Ensemble methods can be used for both classification flowchart. Robustness: Decision trees are robust to applications with variable workloads. Reliability: AWS
and regression tasks. They are often used in noise and outliers in the data. This is because the tree is a reliable platform, and it offers high availability for
applications where high accuracy is required, such as is constructed by splitting the data into smaller applications. AWS also offers a variety of features to
fraud detection and medical diagnosis. Advantages of subsets, which helps to reduce the impact of noise and help improve the reliability of applications, such as
ensemble methods: Ensemble methods are more outliers. Efficiency: Decision trees are efficient to disaster recovery and load balancing. Security: AWS
accurate than individual models. Ensemble methods train and predict with. This is because the tree can be is a secure platform, and it offers a variety of security
are more robust to noise and outliers in the data. constructed recursively, and the predictions can be features to help protect applications and data. AWS
Ensemble methods can be used to improve the made by following a simple path down the tree. also regularly audits its security practices and
performance of any type of machine learning model. infrastructure. Cost-effectiveness: AWS is a cost-
Disadvantages of ensemble methods: Ensemble 27) Explain Multiple Linear Regression. effective platform, and it offers a variety of pricing
methods can be computationally expensive to train. > Multiple linear regression (MLR) is a statistical options to help customers save money. AWS also offers
Ensemble methods can be difficult to interpret. technique that uses two or more independent variables a variety of discounts for customers who commit to
to predict the outcome of a dependent variable. It is an using AWS services for a long period of time.
31) Explain Forecasting. List the steps in extension of ordinary least-squares (OLS) regression,
forecasting. which uses only one independent variable. MLR is 22) Explain any 3 ways to do web scraping.
> Forecasting is the process of predicting future used in a wide variety of fields, including economics, > There are many different ways to do web scraping,
events or trends based on historical data and current finance, marketing, and healthcare. It can be used to but three common methods are: 1. Using a web
conditions. It is used in a wide variety of fields, answer a wide range of questions, such as: How does browser extension: There are a number of web
including business, economics, finance, and income affect spending? How does advertising affect browser extensions available that can be used to
meteorology. Steps in forecasting: Identify the goal sales? How do patient characteristics affect health scrape data from websites. These extensions typically
of the forecast. What are you trying to predict? What outcomes? MLR works by fitting a linear equation to allow you to select the elements on the page that you
is the time horizon of the forecast? Gather historical the data. The equation has the following form: want to scrape, and then export the data to a file or
data. This data should be relevant to the goal of the y = b0 + b1 * x1 + b2 * x2 + ... + bk * xk database. Some popular web browser extensions for
forecast and should cover a sufficiently long period of where: y is the dependent variable web scraping include: Scraper (Chrome) Octoparse
time. Clean and prepare the data. This may involve x1, x2, ..., xk are the independent variables (Chrome, Firefox) Web Scraper (Chrome, Firefox)
removing outliers, filling in missing values, and b0, b1, b2, ..., bk are the coefficients of the regression 2. Using a web scraping library: There are a
transforming the data into a format that is suitable for model. The coefficients of the regression model are number of web scraping libraries available for
forecasting. Choose a forecasting method. There estimated using a variety of methods, such as ordinary different programming languages. These libraries
are a variety of forecasting methods available, each least squares (OLS) or maximum likelihood estimation provide a variety of features for scraping data from
with its own strengths and weaknesses. The best (MLE). Once the coefficients of the regression model websites, such as parsing HTML and CSS, handling
method to use will depend on the nature of the data have been estimated, the model can be used to predict dynamic content, and exporting data to different
and the goal of the forecast. Build the forecasting the value of the dependent variable for a new set of formats. Some popular web scraping libraries include:
model. This involves fitting the forecasting method to independent variables. Advantages of multiple Beautiful Soup (Python) Scrapy (Python) PyQuery
the historical data. Evaluate the forecasting model. linear regression: MLR is a powerful tool for (Python) Cheerio (JavaScript). Puppeteer (JavaScript)
This involves testing the model on a held-out dataset to understanding the relationships between variables. 3. Using a web scraping API: There are a number of
assess its accuracy. Make the forecast. Once the MLR can be used to make predictions about the web scraping APIs available that can be used to scrape
forecasting model has been evaluated and found to be dependent variable. MLR is relatively easy to interpret. data from websites. These APIs typically provide a
accurate, it can be used to make predictions for the Disadvantages of multiple linear regression: MLR simple and easy-to-use interface for scraping data, and
future. Monitor the forecast and update it as can be sensitive to outliers and non-linear they can be used from any programming language.
needed. Forecasts should be monitored regularly and relationships in the data. MLR can be computationally Some popular web scraping APIs include: ScraperAPI
updated as new data becomes available or as expensive to fit for large datasets. ProxyCrawl. BrightData. Apify.
conditions change. Popular forecasting methods:
Time series analysis: This method uses historical 28) Write note on Bias/Variance Tradeoff? 24) What is K-NN? Explain with the help of an
data to identify patterns and trends. Causal > The bias-variance tradeoff is a fundamental concept example.
forecasting: This method uses relationships between in machine learning. It describes the relationship > K-nearest neighbors (K-NN) is a supervised machine
variables to make predictions. Judgmental between two sources of error in a predictive model: learning algorithm that can be used for both
forecasting: This method uses human judgment and bias and variance. Bias is the error due to the classification and regression tasks. It works by finding
expertise to make predictions. simplifying assumptions made by the model to make the K most similar instances in the training data to a
the target function more straightforward to new instance and then using the labels of those K
32) What is AIC, BIC? State their mathematical approximate. A model with high bias will underfit the instances to predict the label of the new instance.
formula. training data and will not be able to generalize well to Example: Suppose we have a dataset of images of cats
> Akaike information criterion (AIC) and Bayesian new data. Variance is the error due to the model's and dogs, and we want to train a K-NN classifier to
information criterion (BIC) are two statistical model sensitivity to small fluctuations in the training data. A classify new images as either cat or dog. We would
selection criteria that are used to compare different model with high variance will overfit the training data first train the classifier on a set of labeled images (i.e.,
models of the same data. They both penalize models and will not be able to generalize well to new data. The images that have already been labeled as either cat or
with more parameters, but the BIC penalizes more bias-variance tradeoff states that it is impossible to dog). To classify a new image, the classifier would first
strongly. AIC formula: AIC = 2k - 2 * ln(L) build a model with both zero bias and zero variance. find the K most similar images in the training data to
where: k is the number of parameters in the model Any model will have some degree of bias and variance, the new image. The similarity between two images can
L is the likelihood of the model and the goal is to find a balance between the two that be measured using a variety of metrics, such as the
BIC formula: BIC = -2 * ln(L) + k * ln(n) minimizes the overall error of the model. How to distance between the two images in pixel space. Once
where: k is the number of parameters in the model reduce bias and variance? There are a number of the classifier has found the K most similar images in
L is the likelihood of the model n is the number of data techniques that can be used to reduce bias and the training data, it would then predict the label of the
points in the training data. Both AIC and BIC can be variance in machine learning models. new image based on the labels of those K images. For
used for both classification and regression tasks. They To reduce bias: Use a more complex model. However, example, if the majority of the K most similar images
are often used in conjunction with other model this can lead to overfitting. Use regularization are cats, then the classifier would predict that the new
selection criteria, such as cross-validation. techniques to penalize complex models. image is also a cat.
Advantages of AIC and BIC: AIC and BIC are Use more training data. To reduce variance: Use a
relatively easy to calculate and interpret. AIC and BIC simpler model. However, this can lead to underfitting. 25) Explain Bayesian Information Criterion.
are consistent model selection criteria, which means Use bagging or boosting techniques. Use more > The Bayesian Information Criterion (BIC), also
that they will select the true model with probability 1 training data. known as the Schwarz criterion, is a statistical model
in the limit as the sample size increases. selection criterion that is used to compare different
Disadvantages of AIC and BIC: AIC and BIC can be 29) Explain hierarchical clustering. models of data. The BIC is based on the likelihood
sensitive to the choice of the prior distribution. > Hierarchical clustering is a type of unsupervised function of the model, and it penalizes models with
AIC and BIC can be computationally expensive to machine learning algorithm that groups similar data more parameters. This is because models with more
calculate for large datasets. points together. It does this by creating a hierarchy of parameters are more likely to overfit the training data.
clusters, with each cluster containing more similar The BIC is calculated as follows:
33) Give the formula for information gain & data points than the clusters above it. Hierarchical BIC = -2 * ln(L) + k * ln(n)
entropy. clustering can be performed using two different where:
> Information gain is a measure of how much approaches: Agglomerative clustering: This L is the likelihood of the model
information about the target variable is gained by approach starts by treating each data point as its own k is the number of parameters in the model
knowing the value of a predictor variable. It is cluster. It then iteratively merges the most similar n is the number of data points in the training data
calculated as the difference between the entropy of the clusters until a desired number of clusters is reached. The BIC is a relative measure, which means that it can
target variable before and after partitioning the data Divisive clustering: This approach starts with all of only be used to compare different models of the same
based on the predictor variable. The formula for the data points in a single cluster. It then iteratively data. The model with the lowest BIC is the model that
information gain is as follows: splits the cluster into smaller clusters until a desired is most likely to generalize well to new data. The BIC is
IG(T | X) = H(T) - H(T | X) number of clusters is reached. Hierarchical clustering often used in conjunction with other model selection
where: IG(T | X) is the information gain of the target is often visualized using a dendrogram, which is a tree- criteria, such as the Akaike Information Criterion
variable T given the predictor variable X. H(T) is the like structure that shows the relationships between the (AIC). The AIC is similar to the BIC, but it does not
entropy of the target variable T before partitioning the clusters. Advantages of hierarchical clustering: penalize models with more parameters as strongly.
data based on the predictor variable X. H(T | X) is the Hierarchical clustering is a versatile algorithm that Advantages of the BIC: The BIC is a consistent
entropy of the target variable T after partitioning the can be used to cluster data of any type. Hierarchical model selection criterion, which means that it will
data based on the predictor variable X clustering is able to identify clusters of different select the true model with probability 1 in the limit as
Entropy: Entropy is a measure of the uncertainty in a shapes and sizes. Hierarchical clustering can be used the sample size increases. Disadvantages of the BIC:
distribution. It is calculated as the sum of the to explore the underlying structure of a dataset. The BIC can be sensitive to the choice of the prior
probabilities of each possible outcome, weighted by Disadvantages of hierarchical clustering: distribution. The BIC can be computationally expensive
the logarithm of those probabilities. The formula for Hierarchical clustering can be computationally to calculate for large datasets.
entropy is as follows: H(T) = -sum(p(t) * log2(p(t))) expensive for large datasets. Hierarchical clustering
where: H(T) is the entropy of the target variable T can be sensitive to the distance metric that is used.
p(t) is the probability of the target variable T taking on Hierarchical clustering does not produce a unique set
the value t of clusters.

You might also like