Data Science Qb

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Data Science

ASSIGNMENT NO 1
1.What are the three characteristics of Big Data?

1. Volume: This refers to the sheer size of the data generated or collected. Big Data involves
datasets that are extremely large and often too massive to be handled by traditional database
systems. The volume of data is measured in terabytes, petabytes, exabytes, and beyond.
2. Velocity: Velocity represents the speed at which data is generated, processed, and analyzed.
With the advent of real-time data sources, social media, sensors, and other technologies, data is
often generated at a rapid pace. Big Data systems need to be capable of handling and
processing data in real-time or near-real-time to derive meaningful insights.
3. Variety: Variety refers to the diverse types of data that are encountered in Big Data scenarios.
This can include structured data (e.g., traditional databases), unstructured data (e.g., text,
images, videos), and semi-structured data (e.g., XML or JSON files). Big Data platforms must be
capable of handling this diverse range of data types.

In addition to the Three Vs, some variations also include:

4. Veracity: This refers to the quality and reliability of the data. Big Data sources may contain
inconsistencies, errors, and outliers, and dealing with the veracity of the data is crucial for
obtaining accurate insights.
5. Value: While not always considered a core characteristic, the ultimate goal of working with Big
Data is to extract value and insights that can inform decision-making processes. The value of Big
Data is derived from the ability to analyze and gain actionable insights from the vast amount of
information available.

2 . What is an analytic sandbox, and why is it important?

An analytic sandbox refers to a controlled and isolated environment where data scientists and
analysts can explore, analyze, and manipulate data without affecting the production systems or
data. It's essentially a virtual or physical space that provides a secure and flexible environment
for experimenting with data and analytical models.

Here are some key aspects of an analytic sandbox and why it is important in data science:

1. Exploratory Analysis: Analytic sandboxes allow data scientists to conduct exploratory analysis
on large and diverse datasets. They can test hypotheses, try different algorithms, and explore
patterns without impacting the operational systems.
2. Model Development and Testing: Data scientists can use analytic sandboxes to develop and
test machine learning models. It provides a safe space for experimenting with different model
architectures, parameters, and features before deploying them in a production environment.
3. Data Cleaning and Preprocessing: Analyzing and cleaning raw data is a crucial step in any data
science project. An analytic sandbox provides a space for data scientists to clean, preprocess,
and transform data without affecting the original data sources.
4. Collaboration: Analytic sandboxes facilitate collaboration among data science teams. Multiple
team members can work on different aspects of a project simultaneously, sharing insights, code,
and results in a collaborative and isolated environment.
5. Security and Compliance: By separating the analytic environment from the production
systems, analytic sandboxes help maintain security and compliance standards. Sensitive or
confidential data can be used and analyzed within the sandbox without risking exposure or
unauthorized access.
6. Resource Flexibility: Analytic sandboxes often provide resources on-demand, allowing data
scientists to scale up or down based on the complexity of the analysis. This flexibility is crucial
for handling large datasets or computationally intensive tasks.
7. Experimentation: Data science is an iterative process that involves experimentation. Analytic
sandboxes provide a playground for trying out different approaches, refining models, and
continuously improving the analytical process.

In summary, an analytic sandbox is important in data science because it provides a controlled


and secure environment for data exploration, model development, and collaboration. It enables
data scientists to experiment with data without risking the integrity of production systems and
helps ensure that insights gained from analysis are accurate and reliable.

3 In which phase would the team expect to invest most of the project time and Why?

In the data analytics life cycle, a substantial amount of time is often invested in the "Data
Preparation" phase. This phase involves collecting, cleaning, and transforming raw data into a
format suitable for analysis. While it's closely related to the overall data science life cycle, the
emphasis on data preparation can vary depending on the specific project and its requirements.

Here are some reasons why the Data Preparation phase tends to be time-consuming and critical:

1. Data Collection and Integration: Gathering data from various sources is often a complex task.
Data may come from different databases, systems, or external APIs, and integrating these
diverse datasets requires careful consideration. Data engineers and analysts need to ensure data
consistency, resolve schema mismatches, and handle issues related to data quality.
2. Cleaning and Handling Missing Data: Raw data is seldom perfect. It may contain missing
values, outliers, or errors. Cleaning and imputing missing data is a time-consuming but crucial
step to ensure the accuracy and reliability of subsequent analyses.
3. Data Transformation and Feature Engineering: Transforming raw data into a format suitable
for analysis involves tasks like scaling, normalizing, or encoding categorical variables. Feature
engineering, which involves creating new features or modifying existing ones to improve model
performance, is often performed during this phase.
4. Dealing with Data Quality Issues: Identifying and addressing data quality issues, such as
duplicates or inconsistencies, is essential for obtaining meaningful insights. The Data Preparation
phase is where these issues are addressed to enhance the overall quality of the dataset.
5. Ensuring Data Security and Privacy: During data preparation, teams must also consider data
security and privacy concerns. This involves handling sensitive information appropriately,
applying encryption where necessary, and ensuring compliance with relevant regulations and
policies.
6. Creating a Clean and Usable Dataset: The goal of the Data Preparation phase is to create a
clean, structured, and usable dataset. This is the foundation for subsequent analysis and
modeling. The quality of the dataset significantly influences the accuracy and reliability of the
insights derived from analytics.

While Data Preparation is a significant phase, it's important to note that the data analytics life
cycle is iterative, and feedback from later stages (such as analysis and interpretation) may
prompt revisiting the Data Preparation phase. Investing time and effort upfront in preparing
high-quality data pays off in terms of the efficiency and effectiveness of subsequent stages in
the data analytics life cycle.

5. What kinds of tools are being used in the following phases a. Data Preparation b. Model building

a. Data Preparation:

 Pandas: A Python library for data manipulation and analysis. It provides data structures for
efficiently storing large datasets and tools for working with them.
 NumPy: A fundamental package for scientific computing in Python. It provides support for
large, multi-dimensional arrays and matrices, along with mathematical functions to operate on
these arrays.
 OpenRefine: An open-source tool for cleaning and transforming messy data. It facilitates
operations like filtering, sorting, and clustering to clean and preprocess data efficiently.
 Trifacta Wrangler: A data wrangling tool that enables users to explore, clean, and prepare
diverse datasets through a user-friendly interface.

b. Model Building:

 Scikit-Learn: A machine learning library for Python that provides simple and efficient tools for
data analysis and modeling, including various algorithms for classification, regression, clustering,
and more.
 TensorFlow: An open-source machine learning framework developed by Google. It is widely
used for building and training deep learning models.
 PyTorch: Another open-source machine learning framework, known for its dynamic
computational graph, making it particularly suitable for research and experimentation.
 R (Programming Language): R is a programming language and software environment
designed for statistical computing and graphics. It is commonly used for statistical analysis and
modeling.
 KNIME: An open-source platform for data analytics, reporting, and integration. KNIME allows
users to visually create data flows, execute selected analysis steps, and view the results in real-
time.

6 . List and brief about the Applications of Data Science/Analytics

1. Healthcare Analytics:
 Application: Predictive analytics for disease diagnosis, patient outcomes, and
personalized medicine.
 Benefits: Early disease detection, treatment optimization, and improved patient care.
2. Finance and Banking:
 Application: Fraud detection, risk assessment, credit scoring, and algorithmic trading.
 Benefits: Improved security, better risk management, and more accurate financial
decisions.
3. E-commerce and Retail:
 Application: Customer segmentation, recommendation systems, demand forecasting,
and pricing optimization.
 Benefits: Enhanced customer experience, increased sales, and optimized inventory
management.
4. Manufacturing and Supply Chain:
 Application: Predictive maintenance, quality control, supply chain optimization, and
demand forecasting.
 Benefits: Reduced downtime, improved efficiency, and cost savings.
5. Marketing and Advertising:
 Application: Customer behavior analysis, targeted advertising, social media analytics, and
campaign optimization.
 Benefits: Increased marketing ROI, personalized marketing strategies, and improved
customer engagement.
6. Telecommunications:
 Application: Network optimization, predictive maintenance, and customer churn
prediction.
 Benefits: Improved network performance, reduced downtime, and enhanced customer
retention.
7. Government and Public Policy:
 Application: Crime prediction, fraud detection, public health monitoring, and policy
planning.
 Benefits: Enhanced public safety, improved resource allocation, and evidence-based
policymaking.
8. Energy and Utilities:
 Application: Predictive maintenance for equipment, energy consumption forecasting, and
grid optimization.
 Benefits: Increased operational efficiency, reduced downtime, and optimized energy
usage.
9. Education:
 Application: Student performance prediction, personalized learning paths, and
educational resource optimization.
 Benefits: Improved student outcomes, personalized learning experiences, and efficient
resource allocation.
10. Human Resources:
 Application: Talent acquisition, employee performance analytics, and workforce planning.
 Benefits: Improved recruitment processes, talent retention, and strategic workforce
management.
11. Sports Analytics:
 Application: Player performance analysis, injury prediction, and game strategy
optimization.
 Benefits: Enhanced team performance, injury prevention, and strategic decision-making.
12. Environmental Monitoring:
 Application: Climate modeling, pollution tracking, and natural resource management.
 Benefits: Informed environmental policies, conservation efforts, and disaster
preparedness.

These applications showcase the diverse and impactful ways in which data science and analytics
contribute to solving complex problems and making informed decisions across various sectors.

What is Primary Data and secondary data


In data science, primary data and secondary data are two types of data used for analysis, and
they differ in their sources and collection methods.

1. Primary Data:
 Definition: Primary data refers to original data collected directly from the source for a
specific research or analysis purpose.
 Collection Methods: It is collected firsthand through methods such as surveys,
interviews, observations, experiments, or direct measurements.
 Characteristics:
 Specific to Purpose: Collected with a particular research question or objective in
mind.
 Firsthand Information: Directly obtained from individuals, entities, or sources
related to the research.
 Current and Relevant: Reflects the most up-to-date information specific to the
research context.
Example: Conducting a survey to gather information on customer preferences for a new
product.
2. Secondary Data:
 Definition: Secondary data refers to existing data that was collected by someone else
for a purpose other than the current research or analysis.
 Sources: It can be sourced from various repositories, databases, publications, or any pre-
existing dataset.
 Characteristics:
 Already Collected: Data that already exists and was collected for a different
purpose.
 Time and Cost Savings: Utilizing data that others have collected can save time
and resources.
 Historical Context: May provide historical perspectives or trends.
Example: Analyzing sales data from a publicly available dataset published by a government
agency.

Comparison:

 Purpose: Primary data is collected for a specific research objective, while secondary data is
already available and collected for a different purpose.
 Collection Process: Primary data involves direct collection efforts, such as surveys or
experiments. Secondary data is collected by someone else, and the analyst uses it for their own
purposes.
 Control: Researchers have more control over the quality and relevance of primary data.
Secondary data may have limitations related to the original purpose and collection methods.

Que List methods to collect Primary Data.

In data science, collecting primary data involves gathering information directly from sources for
a specific research or analysis purpose. Here are some common methods used to collect primary
data:

1. Surveys:
 Description: Administering questionnaires or interviews to a sample of individuals to
gather responses on specific topics.
 Advantages: Efficient for collecting a large amount of data, standardized responses.
 Considerations: Designing unbiased questions, ensuring a representative sample.
2. Interviews:
 Description: Conducting one-on-one or group discussions with individuals to obtain
detailed information.
 Advantages: Allows for in-depth exploration of topics, clarifying responses.
 Considerations: Time-consuming, potential for interviewer bias.
3. Observations:
 Description: Systematically observing and recording behaviors, events, or phenomena in
a natural setting.
 Advantages: Provides real-time, firsthand information.
 Considerations: Observer bias, may alter behavior if subjects are aware of observation.
4. Experiments:
 Description: Manipulating variables in a controlled environment to observe their effect
on outcomes.
 Advantages: Establishes causation, controlled conditions.
 Considerations: Ethical considerations, practical constraints.
5. Focus Groups:
 Description: Facilitated discussions with a small group of participants to gather
qualitative insights.
 Advantages: Encourages interaction, diverse perspectives.
 Considerations: Limited to the specific group and may not be representative.
6. Surveillance and Sensor Data:
 Description: Using sensors or surveillance technology to collect data on behaviors,
movements, or events.
 Advantages: Non-intrusive, continuous data collection.
 Considerations: Privacy concerns, ethical considerations.
7. Diaries and Journals:
 Description: Participants record their experiences, thoughts, or behaviors over a
specified period.
 Advantages: Captures real-time information, participant perspective.
 Considerations: Relies on participant diligence, potential for selective reporting.
8. Social Media and Online Interactions:
 Description: Analyzing data from social media platforms, online forums, or interactions.
 Advantages: Large-scale data, natural expression of opinions.
 Considerations: Ethical considerations, data accuracy.
9. Field Trials:
 Description: Implementing and testing interventions or innovations in real-world
settings.
 Advantages: Assesses practical feasibility and effectiveness.
 Considerations: Potential for external factors to influence outcomes.
10. Biometric Data Collection:
 Description: Measuring physiological or biological characteristics (e.g., heart rate, EEG)
for research purposes.
 Advantages: Objective data, potential for real-time monitoring.
 Considerations: Privacy and ethical considerations.

Explain Survey and Questionnaire methods of data colletion

Survey and Questionnaire Methods in Data Collection:

Survey Method:

 Description: Surveys involve the systematic collection of information from individuals,


organizations, or groups through the use of standardized questionnaires or interviews.
 Process:
1. Design: Develop a set of questions to elicit specific information.
2. Administration: Distribute the survey to a sample of participants.
3. Data Collection: Collect responses through various means (online, phone, in-person).
4. Analysis: Analyze survey responses to draw conclusions and insights.
 Advantages:
 Efficient for collecting data from a large number of respondents.
 Standardized questions allow for comparability.
 Considerations:
 Ensuring the representativeness of the sample.
 Minimizing response bias through careful design.

Questionnaire Method:

 Description: A questionnaire is a written or printed set of questions designed to gather


information from individuals. Questionnaires can be administered in various ways, including self-
administered, online, or in-person interviews.
 Process:
1. Question Design: Develop clear and unbiased questions.
2. Distribution: Administer questionnaires to respondents.
3. Collection: Collect completed questionnaires.
4. Data Entry: Enter and organize the data for analysis.
5. Analysis: Analyze the collected data to derive insights.
 Advantages:
 Standardized format ensures consistency.
 Cost-effective, especially for large-scale surveys.
 Considerations:
 Clarity in question wording to avoid misunderstandings.
 Adequate response options for each question.

Key Differences:

 Format:
 Survey: Can encompass various methods, including questionnaires, interviews, and
observations.
 Questionnaire: Specifically refers to a written set of questions.
 Administration:
 Survey: Encompasses various data collection methods.
 Questionnaire: Typically involves self-administration by respondents.
 Flexibility:
 Survey: Can be more flexible in terms of methods used.
 Questionnaire: Often follows a structured format.
 Scale:
 Survey: Can be a broader term covering a range of research methods.
 Questionnaire: A specific tool within the broader survey methodology.

12 Compare interview and Schedule methods of data collection


 Description:
 Nature: Involves direct interaction between the researcher and the participant.
 Flexibility: More flexible and allows for in-depth exploration of responses.
 Personalization: Enables rapport-building and the adaptation of questions based on
participant responses.
 Examples: Face-to-face interviews, telephone interviews, video interviews.
 Advantages:
 Depth of Information: Allows for in-depth exploration and clarification of responses.
 Adaptability: Questions can be tailored based on participant responses.
 Non-verbal Cues: Enables the observation of non-verbal cues and context.
 Considerations:
 Resource-Intensive: Can be time-consuming and may require skilled interviewers.
 Subjectivity: Interviewer bias and interpretation may influence data.

Schedule Method:

 Description:
 Nature: Structured data collection method with a predefined set of questions.
 Standardization: Follows a standardized schedule or protocol for asking questions.
 Consistency: Ensures consistency in data collection across participants.
 Examples: Surveys, questionnaires, structured observations.
 Advantages:
 Efficiency: More efficient for large-scale data collection.
 Standardization: Standardized questions ensure consistency.
 Reduced Bias: Minimizes interviewer bias as questions are predetermined.
 Considerations:
 Limited Depth: May not allow for the same depth of information as interviews.
 Rigidity: Less adaptable to individual participant responses.
 Limited Context: May miss contextual nuances observed in interviews.

Comparison:

 Flexibility:
 Interview: More flexible, allowing for personalized and dynamic interactions.
 Schedule: Less flexible, as it follows a predetermined set of questions.
 Depth of Information:
 Interview: Provides greater depth and context due to the open-ended nature of
questions.
 Schedule: Provides standardized information, often with limited depth.
 Resource Requirements:
 Interview: Can be resource-intensive, requiring skilled interviewers and more time.
 Schedule: More efficient for large-scale data collection with standardized tools.
 Subjectivity:
 Interview: Prone to interviewer bias and subjectivity.
 Schedule: Minimizes interviewer bias as questions are predetermined and standardized.
 Adaptability:
 Interview: Adaptable to participant responses, allowing for dynamic questioning.
 Schedule: Less adaptable, as questions are predetermined and standardized.

ASSIGNMENT NO 2

Discuss nominal, ordinal and categorical data with suitable examples

1. Nominal Data:
 Definition: Nominal data represents categories or labels without any inherent order or
ranking.
 Examples:
 Colors: Red, Blue, Green.
 Gender: Male, Female, Non-binary.
 Types of Fruit: Apple, Orange, Banana.
 Characteristics:
 No inherent order among categories.
 Mathematical operations like addition or subtraction are not meaningful.
 Mode is a suitable measure of central tendency.
2. Ordinal Data:
 Definition: Ordinal data represents categories with a meaningful order or ranking, but
the intervals between the categories are not uniform.
 Examples:
 Educational Levels: High School, Bachelor's, Master's, Ph.D.
 Customer Satisfaction Ratings: Very Dissatisfied, Dissatisfied, Neutral, Satisfied,
Very Satisfied.
 Rankings: 1st, 2nd, 3rd, etc.
 Characteristics:
 Categories have a meaningful order.
 Intervals between categories are not necessarily equal.
 Median and mode are appropriate measures of central tendency.
3. Categorical Data:
 Definition: Categorical data is a broad term encompassing both nominal and ordinal
data. It represents variables that can take on a limited and fixed number of categories.
 Examples:
 Nominal: Colors, Gender.
 Ordinal: Education Levels, Customer Satisfaction Ratings.
 Characteristics:
 Encompasses both nominal and ordinal data.
 Describes variables with distinct categories.

Comparison:

 Nominal vs. Ordinal:


 Nominal data has categories without inherent order, while ordinal data has a meaningful
order.
 Nominal data uses mode as a measure of central tendency, while ordinal data can use
median and mode.
 Ordinal vs. Categorical:
 Ordinal data is a subset of categorical data, specifically representing categories with a
meaningful order.
 Categorical data is a broader term encompassing both nominal and ordinal data.

Importance in Data Science:

 Understanding data types is crucial for selecting appropriate statistical methods and
visualizations.
 Nominal and ordinal data may require different types of analyses and visualization techniques.
 Categorical data plays a key role in feature engineering and model building in machine learning.

Structured Data:

 Definition: Structured data refers to data that is highly organized and formatted according to a
predefined schema or model. It is typically tabular in nature, with rows and columns.
 Example:
 Relational Database Table: A table in a relational database where each row represents
a record, and each column represents a specific attribute or field.
 CSV (Comma-Separated Values) File: A spreadsheet or database export in which data
is organized into rows and columns.

Semi-Structured Data:

 Definition: Semi-structured data is less rigidly organized than structured data but still has some
level of structure. It may not conform to a fixed schema, allowing for flexibility in representing
relationships.
 Example:
 JSON (JavaScript Object Notation): A data interchange format that is both human-
readable and machine-readable. It allows for nested structures and key-value pairs,
providing a level of hierarchy.
 XML (eXtensible Markup Language): Another markup language similar to HTML, often
used to represent hierarchical and nested data.

Unstructured Data:

 Definition: Unstructured data lacks a predefined data model or schema and doesn't fit neatly
into tables or databases. It is often in a raw, natural format, making it challenging to organize
and analyze without advanced techniques.
 Example:
 Text Documents: Articles, emails, or social media posts that don't follow a specific
structure.
 Images and Videos: Multimedia content without inherent organization or structure.
 Audio Files: Recordings, podcasts, or music tracks.

Comparison:

 Structured vs. Semi-Structured vs. Unstructured:


 Structured Data: Highly organized with a fixed schema.
 Semi-Structured Data: Less rigidly organized, may not conform to a fixed schema but
has some structure.
 Unstructured Data: Lacks a predefined structure, often in raw, natural formats.

Importance in Data Science:

 Structured Data: Well-suited for traditional relational databases and straightforward querying.
Common in business applications and transactional systems.
 Semi-Structured Data: Provides flexibility for representing complex relationships. Widely used
in web development, APIs, and data interchange formats.
 Unstructured Data: Represents a significant portion of the data landscape. Advanced
techniques like natural language processing (NLP) and computer vision are used to extract
insights from unstructured data.

Explain Source of data: Time Series, Transactional Data, Biological Data, Spatial Data, Social Network
Data. Give example
Source of Data in Data Science:

1. Time Series Data:


 Definition: Time series data consists of observations collected over successive, evenly
spaced intervals of time. It is used to analyze trends, patterns, and behavior over time.
 Example in Data Science:
 Application: Stock Price Movements
 Example: Daily closing prices of a company's stock over a year, enabling the
analysis of stock market trends and patterns.
2. Transactional Data:
 Definition: Transactional data records individual transactions or interactions within a
system, capturing details about events, purchases, or operations.
 Example in Data Science:
 Application: E-commerce Sales
 Example: A dataset containing information about each online purchase,
including items bought, purchase time, and payment details.
3. Biological Data:
 Definition: Biological data encompasses information related to living organisms,
including genetic, molecular, and physiological data.
 Example in Data Science:
 Application: Genomic Sequencing
 Example: DNA sequences of individuals for genomic research, helping identify
genetic variations and potential links to diseases.
4. Spatial Data:
 Definition: Spatial data refers to information tied to geographic locations, enabling the
analysis and visualization of data in a spatial context.
 Example in Data Science:
 Application: Geographic Information Systems (GIS)
 Example: Mapping and analyzing crime incidents in a city, using spatial data to
identify high-crime areas.
5. Social Network Data:
 Definition: Social network data represents relationships and interactions among
individuals or entities within a network.
 Example in Data Science:
 Application: Social Media Analytics
 Example: Analyzing interactions and connections on a social media platform to
understand user behavior, identify influencers, or detect trends.

Importance in Data Science:

 Time Series Data: Critical for forecasting, anomaly detection, and trend analysis in various
domains such as finance, weather, and economics.
 Transactional Data: Essential for understanding customer behavior, optimizing business
processes, and detecting fraudulent activities.
 Biological Data: Fundamental for genomics research, personalized medicine, and
understanding the genetic basis of diseases.
 Spatial Data: Enables location-based analysis, including urban planning, environmental
monitoring, and logistics optimization.
 Social Network Data: Used for social network analysis, sentiment analysis, and targeted
marketing, providing insights into human interactions and behavior.

5) Mention the sources of data of the following as transactional, time series, social network ,
Spatial or Biological with 1 sentence reasoning.
1. Library Book issue/return data
2. Structure of proteins and nucleic acids data
3. Twitter data
4. Genome sequence of Viruses
5. Sales data of a grocery store on day today basis.
6. Terrain info of a planet with longitude and latitude
7. Stock information of a Grocery store
8. Sun position observed over a period of time.
9. Friends and friends of friends information over facebook
10. Information of a place over a map in data science
1. Library Book issue/return data: Transactional - This data captures individual transactions
(book issues and returns) within a library system, representing discrete events.
2. Structure of proteins and nucleic acids data: Biological - This data provides information on
the biological macromolecules' structures, offering insights into their composition and spatial
arrangements at a molecular level.
3. Twitter data: Social Network - Twitter data involves interactions between users, forming a social
network where tweets, retweets, and mentions create a dynamic web of connections.
4. Genome sequence of Viruses: Biological - Genome sequences represent the genetic
information of viruses, offering insights into their biological characteristics and evolutionary
relationships.
5. Sales data of a grocery store on a day-to-day basis: Time Series - This data captures sales
figures over time, allowing for trend analysis and forecasting based on daily patterns.
6. Terrain info of a planet with longitude and latitude: Spatial - Terrain information with
longitude and latitude constitutes spatial data, enabling the representation and analysis of
geographic features on a planet's surface.
7. Stock information of a Grocery store: Transactional - Stock information in a grocery store
involves tracking individual transactions related to inventory, such as stock additions, sales, and
restocking.
8. Sun position observed over a period of time: Time Series - Sun position data collected over
time allows for the analysis of patterns in the sun's movement, such as daily and seasonal
variations.
9. Friends and friends of friends information over Facebook: Social Network - Facebook data
includes connections between users, representing a social network where friendships and
relationships are established and maintained.
10. Information of a place over a map in data science: Spatial - Data representing information
about a place on a map is spatial data, providing details about geographic features and
locations in a structured format.
6) Write a short note on Data Evolution.

Data evolution in data science refers to the dynamic and continuous changes that occur in
datasets over time. This phenomenon is a crucial aspect of the data science lifecycle and has
several key dimensions:

1. Temporal Changes: Data evolves with time due to various factors. For instance, in financial
datasets, stock prices change continuously, while in social media data, user interactions and
content generate a constantly evolving stream of information.
2. Volume Growth: As technology advances, the volume of available data increases exponentially.
This growth in data volume poses both opportunities and challenges for data scientists,
requiring scalable solutions for storage, processing, and analysis.
3. Structural Changes: Data sources may undergo structural changes, such as modifications in
data schema, format, or data collection methods. Data scientists must adapt their tools and
techniques to accommodate these alterations to ensure accurate analysis.
4. Quality and Accuracy: Over time, data quality and accuracy may vary. Missing values, outliers,
or errors can emerge, necessitating data cleaning and preprocessing efforts. Maintaining data
integrity is crucial for drawing reliable insights.
5. Concept Drift: In machine learning and predictive modeling, the concept drift phenomenon
refers to the change in the relationships between input features and the target variable over
time. Models need to adapt to these shifts for sustained accuracy.
6. Ethical and Legal Considerations: As data science evolves, there is an increasing awareness of
ethical and legal considerations related to data collection, usage, and privacy. Ongoing changes
in regulations and societal expectations impact how data scientists handle and analyze
information.
7. Integration of New Data Sources: New data sources continually emerge, providing
opportunities for richer analyses. Integrating diverse datasets can offer a more comprehensive
understanding of phenomena but also requires addressing challenges related to data
heterogeneity.
8. Data Versioning: Similar to software versioning, managing different versions of datasets
becomes essential. This ensures reproducibility and transparency in data-driven research,
allowing others to understand and validate the results based on specific dataset versions.
9. Evolution in Tools and Technologies: The field of data science itself undergoes evolution with
the introduction of new tools, algorithms, and technologies. Data scientists need to stay current
with these advancements to leverage the latest capabilities for analysis.
10. Continuous Learning and Adaptation: Data scientists must engage in continuous learning to
stay abreast of evolving data trends, methodologies, and best practices. This adaptability is
critical for maintaining relevance and effectiveness in the rapidly changing landscape of data
science.

Understanding and managing the evolution of data is fundamental to successful data science
endeavors, enabling practitioners to extract meaningful insights from complex and dynamic
information landscapes.

ChatGPT
Data evolution in data science refers to the ongoing changes in datasets over time. This includes
temporal shifts, volume growth, structural changes, variations in quality, concept drift in models,
ethical considerations, integration of new data sources, versioning, technological advancements, and
the need for continuous learning. Navigating these dynamics is crucial for data scientists to derive
accurate and meaningful insights from the evolving data landscape.

 Theory Assignment 3
 Even Roll Numbered students solve 1 ,2, 4,5, 6, 8, 9
 Odd Roll Numbered students solve 2 , 3, 4, 5, 7, 8, 10
Q1. What are the features of R Programming

1. Compatibility with Big Data Tools: R can be integrated with big data tools and
platforms such as Apache Hadoop, Spark, and databases like MySQL, allowing for
scalable data analysis on large datasets.
2. Cross-Platform Compatibility: R is compatible with various operating systems,
including Windows, macOS, and Linux, making it accessible to a broad audience.
3. Machine Learning Capabilities: R provides a range of machine learning algorithms
through packages like caret, randomForest, and others, enabling users to perform
advanced analytics and predictive modeling.
4 Open Source: R is an open-source language, making it freely available for anyone to
use, modify, and distribute. This fosters a collaborative community and allows for
continuous improvement
5 Data Manipulation and Cleaning: R excels in data manipulation and cleaning tasks.
Libraries like dplyr and tidyr make it easy to filter, transform, and clean datasets,
facilitating efficient data preprocessing.
6 Data Visualization: R offers powerful data visualization capabilities through libraries
like ggplot2. Users can create highly customizable and publication-quality plots, charts,
and graphs to explore and communicate data effectively.
7 Reproducibility: R promotes reproducibility in data analysis by supporting the
creation of scripts and notebooks. This ensures that others can replicate the analysis and
results, contributing to transparency and collaboration.

Q2. Write functions of R used with Vector a) Rep() b)seq() c)any() d)all() b) Is.vector() f)
as.vector() g) lappy() h)sapply() Brief about each function. Give syntax and example along with
output

a)rep()
Description: Replicates values in a vector.

Syntax:

R
Copy code
rep(x, times)
Example:

R
Copy code
# Create a vector with the numbers 1 to 3, repeated 2 times
my_vector <- rep(1:3, times = 2)
print(my_vector)
Output:

csharp
Copy code
[1] 1 2 3 1 2 3

b) seq()
Description: Generates a sequence of numbers.

Syntax:

R
Copy code
seq(from, to, by)
Example:

R
Copy code
# Create a sequence from 1 to 5 with a step of 2
my_sequence <- seq(1, 5, by = 2)
print(my_sequence)
Output:

csharp
Copy code
[1] 1 3 5

c) any()
Description: Checks if at least one element in a logical vector is TRUE.

Syntax:

R
Copy code
any(x)
Example:

R
Copy code
# Check if any element in the vector is greater than 5
my_vector <- c(3, 7, 2, 9)
result <- any(my_vector > 5)
print(result)
Output:

csharp
Copy code
[1] TRUE

d) all()
Description: Checks if all elements in a logical vector are TRUE.

Syntax:

R
Copy code
all(x)
Example:

R
Copy code
# Check if all elements in the vector are greater than 5
my_vector <- c(6, 7, 8, 9)
result <- all(my_vector > 5)
print(result)
Output:

csharp
Copy code
[1] TRUE

e) is.vector()
Description: Checks if an object is a vector.

Syntax:

R
Copy code
is.vector(x)
Example:

R
Copy code
# Check if the object is a vector
my_vector <- c(1, 2, 3)
result <- is.vector(my_vector)
print(result)
Output:

csharp
Copy code
[1] TRUE

f) as.vector()
Description: Converts an object to a vector.

Syntax:

R
Copy code
as.vector(x)
Example:

R
Copy code
# Convert a matrix to a vector
my_matrix <- matrix(1:6, ncol = 2)
my_vector <- as.vector(my_matrix)
print(my_vector)
Output:

csharp
Copy code
[1] 1 2 3 4 5 6

g) lapply()
Description: Applies a function to each element of a list and returns a list.

Syntax:

R
Copy code
lapply(X, FUN)
Example:

R
Copy code
# Apply the square root function to each element of a list
my_list <- list(a = 4, b = 9, c = 16)
result <- lapply(my_list, FUN = sqrt)
print(result)
Output:

css
Copy code
$a
[1] 2

$b
[1] 3

$c
[1] 4

h) sapply()
Description: Applies a function to each element of a list and simplifies the result.

Syntax:

R
Copy code
sapply(X, FUN)
Example:

R
Copy code
# Apply the square root function to each element of a list and simplify the result
my_list <- list(a = 4, b = 9, c = 16)
result <- sapply(my_list, FUN = sqrt)
print(result)
Output:

css
Copy code
abc
234
Numeric:

Description: Represents real numbers or decimal values.


Example:
R
Copy code
x <- 3.14
print(x)
Output:
csharp
Copy code
[1] 3.14

Integer:
Description: Represents whole numbers.
Example:
R
Copy code
y <- 42L
print(y)
Output:
csharp
Copy code
[1] 42

Character:
Description: Represents text or strings.
Example:
R
Copy code
name <- "John"
print(name)
Output:
csharp
Copy code
[1] "John"

Logical:
Description: Represents binary values, either TRUE or FALSE.
Example:
R
Copy code
is_valid <- TRUE
print(is_valid)
Output:
csharp
Copy code
[1] TRUE

Factor:
Description: Represents categorical data with levels.
Example:
R
Copy code
gender <- factor(c("Male", "Female", "Male"))
print(gender)
Output:
csharp
Copy code
[1] Male Female Male
Levels: Female Male

Complex:
Description: Represents complex numbers with real and imaginary parts.
Example:
R
Copy code
z <- 3 + 2i
print(z)
Output:
csharp
Copy code
[1] 3+2i

List:
Description: Represents an ordered collection of objects.
Example:
R
Copy code
my_list <- list(1, "apple", TRUE)
print(my_list)
Output:
lua
Copy code
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[1] TRUE

DataFrame:
Description: Represents a two-dimensional table of data with rows and columns.
Example:
R
Copy code
df <- data.frame(
Name = c("John", "Jane", "Bob"),
Age = c(25, 30, 22),
Married = c(TRUE, FALSE, TRUE)
)
print(df)
Output:
graphql
Copy code
Name Age Married
1 John 25 TRUE
2 Jane 30 FALSE
3 Bob 22 TRUE

Q4. Write about different types of operators used in R with example and output
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators

Arithmetic Operators:

These operators are used for mathematical operations.

Examples:

Copy code

# Addition

result_add <- 5 + 3
print(result_add)

# Subtraction

result_sub <- 7 - 2

print(result_sub)

# Multiplication

result_mul <- 4 * 6

print(result_mul)

# Division

result_div <- 8 / 2

print(result_div)

Outputs:

[1] 8 # Addition

[1] 5 # Subtraction

[1] 24 # Multiplication

[1] 4 # Division

Assignment Operators:

Used to assign values to variables.

Examples:

# Assignment

a <- 10

print(a)
# Compound assignment

b <- 5

b <- b + 3

print(b)

Outputs:

[1] 10 # Assignment

[1] 8 # Compound assignment

Comparison Operators:

Used to compare values and return logical results.

Examples:

Copy code

# Equality

result_equal <- (3 == 3)

print(result_equal)

# Inequality

result_not_equal <- (4 != 3)

print(result_not_equal)

# Greater than

result_greater_than <- (5 > 3)

print(result_greater_than)

# Less than or equal to

result_less_than_equal <- (4 <= 4)

print(result_less_than_equal)

Outputs:
[1] TRUE # Equality

[1] TRUE # Inequality

[1] TRUE # Greater than

[1] TRUE # Less than or equal to

Logical Operators:

Used to perform logical operations on values.

Examples:

# AND

result_and <- TRUE & FALSE

print(result_and)

# OR

result_or <- TRUE | FALSE

print(result_or)

# NOT

result_not <- !TRUE

print(result_not)

Outputs:

csharp

Copy code

[1] FALSE # AND

[1] TRUE # OR

[1] FALSE # NOT

5 a) To print even numbers between 50 to 100 using for loop.

b) Create vector x containing 10,20,30……100. And print only values greater than 50.

c) Display
d) Create a city dataframe of having 7 cities and check whether “Pune” city is present in it or not.

e) Create a vector x with repeating values of 2,4,6 . These number will repeat 4 times.

f) Create 2 vectors v1 and v2 with 5 even numbers and 5 odd numbers respectively. Join these 2 vectors
and sort it and display it.

a) Print even numbers between 50 to 100 using a for loop:

for (i in 50:100) {

if (i %% 2 == 0) {

print(i)

b) Create a vector x containing 10, 20, 30...100 and print values greater than 50:

x <- seq(10, 100, by = 10)

print(x[x > 50])

c) Display a city dataframe and check if "Pune" is present:

# Creating a city dataframe

city_data <- data.frame(

City = c("Mumbai", "Delhi", "Bangalore", "Chennai", "Kolkata", "Hyderabad", "Pune"),

Population = c(12442373, 11007835, 8436675, 7088000, 5547785, 6809970, 3115431)

# Displaying the dataframe

print(city_data)

# Checking if "Pune" is present

is_pune_present <- "Pune" %in% city_data$City


print(is_pune_present)

d) Create a vector x with repeating values of 2, 4, 6 repeated 4 times:

x <- rep(c(2, 4, 6), each = 4)

print(x)

e) Create vectors v1 and v2, join them, sort, and display in data science:

# Create vectors

v1 <- c(2, 4, 6, 8, 10)

v2 <- c(1, 3, 5, 7, 9)

# Join vectors

joined_vector <- c(v1, v2)

# Sort the joined vector

sorted_vector <- sort(joined_vector)

# Display the sorted vector

print(sorted_vector)

Read values from CSV (LoanSanction.CSV) file in a dataframe . The fields are Loan_ID Gender Married
Dependents Education Self_employed ApplicantIncome CoapplicantIncome LoanAmount
Loan_Amount_Term Credit_History Property_Area Loan_Status and print

only Loan id values

All column values except Married and self employed

iii) Only Loan_id and Loan_amount column values


# Read CSV file into a dataframe

loan_data <- read.csv("LoanSanction.csv")

# i) Print only Loan ID values

print(loan_data$Loan_ID)

# ii) Print all column values except Married and Self-employed

selected_columns <- loan_data[, !(names(loan_data) %in% c("Married", "Self_Employed"))]

print(selected_columns)

# iii) Print only Loan ID and Loan Amount column values

selected_columns <- loan_data[, c("Loan_ID", "LoanAmount")]

print(selected_columns)

Q6. What are the different methods of subsetting. Write with example and output

Subsetting in R refers to extracting specific elements or subsets from a data structure like a vector,
matrix, list, or dataframe. There are various methods of subsetting in R, and I'll provide examples for
some common methods:

Indexing:

Using numeric indices to extract specific elements.

ExampIe indexing:

Description: Using numeric indices to extract specific elements.

Example:

# Create a vector

my_vector <- c("apple", "banana", "orange", "grape")

# Subset using numeric indices

subset_vector <- my_vector[c(2, 4)]

print(subset_vector)
Output:

[1] "banana" "grape"

Description: Indexing allows you to select specific elements from a vector or a data structure based on
their position.

Logical Indexing:

Description: Using logical conditions to extract elements that satisfy a condition.

Example:

Copy code

# Create a vector

my_vector <- c(10, 20, 30, 40, 50)

# Subset using logical indexing

subset_vector <- my_vector[my_vector > 30]

print(subset_vector)

Output:

[1] 40 50

Description: Logical indexing is useful for filtering elements based on conditions, making it easy to
extract specific subsets.

Named Indexing:

Description: Using named indices to extract elements by name.

Example

# Create a vector with named elements

my_vector <- c(apple = 10, banana = 20, orange = 30, grape = 40)

# Subset using named indexing

subset_vector <- my_vector[c("apple", "orange")]


print(subset_vector)

Output:

Copy code

apple orange

10 30

Description: Named indexing allows you to access elements using their assigned names, providing clarity
in code.

Negative Indexing:

Description: Using negative indices to exclude specific elements.

Example:

# Create a vector

my_vector <- c("apple", "banana", "orange", "grape")

# Exclude elements using negative indexing

subset_vector <- my_vector[-c(2, 4)]

print(subset_vector)

Output:

[1] "apple" "orange"

Description: Negative indexing is handy for excluding specific elements from a vector or data structure.

Column Subsetting in Dataframes:

Description: Extracting specific columns from a dataframe.

Example:

# Create a dataframe

my_df <- data.frame(

Name = c("John", "Jane", "Bob"),

Age = c(25, 30, 22),

Married = c(TRUE, FALSE, TRUE)


)

# Subset specific columns

subset_columns <- my_df[, c("Name", "Age")]

print(subset_columns)

Output:

Name Age

1 John 25

2 Jane 30

3 Bob 22

Description: Column subsetting is crucial for extracting specific variables from a dataframe for
analysis.le:

# Create a vector

my_vector <- c("apple", "banana", "orange", "grape")

# Subset using numeric indices

subset_vector <- my_vector[c(2, 4)]

print(subset_vector)

Output:

csharp

Copy code

[1] "banana" "grape"

Logical Indexing:

Using logical conditions to extract elements that satisfy a condition.

Example:

# Create a vector

my_vector <- c(10, 20, 30, 40, 50)


# Subset using logical indexing

subset_vector <- my_vector[my_vector > 30]

print(subset_vector)

Output:

[1] 40 50

Named Indexing:

Using named indices to extract elements by name.

Example:

# Create a vector with named elements

my_vector <- c(apple = 10, banana = 20, orange = 30, grape = 40)

# Subset using named indexing

subset_vector <- my_vector[c("apple", "orange")]

print(subset_vector)

Output:

apple orange

10 30

Negative Indexing:

Using negative indices to exclude specific elements.

Example

# Create a vector

my_vector <- c("apple", "banana", "orange", "grape")

# Exclude elements using negative indexing

subset_vector <- my_vector[-c(2, 4)]

print(subset_vector)
Output:

[1] "apple" "orange"

Column Subsetting in Dataframes:

Extracting specific columns from a dataframe.

Example:

# Create a dataframe

my_df <- data.frame(

Name = c("John", "Jane", "Bob"),

Age = c(25, 30, 22),

Married = c(TRUE, FALSE, TRUE)

# Subset specific columns

subset_columns <- my_df[, c("Name", "Age")]

print(subset_columns)

Output:

Name Age

1 John 25

2 Jane 30

3 Bob 22

Q7. What is filtering function. . Which package need to be installed to use this function. Give one
example of filter function with output

In data science, the term "filtering function" can refer to various operations depending on the context.
However, since you mentioned the filter() function, I'll assume you are referring to data manipulation
using the filter() function from the dplyr package in R.
The filter() function is part of the dplyr package, which is widely used for data manipulation tasks in R.
To use the filter() function, you need to have the dplyr package installed. If you haven't installed it yet,
you can do so using the following command:

install.packages("dplyr")

Once the package is installed, you can load it into your R session using:

library(dplyr)

Now, here's an example of using the filter() function with a dataframe:

# Assuming you have a dataframe named 'my_data'

# Create a sample dataframe

my_data <- data.frame(

ID = c(1, 2, 3, 4, 5),

Name = c("Alice", "Bob", "Charlie", "David", "Eva"),

Age = c(25, 30, 22, 35, 28),

Gender = c("Female", "Male", "Male", "Male", "Female")

# Use filter() to select rows where Age is greater than 25

filtered_data <- my_data %>% filter(Age > 25)

# Print the filtered dataframe

print(filtered_data)

Output:

Copy code

ID Name Age Gender

1 2 Bob 30 Male

2 4 David 35 Male

3 5 Eva 28 Female

In this example, we used the filter() function to retain only those rows where the 'Age' column is greater
than 25
Q8. Define vector, list , matrices, dataframe.

Vector:

Definition: A vector is a one-dimensional array that can hold elements of the same data type.

Example:

# Create a numeric vector

numeric_vector <- c(1, 2, 3, 4, 5)

print(numeric_vector)

Output:

[1] 1 2 3 4 5

Definition: Vectors can be numeric, character, logical, etc., and are fundamental for storing and
manipulating sequences of data.

List:

Definition: A list is a collection of elements, which can be of different data types.

Example:

# Create a list with different data types

my_list <- list("John", 25, TRUE, c(1, 2, 3))

print(my_list)

Output:

[[1]]

[1] "John"

[[2]]

[1] 25

[[3]]

[1] TRUE

[[4]]

[1] 1 2 3
Definition: Lists are versatile and can store heterogeneous data, making them suitable for complex data
structures.

Matrix:

Definition: A matrix is a two-dimensional array with rows and columns, containing elements of the same
data type.

Example:

# Create a matrix

my_matrix <- matrix(1:6, nrow = 2, ncol = 3)

print(my_matrix)

Output:

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

Definition: Matrices are useful for mathematical operations and storing data in a tabular format.

Dataframe:

Definition: A dataframe is a two-dimensional, tabular data structure with rows and columns, where each
column can be of a different data type.

Example:

# Create a dataframe

my_dataframe <- data.frame(

Name = c("John", "Jane", "Bob"),

Age = c(25, 30, 22),

Married = c(TRUE, FALSE, TRUE)

print(my_dataframe)

Output:

Name Age Married

1 John 25 TRUE
2 Jane 30 FALSE

3 Bob 22 TRUE

Definition: Dataframes are widely used for storing and manipulating datasets, and they provide a
convenient structure for data analysis

Q9. What are NA and NAN in R. How they are handled in R.

NA (Not Available):

Definition: NA is a placeholder for missing or undefined data.

Handling in R:

NA values can propagate through operations.

Functions like is.na() are used to check for and handle NA values.

The na.rm argument in functions like mean() allows removal of NA values during calculations.

Example:

my_vector <- c(1, 2, NA, 4, 5)

print(mean(my_vector, na.rm = TRUE)) # Calculates mean, ignoring NA

Output:

[1] 3

NaN (Not a Number):

Definition: NaN represents undefined or unrepresentable numerical results, such as 0/0.

Handling in R:

NaN values can propagate through mathematical operations.

Functions like is.nan() are used to check for NaN.

NA and NaN are distinct; NaN is specifically for undefined numerical results.

Example:

result <- 0/0

print(is.nan(result)) # Checks if the result is NaN

Output:

[1] TRUE
In data science, handling NA and NaN is crucial for accurate and meaningful analysis. Common
techniques include checking for these values, filtering or removing them, and replacing them with
appropriate values based on the context of the analysis. The examples above demonstrate basic
handling techniques for NA and NaN in R.

Q10. Write about following functions min() , max() , mean() , median(),sum() , range(), abs() with
examples and output

min():

Description: Computes the minimum value of a vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)

min_value <- min(my_vector)

print(min_value)

Output:

[1] 2

max():

Description: Computes the maximum value of a vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)

max_value <- max(my_vector)

print(max_value)

Output:

[1] 10

mean():

Description: Computes the arithmetic mean (average) of a vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)

mean_value <- mean(my_vector)


print(mean_value)

Output:

[1] 5.6

median():

Description: Computes the median of a vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)

median_value <- median(my_vector)

print(median_value)

Output:

[1] 5

sum():

Description: Computes the sum of a vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)

sum_value <- sum(my_vector)

print(sum_value)

Output:

[1] 28

range():

Description: Computes the range, i.e., the difference between the maximum and minimum values of a
vector or a set of values.

Example:

my_vector <- c(5, 2, 8, 3, 10)


range_values <- range(my_vector)

print(range_values)

Output:

[1] 2 10

abs():

Description: Computes the absolute values of elements in a vector.

Example:

my_vector <- c(-3, 7, -5, 2, -8)

abs_values <- abs(my_vector)

print(abs_values)

Output:

[1] 3 7 5 2 8

Theory Assignment 4

You might also like