Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

A project report on

“Exploratory Data Analysis(EDA)


Of
Titanic dataset”

Submitted in partial fulfilments for the

MASTER IN COMPUTER APPLICATION


in
APPLIED SCIENCE
by
NITISH KUMAR MOHANTY
Under the guidance of

AMEEN MANNA

CENTURION UNIVERSITY OF TECHNOLOGY


AND MANAGENENT
Declaration

I, NITISH KUMAR MOHANTY bonafide students of CENTURION UNIVERSITY OF


TECHNOLOGY AND MANAGEMENT, hereby declare that the dissertation entitled
“Exploratory Data Analysis(EDA) Of Titanic dataset” has been carried out by us under the
guidance of ARPITA SENAPATI, in partial fulfilment of the requirements for the MASTER OF
COMPUTER APPLICATION, of the CENTURION UNIVERSITY OF TECHNOLOGY AND
MANAGEMENT, BHUBANESWAR during the academic year 2022-2023.
ABSTRACT
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic
sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone
onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were
more likely to survive than others.
This notebook performs an exploratory data analysis (EDA) of the titanic datasets answers the
question: “what sorts of people were more likely to survive?” using passenger data (ie name, age,
gender, socio-economic class, etc).
ACKNOWLEDGEMENT

The satisfaction and euphoria that accompany a successful completion of any task would
be incomplete without the mention of people who made it possible, success is the epitome of
hard work and perseverance, but steadfast of all is encouraging guidance.

We would also like to thank all the faculty members who have always been very Co-
operative and generous. Conclusively, we also thank all the non-teaching staff and all others who
have done immense help directly or indirectly during our project.

Nitish Kumar Mohanty


TABLE OF CONTENTS

TITLE PAGE NO.


Introduction 1-2
Plan of Implementation 3
APPROACH AND DESIGN 4-10

IMPLEMENTATION (Logistic Regression with Python) 11-21

conclusion 22

Future Scope 23
Introduction
Machine Learning is a branch of Artificial Intelligence that aims at solving real-life
engineering problems. This technique requires no programming, whereas it depends on only data
learning where the machine learns from pre-existing data and predicts the result accordingly.
Machine Learning methods have benefit of using decision trees, heuristic learning, knowledge
acquisition, and mathematical models. It thus provides controllability, observability, stability and
effectiveness.

Exploratory Data Analysis - EDA


 EDA is applied to investigate the data and summarize the key insights.
 It will give you the basic understanding of your data, it’s distribution, null values and
much more.
 You can either explore data using graphs or through some python functions.
 There will be two type of analysis. Univariate and Bivariate. In the univariate, you will
be analyzing a single attribute. But in the bivariate, you will be analyzing an attribute with
the target attribute.
 In the non-graphical approach, you will be using functions such as shape, summary,
describe, isnull, info, datatypes and more.
 In the graphical approach, you will be using plots such as scatter, box, bar, density and
correlation plots.

Understand the ML best practice and project roadmap

When a customer wants to implement ML(Machine Learning) for the identified business problem(s)
after multiple discussions along with the following stakeholders from both sides – Business,
Architect, Infrastructure, Operations, and others. This is quite normal for any new product/application
development. But in the ML world, this is quite different. because, for new application development,
we have to have a set of requirements in the form of sprint plans or traditional SDLC form and it
depends on the customer for the next release plan. But in ML implementation we need to initiate the
below activity first.

1
Identify the data source(s) and Data Collection Organization’s
key application(s)

it would be Internal or External application or web-sites It would be streaming data from the web
(Twitter/Facebook – any Social media) , Facebook , Twitter . By using Analytics Vidhya, you agree
to our Privacy Policy and Terms of Use. Accept Once you’re comfortable with the available data, you
can start work on the rest of the Machine Learning process model.

Machine Learning process

Let’s jump into the EDA process (Step 3) in the above picture. In the data preparation, EDA gets most
of the effort and unavoidable steps.

2
Plan of Implementation
The project can be broken down into 7 main steps which are as follows:

1. Understand the dataset.


2. Clean the data.
3. Analyse the candidate columns to be Features.
4. Process the features as required by the model/algorithm.
5. Train the model/algorithm on training data.
6. Test the model/algorithm on testing data.
7. Tune the model/algorithm for higher accuracy.

3
APPROACH AND DESIGN

The below figure explains the approach we have taken into building the predictive model using
machine learning algorithms.

Data Collection
Data collection is the process of gathering and measuring information from countless
different sources. In order to use the data, we collect to develop practical machine learning
solutions.

Collecting data allows you to capture a record of past events so that we can use data analysis to
find recurring patterns. From those patterns, you build predictive models using machine learning
algorithms that look for trends and predict future changes.

Download dataset from Kaggle via https://www.kaggle.com/c/titanic/data?select=train.csv

Data Pre-processing

Data cleaning
There are some null values in the dataset in the columns such as winner, city, venue
etc. Due to the presence of these null values, the classification cannot be done

4
accurately. So, we tried to replace the null values in different columns with dummy
values.

Choosing Required Attributes

This step is the main part where we can eliminate some columns of the dataset that
are not useful for the estimation of match winning team. This is estimated using
feature importance. The considered attributes have the following feature importance.

Data Visualization
• The data which has been collected is used for visualizing for the better understanding
of the information.
• Matplotlib Library is used here for visualizing the graphs
• The data visualization is necessary to understand the solution in a better way. The
below graphs were drawn based up on the Titanic.

Model Development and Evaluation

Here, we have developed a generic model and applied all classification methods. The data is
split into training data and test data, we train the model using certain features and use it to
predict the testing data, then we calculate the performance of the system. The various
classification models used are: Logistic Regression, Gaussian Naïve Bayes Classifier, KNN (K
Nearest Neighbor) algorithm, Support Vector Machines, Gradient Boost Algorithm, Decision
Trees and Random Forest Classifier. Among these methods the Random Forest and Decision
tree has given good results.

SYSTEM REQUIREMENT SPECIFICATION

A System Requirement Specification (SRS) is basically an organization’s understanding


of a customer or potential client’s system requirements and dependencies at a particular point
prior to any actual design or development work. The information gathered during the analysis is

5
translated into a document that defines a set of requirements. It gives the brief description of the
services that the system should provide and also the constraints under which, the system should
operate. Generally, SRS is a document that completely describes what the proposed software
should do without describing how the software will do it. It’s a two-way insurance policy that
assures that both the client and the organization understand the other’s requirements from that
perspective at a given point in time.

SRS document itself states in precise and explicit language those functions and capabilities a
software system (i.e., a software application, an ecommerce website and so on) must provide, as
well as states any required constraints by which the system must abide. SRS also functions as a
blueprint for completing a project with as little cost growth as possible. SRS is often referred to
as the “parent” document because all subsequent project management documents, such as design
specifications, statements of work, software architecture specifications, testing and validation
plans, and documentation plans, are related to it.

Requirement is a condition or capability to which the system must conform. Requirement


Management is a systematic approach towards eliciting, organizing and documenting the
requirements of the system clearly along with the applicable attributes. The elusive difficulties
of requirements are not always obvious and can come from any number of sources.

Functional Requirements
Functional Requirement defines a function of a software system and how the system must
behave when presented with specific inputs or conditions. These may include calculations, data
manipulation and processing and other specific functionality.

Following are the functional requirements on the system:

1. The whole process can be handled at minimal human interaction with android and web both.
2. The application automatically receives the captured data from server.
3. The user can call emergency, map location and ECG graph on demand
4. The system gives a warning message.

6
Non-Functional Requirement
Non-functional requirements are the requirements which are not directly concerned with
the specific function delivered by the system. They specify the criteria that can be used to judge
the operation of a system rather than specific behaviours. They may relate to emergent system
properties such as reliability, response time and store occupancy. Non-functional requirements
arise through the user needs, because of budget constraints, organizational policies, the need for
interoperability with other software and hardware systems or because of external factors such
as: -
• Performance Requirements
• Design Requirements
• Security Constraints
• Basic Operational Requirements


Product Requirements

• Platform independency: A progressive web app will be developed and deployed so


that users with a smartphone or a computer can access the voting site to cast their
vote.
• Ease of use: The progressive web app provides an interface which is easy to use and
eliminates the need for the voter to go to a voting booth.
• Modularity: The complete product is broken up into modules and well-defined
interfaces are developed to explore the benefit of flexibility of the product.
• Robustness: This software is being developed in such a way that the overall
performance is optimized, and the user can expect the results within a limited time
with utmost relevancy and correctness.

7
System Configuration

H/W System Configuration:

• Processor - Pentium – IV
• Speed - 1.1 GHz
• RAM - 2 GB (min)
• Hard Disk - 256 GB

S/W System Configuration:


• Operating System - XP/7/8/8.1/10/11
• Coding Language - Python

Hardware Requirements
• Processors - Pentium IV Processor
• Speed - 3.00 GHZ
• RAM - 2 GB
• Storage - 20 GB

Software Requirements
• Operating system - Windows 11 Professional
• IDE used - JUPYTER NOTEBOOK

8
SYSTEM DESIGN
Design is a meaningful engineering representation of something that is to be built. It is the most
crucial phase in the developments of a system. Software design is a process through which the
requirements are translated into a representation of software. Design is a place where design is
fostered in software Engineering. Based on the user requirements and the detailed analysis of the
existing system, the new system must be designed. This is the phase of system designing. Design
is the perfect way to accurately translate a customer’s requirement in the finished software
product. Design creates a representation or model, provides details about software data structure,
architecture, interfaces and components that are necessary to implement a system. The logical
system design arrived at as a result of systems analysis is converted into physical system design.

System development methodology


System development method is a process through which a product will get completed or a
product gets rid from any problem. Software development process is described as a number of
phases, procedures and steps that gives the complete software. It follows series of steps which is
used for product progress. The development method followed in this project is waterfall model.

Model phases

The waterfall model is a successive programming improvement process, in which advance is


seen as streaming relentlessly downwards (like a waterfall) through the periods of
Requirement start, Analysis, Design, Implementation, Testing and upkeep.

Prerequisite Analysis: This stage is worried about gathering of necessity of the framework.
This procedure includes producing record and necessity survey.

9
Framework Design: Keeping the prerequisites at the top of the priority list the framework
details are made an interpretation of into a product representation. In this stage the fashioner
underlines on calculation, information structure, programming design and so on.

Coding: In this stage developer begins his coding with a specific end goal to give a full
portray of item. At the end of the day framework particulars are just changed over into
machine coherent register code.

Usage: The execution stage includes the genuine coding or programming of the product. The
yield of this stage is regularly the library, executables, client manuals and extra programming
documentation.

Testing: In this stage all projects (models) are coordinated and tried to guarantee that the
complete framework meets the product prerequisites. The testing is worried with check and
approval.

Support: The upkeep stage is the longest stage in which the product is upgraded to satisfy
the changing client need, adjust to suit change in the outside environment, right mistakes and
oversights beforehand undetected in the testing stage, improve the proficiency of the product.

10
IMPLEMENTATION
Logistic Regression with Python
For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and
very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing
Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle,
you may need to do some additional cleaning not shown in this lecture notebook.

Import Libraries
Let's import some libraries to get started!

The Data
Let's start by reading in the titanic_train.csv file into a pandas dataframe.

11
Exploratory Data Analysis
Let's begin some exploratory data analysis! We'll start by checking out missing data!

Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!

12
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for
reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just
missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or
change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these
plots, this code is just to serve as reference.

13
14
15
Cufflinks for plots

Let's take a quick moment to show an example of cufflinks!

Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is
by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check
the average age by passenger class. For example:

16
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use
these average age values to impute based on Pclass for Age.

Now apply that function!

Now let's check that heat map again!

17
Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

Converting Categorical Features


We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning
algorithm won't be able to directly take in those features as inputs.

18
19
Building a Logistic Regression model
Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play
around with in case you want to use all this data for training).

Train Test Split

Training and Predicting

20
Let's move on to evaluate our model!

Evaluation
We can check precision,recall,f1-score using classification report!

21
Conclusion
This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look

beyond the data. The more we explore the data, the more the insights we draw from it. As a data analyst,

almost 80% of our time will be spent understanding data and solving various business problems through

EDA.

Check out my previous articles about Python here


 Indexing in Pandas Dataframe using Python
 Seaborn: Python
 Pandas: Python
 Matplotlib: Python
 NumPy: Python
 Data Visualization and its Importance: Python
 Time Complexity and Its Importance in Python
 Python Recursion or Recursive Function in Python

References
 Exploratory data analysis: https://en.wikipedia.org/wiki/Exploratory_data_analysis
 Python Exploratory Data
Analysis: https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python
 Exploratory Data Analysis using Python: https://www.activestate.com/blog/exploratory-
data-analysis-using-python/

22
 Univariate and Multivariate Outliers: https://www.statisticssolutions.com/univariate-and-
multivariate-outliers/

23

You might also like