This work focuses more on modeling and applying the

model on a test dataset.
Author: Michael Olusola Adegbenro

There are a lot of works on Titianic event that focus on visual presentation and
analysis of data but this work focuses more on model training and aplication of
trained model.
Assumed train dataset is obtained from the client. Train dataset contains both dependent and
independent variables
Train dataset is then used to train a model and the accuracy is check.
Test dataset is then used to test the model and a new dataset is created for client to make
decsion on
I further confirmed the accuracy of the new dataset obtained from model.

Import Necessary Library

In [1]:
import pandas as pd

import numpy as np

from pandas.plotting import scatter_matrix

import seaborn as sns



from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

from sklearn import metrics

import warnings


There are two datasets for the project. Test and Train Dataset.

I will use Train dataset (trainset) for model training

Confirm the accuracy of the model
Then deploy the model on Test dataset (testset) as a fresh new set for confirming the model

In [2]:
# load the trainset for modeling

trainset = pd.read_csv("train.csv")

In [3]:
# confirm the columns


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],


In [4]:
# drop irrelevant columns and check what is left

trainset = trainset.drop(["PassengerId","Name","Ticket","Cabin"], axis=1)


Out[4]: Survived Pclass Sex Age SibSp Parch Fare Embarked

0 0 3 male 22.0 1 0 7.2500 S

1 1 1 female 38.0 1 0 71.2833 C

2 1 3 female 26.0 0 0 7.9250 S

3 1 1 female 35.0 1 0 53.1000 S

4 0 3 male 35.0 0 0 8.0500 S

In [5]:
# check for null/empty cell


Survived 0

Pclass 0

Sex 0

Age 177

SibSp 0

Parch 0

Fare 0

Embarked 2

dtype: int64

In [6]:
# remove the empty cells



Survived 0

Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

Embarked 0

dtype: int64

In [7]:
# confirm the data type and encode the categorical varables


Survived int64

Pclass int64

Sex object

Age float64

SibSp int64

Parch int64

Fare float64

Embarked object

dtype: object

In [8]:
# encode the categorical data (Sex and Embarked are the only ones to convert)

from sklearn.preprocessing import LabelEncoder,StandardScaler

enco = LabelEncoder()

trainset['Sex'] = enco.fit_transform(trainset['Sex'])

trainset['Embarked'] = enco.fit_transform(trainset['Embarked'])

In [9]:


In [10]:
# check columns


Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',



In [11]:
# Split trainset to X (features/independent varables) and y (label/dependent varables)

X = trainset[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare','Embarked']]

y = trainset.Survived

In [12]:
# Split to train and test

from sklearn.model_selection import train_test_split


In [13]:
print (x_train.shape, y_train.shape)

print (x_test.shape, y_test.shape)

(534, 7) (534,)

(178, 7) (178,)

In [14]:
# apply Logistic Regression Model

classifier = LogisticRegression(random_state = 0), y_train)


In [15]:
# Predict the independent varaible

y_pred = classifier.predict(x_test)

In [16]:

array([[85, 17],

[19, 57]], dtype=int64)

In [17]:
# Check model accuracy


print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 79.78%

Apply the model to a new dataset - here we use

In [18]:
# Load the data

testset = pd.read_csv("test.csv")

In [19]:
# like what i did with trainset drop irrelevant data and check what is left

testset = testset.drop(["PassengerId","Name","Ticket","Cabin"], axis=1)


Out[19]: Pclass Sex Age SibSp Parch Fare Embarked

0 3 male 34.5 0 0 7.8292 Q

1 3 female 47.0 1 0 7.0000 S

2 2 male 62.0 0 0 9.6875 Q

3 3 male 27.0 0 0 8.6625 S

Pclass Sex Age SibSp Parch Fare Embarked

4 3 female 22.0 1 1 12.2875 S

In [20]:
# check for null/empty cell


Pclass 0

Sex 0

Age 86

SibSp 0

Parch 0

Fare 1

Embarked 0

dtype: int64

In [21]:
# remove the empty cells



Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

Embarked 0

dtype: int64

In [22]:

Pclass int64

Sex object

Age float64

SibSp int64

Parch int64

Fare float64

Embarked object

dtype: object

In [23]:
# encode the categorical data (Sex and Embarked are the only ones to convert)

from sklearn.preprocessing import LabelEncoder,StandardScaler

enco = LabelEncoder()

testset['Sex'] = enco.fit_transform(testset['Sex'])

testset['Embarked'] = enco.fit_transform(testset['Embarked'])

In [24]:
# Predict the independent varaible

ynew_pred = classifier.predict(testset)

In [25]:
#convert to DataFrame

ynew_pred = pd.DataFrame(ynew_pred)

#Add column name to the dataframe

ynew_pred.columns = ['Survived']

In [26]:
# combine testset with the predicted values to make it look like data used for trainset
result = pd.concat([testset,ynew_pred], axis=1, join='inner')

result = pd.DataFrame(result)


Out[26]: Pclass Sex Age SibSp Parch Fare Embarked Survived

0 3 1 34.5 0 0 7.8292 1 0

1 3 0 47.0 1 0 7.0000 2 0

2 2 1 62.0 0 0 9.6875 1 0

3 3 1 27.0 0 0 8.6625 2 0

4 3 0 22.0 1 1 12.2875 2 1

... ... ... ... ... ... ... ... ...

326 2 0 12.0 2 1 39.0000 2 1

327 1 1 46.0 0 0 79.2000 0 1

328 2 1 29.0 1 0 26.0000 2 1

329 2 1 21.0 0 0 13.0000 2 1

330 2 0 48.0 0 2 36.7500 2 0

260 rows × 8 columns

In [41]:
data= pd.read_csv("test.csv")

# combine testset with the predicted values to make it look like data used for trainset
data = pd.concat([data,ynew_pred], axis=1, join='inner')

data = pd.DataFrame(data)

data = data.to_csv("New Data.csv")

In [27]:
result = result.to_csv("New Result.csv")

NewSet=pd.read_csv("New Result.csv")

In [38]:


Although not important, I will want to confirm the

accuracy of the new dataset containing our predicted
dependent varables
I will simply repeat same process above

In [29]:
# Split trainset to Xnew (features/independent varables) and ynew (label/dependent vara

Xnew = NewSet[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare','Embarked']]

ynew = NewSet.Survived

In [30]:
# encode the categorical data (Sex and Embarked are the only ones to convert)

from sklearn.preprocessing import LabelEncoder,StandardScaler

enco = LabelEncoder()

testset['Sex'] = enco.fit_transform(testset['Sex'])

testset['Embarked'] = enco.fit_transform(testset['Embarked'])

In [31]:
# Split to train and test

from sklearn.model_selection import train_test_split


In [32]:
print (x1_train.shape, y1_train.shape)

print (x1_test.shape, y1_test.shape)

(182, 7) (182,)

(78, 7) (78,)

In [33]:
# apply Logistic Regression Model

classifier = LogisticRegression(random_state = 0), y1_train)


In [34]:
# Predict the independent varaible

y1_pred = classifier.predict(x1_test)


array([[45, 10],

[19, 4]], dtype=int64)

In [35]:
# Check model accuracy


print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 62.82%

