Artificial Intelligence and Machine Learning (18CS71) : "Personality Prediction System"

VISVESVARAYA TECHNOLOGICAL UNIVERSITY,
BELAGAVI – 590 018
Artificial Intelligence and Machine Learning (18CS71)

A MINI PROJECT REPORT ON
“PERSONALITY PREDICTION SYSTEM”

Submitted as subject assignment work,
BY
TUSHIT SHUKLA 4AL18CS093
SUDARSHAN SHETTY 4AL18CS098
Under the Guidance of
Ms. Shilpa
Assistant Professor
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ALVA’S INSTITUTE OF ENGINEERING AND TECHNOLOGY
MOODBIDRI-574225, KARNATAKA
2021 – 2022
ALVA’S INSTITUTE OF ENGINEERING AND TECHNOLOGY MIJAR,
MOODBIDRI D.K. -574225 KARNATAKA
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that, assignment work for the subject “Artificial Intelligence and
Machine Learning (18CS71)” has been successfully completed and report
submitted by TUSHIT SHUKLA(4AL18CS093), SUDARSHAN
SHETTY(4AL18CS098) during the academic year 2021– 2022. It is certified
that all corrections/suggestions indicated presentation session have been
incorporated in the report and scored Marks out of 10
and deposited in the departmental library.
Ms. Shilpa
Assistant Professor
i
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany a successful completion of any task would be
incomplete without the mention of people who made it possible, success is the epitome of hardwork
and perseverance, but steadfast of all is encouraging guidance.
So, with gratitude we acknowledge all those whose guidance and encouragement served as
beacon of light and crowned the effort with success.
We thank our Subject faculty Ms. Shilpa, Assistant Professor, Department of Computer
Science & Engineering, who has been our source of inspiration. She has been especially enthusiastic
in giving her valuable guidance and critical reviews.
We sincerely thank, Dr. Manjunath Kotari, Professor and Head, Department of Computer
Science & Engineering who has been the constant driving force behind the completion of the group
task.
We thank our beloved Principal Dr.Peter Fernandes, for his constant help and support
throughout.
We are indebted to Management of Alva’s Institute of Engineering and Technology,
Mijar, Moodbidri for providing an environment which helped us in completing our group task in
Artificial Intelligence and Machine Learning.
Also, we thank all the teaching and non-teaching staff of Department of Computer Science&
Engineering for the help rendered.
TUSHIT SHUKLA 4AL18CS093
SUDARSHAN SHETTY 4AL18CS098
ii
ABSTRACT
Machine learning (ML) is one of the intelligent methodologies that have shown promising results in the
domains of classification and prediction. One of the expanding areas necessitating good predictive accuracy is
sport prediction, due to the large monetary amounts involved in betting. In addition, club managers and owners
are striving for classification models so that they can understand and formulate strategies needed to win matches.
These models are based on numerous factors involved in the games, such as the results of historical matches,
player performance indicators, and opposition information. This paper provides a critical analysis of
the literature in ML, focusing on the application of NAÏVE BAISE to sport results prediction. In doing so, we
identify the learning methodologies utilised, data sources, appropriate means of model evaluation, and specific
challenges of predicting sport results. This then leads us to propose a novel sport prediction framework through
which ML can be used as a learning strategy. Our research will hopefully be informative and of use to those
performing future research in this application area.
TABLE OF CONTENTS
CHAPTER DESCRIPTIONS PAGE

NO. NO.
ACKNOWLEDGEMENT…………………………………………………. i
ABSTRACT………………………………………………………………... ii
TABLE OF CONTENT……………………………………………………. iii
LIST OF FIGURES………………………………………………………… iv
LIST OF TABLES………………………………………………………….. v
1. INTRODUCTION
1.1 INTRODUCTION TO SIGN LANGUAGE RECOGNITION 1
1.2 PROBLEM STATEMENT 2
1.3 OBJECTIVE 2
2. SYSTEM REQUIREMENT SPECIFICATION

2.1 HARDWARE SPECIFICATION 3
2.2 SOFTWARE SPECIFICATIONS 4
3. SYSTEM DESIGN
3.1 DATA-FLOW DIAGRAM 5
3.2 USE CASE DIAGRAM 6
4. IMPLEMENTATION
4.1 PSUEDO-CODE 7-14
5. TESTING
5.1 UNIT TESTING 15-17
5.2 TESTING OF OUR MODEL 17-18
6. RESULTS 19-21
7. CONCLUSION 22
REFERENCES………………………………………………………… 23
iii
LIST OF FIGURES
Figure no. Description Page no.
Fig 3.1 Data Flow Diagram 5
Fig 3.2 Use Case Diagram 6
Fig 3.3 Working Diagram 6
Fig 6.1 Start personality test 18
Fig 6.2 Personality Range 19
Fig 6.3 Graph after the personality test 19
Fig 6.3(a) Graph of personality test 20
iv
SPORTS WINNING PREDICTION
CHAPTER 01
INTRODUCTION
Cricket is a well-known sport. The popularity of cricket and its viewership has increased tremendously
in the past two decades. To cater to potential future growth, global market research was commissioned
by the International Cricket Council (ICC) which revealed that cricket has more than one billion fans
worldwide, with the potential for significant growth. Among all formats of cricket, the popularity of T-
20 Internationals (T20Is) is the highest. All of these fans of cricket are eager about upcoming cricket
events and tournaments. They desire to learn about the prospects of their favorite team
1.1 INTRODUCTION ABOUT THE TOPIC

Our study aims to find the winner of the 7 th edition T20 Cricket World Cup. For this purpose, the
dataset has been extracted from ESPN Cricinfo1 . After collecting the dataset, various techniques were
used to check its integrity and cleanliness. Afterwards, different machine learning algorithms belong to
the decision tree algorithm family such as Decision trees (ID3, C4.5, and Extra Trees), and Random
Forest Classifier were used to build predictive models. After testing these Classifiers on the extracted
dataset(s), we found that Random Forest has shown better results with custom accuracy of 80.86% as
compared to other Classifiers. Additionally, Australia is predicted to be the winner of T20 world cup
2020 as a result of this work. This dataset comprised of four datasets: fixtures of ICC World Cup T20,
results of previous matches, current ICC T20 rankings, and previous appearances in T20 World cups.
This model can also be applied to predict the winner of other cricket events or even other sports with
anature.
Department of Computer Science and Engineering Page 1

1.2 PROBLEM STATEMENT
Cricket Outcome Predictor for prediction of ODI cricket matches. Modern classification techniques like
Naïve Bayes, Support vector machine, Random Forest used for the prediction of results, and based on
these outcomes a comparative study was conducted. Several factors involved in the outcome of ODI
cricket matches, including Home ground, toss first and second innings, condition of the pitch, and team
strategies. But all these factors and strategies vary from time to time when gaming proceeds. As per the
accuracy of the model used for the prediction of the outcome of ODI cricket matches the Naïve Bayes
classifier assumed to be the best one when the predictor was independent and perform well in this case
when the dataset was an imbalance.
1.3 OBJECTIVE
The main objective of sports prediction is to improve team performance and enhance the chances of
winning the game. The value of a win takes on different forms like trickles down to the fans filling the
stadium seats, television contracts, fan store merchandise, parking, concessions, sponsorships,
enrollmentandretention.
Followed the general machine learning workflow step-by-step
1. Data cleaning and formatting.
2. Exploratory data analysis.
3. Feature engineering and selection.
4. Compare several machine learning models on a performance metric.
5. Perform hyper-parameter tuning on the best model.
6. Evaluate the best model on the testing set.
7. Interpret the model results.
8. Draw conclusions and document work.

CHAPTER 02
SYSTEM REQUIREMENT SPECIFICATION

A System Requirements Specification is a document or set of documentation that describes
the features and behavior of a system or software application. It includes a variety of elements
that attempts to define the intended functionality required by the customer to satisfy their
different users.
2.1 HARDWARE REQUIREMENT
 400 MB hard disk space.
 4 GB/8 GB RAM.
 Intel i3 or any processor above it and 4 core CPU.
 Operating systems of windows 10 will be sufficient.
 Active internet connection and a scanner optional.
2.2 SOFTWARE SPECIFICATIONS

 Windows(x64) Operating System
 OpenCV, TensorFlow, Keros, Numpy
 Required Datasets
2.2.1 TENSOR FLOW AND OPEN CV
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks. It is used for both research and production at
Google. Among the applications for which TensorFlow is the foundation, are automated image-
captioning software, suchas DeepDream.
OpenCV (Open Source Computer Vision Library) is a library of programming functions mainly
aimed at real-time computer vision. Originally developed by Intel, it was later supported by
Willow Garage then Itseez (which was later acquired by Intel). The library is cross-platform and
free for use under the open- source Apache 2 License. Starting with 2011, OpenCV features
GPU acceleration for real-time operations. The first alpha version of OpenCV was released to

the public at the IEEE Conference on Computer Vision and Pattern Recognition in 2000, and
five betas were released between 2001 and 2005. The second major release of the OpenCV was
in October 2009. OpenCV 2 includes major changes to the C++ interface, aiming at easier,
more type-safe patterns, new functions, and better implementations for existing ones in terms of
performance (especially on multi-core systems).

CHAPTER 03
SYSTEM DESIGN
This chapter discusses about system design which describes about the Data flow diagram and
Entity Relationship Diagram in which both of them explain about overall structure of the system.
3.1 DATA FLOW DIAGRAM
Fig 3.1 Data Flow Diagram
The flowchart of learning performance prediction system is illustrated in Fig. 3.1. a known
target value and data is taken from the training set its features are represented and is sent to
learning algorithm then the data is sent to the learning mode and an unseen data from the
prediction set is taken and its feature is represented and the predicted target value is given from
the Learning mode. And specific output is given and the personality is predicted.

3.2 USE CASE DIAGRAM
Personality Types
Response to Questions
Normalized Scores
USER Displays Graph As

Result
Fig 3.2 USE CASE Diagram
Fig 3.3 Working Diagram
Despite the USE CASE diagram which explains the overall structure of the system. A use case
diagram can summarize the details of your system's users and their interactions with the system.
To build one, you'll use a set of specialized symbols and connectors. In this use-case diagram
personality traits is the model which comprehends the relationship between personality and
academic behavior. This model was defined by several independent sets of researchers who used
factor analysis of verbal descriptors of human behavior.

CHAPTER 04
IMPLEMENTATION
This chapter discusses about the implementation of the code which describes main functions of the
system.
4.1 Implementation
Started by importing all the libraries and dependencies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
Loaded the csv file

Display the head of the data file
Filter the matches played by India

Create a column for the matches played in 2010
Display the results of the newly created dataframe

Delete the columns that won't affect match results
Convert team-1 and team-2 from categorical variables to continous inputs

Adding the ICC rankings
Loop to add teams to new prediction dataset based on the ranking position of each team

Dummy variables and drop winning team column
Get the results of league matches

Function Code
def clean_and_predict(matches, ranking, final, logreg):
# Initialization of auxiliary list for data cleaning

positions = []
# Loop to retrieve each team's position according to ICC ranking

for match in matches:
positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
# Creating the DataFrame for prediction

pred_set = []
# Initializing iterators for while loop

i=0
j=0
# 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
while i < len(positions):
dict1 = {}
# If position of first team is better then this team will be the 'Team_1' team, and vice-versa
if positions[i] < positions[i + 1]:
dict1.update({'Team_1': matches[j][0], 'Team_2': matches[j][1]})
else:
dict1.update({'Team_1': matches[j][1], 'Team_2': matches[j][0]})
# Append updated dictionary to the list, that will later be converted into a DataFrame
pred_set.append(dict1)
i += 2
j += 1
# Convert list into DataFrame

pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set
# Get dummy variables and drop winning_team column

pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])
# Add missing columns compared to the model's training dataset

missing_cols2 = set(final.columns) - set(pred_set.columns)
for c in missing_cols2:
pred_set[c] = 0
pred_set = pred_set[final.columns]
pred_set = pred_set.drop(['Winner'], axis=1)
# Predict!
predictions = logreg.predict(pred_set)
for i in range(len(pred_set)):
print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
if predictions[i] == 1:
print("Winner: " + backup_pred_set.iloc[i, 1])
else:
print("Winner: " + backup_pred_set.iloc[i, 0])
print("")

CHAPTER 05
TESTING
This chapter discusses about the personality of a person. Testing is quality assurance
mechanism for catching residual errors, test techniques includes are not limited to the
processes of executing a program or application with the intent of finding software
bugs.Testing and predicting the personality of a person by responding to few questions.
5.1 UNIT TESTING

MODULE DIVISION
Module Division is the process of dividing collection of source files required in the project
into discrete units of functionality. Each module can be independently built, tested and
debugged. Below are the modules which are divided in our project.
1. Data Collection
2. Attribute Selection
3. Pre-processing of data
4. Prediction of personality
Data Collection
First step for prediction system is data collection and deciding about the training and testing
dataset. In this project we have imported dataset from Kaggle website which includes 70% of
training dataset and 30% of testing dataset. Data collection is defined as the procedure of
collecting, measuring and analyzing accurate insights for research using standard validated
techniques. A researcher can evaluate their hypothesis on the basis of collected data. In most
cases, data collection is the primary and most important step for research, irrespective of the
field of research. The approach of data collection is different for different fields of study,
depending on the required information

Training Dataset:
In a dataset, a training set is implemented to build up a model, while a test (or validation) set is
to validate the model built. Here, you have the complete training dataset. You can extract
features and train to fit a model and so on.
Testing Dataset:
Here, once the model is obtained, you can predict using the model obtained on the training set.
Some data may be used in a confirmatory way, typically to verify that a given set of input to a
given function produces some expected result. Other data may be used in order to challenge
the ability of the program to respond to unusual, extreme, exceptional, or unexpected input.
Attribute Selection
Attribute of dataset are property of dataset which are used for system and for personality many
attributes are like heart gender of the person, age of the person ,Big five traits like Openness,
Neuroticism, Extraversion, Agreeableness, Consciousness( value 1 -10). The importance of
feature selection can best be recognized when you are dealing with a dataset that contains a
vast number of features. This type of dataset is often referred to as a high dimensional dataset.
Now, with this high dimensionality, comes a lot of problems such as - this high dimensionality
will significantly increase the training time of your machine learning model, it can make your
model very complicated which in turn may lead to Overfitting.
Pre-Processing of Data
Pre-processing needed for achieving best result from the machine learning algorithms. In this,
we gathered dataset and it was pre-processed before it is sent to training stage. Sampling is a
very common method for selecting a subset of the dataset that we are analysing. In most cases,
working with the complete dataset can turn out to be too expensive considering the
memory.Using a sampling algorithm can help us reduce the size of the dataset to a point where
we can use a better, but more expensive, machine learning algorithm. When we talk about

data, we usually think of some large datasets with huge number of rows and columns. While
that is a likely scenario, it is not always the case — data could be in so many different forms:
Structured Tables, Images, Audio files, Videos etc. Machines don’t understand free text, image
or video data as it is, they understand 1s and 0s. So we pre-processthedata
Prediction of Personality Classification
In this, system we used machine learning algorithms is performed and whichever algorithm is
used which it gives best accuracy for personality prediction. By applying all this modules
finally the personality is predicted and the final result is personality of the user.by using the
training and testing dataset the personality of the user is classified.
5.2 TESTING OF OUR MODEL
TABLE 5.2.1 PREDECTING THE BEHAVIORAL TRAIT IS SUCCESSFUL
SI NO. 1
Feature being tested Behavioral Traits
Description It will predict test for already collected
response to data of different matches
and predict the final output .
Input Data set Of Teams
Expected Output Display the Result as accuracy and team which
wins.
Actual Output Display the Result as accuracy and team which
wins.
Remark SUCCESSFUL
From the above table 5.2.1 all the behavioral traits are calculated by answering all the questionnaire
given and with that all the expected and actual output is calculated. If the actual output has the
normalized score as the expected output then the remark is given as successful, along with the
graph

TABLE NO 5.2.2 PREDECTING THE BEHAVIORAL TRAIT IS UNSUCCESSFUL
SI NO. 2
Feature being tested Behavioral Traits
Description It will predict test for already collected
response to data of different matches and predict
the final output
Input Data set Of Teams
Expected Output Display the Result as accuracy and team which
wins.
Actual Output Display the Result as accuracy and team which
wins.
Remark SUCCESSFUL
From the above table 5.2.2 all the behavioral traits are calculated by answering all the qustionnaire
given and with all the expected inputs the graph and score is calculated. If the actual output is
equal to the expected output then the remark shows successful. From the above test the actual input
does not match with te expected output, therefore te remark shows unsuccessful

CHAPTER 06
RESULT
Fig 6.1 Final Predicted output
From the above fig 6.1 the final predicted output has been given which help the user
to predict the winner using naive baise theorem.

Fig 6.2 Accuracy of the training and test data

From the fig 6.3 the the logreg help the user to get the training and test data accuracy.

CHAPTER 07
CONCLUSION
This project, we discuss about how the personality is identified using different classification
algorithms. Here we study relationship between user and his/her personality. In this we used
logistic regression because it gives best accuracy around 86.53% while compare to other
algorithms that are used previously like naive Bayes , SVM , Logistic regression is fast and
give accurate results compared to other algorithms. Thus the personality is automatically
classified by the system after user attempts the survey by the data set provided in the
back end . sports prediction is more in recent times so further in future more accurate traits
can be added. Further any improvement can be done using the data set and algorithms to
improve the accuracy and can be helpful for career guidance module,. This project discuss
about sport winning Prediction basically for the world cup prediction.

REFERENCES
[1] Fazel Keshtkar, Candice Burkett, Haiying Li and Arthur C. Graesser,Using Data
Mining Techniques to Detect the Personality of Players in an Educational Game
[2] R. Wald,T. M. Khoshgoftaar,A. Napolitano Using Twitter Content to Predict

Psychopathy
[3] YagoSaez , Carlos Navarro , Asuncion Mochon and Pedro Isasi, A system for
personality and happiness detection.
[4] Golbeck, J., Robles, C., and Turner, K. 2011a. Predicting Personality with Social
Media. In Proc of the 2011 annual conference extended abstracts on Human factors in
computing systems.
[5] DURGESH K.SRIVASTAVA, LEKHA BHAMBHU, “DATA Classification using

Support Vector Machine,” Journal of Theoretical and Applied Information Technology
[6] YILUN WANG, “Understanding Personality through social media,” International of

computer Science stand ford University.

Artificial Intelligence and Machine Learning (18CS71) : "Personality Prediction System"

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Intelligence and Machine Learning (18CS71) : "Personality Prediction System"

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY,

BELAGAVI – 590 018

Artificial Intelligence and Machine Learning (18CS71)

“PERSONALITY PREDICTION SYSTEM”

TUSHIT SHUKLA 4AL18CS093

SUDARSHAN SHETTY 4AL18CS098

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Machine Learning (18CS71)” has been successfully completed and report

submitted by TUSHIT SHUKLA(4AL18CS093), SUDARSHAN

SHETTY(4AL18CS098) during the academic year 2021– 2022. It is certified

that all corrections/suggestions indicated presentation session have been

incorporated in the report and scored Marks out of 10

and deposited in the departmental library.

TUSHIT SHUKLA 4AL18CS093

SUDARSHAN SHETTY 4AL18CS098

CHAPTER DESCRIPTIONS PAGE

2. SYSTEM REQUIREMENT SPECIFICATION

Figure no. Description Page no.

Fig 3.1 Data Flow Diagram 5

Fig 3.2 Use Case Diagram 6

Fig 3.3 Working Diagram 6

Fig 6.1 Start personality test 18

Fig 6.2 Personality Range 19

Fig 6.3 Graph after the personality test 19

Fig 6.3(a) Graph of personality test 20

1.1 INTRODUCTION ABOUT THE TOPIC

Department of Computer Science and Engineering Page 1

1.2 PROBLEM STATEMENT

Department of Computer Science and Engineering Page 2

SYSTEM REQUIREMENT SPECIFICATION

2.2 SOFTWARE SPECIFICATIONS

2.2.1 TENSOR FLOW AND OPEN CV

Department of Computer Science and Engineering Page 3

Department of Computer Science and Engineering Page 4

3.1 DATA FLOW DIAGRAM

Fig 3.1 Data Flow Diagram

Department of Computer Science and Engineering Page 5

3.2 USE CASE DIAGRAM

USER Displays Graph As

Fig 3.2 USE CASE Diagram

Fig 3.3 Working Diagram

Department of Computer Science and Engineering Page 6

Loaded the csv file

Department of Computer Science and Engineering Page 7

Display the head of the data file

Filter the matches played by India

Department of Computer Science and Engineering Page 8

Create a column for the matches played in 2010

Display the results of the newly created dataframe

Department of Computer Science and Engineering Page 9

Delete the columns that won't affect match results

Convert team-1 and team-2 from categorical variables to continous inputs

Department of Computer Science and Engineering Page 10

Adding the ICC rankings

Department of Computer Science and Engineering Page 11

Dummy variables and drop winning team column

Get the results of league matches

Department of Computer Science and Engineering Page 12

# Initialization of auxiliary list for data cleaning

# Loop to retrieve each team's position according to ICC ranking