Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

AN INTERNSHIP REPORT ON

STOCK PREDICTION ANALYSIS

Submitted by

C KEERTHANA (113219031071)
M SAI RAMYA (113219031128)
V MAHALAKSHMI (113219031083)

In partial fulfillment for the award of the degree

Of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.


(An Autonomous Institution, Affiliated to Anna University, Chennai)

2021-2022
VELAMMAL ENGINEERIN COLLEGE,
CHENNAI-66

BONAFIDE CERTIFICATE

Certified that this internship report “STOCK PREDICTION ANALYSIS” is the


bonafide work of C.KEERTHANA (113219031071), M.SAI RAMYA
(113219031128), V.MAHALAKSHMI (113219031083) carried out at MIT
SQUARE,LONDON during 01.12.2021 to 31.01.2022.

Dr. B MURUGESHWARI MS. VAISHNAVI PILLALAMARRI


PROFESSOR & HEAD ASSISTANT PROFESSOR - I
Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
Velammal Engineering College Velammal Engineering College
Chennai –600 066 Chennai – 600 066
i
ii
iii
CERTIFICATE OF EVALUATION

COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE


BRANCH : COMPUTER SCIENCE AND ENGINEERING
SEMESTER :V

Name of faculty
Name of the student who
Sl. No Title of the Internship coordinator with
has done the Internship
designation
1 C.KEERTHANA STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI
2. M.SAI RAMYA STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI
3 V.MAHALAKSHMI STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI

This report of internship work submitted by the above students in partial fulfillment
for the award of Bachelor of Computer Science and Engineering Degree in Velammal
Engineering College was evaluated and confirmed to be reports of the work done by
the above student and then assessed.

Submitted for Internal Evaluation held on........................

Examiner 1 Examiner 2 Examiner 3

iv
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT vi
ACKNOWLEDGEMENT vii
LIST OF TABLES viii
LIST OF FIGURES ix
LIST OF SYMBOLS AND x
ABBREVIATION

1. INTRODUCTION AND COMPANY


PROFILE
1.1 BIZ TECH 1
1.2 COMPANY PROFILE 1
2. EXECUTIVE SUMMARY
2.1 PROBLEM STATEMENT 3
2.2 OVERVIEW OF THE PROJECT 3
2.2.1 DATA DESCRIPTION 3
2.2.2 METHODOLOGY 3
2.2.3 AIM AND OBJECTIVE 4
2.2.4 VISUALIZATION OF 4
DATASET
2.2.5 LIBRARIES USED IN 4
MACHINE LEARNING
3. STOCK PREDICTION ANALYSIS
3.1 VISUALIZATION 5
3.2 DATA PREPROCESSING 6
3.3 DETECTING AND REMOVING 7
OUTLIERS
3.4 TRAIN TEST SPLIT 10
3.5 LINEAR REGRESSION 12
3.6 SUPPORT VECTOR MACHINE 13
ALGORITHM
3.7 RESULT 14
3.8 DISCUSSION AND CONCLUSION 14

v
ABSTRACT

In the finance world stock trading is one of the most important activities. Stock
market prediction is an act of trying to determine the future value of a stock other financial
instrument traded on a financial exchange. This paper explains the prediction of a stock using
Machine Learning. The technical and fundamental or the time series analysis is used by
the most of the stockbrokers while making the stock predictions. The programming language
is used to predict the stock market using machine learning is Python. In this paper we propose
a Machine Learning (ML) approach that will be trained from the available stocks data and gain
intelligence and then uses the acquired knowledge for an accurate prediction. In this context
this study uses a machine learning technique called Linear Regression to predict stock prices
for the large and small capitalizations and in the three different markets, employing prices with
both daily and up-to-the-minute frequencies.
The data is the price history and trading volumes of the fifty stocks in the index NIFTY
50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and
trading values split across .csv files for each stock along with a metadata file with some macro-
information about the stocks itself. The data spans from 1st January, 2000 to 30th April,
2021.Since new stock market data is generated and made available every day, in order to have
the latest and most useful information, the dataset will be updated once a month.

vi
ACKNOWLEDGEMENT

I wish to acknowledge with thanks to the significant contribution given by


the management of our college Chairman, Dr.M.V.Muthuramalingam, and
our Chief Executive Officer Thiru. M.V.M. Velmurugan, for their extensive
support.

I would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal


Engineering College, for giving me this opportunity to do this project.

I wish to express my gratitude to our effective Head of the Department,


Dr. B. Murugeshwari, for her moral support and for her valuable innovative
suggestions, constructive interaction, constant encouragement and unending help
that have enabled me to complete the project.

I wish to express my indebted humble thanks to the Company MIT Square


and the External Guide Mr. G V Madhurangan, Senior Developer for their
invaluable guidance in shaping of this project.

1 wish to express my sincere gratitude to my Internal Guide, MS


Vaishnavi Pillalamarri, Assistant Professor I, Department of Computer
Science and Engineering for her guidance, without whom this project would not
have been possible.

I am grateful to the entire staff members of the department of Computer


Science and Engineering for providing the necessary facilities to carry out the
project. I would especially like to thank my parents for providing me with the
unique opportunity to work, and for their encouragement and support at all levels.
Finally, my heartfelt thanks to The Almighty for guiding me throughout the life.

vii
LIST OF TABLES

Table no TABLE NAME PAGE NO

1 Difference between linear regression and Support Vector Machine 14

viii
LIST OF FIGURES

FIGURE NO TITLE PAGE NO


1.1 Plot all columns against Index 6
1.2 Scatter plot of dataset 6
1.3 Estimation of density function 6
1.4 Histogram plot of dataset 7
2.1 Boxplot of Prev Close before removing outliers 8
2.2 Boxplot of Prev Close after removing outliers 9
2.3 Boxplot of Open before removing outliers 9
2.4 Boxplot of High before removing outliers 10

ix
LIST OF SYMBOLS AND ABBREVIATION

S.NO SYMBOLS ABBREVIATION


1 np numpy
2 pd pandas
3 plt mathplot
4 df dataframe
5 Prev Close Previous day’s close price
6 Open Open price of day
7 High Highest price in day
8 Low Lowest price in day
9 Last Last traded price in day
10 Close Close price of day
11 VWAP Volume Waited Average Price
12 Volume Volume of shares traded on the current
day.
13 Turnover Turnover in a day
14 Trades No of trades
15 Deliverable Volume Amount of eliverable volume
16 %Deliverable Deliverable volume in percentage
17 sns Seaborn
18 mse Mean Squared Error
19 rmse Root Mean Squared Error
20 lin_reg Linear Regression
21 r2 Performance of regression
22 sc Standard scalar
23 svr Support Vector Regression

x
CHAPTER 1

INTRODUCTION AND COMPANY PROFILE

1.1 BIZ TECH

Machine learning has moved from the stuff of science fiction to a staple of modern
business, as organizations across nearly every industry vertical implement ML technologies.
Experts said machine learning enables businesses to perform tasks on a scale and scope
previously impossible to achieve. As a result, it speeds up the pace of work, reduces errors and
improves accuracy, thereby aiding employees and customers alike. Moreover, innovation-
oriented organizations are finding ways to harness machine learning to not just drive
efficiencies and improvements but to fuel new business opportunities that can differentiate their
companies in the marketplace.
Machine learning applications don't just help companies set prices; they also helps
companies deliver the right products and services to the right areas at the right time through
predictive inventory planning and customer segmentation. Retailers, for example, use machine
learning to predict what inventory will sell best in which of its stores based on the seasonal
factors impacting a particular store, the demographics of that region and other data points.

1.2 COMPANY PROFILE


MIT Square is a premier product development company headquartered in Bangalore,
India and has a presence in Southampton, UK. MIT stands for "Management and Innovation
for Transformation" with atagline "We transform your life". MIT Square is an International
Organisation for Standardisation, ISO 9001:2015, Certified Company. We at MIT Square, are
experts in designing and developing innovative products, building start-ups, and understanding
the need of the enterprises for their business growth. We offer product design, product
development, product manufacturing and patent filing services.
MIT Square excels in the design, development, manufacturing and supplying of
consumer products, industrial and IoT devices, education platforms, hospitality products, and
healthcare technology. We offer turnkey, tooling and OEM/ODM services. From individuals,
start-ups, small and medium-sized companies to international corporations, MIT Square is
here to support you in all your product design & product development needs and pave the way
to transform your life by turning your ideas into reality. MIT Square offers you an unparalleled
equation of value, cost and on time delivery by having our highly qualified product design-
development, supply chain and product manufacturing specialists team in UK, USA, Asia and
Middle East. We have strong links with manufacturing units in India. From discovery to
delivery, MIT Square is one stop solution for your ideas to get executed. Our product
designers, engineering developers and innovative management teams ensure your product
meets the world class standard. IP protection is at theheart of our management.
We follow a rigorous method to strictly protect your intellectual property rights in
Asia and across the globeand offer you complete ownership of the design. We do not just

1
stop by playing an advisory role. It is just our starting point. We walk along with you in
executing these strategies end-to-end, to ensure business success.

Website: https://www.mitsquare.com
Industry: Information Technology & Services
Headquarters: London, UK

2
CHAPTER 2

EXECUTIVE SUMMARY

DOMAIN NAME: BIZ TECH

Business technology refers to applications of science, data, engineering, and


information for business purposes, such as the achievement of economic and organizational
goals. The main element of technology is the idea of change, and how it can affect business
and society. For many, the issue of future shock develops when technology change happens so
fast that it causes individuals to be unable to tolerate changes or handle the consequences. For
example, your grandparents might have future shock when dealing with smartphones, tablets,
and the Internet. Technology’s effects on society and business are far-reaching.

2.1 PROBLEM STATEMENT


We must detect and remove missing values and outliers in the dataset and improve the
accuracy using various algorithm and find the best fit algorithm for increasing the accuracy
score.

2.2 OVERVIEW OF THE PROJECT

2.2.1 DATA DESCRIPTION


The data is the price history and trading volumes of the fifty stocks in the index NIFTY
50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and
trading values split across .csv files for each stock along with a metadata file with some macro-
information about the stocks itself. The data spans from 1st January, 2000 to 30th April, 2021.
Since new stock market data is generated and made available every day, in order to have the
latest and most useful information, the dataset will be updated once a month.

2.2.2 METHODOLOGY
The methodologies we used in this dataset are:
• Data visualization
• Data pre-processing – Removing missing values and Outliers
• Train Test split
• Linear regression algorithm
• Support vector machines

3
2.2.3 AIM AND OBJECTIVE

• Understand the Dataset & cleanup (if required).


• Build Regression models to predict the Stock price.
• Also evaluate the models & compare their respective scores like R2, RMSE, etc.

2.2.4 VISUALIZATION OF DATASET

• plot()

The plot() function is used to draw points (markers) in a diagram.


• scatter

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates


to display values for typically two variables for a set of data.
• density

A density plot is used to visualize the distribution of a continuous numerical


variable in a dataset. It is also known as Kernel Density Plots
• histogram

A histogram is a graphical representation that organizes a group of data points into


user-specified ranges.

2.2.5 LIBRARIES USED IN MACHINE LEARNING

NumPy-which stands for Numerical Python, is a library consisting of multidimensional array


objects and a collection of routines for processing those arrays. Using NumPy, mathematical
and logical operations on arrays can be performed. NumPy is a Python package. It stands for
'Numerical Python'.

Pandas – It is a Python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labeled” data both easy and intuitive. ... pandas
is well suited for many different kinds of data: Tabular data with heterogeneously-typed
columns, as in an SQL table or Excel spreadsheet.

Matplotlib. pyplot -is a collection of functions that make matplotlib work like MATLAB.
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting
area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

4
CHAPTER 3

STOCK PREDICTION ANALYSIS

3.1

Fig. 1.1 Plot all columns against index

Fig 1.2 Scatter plot of dataset

Fig 1.3 Estimation of density

5
Fig 1.4 Histogram plot of dataset

df.head() – Returns the first 5 rows of the dataframe.

3.2

df.isnull() in pandas. It will return True for missing components and False for non-missing
cells.
dropna() method allows the user to analyze and drop Rows/Columns with Null values in
different ways.

6
3.3

Fig 2.1 Boxplot of Prev Close before removing outliers

Seaborn- is a library in Python predominantly used for making statistical graphics. Seaborn is
a data visualization library built on top of matplotlib and closely integrated with pandas data
structures in Python.

7
Fig 2.2 Boxplot of Prev Close after removing outliers

Fig 2.3 Boxplot of Open before removing outliers

Fig 2.3 Boxplot of Open before removing outliers

8
Fig 2.4 Boxplot of High before removing outliers

df.describe() - Provides descriptive statistics that summarizes the central tendency, dispersion,
and shape.

9
df.shape - Returns a tuple representing the dimensions. For example, an output of (48, 14)
represents 48 rows and 14 columns.
df.info() - Provides a summary of the data including the index data type, column data types,
non-null values and memory usage.

3.4

10
df.iloc[] method - is used when the index label of a data frame is something other than numeric
series of 0, 1, 2, 3…. n or in case the user doesn't know the index label. Rows can be extracted
using an imaginary index position which isn't visible in the data frame.

Train_test_split

In machine learning, it is a common practice to split your data into two different sets.
These two sets are the training set and the testing set. As the name suggests, the training set is
used for training the model and the testing set is used for testing the accuracy of the model.
While training a machine learning model we are trying to find a pattern that best
represents all the data points with minimum error. While doing so, two common errors come
up. These are overfitting and underfitting.
Ideally, you should not test on training data. Your model might be overfitting the training set
and hence will fail on new data. Good accuracy in the training dataset can’t guarantee the
success of your model on unseen data.
This is why it is recommended to keep training data separate from the testing data.
The basic idea is to use the testing set as unseen data.After training your data on the training
set you should test your model on the testing set.If your model performs well on the testing set,
you can be more confident about your model.
The most common split ratio is 80:20.That is 80% of the dataset goes into the training set and
20% of the dataset goes into the testing set.Before splitting the data, make sure that the dataset
is large enough. Train/Test split works well with large datasets.
To split the data we will be using train_test_split from sklearn.train_test_split randomly
distributes your data into training and testing set according to the ratio provided.

11
3.5

Simple linear regression – It is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Accuracy – It is the measurement used to determine which model is best at identifying
relationships and patterns between variables in a dataset based on the input, or training, data.
MSE - It is the average of the squared error that is used as the loss function for least squares
regression. It is the sum, over all the data points, of the square of the difference between the
predicted and actual target variables, divided by the number of data points.
RMSE – It is measured in the same units as the target variable. Due to its formulation, MSE,
just like the squared loss function that it derives from, effectively penalizes larger errors more
severely.
R2 score - It is a very important metric that is used to evaluate the performance of a regression-
based machine learning model. It is pronounced as R squared and is also known as the
coefficient of determination. It works by measuring the amount of variance in the predictions
explained by the dataset.

12
3.6 SUPPORT VECTOR MACHINE ALGORITHM

Support Vector Machine (SVM) – It is a supervised machine learning algorithm capable


of performing classification, regression and even outlier detection. The linear SVM classifier
works by drawing a straight line between two classes. ... This is where the LSVM algorithm
comes in to play.
Comparing Two Algorithm
• Linear Regression
• Support Vector Machine

Linear Regression Support Vector Machine


1.Linear regression is a linear approach for 1.The Support Vector Regression (SVR) uses
modeling the relationship between a scalar the same principles as the SVM for
dependent variable y and one or more classification, with only a few minor
explanatory variables denoted X. differences.
2.Accuracy achieved in Linear Regression is 2.Accuracy achieved in Support Vector
84%. Machine is 56%.
3. Accuracy rate is high when compared to 3.Accuracy rate is low when compared to
SVM. Linear Regression.
Table 1. Difference between linear regression and Support Vector Machine

13
3.7 RESULT

• Linear Regression algorithm is best for achieving maximum accuracy for our dataset.
• Accuracy score achieved = 0.8422
• Mean Score=0.831
• MSE =0.00
• RMSE= 0.07
• R2=0.84

3.8 DISCUSSION AND CONCLUSION

➢ The Dataset was quite small with just 5307 samples & after preprocessing 2851 data
samples were dropped.
➢ Visualizing the distribution of data & their relationships, helped us to get some insights
on the feature-set.
➢ Testing multiple algorithms with default hyperparameters gave us some understanding
for various models performance on this specific dataset.
➢ It is safe to use linear regression algorithm performed better than other algorithms, as
their scores were quite comparable & also they're more generalizable.

REFERENCE

[1]https://www.kaggle.com/rohanrao/nifty50-stock-market-data?select=TATAMOTORS.csv
[2]https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-
example-code/
[3]https://www.analyticsvidhya.com/blog/2021/05/all-you-need-to-know-about-your-first-
machine-learning-model-linear-regression/
[4]https://www.simplilearn.com/top-python-libraries-for-data-science-article
[5]https://realpython.com/linear-regression-in-python/
[6]https://towardsdatascience.com/linear-regression-from-scratch-with-numpy-
implementation-finally-8e617d8e274c

14

You might also like