Professional Documents
Culture Documents
Project Report
Project Report
Submitted by
C KEERTHANA (113219031071)
M SAI RAMYA (113219031128)
V MAHALAKSHMI (113219031083)
Of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
2021-2022
VELAMMAL ENGINEERIN COLLEGE,
CHENNAI-66
BONAFIDE CERTIFICATE
Name of faculty
Name of the student who
Sl. No Title of the Internship coordinator with
has done the Internship
designation
1 C.KEERTHANA STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI
2. M.SAI RAMYA STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI
3 V.MAHALAKSHMI STOCK PREDICTION MS.VAISHNAVI
ANALYSIS PILLALAMARRI
This report of internship work submitted by the above students in partial fulfillment
for the award of Bachelor of Computer Science and Engineering Degree in Velammal
Engineering College was evaluated and confirmed to be reports of the work done by
the above student and then assessed.
iv
TABLE OF CONTENTS
ABSTRACT vi
ACKNOWLEDGEMENT vii
LIST OF TABLES viii
LIST OF FIGURES ix
LIST OF SYMBOLS AND x
ABBREVIATION
v
ABSTRACT
In the finance world stock trading is one of the most important activities. Stock
market prediction is an act of trying to determine the future value of a stock other financial
instrument traded on a financial exchange. This paper explains the prediction of a stock using
Machine Learning. The technical and fundamental or the time series analysis is used by
the most of the stockbrokers while making the stock predictions. The programming language
is used to predict the stock market using machine learning is Python. In this paper we propose
a Machine Learning (ML) approach that will be trained from the available stocks data and gain
intelligence and then uses the acquired knowledge for an accurate prediction. In this context
this study uses a machine learning technique called Linear Regression to predict stock prices
for the large and small capitalizations and in the three different markets, employing prices with
both daily and up-to-the-minute frequencies.
The data is the price history and trading volumes of the fifty stocks in the index NIFTY
50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and
trading values split across .csv files for each stock along with a metadata file with some macro-
information about the stocks itself. The data spans from 1st January, 2000 to 30th April,
2021.Since new stock market data is generated and made available every day, in order to have
the latest and most useful information, the dataset will be updated once a month.
vi
ACKNOWLEDGEMENT
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
LIST OF SYMBOLS AND ABBREVIATION
x
CHAPTER 1
Machine learning has moved from the stuff of science fiction to a staple of modern
business, as organizations across nearly every industry vertical implement ML technologies.
Experts said machine learning enables businesses to perform tasks on a scale and scope
previously impossible to achieve. As a result, it speeds up the pace of work, reduces errors and
improves accuracy, thereby aiding employees and customers alike. Moreover, innovation-
oriented organizations are finding ways to harness machine learning to not just drive
efficiencies and improvements but to fuel new business opportunities that can differentiate their
companies in the marketplace.
Machine learning applications don't just help companies set prices; they also helps
companies deliver the right products and services to the right areas at the right time through
predictive inventory planning and customer segmentation. Retailers, for example, use machine
learning to predict what inventory will sell best in which of its stores based on the seasonal
factors impacting a particular store, the demographics of that region and other data points.
1
stop by playing an advisory role. It is just our starting point. We walk along with you in
executing these strategies end-to-end, to ensure business success.
Website: https://www.mitsquare.com
Industry: Information Technology & Services
Headquarters: London, UK
2
CHAPTER 2
EXECUTIVE SUMMARY
2.2.2 METHODOLOGY
The methodologies we used in this dataset are:
• Data visualization
• Data pre-processing – Removing missing values and Outliers
• Train Test split
• Linear regression algorithm
• Support vector machines
3
2.2.3 AIM AND OBJECTIVE
• plot()
Pandas – It is a Python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labeled” data both easy and intuitive. ... pandas
is well suited for many different kinds of data: Tabular data with heterogeneously-typed
columns, as in an SQL table or Excel spreadsheet.
Matplotlib. pyplot -is a collection of functions that make matplotlib work like MATLAB.
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting
area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
4
CHAPTER 3
3.1
5
Fig 1.4 Histogram plot of dataset
3.2
df.isnull() in pandas. It will return True for missing components and False for non-missing
cells.
dropna() method allows the user to analyze and drop Rows/Columns with Null values in
different ways.
6
3.3
Seaborn- is a library in Python predominantly used for making statistical graphics. Seaborn is
a data visualization library built on top of matplotlib and closely integrated with pandas data
structures in Python.
7
Fig 2.2 Boxplot of Prev Close after removing outliers
8
Fig 2.4 Boxplot of High before removing outliers
df.describe() - Provides descriptive statistics that summarizes the central tendency, dispersion,
and shape.
9
df.shape - Returns a tuple representing the dimensions. For example, an output of (48, 14)
represents 48 rows and 14 columns.
df.info() - Provides a summary of the data including the index data type, column data types,
non-null values and memory usage.
3.4
10
df.iloc[] method - is used when the index label of a data frame is something other than numeric
series of 0, 1, 2, 3…. n or in case the user doesn't know the index label. Rows can be extracted
using an imaginary index position which isn't visible in the data frame.
Train_test_split
In machine learning, it is a common practice to split your data into two different sets.
These two sets are the training set and the testing set. As the name suggests, the training set is
used for training the model and the testing set is used for testing the accuracy of the model.
While training a machine learning model we are trying to find a pattern that best
represents all the data points with minimum error. While doing so, two common errors come
up. These are overfitting and underfitting.
Ideally, you should not test on training data. Your model might be overfitting the training set
and hence will fail on new data. Good accuracy in the training dataset can’t guarantee the
success of your model on unseen data.
This is why it is recommended to keep training data separate from the testing data.
The basic idea is to use the testing set as unseen data.After training your data on the training
set you should test your model on the testing set.If your model performs well on the testing set,
you can be more confident about your model.
The most common split ratio is 80:20.That is 80% of the dataset goes into the training set and
20% of the dataset goes into the testing set.Before splitting the data, make sure that the dataset
is large enough. Train/Test split works well with large datasets.
To split the data we will be using train_test_split from sklearn.train_test_split randomly
distributes your data into training and testing set according to the ratio provided.
11
3.5
Simple linear regression – It is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Accuracy – It is the measurement used to determine which model is best at identifying
relationships and patterns between variables in a dataset based on the input, or training, data.
MSE - It is the average of the squared error that is used as the loss function for least squares
regression. It is the sum, over all the data points, of the square of the difference between the
predicted and actual target variables, divided by the number of data points.
RMSE – It is measured in the same units as the target variable. Due to its formulation, MSE,
just like the squared loss function that it derives from, effectively penalizes larger errors more
severely.
R2 score - It is a very important metric that is used to evaluate the performance of a regression-
based machine learning model. It is pronounced as R squared and is also known as the
coefficient of determination. It works by measuring the amount of variance in the predictions
explained by the dataset.
12
3.6 SUPPORT VECTOR MACHINE ALGORITHM
13
3.7 RESULT
• Linear Regression algorithm is best for achieving maximum accuracy for our dataset.
• Accuracy score achieved = 0.8422
• Mean Score=0.831
• MSE =0.00
• RMSE= 0.07
• R2=0.84
➢ The Dataset was quite small with just 5307 samples & after preprocessing 2851 data
samples were dropped.
➢ Visualizing the distribution of data & their relationships, helped us to get some insights
on the feature-set.
➢ Testing multiple algorithms with default hyperparameters gave us some understanding
for various models performance on this specific dataset.
➢ It is safe to use linear regression algorithm performed better than other algorithms, as
their scores were quite comparable & also they're more generalizable.
REFERENCE
[1]https://www.kaggle.com/rohanrao/nifty50-stock-market-data?select=TATAMOTORS.csv
[2]https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-
example-code/
[3]https://www.analyticsvidhya.com/blog/2021/05/all-you-need-to-know-about-your-first-
machine-learning-model-linear-regression/
[4]https://www.simplilearn.com/top-python-libraries-for-data-science-article
[5]https://realpython.com/linear-regression-in-python/
[6]https://towardsdatascience.com/linear-regression-from-scratch-with-numpy-
implementation-finally-8e617d8e274c
14