PBL Project

MAULANA AZAD NATI ONAL INSTITUTE OF TECHNOLOGY BHOPAL
DEPARTMENT OF MATHEMATICS, BIOINFORMATICS AND COMPUTER APPLICATIONS
REPORT ON:
“STOCK MARKET PREDICTION USING

SENTIMENT ANALYSIS AND MACHINE
LEARNING”
REPORT BY:
POORTI JAIN (214104018)
SNEH YADAV(214104019)
UNDER THE GUIDANCE OF

DR. MANOJ JHA
ABSTRACT
This project aims to develop a stock market

prediction model using sentiment analysis in
Machine Learning (ML). The stock market is a
highly volatile and unpredictable entity, with
various factors affecting its performance. One such
factor is the sentiment of investors, which can be
analysed through news articles, social media, and
other sources.
The proposed model uses ML algorithms to

analyse the sentiment of market-related news and
social media posts to predict the movement of the
stock market. The given model uses natural
language processing (NLP) techniques to extract
sentiment-related features from the data.
-The dataset used in the project are collected from

various sources, including financial news outlets,
Twitter, and other social media platforms.
-The collected data is then preprocessed and

transformed into a format suitable for ML
algorithms.
- ML algorithms including decision trees( in this
project we have used random forest ml algorithm
which combines the output of many decision trees
to reach a single result. You can also use other
algorithms like multinomial naïve bayes for the
better accuracy) will be trained and evaluated on
the pre-processed dataset to find the best-
performing model. The accuracy of the model is
measured using confusion metrix which visualises
and summarises the performance of the random
forest algorithm and defines the performance of
that algorithm.
Through this matrix you can get the result and its
f1 score, precision and recall i.e, overall accuracy
percent.
INTRODUCTION
The Stock market process is full of uncertainty

and it's affected by many factors such as
company news and performance, industry
performance, investor sentiment, economic
factors etc. that can cause the price of a stock
to rise or fall. The common problem faced by
the investors include high market volatility,
loss of money, stock market crash, poor
investment skills and lack of market
knowledge. These reasons may lead to wrong
decisions. If investor makes wrong decision
while selling and buying of the shares then
they may face loss. Hence, before investing
money, it is very important for investors to
predict the stock market. Hence the Stock
market prediction is one of the important
exertions in business and finance.
About Natural language Processing
Natural language processing (NLP)
enables computers to communicate with
humans. With the help of NLP, it is
possible for computers to read text, hear
speech, interpret it, measure sentiments,
and also with the help of NLP we can
determine which part of text is important.
NLP allows machines to break down and
interpret human language. Natural
Language processing techniques are
widely used for Projects like chatbot
creation, spam filters, Social media
monitoring, etc.
About Sentiment Analysis Project:
This is a Machine Learning project, in
which with the help of machine learning
algorithms and techniques we will classify
the sentiment of text/news headline is
positive, negative, or neutral.
PROBLEM STATEMENT
Market sentiment analysis involves analySing
the sentiment of investors and the public
towards the market to predict its movement.
This sentiment analysis can be performed
using natural language processing (NLP)
techniques and machine learning (ML)
algorithms on news articles, social media
posts, and other sources.
However, sentiment analysis techniques'

effectiveness in predicting stock market
movements is still under debate. While some
studies have reported promising results,
others have found limited or inconsistent
predictability. Thus, the problem statement of
this project is to develop a robust and
accurate stock market prediction model using
sentiment analysis in ML.
SYSTEM ARCHITECTURE
DATASET DESCRIPTION
There are two channels of data provided in this dataset:

1. News data: historical news headlines from twitter and Reddit
WorldNews Channel (/r/worldnews). They are ranked by reddit
users' votes, and only the top 25 headlines are considered for a
single date.
(Range: 2008-06-08 to 2016-07-01)
2. Stock data: Dow Jones Industrial Average (DJIA) is used to

"prove the concept".(DJIA is an index that tracks 30 large, publicly-
owned companies trading on U.S. stock exchanges).
The dataset of DJIA is directly extracted from yahoo finance.
To make things easier, this combined dataset contains 27

columns.
The first column is "Date", the second is "Label", and the following
ones are news headlines ranging from "Top1" to "Top25".
This is a binary classification task. Hence, there are only two labels:
"1" when DJIA Adj Close value rose.
"0" when DJIA Adj Close value decreased or stayed as the same.
PROCESS
1. Sentiment Analysis Dataset
The dataset we will be using to develop this machine
learning sentiment analysis project is combination of
the world news and stock price shifts available
on Kaggle.
The dataset is basically a CSV file that consists of 25

columns. With the help of this data, we will train our ml
model that will predict the sentiment of the news.
2. Required Libraries:
You need to install certain libraries in your system to
implement the python sentiment analysis project. The
required libraries are:
 Numpy (pip install numpy)
 Pandas (pip install pandas)
 Matplotlib (pip install matplotlib)
 Natural language Processing toolkit (NLTK) (pip
install nltk)
 Sklearn (pip install sklearn)
In this ml project, we will require these libraries. To

install this on your system you can use pip installer. So
open your command prompt and type pip install numpy,
pip install pandas, etc.
Now let’s start implementing and understanding
Sentiment Analysis:
1.)IMPORT LIBRARIES:
Basically, we will be importing libraries at the time we
require to use it. So in the first step we will import only
two libraries that are pandas and nltk.
2.)READ THE DATASET:

Using pandas method read_csv() we are going to read
the dataset that we have downloaded above:
3.) ANALYSING DATASET:

Now we will analyze our data,
4.) SPLIT DATASET INTO TRAINING DATA AND TEST DATA:
Here dataset is splitted into training and test data ont the basis of
dates. You can also split your data on any other basis.
5.) CLEANING OUR DATA:

In this step, we will clean our text data using NLP.
Initializing stopwords and punctuation so that we can remove them from our
text data and renaming the column names for ease of access.
So the dataset will look like this:
Now we will first take our text data only and convert it into the lower case
using lower() method:
Hence the data set will look like:
Now join all the headlines for the vectorisation
Lets see how the first headline is now-
Here you can see the all the punctuations are gone as well as
all the letters are in lowercase.
6.)VECTORISING THE DATA:

In this step, we will vectorize our text data using the CountVectorizer
method given by sklearn. Vectorizing a text basically means we will put 1
where we find a word and rest with 0 value.
Randomise the data using random forest classifier as it has
complex visualisation and accurate precisions.
7.) PREDICTION:
Now the last step is to evaluate the model that we have created on test data.
As you can see our model is able to accurately classify the sentiments of the
text.
Let’s also plot the confusion matrix:

OUTPUT:
You can also see the predictions array on the basis of which our matrix is
built and the result is concluded.
RESULT:
Basic terms:-
F1 score is a metric used in machine learning to evaluate how accurately a binary
classification model classifies new input, taking both precision and recall metrics
into account . Precision measures how often the model is correct when it predicts a
1
positive instance, while recall measures how well the model is able to find all the
positive instances in a dataset .1
F1 scores combine these two metrics to create a single score that represents the
overall accuracy of the model . F1 scores range from 0 to 1 and are often used to
1
compare the performance of different models or to optimize the hyperparameters of

a single model.
Macro average is the usual average we’re used to seeing. Just add them all
up and divide by how many there were. Weighted average considers
how many of each class there were in its calculation, so fewer of one
class means that it’s precision/recall/F1 score has less of an impact
on the weighted average for each of those things.
Result breakdown:-
Here as you can see in the given matrix , true negative is 139 (label 0) and
true positive is 182(label 1) which means possibility of increasing of the
stock prices is greater the next day and the possibility that this prediction is
true is given by f1 scores and other parameters like precision.
So we have successfully created the machine learning model which is able

to predict the sentiment of news headlines and give us the outcome whether
the stock will go up or go down with the precision of ’84.9%’.
In this python sentiment analysis project, we have learned how to perform

operations and data pre-processing on text data using natural language
processing and sentiment analysis.
You can predict the same for any company by collecting their news
headlines dataset and applying the same process on that dataset.
References:
1. "Stock price prediction using machine learning" by Yash Patel and Yogesh
Patel (2018)
2. Machine learning approach for sentiment analysis

iq.opengenus.org/ml-for-sentiment-analysis/
3. Comprehensive Study on Sentiment Analysis: Types

https://ieeexplore.ieee.org/document/9213209
4. Using Machine Learning for Sentiment Analysis: a Deep Dive

https://www.datarobot.com/blog/using-machine-learning-for-sentimentanalysis
5.dataset from:
Daily News for Stock Market Prediction | Kaggle
Yahoo Finance - Stock Market Live, Quotes, Business & Finance News
News (reddit.com)
Dow Jones INDEX TODAY | DJIA LIVE TICKER | Dow Jones QUOTE & CHART | Markets Insider
(businessinsider.com)

PBL Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PBL Project

Uploaded by

Copyright:

Available Formats

MAULANA AZAD NATI ONAL INSTITUTE OF TECHNOLOGY BHOPAL

DEPARTMENT OF MATHEMATICS, BIOINFORMATICS AND COMPUTER APPLICATIONS

“STOCK MARKET PREDICTION USING

UNDER THE GUIDANCE OF

This project aims to develop a stock market

The proposed model uses ML algorithms to

-The dataset used in the project are collected from

-The collected data is then preprocessed and

The Stock market process is full of uncertainty

However, sentiment analysis techniques'

There are two channels of data provided in this dataset:

2. Stock data: Dow Jones Industrial Average (DJIA) is used to

The dataset of DJIA is directly extracted from yahoo finance.

To make things easier, this combined dataset contains 27

The dataset is basically a CSV file that consists of 25

 Matplotlib (pip install matplotlib)

 Natural language Processing toolkit (NLTK) (pip

In this ml project, we will require these libraries. To

2.)READ THE DATASET:

3.) ANALYSING DATASET:

5.) CLEANING OUR DATA:

Now join all the headlines for the vectorisation

Lets see how the first headline is now-

6.)VECTORISING THE DATA:

Let’s also plot the confusion matrix:

compare the performance of different models or to optimize the hyperparameters of

So we have successfully created the machine learning model which is able

In this python sentiment analysis project, we have learned how to perform

2. Machine learning approach for sentiment analysis

3. Comprehensive Study on Sentiment Analysis: Types

4. Using Machine Learning for Sentiment Analysis: a Deep Dive

You might also like