Professional Documents
Culture Documents
CCP Final Stock Market
CCP Final Stock Market
INTRODUCTION
INTRODUCTION
The environment and public health greatly depend on the quality of the air we breathe.
Knowing how pollutants affect our health depends in large part on the collection,
monitoring, and analysis of data on air quality. Using a dataset with a plethora of data on
air pollutants and meteorological variables, we explore the field of air quality data
analysis and prediction in this report. Our goal is to forecast air quality indices (AQI) and
obtain important insights into trends in air quality by utilizing data science and machine
learning. A multitude of factors, such as particulate matter (PM2.5 and PM10), different
gasses (NO, NO2, NOx, CO, SO2, O3), and volatile organic compounds (benzene,
toluene, xylene), are involved in the complex and dynamic domain of air quality data.
Furthermore, weather factors like humidity, temperature, and wind speed can have a big
impact on air quality. 'city_day.csv,' the dataset we have access to, contains a significant
amount of information from several cities and presents a complete picture of air quality in
different scenarios. This report provides a thorough analysis of the data, including
visualization, clustering analysis, predictive modeling, and data preprocessing. First, we
must complete the crucial work of cleaning and preparing the data. We deal with missing
values, eliminate outliers, standardize the data, and reduce dimensionality using Principal
Component Analysis (PCA). These preprocessing procedures are essential to guarantee
The use of K-Means clustering to reveal innate patterns and clusters within the air quality
data is one of this report's highlights. This helps us comprehend the relationships and
groupings between the different pollutants and meteorological variables on a deeper
level. Additionally, we examine the dataset's correlation structure in more detail. We can
learn a great deal about the interactions between variables by examining the correlation
matrix. These insights can be very helpful to environmental scientists and policymakers.
Predictive modeling is where machine learning comes into play. With the help of a
Random Forest Regressor, we hope to predict air quality indices (AQI) from the
information at hand. With major ramifications for environmental management and public
health, this predictive model offers a potent tool for predicting air quality conditions. To
sum up, this report offers a thorough investigation of the analysis and prediction of air
quality data, demonstrating the potential of data science and machine learning to improve
our comprehension of air quality trends and enable better informed decision-making. The
analysis's conclusions could help create preventative strategies to deal with air quality
issues and safeguard both the environment and community health.
Abstract
Stock market prediction is that the scene of trying to complete the long-run value of
company stock. Analysis of the stock price we take the price. By using the neural
network, we develop a model. within the neural network, we use a recurrent neural
network that remembers each and each information through time. LSTM networks
maintain to stay contextual information of inputs by associate a loop that permits
information to travel from one step to the following. These loops make recurrent neural
networks seem magical. Each x train has the last 100 days value for the present-day y
data. this can add more accurate results when a measure to existing stock price prediction
algorithms. The network is trained and evaluated for accuracy with various sizes of
knowledge, and also the results are tabulated.
A. Existing System
In the existing system, SVM and Backpropagation Algorithm there is no which won't do
dropout process. Because of this, unwanted data have been processed which leads to
wastage of time and memory space. The prediction of future stock price by SVM and
Backpropagation Algorithm is less efficient because of processing unwanted data. The
SVM and Backpropagation Algorithm which is used in the existing system is not that
effective in handling non-linear data. So, in our proposed future stock price prediction is
done using LSTM (Long Short Term Memory) which is a higher accurate value for the
next day than SVM and Backpropagation Algorithm .
B. Proposed System
In the proposed system we try to find the accurate value of the next day closing value that
helps the investors to invest or sell their shares. Long Short Term Memory (LSTM) is an
artificial neural network in the field of deep learning. LSTM is an advance Neural
network with having a memory cell that stores a small amount of data for further
references. LSTM has feedback links that make it a "general-purpose computer". LSTM
can also process an entire series of data not only single value like image. Because of the
dropout process which takes place in the LSTM algorithm, it is comparatively faster than
SVM and Backpropagation. LSTM algorithm is more suitable in predicting the future
stock price than the SVM and Backpropagation algorithm because of removing the
undesired data. The time and memory consumption are also reduced when compared to
the exciting system due to the dropout process. LSTM algorithm is more proper in
handling non-linear data. We predict the 07-company stock price and store them in a
tabular format and visualize them.
Stock market indexes around the world are powerful indicators for global and country-
specific economies. In the United States the S&P 500, Dow Jones Industrial Average, and
Nasdaq Composite are the three most broadly followed indexes by both the media and
investors.1 In addition to these three indexes, there are approximately 5,000 others that
make up the U.S. equity market.
KEY TAKEWAYS –
There are approximately 5,000 U.S. indexes.
The three most widely followed indexes in the U.S. are the S&P 500, Dow Jones
Industrial Average, and Nasdaq Composite.
The Wilshire 5000 includes all the stocks from the U.S. stock market.
Indexes can be constructed in a wide variety of ways but they are commonly
identified generally by capitalization and sector segregation.
INDEXES ABOUND –
With so many indexes, the U.S. market has a wide range of methodologies and
categorizations that can serve a broad range of purposes. The media most often reports on
the direction of the top three indexes regularly throughout the day with key news items
serving as contributors and detractors. Investment managers use indexes as benchmarks
for performance reporting.
Meanwhile, investors of all types use indexes as performance proxies and allocation
guides. Indexes also form the basis for passive index investing often done primarily
through exchange-traded funds that track indexes specifically.
Overall, an understanding of how market indexes are constructed and utilized can help to
add meaning and clarity for a wide variety of investing avenues. Below, we elaborate on
the three most followed U.S. indexes, the Wilshire 5000 which includes all the stocks
across the entire U.S. stock market, and a roundup of some of the other most notable
indexes.
PROJECT
SCHEDULE
PROJECT SCHEDULE
Sep System
Engineering
Oct
Requirement
Analysis
Nov
Dec
Design
Nov
Feb
Coding
Mar
Apr
May Testing
REQUIREMENTS
ANALYSIS
REQUIREMENTS ANALYSIS
Software Requirements :
I.Python: Python is a high-level, interpreted programming language known for its
simplicity and readability. It is versatile and widely used in various domains, including
web development, data analysis, machine learning, scientific computing, and more.
Python's easy-to-read syntax and extensive standard library make it a popular choice for
both beginners and experienced developers.
II.Python Libraries: The code relies on several Python libraries and modules.We install
these libraries using Python's package manager, pip. The required libraries include:
● pandas: Pandas is an open-source Python library that provides high performance data
structures and data analysis tools. It is particularly valuable for data manipulation,
cleaning, and analysis. Pandas introduces two main data structures, the DataFrame and
Series, which allow users to work with tabular data effectively. It's an essential tool for
anyone dealing with data in Python.
● numpy: NumPy, short for Numerical Python, is a fundamental library for numerical
and scientific computing in Python. It provides support for multi dimensional arrays and
matrices, along with a wide range of mathematical functions to operate on these arrays.
NumPy is the foundation for many other scientific and data manipulation libraries in
Python, and it's crucial for tasks like linear algebra, statistics, and more.
● scikit-learn (sklearn): Scikit-learn is a machine learning library used for clustering and
predictive modeling in the code. It offers tools for data preprocessing, modeling, and
evaluation.
Hardware Requirements
The hardware requirements depend on the size of your dataset and the complexity of
the analysis. However, for a smooth execution of the code provided in this report, a
standard modern computer or laptop with the following specifications is typically
sufficient: 4
Processor: A multi-core processor, such as Intel Core i5 or AMD Ryzen, is
recommended for faster data processing.
Memory (RAM): At least 8 GB of RAM is advisable, especially when dealing with large
datasets.
Storage: Adequate storage space for the dataset and any additional data files or models.
An SSD is preferred for faster read/write operations.
Operating System: The code is compatible with Windows, macOS, and Linux operating
systems.
DESIGN
DESIGN OF LSTM
Long Short Term Memory Network consists of four different gates for
different purposes as described below:-
Input Gate(i): It determines the extent of information be written onto the Internal
Cell State.
Input Modulation Gate(g): It is often considered as a sub-part of the input gate and
much literature on LSTM’s does not even mention it and assume it is inside the Input
gate. It is used to modulate the information that the Input gate will write onto the
Internal State Cell by adding non-linearity to the information and making the
information Zero-mean. This is done to reduce the learning time as Zero-mean input
has faster convergence. Although this gate’s actions are less important than the others
and are often treated as a finesse-providing concept, it is good practice to include this
gate in the structure of the LSTM unit.
Output Gate(o): It determines what output(next Hidden State) to generate from the
current Internal Cell State.
The basic workflow of a Long Short Term Memory Network is similar to the
workflow of a Recurrent Neural Network with the only difference being that the
Internal Cell State is also passed forward along with the Hidden State.
Take input the current input, the previous hidden state, and the previous internal cell
state.
Calculate the values of the four different gates by following the below steps :
For each gate, calculate the parameterized vectors for the current input and the
previous hidden state by element-wise multiplication with the concerned vector
with the respective weights for each gate.
Apply the respective activation function for each gate element-wise on the
parameterized vectors. Below given is the list of the gates with the activation
function to be applied for the gate.
Calculate the current internal cell state by first calculating the element-wise
multiplication vector of the input gate and the input modulation gate, then calculate
the element-wise multiplication vector of the forget gate and the previous internal cell
state and then adding the two vectors.
Calculate the current hidden state by first taking the element-wise hyperbolic tangent
of the current internal cell state vector and then performing element-wise
multiplication with the output gate.
Recurrent Unit
That the blue circles denote element-wise multiplication. The weight matrix W contains
different weights for the current input vector and the previous hidden state for each gate.
Just like Recurrent Neural Networks, an LSTM network also generates an output at each
time step and this output is used to train the network using gradient descent.
When performing normal text modelling, most of the pre-processing task and modelling
task focuses on creating data sequentially. Examples of such tasks can be POS tagging,
stop words elimination, sequencing of the text. These are the methods that try to make
data understood by a model with less effort according to the known pattern. It can give
the results.
Here applying LSTM networks can have its own special feature. We have discussed that
LSTM has a feature through which it can memorize the sequence of the data. It has one
more feature that it works on the elimination of unused information and as we know the
text data always consists a lot of unused information which can be eliminated by the
LSTM so that the calculation timing and cost can be reduced,
So basically the feature of elimination of unused information and memorizing the
sequence of the information makes the LSTM a powerful tool for performing text
classification or other text-based tasks.
IMPLEMENTATIO
N
`
SYSTEM ARCHITECTURE
There are many sources for getting the share price data. Python has a library to get the
data from the internet. Pandas_datareader has 4 parameters that will indicate how and
where to get the data to the API. The first parameter is it specify the stock name of a
company. next, it has a data source parameter that indicates that from where the API
should you want to collect the data and the next parameters are starting date and ending
date. There are many API to collect the data from the internet every API has some
limitation of requesting data. so, by using Pandas_datareader I collected the 5 companies.
the data files are in CSV format that can be used easily in Jupiter notebook. data set has
Date, High, Low, Open, Close, Volume, Adj Close .
C. Data Preprocessing
Data set is a collection of feathers and N number of rows there are many values and
values are in different formats. In a dataset, they may be duplicate values or null values
that may lead to some loss inaccuracy and there may be dependent. Data have been
collected from different sources so there use a different type of format to notate a single
value like gender someone represents M/F or Male/Female. The machine can understand
only 0 and 1 so an image will be in 3-dimension data should be reduced to a 2-dimension
format like data show to free from noisy data, null values, an incorrect format. Data
cleaning can be performed by panda’s tabular data and OpenCV for images.
D. Feather Scaling
The dataset consists of numerical value which are constantly increasing day by day in the
data set the closing value gradually increasing from 1 to 2000. As in the neural network
target variable with a large spread of values, in turn, may result in large error gradient
values causing weight values to change dramatically, making the learning process
unstable. So, we scale the dataset into a small scale of (0,1) or (0,2) that make simple and
reduce the loss. By scaling it easy for the model to store similar data for further reference
in the further. We scale all the data to 0 to 1 and then try the model and pass the scaled
value to get the output and reconvert the scale to an original scale value.
PROPOSED WORK
By using LSTM, we try to implement the predicter model for the top 10 companies. For
that, we have done some data modification that made the algorithm more efficient to find
the accurate value. First, we get the data from the yahoo server by using the pandas web
Reader that uses four parameters name of the company, server name, starting date and
ending date. That data has been scaled because we have ranged value from 0 to different
so for making the model more efficient, we scale the input data from 0 to 1. For that, we
have a package sklearn that has a scaling method to convert the input data to a range from
0 to 1. The main point is to create an x_train and y_train after the scaling we create
training data we take 100 days of previous data for 'y' day value. So, we create train data
for every 'y' day that the previous 100 days values are responsible for that day. Like that
we create a training data set. We try to create a network model with two LSTM layers
and 2 dense layers that make the model try to find the closing value for the next day.
Sequential methods are used because of the advantages of finding patterns of similarities.
it helps in finding the next event for that particular X data. Model Compile defines the
loss function, the optimizer, and the metrics. That's all. It has nothing to do with the
weights and you can compile a model as many times as you want without causing any
problem to pretrained weights. LSTM. The compiler has optimizer Adam that makes the
network learn the value. We use the loss as have a mean squared error than try to reduce
the loss that has accrued at the time of learning.
LSTM Model
Then we try to fix the x train data and y train data to the model fit to make the model
learn from the historic data. The epochs are used to make the model learn the same data
repeated times. At epochs 2 is apply for that we got the nearest value for the next day
closing value. Batch size is 1 because each value is individual and it is independent that
makes the prediction accuracy. Then we create x train and y train to predict the accuracy
of the model and predict the values for x train and get the y to predict value. We compare
both y train and y predict values. The RMSE is 5.77 that gives the assurance of that we
can find the max closest value for the next day closing value .
Flow of project
TESTING
REPORT
TESTING REPORTS
1. Unit Testing
Unit testing focuses each module individually, ensuring that if function properly as unit.
In this testing we have tested of our software to ensure maximum error detection. It helps
to remove bugs from sub modules & prevent arrival of huge bugs after words.
Functional Partitioning:
1. Fetching data from www.Tiingo.com Website
3. Pre-processing of data
Functional Description:
Specification:
This module provides the raw data which we had fetched from stock market website
and .csv file it is successfully converted into understandable format .
4) Name: Train , Test of data
Input:
To recognize patterns and validation of data
Output:
Pattern recognized successfully and data validated
Specification:
Training data is the initial dataset you use to teach a machine learning application to
recognize patterns or perform to your criteria, while testing or validation data is used to
evaluate your model's accuracy .
2 . Integration Testing
Integration testing is a systematic technique for constructing the program structure while
at the same time conducting tests to uncover errors associated with interfacing. The
objective is to take unit tested components and build a program structure that has been
dictated by the design.
We prefer the Top-down integration testing as a testing approach for our project. The
Top-down integration testing is an incremental approach to construction of program
structure. Modules are integrated by moving downward through the control hierarchy,
beginning with the main control module
Modules subordinate to the main module are incorporated into the structure in depth-first
order.
Depth-first integration would integrate all components on a major control path of a
structure. Selection of a major path is somewhat arbitrary and depends on application-
specific characteristics.
3. Validation testing
The purpose of validation testing is to ensure that all expectations of the customer have
been satisfied. The testing is done by the project group members by inspecting the
requirements and validating each requirement.
We prefer the alpha testing technique for the validation purpose. In which case the a
group of students is called on to test Linking , fetching , pre-processing and plotting . The
group members will present at the place and observing the working and collecting the
bugs which have been occurred. The changes are made according to the requirements and
testing is done again to gather more errors if present .
4. System Testing
In the system testing system undergoes various exercises to fulfill the system
requirements.
These tests include:
a. Linking Tests : These are designed to ensure that the project is successfully linked to
the stock website .
b. Performance Tests: The tests are conducted to check the performance of the system.
PERFORMANCE
ANALYSIS
PERFORMANCE ANALYSIS
INSTALLATION
GUIDE
INSTALLATION GUIDE
RESULT
Functional Partitioning:
APPLICATION
APPLICATIONS
Random Forest Regressor: The code employs a Random Forest Regressor, which is a
machine learning model, to perform feature importance analysis.
Model Training: The Random Forest Regressor is trained on the dataset, with
'X_encoded' as the input and 'AQI' as the target variable..
Health Alerts and Advisories: Predictive models can help in issuing health alerts and
advisories to the public when air quality is expected to deteriorate. By forecasting high
pollution days in advance, individuals with respiratory conditions or other health
concerns can take precautionary measures such as staying indoors or using masks to
reduce exposure to harmful pollutants.
Urban Planning and Policy Making: Governments and urban planners can use air
quality prediction models to inform policy decisions related to transportation, industrial
zoning, and urban development. By understanding the factors contributing to poor air
quality, policymakers can implement measures to mitigate pollution and improve overall
air quality in cities.
Traffic Management: Traffic congestion is a significant contributor to air pollution in
urban areas. By incorporating traffic data into Ridge Regression models, transportation
authorities can predict how changes in traffic flow patterns, such as implementing
congestion pricing or promoting public transportation, may impact air quality.
Industrial Emissions Control: Industries can use predictive models to optimize their
operations and reduce missions. By analyzing the relationship between production
activities, weather conditions, and air quality, companies can adjust their processes to
minimize pollution and comply with environmental regulations.
Community Engagement and Education: Air quality prediction models can be used to
raise awareness among the general public about the sources and health effects of air
pollution. By providing easily accessible forecasts through mobile apps or websites,
individuals can make informed decisions to protect their health and the environment .
BIBLIOGRAPHY
BIBLIOGRAPHY
WEBSITES
https://en.wikipedia.org/wiki/Long_short-term_memorywww.beyondlogic.org
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
www.howstuffworks.com
https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-
networks-experts/