Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Equity Research Analysis Using Web Scraping

A Project Synopsis
Submitted by

Group ID:

Team Members:

Project Member 1 : Nakate Sakshi Sanjay Roll No.2243011

Project Member 2 : Gawade Sonali Ajay Roll No.2243042

Project Member 3 : Narkhede Shruti Subhash Roll No.2243039

In partial fulfilment for the requirement of the degree


BE( Information Technology )

Under the guidance of


Mrs. P.N.Kokare
Assistant Professor in IT Deptt

Department of Information Technology


Vidya Pratishthanś Kamalnayan Bajaj Institute of Engineering
and Technology Bhigawan Road, Baramati
August 2022
Project Title : Equity Research Analysis

Area of Project Web Scraping

Abstract
Web scraping is the process of collecting or extracting information from a particular website.It is
a technique to convert any unstructured data into structured data and then analyze the obtained
data based,and is the stored in required format file type.Web scraping is becoming well known
due to large amount of data available on internet and want to collect the data without wasting
time.Web scarping can be applied to obtain a huge amount of data for better decision making.We
can achieve this using selenium tool and other algorithms.The obtained data after web scraping
will be processed for Text Recognition and Text Classification using sum classifier and Naive
Bayes.
Problem Statement
To convert extracted data from an image , text and file type whether in unstructured or struc-
tured format into required format and using the obtained file for text recognition and text
classification to provide output to chatbot.
Introduction

1.Web Scraping:
Concept of web scraping is in fact becoming increasingly well-known due to the mass of data we
can find on the internet and new startups that don’t want to spend time collecting data where
it can be found quickly on the internet. Web scraping is the process of extracting information
from website.It is a technique of extracting data from the web and turning unstructured data on
the web(including HTML format)into structured data that you can store to your local computer
or database.Data is mostly stored in CSV file or Excel spreadsheet,but it can be saved in other
formats.

2.Text Recognition:
Text Recognition is the process that convert image of the text into a machine readable text
format.

3.Text Classification:
It is ML technique that assigns a set of predefined categories to open-ended text.It can be used
to organise structure and categorize any kind of text from documents,medical studies and files
all over the web.

Objective:
1.Technology make it easy to extract data.
2.To help portfolio managers to be sure of the investment in a company of their interests.
3.Better access to company data.
4.To create an important source for asset managers about market trends and investment oppor-
tunities.
5.Data enrichment on demand.
6.Machine learning and large data set.

Scope of the project and Constraints :


The proposed system can be helpful for the extraction of data. The system will also be able to
extract hidden data for having insights of ever-changing world. The main aim of the project is
to evaluate a large scale program for scraping data available in the huge amount. This system
is used to decrease time for collecting data pat Development of an entire system for the overall
progression of web scraper is beyond the scope of this project.
System Requirements

Software Requirements:
1.Python 2.x or Python 3.x with Selenium,BeautifulSoup,Pandas libraries installed.
2.Google-chrome browser.
3.Ubuntu Operating System.

Hardware Requirements:
Device name DESKTOP-NL8H82D
Processor Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz 1.19 GHz
Installed RAM 8.00 GB (7.77 GB usable)
Device ID FFC24962-5303-491B-9F2A-99CC8AB9531A
Product ID 00327-35907-20393-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display
Project Design
The following figure shows the proposed system of web scraping program:

Figure 1: System Architecture of Proposed System


The above figure shows following processes: Firstly, we visit the particular website and it will
go web scraping; web scraping involves fetching; that will fetch the website using provided URL.
Then it extracts the content ,in which the user is given the option to pick the area from which
the text is to be extracted .It is primarily converting unstructured web content into organized
information and which is then processed and analysed on spreadsheets. After all the process
we obtained the database which is then converted into required format using selenium tools.
Selenium tool is an open source tool that automate web browser. Selenium users web driver
protocol to automate process on various browsers and this automation is carried out locally for
purpose of web scraping. The obtained file is then goes through text recognition in following way:

Pre-processing: Pre-processing technique are needed on coloured , grey level


or binary document image containing text or graphics. Several steps are needed -
Step 1:Some image enhancement techniques to remove noise or correct the contrast in the image.
Step 2:To remove the background containing any scenes,water marks and noise.
Step 3:Page segmentation to separate graphics from text.
Step 4:Separate character from each other.

Text Recognition: The extraction of text from image and the scanned
documents are alone by converting the images using Otsu’s algorithm.

Storing: Then in post processing involves storing the recognized text in format suit-
able for further processing.

The content of the stored text files is classified by using SVM and Naïve Bayes Classifier.
SVM (Support Vector Machine) is a modern classification technology with a simple structure
and strong classification ability . It is a small theory of mathematical learning based on the
concept of systematic risk minimization and “Vapanic Cherbonenkis” theory SVM helps to find
the best decision boundaries. The main goal of SVM is to maximize the margin.The main goal
of Naïve Bayes classifier is to find the best mapping within the specific problem domain field
between a piece of new data and collection of classification . We can use other classification
algorithms such as Random Forest Classifier,K-nearest neighbour,Decision tree to increase the
accuracy of classification .We can use Reinforcement learning technique to learn the behaviour
of the user in identifying the regions in the images and PDF to automate the file for a Chatbot
Project Plan 1.0

Figure 2: Figure:Gantt Chart


References

[1] . A. Maududie, W. E. Y. Retnani and M. A. Rohim, "An Approach of Web Scraping


on News Website based on Regular Expression," 2018 2nd East Indonesia Conference on
Computer and Information Technology (EIConCIT), Makassar, Indonesia, 2018

[2] . S. Upadhyay, V. Pant, S. Bhasin and M. K. Pattanshetti, "Articulating the construction


of a web scraper for massive data extraction," 2017 Second International Conference on
Electrical, Computer and Communication Technologies (ICECCT), Coimbatore

[3] . R. S. Chaulagain, S. Pandey, S. R. Basnet and S. Shakya, "Cloud Based Web Scraping
for Big Data Applications," 2017 IEEE International Conference on Smart Cloud (Smart-
Cloud), New York, NY, 2017, pp. 138-143, doi: 10.1109/SmartCloud

[4] . D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using Python," 2019
3rd International conference on Electronics, Communication and Aerospace Technology
(ICECA), Coimbatore, India, 2019,

[5] . M. S. Parvez, K. S. A. Tasneem, S. S. Rajendra, and K. R. Bodke, "Analysis Of Differ-


ent Web Data Extraction Techniques," 2018 International Conference on Smart City and
Emerging Technology (ICSCET), Mumbai, 2018

9
Roll No Name Sign

Roll No.2243011 Project Member 1 : Nakate Sakshi Sanjay

Roll No.2243042 Project Member 2 : Gawade Sonali Ajay

Roll No.2243039 Project Member 3 : Narkhede Shruti Subhash

Project Guide Project Coordinator

(Mrs.P.N.Kokare) ( Mr. P.M.Patil )

Head of Department Principal

( Dr. S. A. Takale) ( Dr. R. S. Bichkar)

You might also like