Professional Documents
Culture Documents
BE IT Project Synopsis Format 2022 23 V1
BE IT Project Synopsis Format 2022 23 V1
A Project Synopsis
Submitted by
Group ID:
Team Members:
Abstract
Web scraping is the process of collecting or extracting information from a particular website.It is
a technique to convert any unstructured data into structured data and then analyze the obtained
data based,and is the stored in required format file type.Web scraping is becoming well known
due to large amount of data available on internet and want to collect the data without wasting
time.Web scarping can be applied to obtain a huge amount of data for better decision making.We
can achieve this using selenium tool and other algorithms.The obtained data after web scraping
will be processed for Text Recognition and Text Classification using sum classifier and Naive
Bayes.
Problem Statement
To convert extracted data from an image , text and file type whether in unstructured or struc-
tured format into required format and using the obtained file for text recognition and text
classification to provide output to chatbot.
Introduction
1.Web Scraping:
Concept of web scraping is in fact becoming increasingly well-known due to the mass of data we
can find on the internet and new startups that don’t want to spend time collecting data where
it can be found quickly on the internet. Web scraping is the process of extracting information
from website.It is a technique of extracting data from the web and turning unstructured data on
the web(including HTML format)into structured data that you can store to your local computer
or database.Data is mostly stored in CSV file or Excel spreadsheet,but it can be saved in other
formats.
2.Text Recognition:
Text Recognition is the process that convert image of the text into a machine readable text
format.
3.Text Classification:
It is ML technique that assigns a set of predefined categories to open-ended text.It can be used
to organise structure and categorize any kind of text from documents,medical studies and files
all over the web.
Objective:
1.Technology make it easy to extract data.
2.To help portfolio managers to be sure of the investment in a company of their interests.
3.Better access to company data.
4.To create an important source for asset managers about market trends and investment oppor-
tunities.
5.Data enrichment on demand.
6.Machine learning and large data set.
Software Requirements:
1.Python 2.x or Python 3.x with Selenium,BeautifulSoup,Pandas libraries installed.
2.Google-chrome browser.
3.Ubuntu Operating System.
Hardware Requirements:
Device name DESKTOP-NL8H82D
Processor Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz 1.19 GHz
Installed RAM 8.00 GB (7.77 GB usable)
Device ID FFC24962-5303-491B-9F2A-99CC8AB9531A
Product ID 00327-35907-20393-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display
Project Design
The following figure shows the proposed system of web scraping program:
Text Recognition: The extraction of text from image and the scanned
documents are alone by converting the images using Otsu’s algorithm.
Storing: Then in post processing involves storing the recognized text in format suit-
able for further processing.
The content of the stored text files is classified by using SVM and Naïve Bayes Classifier.
SVM (Support Vector Machine) is a modern classification technology with a simple structure
and strong classification ability . It is a small theory of mathematical learning based on the
concept of systematic risk minimization and “Vapanic Cherbonenkis” theory SVM helps to find
the best decision boundaries. The main goal of SVM is to maximize the margin.The main goal
of Naïve Bayes classifier is to find the best mapping within the specific problem domain field
between a piece of new data and collection of classification . We can use other classification
algorithms such as Random Forest Classifier,K-nearest neighbour,Decision tree to increase the
accuracy of classification .We can use Reinforcement learning technique to learn the behaviour
of the user in identifying the regions in the images and PDF to automate the file for a Chatbot
Project Plan 1.0
[3] . R. S. Chaulagain, S. Pandey, S. R. Basnet and S. Shakya, "Cloud Based Web Scraping
for Big Data Applications," 2017 IEEE International Conference on Smart Cloud (Smart-
Cloud), New York, NY, 2017, pp. 138-143, doi: 10.1109/SmartCloud
[4] . D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using Python," 2019
3rd International conference on Electronics, Communication and Aerospace Technology
(ICECA), Coimbatore, India, 2019,
9
Roll No Name Sign