Fake Profile Detection in Social Media Using NLP: About The Project

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

FAKE PROFILE DETECTION IN SOCIAL MEDIA USING NLP

ABSTRACT:
In present times, social media plays a key role in every individual life. Everyday majority of the people
are spending their time on social media platform. The number of accounts in these social networking
sites has dramatically increasing day-by-day and many of the users are interacting with others
irrespective of their time and location. In existing system Machine Learning Algorithms are used which
may be inefficient avoid such difficulty we are implementing the content oriented fake profile
prediction and we are using the NLP (Natural Language Process) and Machine Learning classifier for
the fake profile prediction.

ABOUT THE PROJECT:


In recent years, online social networks such as Twitter, have become a global mass phenomenon
and one of the fastest emerging e-services according to Gross and Acquisti (2005) and Boyd and
Ellison (2007). A study recently published by Twitter (2012) indicates that there were about 901
million monthly active users on the platform at the end of March 2019. Therefore, Twitter is one of the
largest online social networks. Not only common users but also celebrities, politicians and other people
of public interest use social media to spread content to others. Furthermore, companies and
organizations consider social media sites the medium of choice for large-scale marketing and target-
oriented advertising campaigns. The sustainability of the business model relies on several different
factors and is usually not publicly disclosed. Nonetheless, we assume that two major aspects are
significant for Twitter.

First and foremost, Twitter relies on people using their real-life identity and therefore
discourages the use of pseudonyms. Verified accounts allow (prominent) users to verify their identity
and to continue using pseudonyms, e.g., stage names such as ‘Gaga’. This is considered to be a security
mechanism against fake accounts (TechCrunch 2012); moreover, users are asked to identify friends
who do not use their real names. Second, the revenue generated by advertising is substantial and thus
protection of the revenue streams is important. Media reports (TechCrunch 2012) have indicated that a
large portion of clicks are not genuine clicks by real users.
In the short term, Twitter does not suffer from bot-click attacks but instead benefits from
increased revenue, so there may be no incentive to prevent such fraud. In the long run, however,
advertisers will move away from the platform if the promised targeted advertisements are not delivered
correctly. Amongst users, social media are widely regarded as an opportunity for self-presentation and
interaction with other participants around the globe. Due to the wide circulation and growing popularity
of social media sites, even for-profit organizations, such as companies, and non-profit organizations
have gained interest in presenting themselves and reaching potential customers. A presence on Twitter,
Twitter etc. is nearly taken for granted. The social media site operators have a huge amount of personal
data and other shared content such as links, photos and videos stored on their servers. Many individuals
and organizations.

In the present generation, the social life of everyone has become associated with the online social
networks. There is a tremendous increase in technologies these days. Online social networks is playing
an important role in modern society. Social networking sites engage millions of users around the world.
The users' interactions with these social sites, such as Twitter and Facebook have a tremendous impact
and occasionally undesirable repercussions for daily life. Adding new friends and keeping in contact
with them has become easier. The social networking sites are making our social lives better but
nevertheless there are a lot of issues with using these social networking sites. The issues are privacy,
online bullying, potential for misuse, trolling, creation of fake account etc. We will implement machine
learning algorithms to predict if an account is controlled by fake user. There are two main common
types of ML methods known as Supervised Learning and Unsupervised Learning. In supervised
learning a labeled set of training data is used to estimate or map the input data to the desired output. In
contrast, under the unsupervised learning methods no labeled examples are provided and there is no
notion of the output during the learning process. In Supervised learning, input data is called training
data and has a known label or result such as spam/not-spam at a time. A model is prepared through a
training process where it is required to make predictions and is corrected when those predictions are
wrong. The training process continues until the model achieves a desired level of accuracy on the
training data. The goal of machine learning in fake profile detection is to have trained machine learning
algorithm that, given the data of a particular profile like age, gender, numbers of friends etc. This data
can facilitate in predicting whether a profile is fake or genuine effectively and will result in ensuring
security of data on social networking sites.

EXISTING SYSTEM:

The main theme of our paper is identifying whether the instagram profile is genuine or fake.
Algorithms will be trained with all previous users fake and genuine account data and then whenever we
give new test data then trained model will be applied on new test data to identify whether given new
account details are from genuine or fake users. The machine learning-based methods were used to
perceive false accounts that could give the wrong impression about people. The dataset is pre-processed
using a variety of python libraries and a comparison form is obtained to get a realistic algorithm
appropriate for the specified dataset .An effort to notice forged accounts on the social media platforms
is strong-minded by a variety of machine learning algorithms. The performances of the classification
algorithms Random Forest and support vector machines are used for the detection of fake accounts.

DISADVANTAGES OF EXISTING SYSTEM:


 The prediction accuracy will be low.
 SVM algorithm is not suitable for large data sets.
 SVM does not perform very well when the data set has more noise i.e. target classes are
overlapping.
 In cases where the number of features for each data point exceeds the number of training data
samples, the SVM will underperform.
PROPOSED SYSTEM:
Fake profile in the twitter plays a vital role which makes user down. In, order to overcome such
difficulty in the existing system. We, are proposing the LSTM algorithm for the prediction and the data
set is collected from the kaggle.com and the download file will be in the form of the .CSV file and the
pre processing is made and then the feature extraction using the NLP and the extracted data will be
proceed using the LSTM algorithm and it will predict the fake profile using the trained model.

ADVANTAGES OF PROPOSED SYSTEM:


 They are much better at handling long-term dependencies.

 Since, we are using the LSTM for the prediction the prediction accuracy can be increased.

 LSTMs are much less susceptible to the vanishing gradient problem.

 Finally, LSTMs are very efficient at modeling complex sequential data.

MODULE IN THE PROPOSED SYSTEM:


 Data collection
 Pre-processing
 Modal creation
 Test data
 Prediction
DATA COLLECTION:
Data collection or data gathering is the process of gathering and measuring information on targeted
variables in an established system, which then enables one to answer relevant questions and evaluate
outcomes. Data collection is a research component in all study fields, including physical and social
sciences, humanities, and business. While methods vary by discipline, the emphasis on ensuring
accurate and honest collection remains the same. The goal for all data collection is to capture quality
evidence that allows analysis to lead to the formulation of convincing and credible answers to the
questions that have been posed. Data collection and validation consists of four steps when it involves
taking a census and seven steps when it involves sampling. We are collecting the data from the
kaggle.com open source.

PRE-PROCESSING:

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or
enhance performance, and is an important step in the data mining process. The phrase "garbage in,
garbage out" is particularly applicable to data mining and machine learning projects. Data-
gathering methods are often loosely controlled, resulting in out-of-range values analyzing data that has
not been carefully screened for such problems can produce misleading results. Thus, the representation
and quality of data is first and foremost before running any analysis. Often, data preprocessing is the
most important phase of a machine learning project, especially in computational biology. If there is
much irrelevant and redundant information present or noisy and unreliable data, then knowledge
discovery during the training phase is more difficult. Data preparation and filtering steps can take
considerable amount of processing time. Examples of data preprocessing include cleaning, instance
selection, normalization, one hot encoding, transformation, feature extraction and selection, etc. The
product of data preprocessing is the final training set. In our project we, are using the NLP for the pre-
processing process.

MODAL CREATION:
In this modal we are using the LSTM algorithm. An LSTM module has a cell state and three
gates which provides them with the power to selectively learn, unlearn or retain information from each
of the units. The cell state in LSTM helps the information to flow through the units without being
altered by allowing only a few linear interactions.

TEST DATA AND PREDICTION:


In test data in passed to the modal and it will passed to the created and predict the profile status. And it
will show the profile is fake or original.
CHAPTER II

HARDWARE AND SOFTWARE

HARDWARE REQUIREMENTS

 PROCESSOR : i3 (or) i5 processor .


 RAM : 4GB DD RAM
 MONITOR : 15” COLOR
 HARD DISK : 40 GB

SOFTWARE REQUIREMENTS:

 Front End : react js


 Back End : SQL server
 Operating System : Windows 10
 IDE : python
 Frame work: flask
MACHINE LEARNING:
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.

DEEPLEARNING

Deep learning is a subset of machine learning, which is essentially a neural network


with three or more layers. These neural networks attempt to simulate the behavior of the human
brain—allowing it to “learn” from large amounts of data. While a neural network with a single
layer can still make approximate predictions, additional hidden layers can help to optimize and
refine for accuracy. The Deep learning Architecture is shown in thefig1.1
Figure1.1DeeplearningArchitecture

In deep learning, a computer model learns to perform classification tasks directly from images,
text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding
human-level performance. Models are trained by using a large set of labeled data and neural network
architectures that contain many layers. It is a field that is based on learning and improving on its own
by examining computer algorithms. However, advancements in Big Data analytics have permitted
larger, sophisticated neural networks, allowing computers to observe, learn, and react to complex
situations faster than humans. Deep learning has aided image classification, language translation,
speech recognition .It can be used to solve any pattern recognition problem and without human
intervention.
In deep learning, a computer model learns to perform classification tasks directly from images,
text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding
human-level performance. Models are trained by using a large set of labeled data and neural network
architectures that contain many layers. It is a field that is based on learning and improving on its own by
examining computer algorithms. However, advancements in Big Data analytics have permitted larger,
sophisticated neural networks, allowing computers to observe, learn, and react to complex situations
faster than humans. Deep learning has aided image classification, language translation, speech
recognition. It can be used to solve any pattern recognition problem and without human intervention.
1.1.1 DEEP LEARNING VS MACHINE LEARNING

Deep learning is a specialized form of machine learning. A machine learning workflow


starts with relevant features being manually extracted from images. The features are then used to
create a model that categorizes the objects in the image. With a deep learning workflow, relevant
features are automatically extracted from images. In addition, deep learning performs “end-to-
end learning” – where a network is given raw data and a task to perform, such as classification,
and it learns how to do this automatically. Another key difference is deep learning algorithms
scale with data, whereas shallow learning converges. A key advantage of deep learning networks
is that they often continue to improve as the size of your data increases. The difference in
working of machine and deep learning is shown in fig 1.1.1.

Fig 1.2 Machine learning vs Deep learning

Fig 1.4 Unrolled RNN

1
0
1.2 SENTIMENT ANALYSIS

Sentiment Analysis is a sub-field of NLP and with the help of machine learning
techniques, it tries to identify and extract the insights.Sentiment analysis studies the subjective
information in an expression, that is, the opinions, appraisals, emotions, or attitudes towards a
topic, person or entity. Expressions can be classified as positive, negative, or neutral. For
example: “I really like the new design of your website!” → Positive.

Sentiment analysis is contextual mining of text which identifies and extracts subjective
information in source material, and helping a business to understand the social sentiment of their
brand, product or service while monitoring online conversations. With the recent advances in
deep learning, the ability of algorithms to analyse text has improved considerably. Creative use
of advanced artificial intelligence techniques can be an effective tool for doing in-depth research.

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to
determine whether data is positive, negative or neutral. Sentiment analysis is often performed on
textual data to help businesses monitor brand and product sentiment in customer feedback, and
understand customer needs. The process of sentimental analysis is shown in fig 1.4.

Fig 1.6 Sentimental Analysis


1
1
1.3 TYPES OF SENTIMENTAL ANALYSIS
People have a wide range of emotions – sad or happy, interested or uninterested, and positive or
negative. Different sentiment analysis models are available to capture this variety of emotions.

Let’s look at the most important types of sentiment analysis.

1. Fine-Grained

This sentiment analysis model helps you derive polarity precision. You can conduct a sentiment
analysis across the following polarity categories: very positive, positive, neutral, negative, or very
negative. Fine-grained sentiment analysis is helpful for the study of reviews and ratings.

For a rating scale from 1 to 5, you can consider 1 as very negative and five as very positive. For a
scale from 1 to 10, you can consider 1-2 as very negative and 9-10 as very positive.

2. Aspect-Based

While fine-grained analysis helps you determine the overall polarity of your customer reviews,
aspect-based analysis delves deeper. It helps you determine the particular aspects people are talking
about.

Let’s say; you’re a mobile phone manufacturer, and you get a customer review stating, “the camera
struggles in artificial lighting conditions.”

With aspect-based analysis, you can determine that the reviewer has commented on something
“negative” about the “camera.”

3. Emotion Detection

As the name suggests, emotion detection helps you detect emotions. This can include anger,
sadness, happiness, frustration, fear, worry, panic, etc. Emotion detection systems typically use
lexicons – a collection of words that convey certain emotions. Some advanced classifiers also
utilize robust machine learning (ML) algorithms.

It’s recommended to use ML over lexicons because people express emotions in a myriad of ways.
4. Intent Analysis

Accurately determining consumer intent can save companies time, money, and effort.
So many times, businesses end up chasing consumers that don’t plan to buy anytime
soon. Accurate intent analysis can resolve this hurdle.

The intent analysis helps you identify the intent of the consumer – whether the
customer intends to purchase or is just browsing around.

If the customer is willing to purchase, you can track them and target them with
advertisements. If a consumer isn’t ready to buy, you can save your time and
resources by not advertising to them.

Modern-day sentiment analysis approaches are classified into three categories:


knowledge-based, statistical, and hybrid. Here’s how to perform sentiment analysis.

Knowledge-Based: This approach included the classification of text based on words


that emanate emotion.

Statistical: This approach utilizes machine learning algorithms like latent semantic
analysis and deep learning for accurate sentiment detection.

Hybrid: This approach leverages both knowledge-based and statistical techniques for
on- point sentiment analysis.

Back in the day, performing sentiment analysis required expertise in technologies


like Python, R, and machine learning. But nowadays, several software tools enable
you to conduct sentiment analysis with no or minimal technical knowledge.

FLASK
Flask is a Micro Web framework written in Python. it is Classified as a microframework
because it doesn’t required particular tools or libraries. it has no Database abstraction layer, form
validation, or any other components where pre-existing third- party libraries provide common
function.
PYTHON 3.7
Python is an Interpreted, Object Oriented, High-level Programming Language With dynamic
Semantics developed by Guido van Rossum , It was originally released in 1991. Designed to be
easy as well as fun, the name “Python” is a nod to the British comedy group Monty Python. Python
has a reputation as a beginner-friendly language, replacing Java as the most widely used
introductory language because it handles much of the complexity for the user, allowing beginners to
focus on fully grasping programming concepts rather than minute details.
Python is used for server-side web development, software development, mathematics, and
system scripting, and is popular for Rapid Application. Development and as a scripting or
glue language to tie existing components because of its high-level, built-in data structures,
dynamic typing, and dynamic binding. Program maintenance costs are reduced with Python
due to the easily learned syntax and emphasis on readability.
Additionally, Python's support of modules and packages facilitates modular programs and reuse of
code. Python is an Open source community language, so numerous independent programmers are
continually building libraries and functionality for it. python is an interpreter, object-oriented,
high-level programming language with dynamic semantics. Its high-level built in data structures,
combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application
Development, as well as for use as a scripting or glue language to connect existing components
together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the
cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse. The
Python interpreter and the extensive standard library are available in source or binary form without
charge for all major platforms, and can be freely distribution.

Anaconda:
Anaconda is a distribution of the Python and R programming languages for scientific
computing (data science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment. The distribution
includes data-science packages suitable for Windows, Linux, and macOS. It is developed and
maintained by Anaconda, Inc., which was founded by Peter Wang and Travis Oliphant in 2012. As
an Anaconda, Inc. product, it is also known as Anaconda Distribution or Anaconda Individual
Edition, while other products from the company are Anaconda Team Edition and Anaconda
Enterprise Edition, both of which are not free.
This package manager was spun out as a separate open-source package as it ended up being
useful on its own and for things other than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend on, and
a small number of other packages.
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda
distribution that allows users to launch applications and manage conda packages, environments and
channels without using command-line commands. Navigator can search for packages on Anaconda
Cloud or in a local Anaconda Repository, install them in an environment, run the packages and
update them. It is available for Windows, macOS and Linux.
 JupyterLab
 Jupyter Notebook
 Spyder
 Glue
 Orange
 RStudio
 Visual Studio Code

JUPYTER NOTEBOOK:
Jupyter Notebook is a web-based interactive computational environment for creating
notebook documents. Jupyter Notebook is built using several open-source libraries, including
IPython, ZeroMQ, Tornado, jQuery, Bootstrap, and MathJax. A Jupyter Notebook document is a
browser-based REPL containing an ordered list of input/output cells which can contain code, text
(using Markdown), mathematics, plots and rich media. Underneath the interface, a notebook is a
JSON document, following a versioned schema, usually ending with the ".ipynb" extension.
Jupyter Notebook is similar to the notebook interface of other programs such as Maple,
Mathematica, and SageMath, a computational interface style that originated with Mathematica in
the 1980s. Jupyter interest overtook the popularity of the Mathematica notebook interface in early
2018.The main parts of the Jupyter Notebooks are: Metadata, Notebook format and list of cells.
Metadata is a data Dictionary of definitions to set up and display the notebook. Notebook Format is
a version number of the software. List of cells are different types of Cells for Markdown(display),
Code( to execute), and output of the code type cells. JupyterHub is a multi-user server for Jupyter
Notebooks. It is designed to support many users by spawning, managing, and proxying many
singular Jupyter Notebook servers.
Tensorflow & Keras:
TensorFlow is a free and open-source software library for machine learning and artificial
intelligence. It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks.
TensorFlow was developed by the Google Brain team for internal Google use in research and
production.TensorFlow can be used in a wide variety of programming languages, including Python,
JavaScript, C++, and Java.TensorFlow serves as the core platform and library for machine learning.
TensorFlow's APIs use Keras to allow users to make their own machine learning models. In
addition to building and training their model, TensorFlow can also help load the data to train the
model, and deploy it using TensorFlow Serving.
AutoDifferentiation is the process of automatically calculating the gradient vector of a model
with respect to each of its parameters. With this feature, TensorFlow can automatically compute the
gradients for the parameters in a model, which is useful to algorithms such as backpropagation
which require gradients to optimize performance.
Keras is an open-source software library that provides a Python interface for artificial neural
networks. Keras acts as an interface for the TensorFlow library.
Keras contains numerous implementations of commonly used neural-network building blocks
such as layers, objectives, activation functions, optimizers, and a host of tools to make working with
image and text data easier to simplify the coding necessary for writing deep neural network code.
The code is hosted on GitHub, and community support forums include the GitHub issues page, and
a Slack channel. Keras has support for convolutional and recurrent neural networks. It supports
other common utility layers like dropout, batch normalization, and pooling.Keras allows users to
productize deep models on smartphones (iOS and Android), on the web, or on the Java Virtual
Machine.

Frontend

In the front end of the project, w ehave using HTML, CSS and JAVA Script with the Visual studio
Framework.

HTML
HTML stands for “Hyper Text Markup Language” It was invented in 1990 by a Scientist called Tim
Berners Lee. It is a Computer Language that allows website creation. This website can be viewed
by anyone else connected to the internet. It is relatively easy to learn. It is constantly undergoing
revision and evaluation to meet the demands &requirements of the growing internet audience under
the direction of the LO3C (World Wide Web consortium) the organization change with designing
and maintain the language. HTML is the mother tongue of the browser. For example, it specifies
text, images, and other objects and can also specify the appearance of text, such as bold or italic
text.
The World Wide Web Consortium (W3C) defines the specification for HTML. The current
versions of HTML are HTML5 and XHTML5.

Note: DHTML stands for Dynamic HTML. DHTML combines cascading style sheets
(CSS) and scripting to create animated Web pages and page elements that respond to user
interaction.

FEATURES
 It is easy to learn and easy to use.
 It is platform-Independent.
 Images ,Video, and Audio can be added to a web page.
 Hypertext can be added to the text.
 It is a markup language.

JAVA Script
Java Script (often shortened to JS) is a lightweight, interpreted, objects oriented language with first
class functions and is best known as the scripting language for web pages, but it’s used in, many
non-browser environments as well. It is a prototype based, Multi-paradigm scripting language that
is dynamic & supports object-oriented, imperative, & functional programming styles. Java script
runs on the client side of the web which can be used to design program how the web pages behave
on the occurrence of an event, Java script is an easy to learn & also powerful scripting language,
widely used for controlling web page behavior.

FEATURES
 Object -Centered Script Language.
 Client Edge Technology.
 Validation of user's Input.
 Interpreter centered.
 Ability to perform in Build Function.
 Case Sensitive Format.
 Light Weight and Delicate.
 Handling Event.

CSS
CSS stands for Cascading Style Sheets. CSS describes how HTML elements are to be
displayed on screen, paper, or in other media. CSS saves a lot of work. It can control the layout of
multiple web pages all at once. External style sheets are stored in CSSfiles.
CSShelpsWebdeveloperscreateauniformlookacrossseveralpagesofaWebsite.Insteadof
defining the style of each table and each block of text within a page's HTML, commonly used
styles need to be defined only once in a CSS document. Once the style is defined in cascading
style sheet, it can be used by any page that references the CSS file. Plus, CSS makes it easy to
change styles across several pages at once.

FEATURES

 CSS saves time.


 Pages load faster.
 Easy maintenance.
 Superior Style to HTML.
 Multiple Device Compatibility.
 Global Web Standards.
2.7.2 Backend

In the back end of the project, we have using SQLite with the Python3.7and Flask
Framework.

SQLite3:
SQLite3 is an in-process library that implements a self-contained, serverless, zero-configuration,
transaction SQLite3 database engine. It is a database, which is zero-configured, which means like
other databases you do not need to configure it in your system.
SQLite3 engine is not a standalone process like other databases, you can link it statically or
dynamically as per your requirement with your application. SQLite3 accesses its storage files
directly.

FEASIBILITY STUDY:
The Feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system Analysis the feasibility study of the proposed system is to be carried out. This
is to ensure that the proposed system is not a burden to the company. For Feasibility
analysis, some Understanding of the major requirements for the system is essential.

 Economic Feasibility
 Technical Feasibility
 Social Feasibility

Economic Feasibility

Economic analysis is the most frequently used method for evaluating


the effectiveeness of a candidate system. More commonly known as cost/benefit
analysis, the procedure is to determine the benefits and savings that are expected from
a candidate system and compare them with costs. If benefits outweigh cost, then the
decision is made to design and implement the system.

Technical Feasibility

This involves questions such as whether the technology needed for the system
exits, how different it will be to build, and whether the firm has enough experience
using that technology. The assessment is based on an outline design of system
requirements in terms of Input, processes, output, fields, programs, and procedures.
This can be quantified in terms of volumes of data, trends, frequency of updating, etc.
in order to estimate if the new system will perform adequately or not.

Social Feasibility
Determines whether the proposed system conflicts with legal requirements,
(e.g., A data processing system must comply with the local data protection acts).
When an organization has either internal or external legal counsel, such reviews are
typically standard. However, a project may face legal issues after completion if this
factor is not considered at this stage. If is about the authorization.
CHAPTER III
SYSTEM DESIGN
TABLE DESIGN:
USER SIGNUP:

Field Name Data Type Description

user_id Varchar(30) This field is used to enter the username

User e-mail Varchar(50) This field is used to enter the user e-mail

User password Varchar(100) This field is used to enter the user password

Date date This field is used to enter the expire date

Contact Varchar (30) This field is used to enter the contact number

USER SIGNS IN:

Field Name Data Type Description

user_id Varchar(30) This field is used to enter the username

User password Varchar(100) This field is used to enter the user password
ARCHITECTURE DIAGRAM:

An architectural diagram is a visual representation that maps out the physical


implementation for components of a software system. It shows the general structure of the software
system and the associations, limitations, and boundaries between each element.
FLOW DIAGRAM:
Flow diagram is a graphic representation of the physical route or flow of people, materials,
paper works, vehicles, or communication associated with a process, procedure plan, or
investigation. In the second definition the meaning is limited to the representation of the physical
route or flow.
DATA FLOW DIAGRAM:
A data flow diagram (DFD) maps out the flow of information for any process or system. It uses
defined symbols like rectangles, circles and arrows, plus short text labels, to show data inputs,
outputs, storage points and the routes between each destination. Data flowcharts can range from
simple, even hand-drawn process overviews, to in-depth, multi-level DFDs that dig progressively
deeper into how the data is handled. They can be used to analyze an existing system or model a new
one. Like all the best diagrams and charts, a DFD can often visually “say” things that would be hard
to explain in words, and they work for both technical and nontechnical audiences, from developer to
CEO. That’s why DFDs remain so popular after all these years. While they work well for data flow
software and systems, they are less applicable nowadays to visualizing interactive, real-time or

database-oriented software or systems.

DATA COLLECTION:

FEATURE EXTRACTION:
MODAL CREATION:

PREDICTION:
CHAPTER IV

IMPLEMENTATION AND TESTING

SOFTWARE TESTING:
Testing documentation is the documentation of artifacts that are created during or before the testing
of a software application. Documentation reflects the importance of processes for the customer,
individual and organization. Projects which contain all documents have a high level of maturity.
Careful documentation can save the time, efforts and wealth of the organization.

If the testing or development team gets software that is not working correctly and developed by
someone else, so to find the error, the team will first need a document. Now, if the documents are
available then the team will quickly find out the cause of the error by examining documentation.
But, if the documents are not available then the tester need to do black box and white box testing
again, which will waste the time and money of the organization. More than that, Lack of
documentation becomes a problem for acceptance.

As per the IEEE Documentation describing plans for, or results of, the testing of a system or
component, Types include test case specification, test incident report, test log, test plan, test
procedure, test report. Hence the testing of all the above mentioned documents is known as
documentation testing. This is one of the most cost effective approaches to testing. If the
documentation is not right: there will be major and costly problems. The documentation can be
tested in a number of different ways to many different degrees of complexity. These range from
running the documents through a spelling and grammar checking device, to manually reviewing the
documentation to remove any ambiguity or inconsistency.

Documentation testing can start at the very beginning of the software process and hence save large
amounts of money, since the earlier a defect is found the less it will cost to be fixed.
The most popular testing documentation files are test reports, plans, and checklists. These
documents are used to outline the teams workload and keep track of the process. Lets take a look at
the key requirements for these files and see how they contribute to the process.

TEST STRATEGY

An outline of the full approach to product testing. As the project moves along, developers,
designers, product owners can come back to the document and see if the actual performance
corresponds to the planned activities.

TEST DATA

The data that testers enter into the software to verify certain features and their outputs. Examples of
such data can be fake user profiles, statistics, media content, similar to files that would be uploaded
by an end-user in a ready solution.

TEST PLANS

A file that describes the strategy, resources, environment, limitations, and schedule of the testing
process. Its the fullest testing document, essential for informed planning. Such a document is
distributed between team members and shared with all stakeholders.

TEST SCENARIOS

In scenarios, testers break down the product’s functionality and interface by modules and
provide real-time status updates at all testing stages. A module can be described by a single
statement, or require hundreds of statuses, depending on its size and scope.

TEST CASES

If the test scenario describes the object of testing (what), a scenario describes a procedure (how).
These files cover step-by- step guidance, detailed conditions, and current inputs of a testing task.
Test cases have their own kinds that depend on the type of testing, functional, UI, physical, logical
cases, etc. Test cases compare available resources and current conditions with desired outcomes and
determine if the functionality can be released or not.

TRACEABILITY MATRIX:

This software testing documentation maps test cases and their requirements. All entries have their
custom IDs team members and stakeholders can track the progress of any tasks by simply entering
its ID to the search.

The combination of internal and external documentation is the key to a deep understanding of all
testing processes. Although stakeholders typically have access to the majority of documentation,
they mostly work with external files, since they are more concise and tackle tangible issues and
results. Internal files, on the other hand, are used by team members to optimize the testing process.

Unit Testing is not a new concept. It's been there since the early days of programming. Usually,
developers and sometimes White box testers write Unit tests to improve code quality by verifying
each and every unit of the code used to implement functional requirements (aka test drove
development TDD or test-first development).

TEST DRIVEN DEVELOPMENT:

Test Driven Development, or TDD, is a code design technique where the programmer writes a test
before any production code, and then writes the code that will make that test pass. The idea is that
with a tiny bit of assurance from that initial test, the programmer can feel free to refactor and
refactor some more to get the cleanest code they know how to write. The idea is simple, but like
most simple things, the execution is hard. TDD requires a completely different mind set from what
most people are used to and the tenacity to deal with a learning curve that may slow you down at
first.

BLACKBOX TESTING:
During functional testing, testers verify the app features against the user specifications. This is
completely different from testing done by developers which is unit testing. It checks whether the
code works as expected. Because unit testing focuses on the internal structure of the code, it is
called the white box testing. On the other hand, functional testing checks functionalities without
looking at the internal structure of the code, hence it is called black box testing. Despite how
flawless the various individual code components may be, it is essential to check that the app is
functioning as expected, when all components are combined. Here you can find a detailed
comparison between functional testing vs unit testing.

INTEGRATION TESTING:

Integration testing is a level of software testing where individual units are combined and tested as a
group. The purpose of this level of testing is to expose faults in the interaction between
integrated units. Test drivers and test stubs are used to assist in Integration Testing.

Integration testing: Testing performed to expose defects in the interfaces and in the interactions
between integrated components or systems. See also component integration testing, system
integration testing.

Component integration testing Testing performed to expose defects in the interfaces and
interaction between integrated components. System integration testing: Testing the integration
of systems and packages; testing interfaces to external organizations (e.g. Electronic Data
Interchange, Internet).

Integration tests determine if independently developed units of software work correctly when
they are connected to each other. The term has become blurred even by the diffuse standards of
the software industry, so I've been wary of using it in my writing. In particular, many people
assume integration tests are necessarily broad in scope, while they can be more e ffectively
done with a narrower scope.

As often with these things, it's best to start with a bit of history. When I first learned about
integration testing, it was in the 1980's and the waterfall was the dominant influence of
software development thinking. In a larger project, we would have a design phase that would
specify the interface and behavior of the various modules in the system. Modules would then
be assigned to developers to program. It was not unusual for one programmer to be responsible
for a single module, but this would be big enough that it could take months to build it. All this
work was done in isolation, and when the programmer believed it was finished they would
hand it over to QA for testing.

Integration testing tests integration or interfaces between components, interactions to different


parts of the system such as an operating system, file system and hardware or interfaces between
systems. Integration testing is a key aspect of software testing.

SYSTEM TESTING:

System testing is a level of software testing where a complete and integrated software is tested.
The purpose of this test is to evaluate the systems compliance with the specified requirements.
System Testing means testing the system as a whole. All the modules/components are
integrated in order to verify if the system works as expected or not.

System Testing is done after Integration Testing. This plays an important role in delivering a
high-quality product. System testing is a method of monitoring and assessing the behaviour of
the complete and fully-integrated software product or system, on the basis of pre-decided
specifications and functional requirements. It is a solution to the question "whether the
complete system functions in accordance to its pre-defined requirements?"

It's comes under black box testing i.e. only external working features of the software are
evaluated during this testing. It does not requires any internal knowledge of the coding,
programming, design, etc., and is completely based on users-perspective.

A black box testing type, system testing is the first testing technique that carries out the task of
testing a software product as a whole. This System testing tests the integrated system and
validates whether it meets the specified requirements of the client.

System testing is a process of testing the entire system that is fully functional, in order to
ensure the system is bound to all the requirements provided by the client in the form of the
functional specification or system specification documentation. In most cases, it is done next to
the Integration testing, as this testing should be covering the end-to-end systems actual routine.
This type of testing requires a dedicated Test Plan and other test documentation derived from
the system specification document that should cover both software and hardware requirements.
By this test, we uncover the errors. It ensures that all the system works as expected. We check
System performance and functionality to get a quality product. System testing is nothing but
testing the system as a whole. This testing checks complete end-to-end scenario as per the
customer point of view. Functional and Non-Functional tests also done by System testing. All
things are done to maintain trust within the development that the system is defect-free and bug-
free. System testing is also intended to test hardware/software requirements specifications.
System testing is more of a limited type of testing.
CHAPTER 5

CONCLUSION AND FUTURE WORK


The fraudsters use the fake profile to spread false or fake information, damage the reputation of
the victim, and may also send friend requests to other friends of victim to gain financial benefit.
In order to overcome such difficulty we propose the hybrid modal to prediction of fake profiles
on online social networks. Based on the analysis in this research work it was concluded as there
is no such model being used for detection of fake as well genuine profiles. Therefore, a
combination of NLP and LSTN can be used for detection of fake as well as genuine profiles on
social media . Main problem is that a person can have multiple accounts which makes them an
advantage of creating fake profiles and accounts in online social networks. The idea is to attach
an Aadhar card number when signing up an account so that we can restrict to creation of multiple
account.

REFERENCES:

[1] Yeh-Cheng chen and ShystunfelixWu,Fake Buster: A Robust fake Account detection by
Activity Analysis,2018

[2] Sk.Shama, K.Siva Nandini, P.Bhavya Anjali, K. Devi Manaswi, Fake Profile.Identification in
Online Social Network, 2019

[3] Faiza Masood, Ghana Ammad, Ahmad Almogren, Assad Abbas, Hasan Ali Khathak,Ikram
Uddin,MohsenGuizani, and Mansour Zuair, Spammer Detection and fake Profile Identification
on Social Network, 2019.

[4] Malicious Account Detection on Twitter Based on Tweet Account Features using Machine
Learning.

You might also like