Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

INTRODUCTION

What is Data Mining?

Structure of Data Mining

Generally, data mining (sometimes called data or knowledge discovery) is the


process of analyzing data from different perspectives and summarizing it into
useful information - information that can be used to increase revenue, cuts costs, or
both. Data mining software is one of a number of analytical tools for analyzing
data. It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data mining
is the process of finding correlations or patterns among dozens of fields in large
relational databases.

How Data Mining Works?

While large-scale information technology has been evolving separate transaction


and analytical systems, data mining provides the link between the two. Data
mining software analyzes relationships and patterns in stored transaction data
based on open-ended user queries. Several types of analytical software are
available: statistical, machine learning, and neural networks. Generally, any of
four types of relationships are sought:

 Classes: Stored data is used to locate data in predetermined groups. For


example, a restaurant chain could mine customer purchase data to determine
when customers visit and what they typically order. This information could
be used to increase traffic by having daily specials.

 Clusters: Data items are grouped according to logical relationships or


consumer preferences. For example, data can be mined to identify market
segments or consumer affinities.

 Associations: Data can be mined to identify associations. The beer-diaper


example is an example of associative mining.

 Sequential patterns: Data is mined to anticipate behavior patterns and


trends. For example, an outdoor equipment retailer could predict the
likelihood of a backpack being purchased based on a consumer's purchase of
sleeping bags and hiking shoes.

Data mining consists of five major elements:

1) Extract, transform, and load transaction data onto the data warehouse
system.
2) Store and manage the data in a multidimensional database system.
3) Provide data access to business analysts and information technology
professionals.
4) Analyze the data by application software.
5) Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:

 Artificial neural networks: Non-linear predictive models that learn through


training and resemble biological neural networks in structure.

 Genetic algorithms: Optimization techniques that use process such as


genetic combination, mutation, and natural selection in a design based on the
concepts of natural evolution.

 Decision trees: Tree-shaped structures that represent sets of decisions.


These decisions generate rules for the classification of a dataset. Specific
decision tree methods include Classification and Regression Trees (CART)
and Chi Square Automatic Interaction Detection (CHAID). CART and
CHAID are decision tree techniques used for classification of a dataset.
They provide a set of rules that you can apply to a new (unclassified) dataset
to predict which records will have a given outcome. CART segments a
dataset by creating 2-way splits while CHAID segments using chi square
tests to create multi-way splits. CART typically requires less data
preparation than CHAID.

 Nearest neighbor method: A technique that classifies each record in a


dataset based on a combination of the classes of the k record(s) most similar
to it in a historical dataset (where k=1). Sometimes called the k-nearest
neighbor technique.

 Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
 Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data
relationships.

Characteristics of Data Mining:

 Large quantities of data: The volume of data so great it has to be analyzed


by automated techniques e.g. satellite information, credit card transactions
etc.
 Noisy, incomplete data: Imprecise data is the characteristic of all data
collection.
 Complex data structure: conventional statistical analysis not possible
 Heterogeneous data stored in legacy systems

Benefits of Data Mining:

1) It’s one of the most effective services that are available today. With the help
of data mining, one can discover precious information about the customers
and their behavior for a specific set of products and evaluate and analyze,
store, mine and load data related to them
2) An analytical CRM model and strategic business related decisions can be
made with the help of data mining as it helps in providing a complete
synopsis of customers
3) An endless number of organizations have installed data mining projects and
it has helped them see their own companies make an unprecedented
improvement in their marketing strategies (Campaigns)
4) Data mining is generally used by organizations with a solid customer focus.
For its flexible nature as far as applicability is concerned is being used
vehemently in applications to foresee crucial data including industry analysis
and consumer buying behaviors
5) Fast paced and prompt access to data along with economic processing
techniques have made data mining one of the most suitable services that a
company seek

Advantages of Data Mining:

1. Marketing / Retail:

Data mining helps marketing companies build models based on historical data
to predict who will respond to the new marketing campaigns such as direct mail,
online marketing campaign…etc. Through the results, marketers will have
appropriate approach to sell profitable products to targeted customers.

Data mining brings a lot of benefits to retail companies in the same way as
marketing. Through market basket analysis, a store can have an appropriate
production arrangement in a way that customers can buy frequent buying products
together with pleasant. In addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.

2. Finance / Banking

Data mining gives financial institutions information about loan information and
credit reporting. By building a model from historical customer’s data, the bank and
financial institution can determine good and bad loans. In addition, data mining
helps banks detect fraudulent credit card transactions to protect credit card’s
owner.
3. Manufacturing

By applying data mining in operational engineering data, manufacturers can


detect faulty equipments and determine optimal control parameters. For example
semi-conductor manufacturers has a challenge that even the conditions of
manufacturing environments at different wafer production plants are similar, the
quality of wafer are lot the same and some for unknown reasons even has defects.
Data mining has been applying to determine the ranges of control parameters that
lead to the production of golden wafer. Then those optimal control parameters are
used to manufacture wafers with desired quality.

4. Governments

Data mining helps government agency by digging and analyzing records of


financial transaction to build patterns that can detect money laundering or criminal
activities.

5. Law enforcement:

Data mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by examining trends in location, crime type, habit,
and other patterns of behaviors.

6. Researchers:

Data mining can assist researchers by speeding up their data analyzing process;
thus, allowing those more time to work on other projects.
LITERATURE SURVEY

1) Detecting malicious websites with low-interaction honeyclients


AUTHORS: A. Ikinci, T. Holz, and F. Freiling. Monkey-spider.

Client-side attacks are on the rise: malicious websites that exploit vulnerabilities in
the visitor’s browser are posing a serious threat to client security, compromising
innocent users who visit these sites without having a patched web browser.
Currently, there is neither a freely available comprehensive database of threats on
the Web nor sufficient freely available tools to build such a database. In this work,
we introduce the Monkey-Spider project [Mon]. Utilizing it as a client honeypot,
we portray the challenge in such an approach and evaluate our system as a high-
speed, Internetscale analysis tool to build a database of threats found in the wild.
Furthermore, we evaluate the system by analyzing different crawls performed
during a period of three months and present the lessons learned.

2) A guided approach to finding malicious web pages


AUTHORS:L. Invernizzi, S. Benvenuti, M. Cova, P. M. Comparetti, C.
Kruegel, and G. Vigna. Evilseed

Malicious web pages that use drive-by download attacks or social engineering
techniques to install unwanted software on a user's computer have become the
main avenue for the propagation of malicious code. To search for malicious web
pages, the first step is typically to use a crawler to collect URLs that are live on the
Internet. Then, fast prefiltering techniques are employed to reduce the amount of
pages that need to be examined by more precise, but slower, analysis tools (such as
honey clients). While effective, these techniques require a substantial amount of
resources. A key reason is that the crawler encounters many pages on the web that
are benign, that is, the "toxicity" of the stream of URLs being analyzed is low. In
this paper, we present EVILSEED, an approach to search the web more efficiently
for pages that are likely malicious. EVILSEED starts from an initial seed of
known, malicious web pages. Using this seed, our system automatically generates
search engines queries to identify other malicious pages that are similar or related
to the ones in the initial seed. By doing so, EVILSEED leverages the crawling
infrastructure of search engines to retrieve URLs that are much more likely to be
malicious than a random page on the web. In other words EVILSEED increases the
"toxicity" of the input URL stream. Also, we envision that the features that
EVILSEED presents could be directly applied by search engines in their prefilters.
We have implemented our approach, and we evaluated it on a large-scale dataset.
The results show that EVILSEED is able to identify malicious web pages more
efficiently when compared to crawler-based approaches.

3) Blog identification and splog detection.


AUTHORS:P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere.
Weblogs, or blogs have become an important new way to publish information,
engage in discussions and form communities. The increasing popularity of blogs
has given rise to search and analysis engines focusing on the 'blogosphere'. A key
requirement of such systems is to identify blogs as they crawl the Web. While this
ensures that only blogs are indexed, blog search engines are also often
overwhelmed by spam blogs (splogs). Splogs not only incur computational
overheads but also reduce user satisfaction. In this paper we first describe our
experiments on blog identification using Support Vector Machines (SVM). We
compare results of using different feature sets and introduce new features for blog
identification. We then report preliminary results on splog detection and identify
future work.
4) Phishdef: Url names say it all.
AUTHORS:A. Le, A. Markopoulou, and M. Faloutsos.
Phishing is an increasingly sophisticated method to steal personal user information
using sites that pretend to be legitimate. In this paper, we take the following steps
to identify phishing URLs. First, we carefully select lexical features of the URLs
that are resistant to obfuscation techniques used by attackers. Second, we evaluate
the classification accuracy when using only lexical features, both automatically and
hand-selected, vs. when using additional features. We show that lexical features
are sufficient for all practical purposes. Third, we thoroughly compare several
classification algorithms, and we propose to use an online method (AROW) that is
able to overcome noisy training data. Based on the insights gained from our
analysis, we propose PhishDef, a phishing detection system that uses only URL
names and combines the above three elements. PhishDef is a highly accurate
method (when compared to state-of-the-art approaches over real datasets),
lightweight (thus appropriate for online and client-side deployment), proactive
(based on online classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).

5) The core of the matter: Analyzing malicious traffic in cellular carriers.


AUTHORS:C. Lever, M. Antonakakis, B. Reaves, P. Traynor, and W. Lee.
Much of the attention surrounding mobile malware has focused on the in-depth
analysis of malicious applications. While bringing the community valuable
information about the methods used and data targeted by malware writers, such
work has not yet been able to quantify the prevalence with which mobile devices
are actually infected. In this paper, we present the first such attempt through a
study of the hosting infrastructure used by mobile applications. Using DNS traffic
collected over the course of three months from a major US cellular provider as
well as a major US noncellular Internet service provider, we identify the DNS
domains looked up by mobile applications, and analyze information related to the
Internet hosts pointed to by these domains. We make several important
observations. The mobile malware found by the research community thus far
appears in a minuscule number of devices in the network: 3,492 out of over 380
million (less than 0.0009%) observed during the course of our analysis. This result
lends credence to the argument that, while not perfect, mobile application markets
are currently providing adequate security for the majority of mobile device users.
Second, we find that users of iOS devices are virtually identically as likely to
communicate with known low reputation domains as the owners of other mobile
platforms, calling into question the conventional wisdom of one platform
demonstrably providing greater security than another. Finally, we observe two
malware campaigns from the upper levels of the DNS hierarchy and analyze the
lifetimes and network properties of these threats. We also note that one of these
campaigns ceases to operate long before the malware associated with it is
discovered suggesting that network-based countermeasures may be useful in the
identification and mitigation of future threats.
SYSTEM ANALYSIS
3.2 Existing System:

Functionality and layout have regularly been used to perform static analysis
to determine maliciousness in the desktop space [20], [37], [51]. Features such
as the frequency of iframes and the number of redirections have traditionally
served as strong indicators of malicious intent. Due to the significant changes made
to accommodate mobile devices, such assertions may no longer be true. Previous
techniques also fail to consider mobile specific webpage elements such as calls to
mobile APIs. For instance, links that spawn the phone’s dialer (and the reputation
of the number itself) can provide strong evidence of the intent of the page. New
tools are therefore necessary to identify malicious pages in the mobile web. We
first experimentally demonstrate that the distributions of identical static features
when extracted from desktop and mobile webpages vary dramatically.

3.2.1 Disadvantages:

 Mobile webpages require multiple redirections before users gain access to


content.
 kAYO also detects a number of malicious mobile webpages not precisely
detected by existing techniques such as VirusTotal and Google Safe
Browsing
 We experimentally demonstrate that the distributions of static features used
in existing techniques (e.g., the number of redirections) are different when
measured on mobile and desktop webpages.
3.3 Proposed System:

We illustrate that certain features are inversely correlated or unrelated to or non-


indicative to a webpage being malicious when extracted from each space. The
results of our experiments demonstrate the need for mobile specific techniques for
Detecting malicious webpages. Which informs users about the maliciousness of the
webpages they intend to visit in real-time? We plan to make the extension publicly
Available post publication

3.3.1 Advantages:

 improvement of two orders of magnitude in the speed of feature extraction

 To the best of our knowledge kAYO is the first technique that detects mobile
specific malicious webpages by static analysis.

 KAYO enables detection of malicious mobile webpages missed by existing


techniques. Finally, our survey of existing extensions on Firefox desktop
browser
3.4 SOFTWARE REQUIREMNET SPECIFICATION
3.4.1 SYSTEM REQUIREMENTS:

3.4.2 SOFTWARE REQUIREMENTS:

 Operating system : Windows XP/7.


 Coding Language : JAVA/J2EE
 IDE : Netbeans 7.4
 Database : MYSQL

3.4.3 HARDWARE REQUIREMENTS:

 System : Pentium IV 2.4 GHz.


 Hard Disk : 40 GB.
 Floppy Drive : 1.44 Mb.
 Monitor : 15 VGA Colour.
 Mouse : Logitech.
 Ram : 512 Mb.
SYSTEM DESIGN

4.1 SYSTEM ARCHITECTURE:

4.2 CLASS DIAGRAM:

In software engineering, a class diagram in the Unified


Modeling Language (UML) is a type of static structure diagram that describes the
structure of a system by showing the system's classes, their attributes, operations
(or methods), and the relationships among the classes. It explains which class
contains information.
4.3 USE CASE DIAGRAM:

A use case diagram in the Unified Modeling Language (UML) is a type of


behavioral diagram defined by and created from a Use-case analysis. Its purpose is
to present a graphical overview of the functionality provided by a system in terms
of actors, their goals (represented as use cases), and any dependencies between
those use cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. Roles of the actors in the system can be
depicted.
4.4 SEQUENCE DIAGRAM:

A sequence diagram in Unified Modeling Language (UML) is a kind of


interaction diagram that shows how processes operate with one another and in
what order. It is a construct of a Message Sequence Chart. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing diagrams.
4.8 ACTIVITY DIAGRAM:

Activity diagrams are graphical representations of workflows of stepwise


activities and actions with support for choice, iteration and concurrency. In the
Unified Modeling Language, activity diagrams can be used to describe the
business and operational step-by-step workflows of components in a system. An
activity diagram shows the overall flow of control.

You might also like