Analyzing cyber incident data sets is an important method for
deepening our understanding of the evolution of the threat situation. This
is a relatively new research topic, and many studies remain to be done.
In this paper, we report a statistical analysis of a breach incident data set
corresponding to 12 years (2005–2017) of cyber hacking activities that
include malware attacks. We show that, in contrast to the findings
reported in the literature, both hacking breach incident inter-arrival times
and breach sizes should be modeled by stochastic processes, rather than
by distributions because they exhibit autocorrelations. Then, we propose
particular stochastic process models to, respectively, fit the inter-arrival
times and the breach sizes. We also show that these models can predict
the inter-arrival times and the breach sizes. In order to get deeper
insights into the evolution of hacking breach incidents, we conduct both
qualitative and quantitative trend analyses on the data set. We draw a set
of cybersecurity insights, including that the threat of cyber hacks is
indeed getting worse in terms of their frequency, but not in terms of the
magnitude of their damage.

The present study is motivated by several questions that have not been
investigated until now, such as: Are data breaches caused by cyber-
attacks increasing, decreasing, or stabilizing? A principled answer to this
question will give us a clear insight into the overall situation of cyber
threats. This question was not answered by previous studies.
Specifically, the dataset analyzed in [7] only covered the time span from
2000 to 2008 and does not necessarily contain the breach incidents that
are caused by cyber-attacks; the dataset analyzed in [9] is more recent,
but contains two kinds of incidents: negligent breaches (i.e., incidents
caused by lost, discarded, stolen devices and other reasons) and
malicious breaching. Since negligent breaches represent more human
errors than cyber-attacks, we do not consider them in the present study.
Because the malicious breaches studied in [9] contain four sub-
categories: hacking (including malware), insider, payment card fraud,
and unknown, this study will focus on the hacking sub-category (called
hacking breach dataset thereafter), while noting that the other three sub-
categories are interesting on their own and should be analyzed
separately. Recently, researchers started modeling data breach incidents.
Maillart and Sornette studied the statistical properties of the personal
identity losses in the United States between year 2000 and 2008. They
found that the number of breach incidents dramatically increases from
2000 to July 2006 but remains stable thereafter. Edwards et al. analyzed
a dataset containing 2,253 breach incidents that span over a decade
(2005 to 2015). They found that neither the size nor the frequency of
data breaches has increased over the years. Wheatley et al., analyzed a
dataset that is combined from corresponds to organizational breach
incidents between year 2000 and 2015. They found that the frequency of
large breach incidents (i.e., the ones that breach more than 50,000
records) occurring to US firms is independent of time, but the frequency
of large breach incidents occurring to non-US firms exhibits an
increasing trend.

In this paper, we make the following three contributions. First, we
show that both the hacking breach incident interarrival times (reflecting
incident frequency) and breach sizes should be modeled by stochastic
processes, rather than by distributions. We find that a particular point
process can adequately describe the evolution of the hacking breach
incidents inter-arrival times and that a particular ARMA-GARCH model
can adequately describe the evolution of the hacking breach sizes, where
ARMA is acronym for “AutoRegressive and Moving Average” and
GARCH is acronym for “Generalized AutoRegressive Conditional
Heteroskedasticity.”We show that these stochastic process models can
predict the inter-arrival times and the breach sizes. To the best of our
knowledge, this is the first paper showing that stochastic processes,
rather than distributions, should be used to model these cyber threat
factors. Second, we discover a positive dependence between the
incidents inter-arrival times and the breach sizes, and show that this
dependence can be adequately described by a particular copula. We also
show that when predicting inter-arrival times and breach sizes, it is
necessary to consider the dependence; otherwise, the prediction results
are not accurate. To the best of our knowledge, this is the first work
showing the existence of this dependence and the consequence of
ignoring it. Third, we conduct both qualitative and quantitative trend
analyses of the cyber hacking breach incidents. We find that the
situation is indeed getting worse in terms of the incidents inter-arrival
time because hacking breach incidents become more and more frequent,
but the situation is stabilizing in terms of the incident breach size,
indicating that the damage of individual hacking breach incidents will
not get much worse. We hope the present study will inspire more
investigations, which can offer deep insights into alternate risk
mitigation approaches. Such insights are useful to insurance companies,
government agencies, and regulators because they need to deeply
understand the nature of data breach risks.

“Support Vector Machine” (SVM) is a supervised machine learning
algorithm which can be used for both classification and regression
challenges. However, it is mostly used in classification problems. In this
algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature
being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two
classes very well (look at the below snapshot). Support Vectors are
simply the co-ordinates of individual observation. Support Vector
Machine is a frontier which best segregates the two classes (hyper-plane/
line). More formally, a support vector machine constructs a hyper plane
or set of hyper planes in a high- or infinite-dimensional space, which can
be used for classification, regression, or other tasks like outliers
detection. Intuitively, a good separation is achieved by the hyper plane
that has the largest distance to the nearest training-data point of any class
(so-called functional margin), since in general the larger the margin the
lower the generalization error of the classifier. Whereas the original
problem may be stated in a finite dimensional space, it often happens
that the sets to discriminate are not linearly separable in that space. For
this reason, it was proposed that the original finite-dimensional space be
mapped into a much higher-dimensional space, presumably making the
separation easier in that space.

The data resource to database can be uploaded by both
administrator and authorized user. The data can be uploaded with
key in order to maintain the secrecy of the data that is not released
without knowledge of user. The users are authorized based on their
details that are shared to admin and admin can authorize each user.
Only Authorized users are allowed to access the system and upload
or request for files.
The access of data from the database can be given by
administrators. Uploaded data are managed by admin and admin is
the only person to provide the rights to process the accessing details
and approve or unapproved users based on their details.

The data from any resources are allowed to access the data with
only permission from administrator. Prior to access data, users are
allowed by admin to share their data and verify the details which are
provided by user. If user is access the data with wrong attempts then,
users are blocked accordingly. If user is requested to unblock them,
based on the requests and previous activities admin is unblock users.

Data analyses are done with the help of graph. The collected
data are applied to graph in order to get the best analysis and
prediction of dataset and given data policies. The dataset can be
analyzed through this pictorial representation in order to better
understand of the data details.

There are many open problems that are left for future research. For
example, it is both interesting and challenging to investigate how to
predict the extremely large values and how to deal with missing data
(i.e., breach incidents that are not reported). It is also worthwhile to
estimate the exact occurring times of breach incidents. Finally, more
research needs to be conducted towards understanding the predictability
of breach incidents (i.e., the upper bound of prediction accuracy).


The project involved analyzing the design of few applications so as

to make the application more users friendly. To do so, it was really
important to keep the navigations from one screen to the other well
ordered and at the same time reducing the amount of typing the user
needs to do. In order to make the application more accessible, the
browser version had to be chosen so that it is compatible with most of
the Browsers.


Functional Requirements

 Graphical User interface with the User.

Software Requirements

For developing the application the following are the Software


1. Python
2. Django

3. Mysql

4. Wampserver

Operating Systems supported

1. Windows 7

2. Windows XP

3. Windows 8

Technologies and Languages used to Develop

1. Python

Debugger and Emulator

 Any Browser (Particularly Chrome)
Hardware Requirements

For developing the application the following are the Hardware


 Processor: Pentium IV or higher

 RAM: 2 GB
 Space on Hard Disk: minimum 512MB

A use case diagram in the Unified Modeling Language (UML) is a type of

behavioral diagram defined by and created from a Use-case analysis. Its purpose is
to present a graphical overview of the functionality provided by a system in terms
of actors, their goals (represented as use cases), and any dependencies between
those use cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. Roles of the actors in the system can be



In software engineering, a class diagram in the Unified Modeling Language

(UML) is a type of static structure diagram that describes the structure of a system
by showing the system's classes, their attributes, operations (or methods), and the
relationships among the classes. It explains which class contains information.


A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

diagram that shows how processes operate with one another and in what order. It is
a construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.

Activity diagrams are graphical representations of workflows of stepwise activities

and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.


The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, sub assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each
test type addresses a specific testing requirement.
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid outputs. All
decision branches and internal code flow should be validated. It is the testing of individual
software units of the application .it is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined inputs
and expected results.

Integration testing
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more concerned
with the basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed at
exposing the problems that arise from the combination of components.

Functional test
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system documentation, and
user manuals.
Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures : interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key

functions, or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be considered for
testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An example of
system testing is the configuration oriented system integration test. System testing is based on
process descriptions and flows, emphasizing pre-driven process links and integration points.

White Box Testing

White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least its purpose.
It is purpose. It is used to test areas that cannot be reached from a black box level.

Black Box Testing

Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as most other kinds
of tests, must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot “see” into it. The test provides inputs and
responds to outputs without considering how the software works.
Unit Testing

Unit testing is usually conducted as part of a combined code and unit test phase
of the software lifecycle, although it is not uncommon for coding and unit testing to be
conducted as two distinct phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written in
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

Literature survey is the most important step in software development
process. Before developing the tool it is necessary to determine the time factor,
economy and company strength. Once these things are satisfied, ten next steps
are to determine which operating system and language can be used for
developing the tool. Once the programmers start building the tool the
programmers need lot of external support. This support can be obtained from
senior programmers, from book or from websites. Before building the system the
above consideration are taken into account for developing the proposed system.


Widespread technical failures and interruptions of recent online testing in a number of states
have shaken the confidence of educators and policymakers in high-tech assessment methods and
raised serious concerns about schools' technological readiness for the coming common-core
online tests. The glitches arose as many districts in the 46 states that have signed on to the
Common Core State Standards are trying to ramp up their technological infrastructure to prepare
for the requirement that students take online assessments starting in 2014-15. Disruptions of
testing were reported across Indiana, Kentucky, Minnesota, and Oklahoma and were linked to
the states' assessment providers: CTB/McGraw-Hill, in Indiana and Oklahoma; ACT Inc., in
Kentucky; and the American Institutes for Research, in Minnesota. Thousands of students
experienced slow loading times of test questions, students were closed out of testing in mid-
answer, and some were unable to log in to the tests. Hundreds, if not thousands, of tests may be
invalidated. The difficulties prompted all four states' education departments to extend testing
windows, made some state lawmakers and policymakers reconsider the idea of online testing,
and sent district officials into a tailspin. The testing problems were "absolutely horrible, in terms
of kids being anxious," said Eric F. Hileman, the executive director of information technology
services for the 43,000-student Oklahoma City schools. Some high school students were taking
Oklahoma's high-stakes tests, which require that students pass four out of seven end-of-
instruction tests to graduate. "It was heartbreaking to watch them," Mr. Hileman said. "Some of
them were almost in tears”


1. Prajakta Pathe
2. Dr. Sachin Choudhari
3. Ms. Monali Gulhane

This paper features a software system called ∞Exams (Infinity Exams) which supports (primarily
in higher education) paper-based examination and makes it easier, more comfortable and speeds
up the whole process while keeping every single positive attribute of it but also reducing the
number of negative aspects. The approach significantly differs from the ones used in the
previous 10+ years which were implemented in such a way that they could not reproduce and
replace the traditional paper-based examination model. The heart of the article relies on the most
important element of the software which is the image processing flow.The way of conducting
testing the knowledge of a person using Multiple Choice Questions (MCQ) has been increased
gradually. In Educational industries (like schools and colleges) it is more common now days
having tests using multiple choice questions. Even in conducting interviews it is used. Current
day scenario is either using OMR technology to correct the test or manually. In real-time it is
quite difficult to have OMR at all the time and manually it is highly taking the time to correct
and it may give you the error. We address this issue, in our proposed system we using digital
image processing technique to correct the answer using multi choice question in python. We are
here using Open Source Computer Vision Library (Open CV) to process and correct the answer.
Python is the best language to implement this concept with the available Open CV library. In this
system we also implement in the django environment. There is a growing need for storing paper-
based information digitalized nowadays. This problem concerns education as well but it does not
always get enough attention, however using our technology accordingly many aspects of the
educational process could be made a lot simpler, easier, faster, more comfortable and (partially)
automatable. Most of the educational institutions are using traditional teaching and examination
methods in most of their subjects still. Though the digitalization of teaching got a little bit of
attention in the previous years and began its growth since then. Alongside it there are also
computer-based examination methods but it is not the main functionality of the e-learning
systems. So mostly the traditional examination models are used concerning those subjects who
require such a way to be examined accordingly. From now on the paper-based examination
method will be discussed, since it is the main concern of this paper. The keyword “e-assessment”
refers to electronic assessment as software is used to mark the exam papers filled by the students
after the exam is completed. Multiple choices Question (MCQ) are a form of an objective
assessment in which respondents are asked to select only correct answers out of the choices from
a list.[1] The multiple choice format is most frequently used in educational testing, in market
research, and in elections, when a person chooses between multiple candidates, parties, or
policies. In this paper we are using image processing to accomplish the MCQ correction in very
easy manner. It produces the great effort to deal to remove the barriers of multi choice
assessment correction. In this we are using array format to correct the answer paper which is
photo copier and uploaded by user. The main concept is to get the image and get the answer
which is shadowed by user. In Python Open CV library is available for image processing. In
order to get the best effective output we use the django framework along with python. The Open
CV is a library of programming functions mainly aimed at real-time computer vision. The
Following topics are organized to explain the process of how to deal with this technique.
1. Dezso Sima
2. Balazs Schmuck
3. Sandor Szollosi
4. Arpad Miklos

Rapidly increasing student numbers and spreading distance learning systems strengthen the urgent
need for effective Knowledge Assessment Systems (KAS’s). Recent KAS’s have however, the deficiency of
not providing intelligent assessment modules for e.g. the evaluation of freely formulated short answers
including a few sentences or partially solved mathematical problems. The eMax KAS, developed at the
Intelligent Knowledge Management Innovation Center of IBM Hungary and the John von Neumann
Faculty of Informatics at Budapest Tech, aims to provide these capabilities. Our paper gives an
introduction to the intelligent assessment of short texts component of eMax by presenting the approach
used for the formal description of the “answer space” defined as well as the methods chosen for the
syntactic analysis, semantic analysis and scoring. The α version of eMax is now completed and is in
testing In the last decade student enrollment to our faculty has steadily grown, nearly five-fold in ten
years. The annual intake to our Engineer Informatics program approaches already 400 direct and 100
part time students. With such a large number of students, the assessment of midterm and final exams
require a huge effort from our faculty members. In order to substantially ease the burden of assessing
large numbers of exam questions while avoiding the constraints of passive answer types, such as
multiple-choice questions, an internal project was launched five years ago with the goal of developing
an intelligent knowledge assessment system (KAS). In the first phase of the project, a wide range
literature research was conducted [1], [2] and a test system called Evita [3] was developed to check the
feasibility of our ideas concerning the assessment of freely formulated short texts. Based on positive
results obtained, an Intelligent Knowledge Management Innovation Center was established in 2005
jointly by IBM Hungary and the John von Neumann Faculty of Informatics at Budapest Tech to develop
an appropriate intelligent KAS. The research and development work performed has resulted in the α
version of the eMax system, now under testing. Beyond the usual passive question types, eMax is able
to semi-automatically assess freely formulated short student answers and partially solved mathematical
problems as well. Unlike Automated Essay Grading (AEG) published widely [4],[5] the short text module
of our eMax is constrained only to a few (say 2-3) sentences and is based on a formal semantic
description rather than on methods used in AEG, such as statistical or Natural Language Processing (NLP)
techniques. Our paper and a companion paper [6] give an account of the two intelligent assessment
modules developed as parts of eMax


Educational institutions apply traditional examination methods, using pencil and paper. Correctors are
to read through the answers manually, calculating a grade from the work, and optionally provide written
feedback. Students are to receive only a grade from the corrector, usually they are urged to collect
feedback individually from the corrector. This traditional approach of exam correcting, grading and
feedback provision, has pitfalls in regard to correction efficiency and feedback distribution The
Universiteit van Amsterdam employs various methods for teachers and instructors to examine a
student’s knowledge of a particular subject. Most popular is the written exam, where the instructor
provides a question sheet and a separate answer sheet. The question sheet is generated in a manner the
instructor suits them best, enumerating the individual open and multiple choice questions linearly or in
sections. Other examinations are used as well, popular courses with high attendance often employ
multiple-choice only exams, which use a standardized answer sheet that is able to be scanned,
recognized and corrected by a software package. Alternative methods such as verbal or practical
examinations are left out the scope of this paper. Both the automated and manual examination
methods have benefits and disadvantages. Most importantly, the manual examination provides ability
for the instructor to write feedback on the sheets of the student, where the automatic examination
greatly reduces time required to calculate a grade. This paper discusses an (internet) software package
which has the combined functionality of providing feedback to the student as well as automatically
correct multiple choice questions. Spearhead abilities of the package are to move the correction
environment from pencil and paper to an online interface, displaying the answer regions of the scanned
paper sheets; provide automatic recognition of ticked multiple choice question boxes, and thus
automatically grade multiple choice questions; access for the student on the online platform so it may
receive (general) personalized feedback. The automatic exam correction project is inspired and based on
several other works. The most prominent related work is Auto Multiple Choice (AMC)[4], which
functionality set overlaps with that of Automatic Exam Correction. Identical exam generation and
configuration file generation is derived from the AMC papers. Notably the AMC package does not offer
open question correction and feedback provisioning, and operates stand-alone as an installer desktop
application. C´ıp and Babinec[5] have made a multiple-choice concept where document calibration and ˇ
identification is described in detail to recognize forms. Many elements from Automatic Exam Correction
are derived from this paper relating the document generation concept. Lira, Bronfman et al. wrote
Multitest[6], a full digital environment dedicated to generate, correct and analyze (digital) multiple
choice answer sheets. Although the materials and methods used are relatively outdated, it reveals
insights on a functional model for a digital correction environment. Olˇs´ak[7] made a TEX library to
generate EAN-13 barcodes, which transforms an input string into a pure TEX barcode, which is used for
digital detection and identification in the Automatic Exam Correction project. The book Learning
OpenCV [8] is a great reference with many examples written by Bradski and Kaehler, on which the
computer vision part of the Automatic Exam Correction project is based.
We analyzed a hacking breach dataset from the points of view of
the incidents inter-arrival time and the breach size, and showed that they
both should be modeled by stochastic processes rather than distributions.
The statistical models developed in this paper show satisfactory fitting
and prediction accuracies. In particular, we propose using a copula-
based approach to predict the joint probability that an incident with a
certain magnitude of breach size will occur during a future period of
time. Statistical tests show that the methodologies proposed in this paper
are better than those which are presented in the literature, because the
latter ignored both the temporal correlations and the dependence
between the incidents inter-arrival times and the breach sizes. We
conducted qualitative and quantitative analyses to draw further insights.
We drew a set of cybersecurity insights, including that the threat of
cyber hacking breach incidents is indeed getting worse in terms of their
frequency, but not the magnitude of their damage. The methodology
presented in this paper can be adopted or adapted to analyze datasets of a
similar nature.

