LCD Documentation

Local Community Detection Algorithm Based
on Minimal Cluster
1
Table of Contents
ABSTRACT
LIST OF FIGURES
CHAPTER-1
INTRODUCTION
1.1 WHAT IS SOFTWARE.....................................................................................................1
1.2 WHAT IS SOFTWARE DEVELOPMENT LIFE CYCLE……………………………...1
1.3 BUG PREDICTION………………………………………………...................................2
CHAPTER-2
LITERATURE REVIEW
2.1 RELATED WORK…………………..................................................................................5
CHAPTER 3
PROBLEM IDENTIFICATION AND OBJECTIVE
3.1 PROBLEM STATEMENT..................................................................................................8
3.2 PROJECT OBJECTIVE.......................................................................................................8
CHAPTER 4
METHODOLOGY
4.1 METHODOLOGY...............................................................................................................9
4.1.1USING PYTHON TOOL ON STANDALONE MACHINE LEARNING
ENVIRONMENT……………………………………………………………………………...9
2
4.2 DATA DESCRIPTION………………………………………………………………….10
4.3 EVALUATION CRITERIA USED FOR CLASSIFICATION………………………….11
4.3.1 CONFUSION MATRIX.................................................................................................12
4.3.2 ACCURACY AND PRECISION...................................................................................12
4.3.3 RECALL AND F-SQUARE...........................................................................................13
4.3.4 SENSITIVITY, SPECIFICITY AND ROC....................................................................13
4.3.5 SIGNIFICANCE AND ANALYSIS OF ENSEMBLE METHOD IN MACHINE
LEARNING.............................................................................................................................13
4.4 UML DIAGRAMS……………………………………………………………………….14
4.4.1 USE CASE DIAGRAM………………………………………………………………..14
4.4.2 STATE DIAGRAM……………………………………………………………………15
CHAPTER 5
OVERVIEW OF TECHNOLOGIES
5.1 ALGORITHMS USED......................................................................................................17
5.1.1 DECISION TREE INDUCTION………………………………………………………17
5.1.2 NAÏVE BAYES..............................................................................................................21
5.1.3 ARTIFICIALNEURALNETWORK..............................................................................22
5.1.4 SUPPORT VECTOR MACHINE MODEL...................................................................25
3
5.1.5 KERNAL FUNCTIONS……………………………………………………………….28
5.2 TENSORFLOW………………………………………………………………………….28
CHAPTER 6
IMPLEMENTATION AND RESULTS
6.1 FRAMEWORK DESIGN………………………………………………………………..31
6.2 CODING AND TESTING……………………………………………………………….34
CHAPTER 7
CONCLUSION………………………………………………………….40
REFERENCE...............................................................................................42
4
5
LIST OF FIGURES
FIG 5.1.1 Decision tree induction…………………………………………………...17

FIG 5.1.2 Naive Bayes Algorithm………………………………………………..….21
FIG 5.1.3 Artificial Neural Networks Algorithm……………………………………23
FIG 5.1.4(i). Support Vector Machine Hyperplane………………………...............25

FIG 5.1.4(ii). Support Vector Machine Hyperplane (Linear Inseparable)…………..26
FIG 5.2 TensorFlow in ML…………………………………………………………..29
FIG 4.3.1 Use Case Diagram………………………………………………………...14
FIG 4.3.2 State Diagram………………………………………………………….....15
FIG 4.1.1 Block Diagram of Proposed work………………………………………..11

FIG.4.2.5 Classificaion Model……………………………………………………….16
FIG 6.1 Framework Design………………………………………………………….31.
FIG 6.2 Bar graph of knn,svm,nauvebayes………………………………………….39
6
ABSTRACT
In order to discover the structure of local community more effectively, this paper puts forward a new
local community detection algorithm based on minimal cluster. Most of the local community detection
algorithms begin from one node. The agglomeration ability of a single node must be less than multiple nodes,
so the beginning of the community extension of the algorithm in this paper is no longer from the initial node
only but from a node cluster containing this initial node and nodes in the cluster are relatively densely
connected with each other. The algorithm mainly includes two phases. First it detects the minimal cluster and
then finds the local community extended from the minimal cluster. Experimental results show that the quality
of the local community detected by our algorithm is much better than other algorithms no matter in real
networks or in simulated networks.
CHAPTER 1 INTRODUCTION
Recently, many researchers notice that the complex network is a proper tools to describe
variety of complex system in the real world , and thus the complex network has attracted the great
attention in many fields such as physics, biology and social network et al. In complex network field,
one of the important topology property is community structure which comprise of densely connected
nodes, and some researchers have found that detecting community structure can reveal some valuable
insights of the functional feature in the complex system . For example, communities in multimedia
social network may imply people with the same hobby and trust relationship. Zhiyong Zhang et al.
proposed an approach to analyse and detect credible potential path based on community in
multimedia social networks , the approach can effectively and accurately mine potential paths of
copyrighted digital content sharing. Zhiyong Zhang et al. also proposed a trust model based on small
world theory which shows the widely application of community struction The community structure
in biology field may cluster proteins with the same function. So that many methods have been
proposed to reveal this topological property in complex networks. Community detection on complex
7
networks has been a hot research field. Recently, a large number of algorithms for studying the global
structure of the network are proposed, such as the modularity optimization algorithms , the spectral
clustering algorithms , the hierarchical clustering algorithms , and the label propagation algorithms .
However, with the continuous expansion of complex networks, it is easy to collect large network
dataset with millions of nodes. How to store such a large-scale dataset in computer memory to
analyze is a huge challenge for scholars. The calculation for studying the overall structure of this kind
of large-scale networks is unimaginable. So local community detection becomes an appealing
problem and has drawn more and more attention The main task of local community detection is to
find a community using the local information of the network. Local community detection has good
extensibility. If the local community detection algorithm is iteratively executed, more local
communities can be found and the whole community structure of the network can be obtained. The
time complexity of this kind of global community detection algorithm is dependent on the efficiency
and accuracy of local community detection algorithms, so the research of local community detection
algorithm still has a long way to go. There are several problems that need to be solved in the research
of local community detection. First, we should determine the initial state and find the initial node for
local community detection, so as to determine the needed local information; then, we need to select
an objective function, and through continuous iterative optimization of the objective function we find
the community structure with high quality; after that we need to find a suitable node expansion
method, so that the algorithm can extract the local community from the initial state step by step;
finally, in order to terminate the algorithm, a suitable termination condition is needed to determine
the boundary of the community.
Most of local community detection algorithms are based on the above-mentioned process. The
definition of local community detection is to find the local community structure from one or more
nodes, but most of the existing local community detection algorithms, including Clauset , LWP , and
LS , are starting from only one initial node. They greedily select the optimal nodes from the candidate
nodes and add them into the local community. LMD algorithm extends not from the initial node but
from its closest and next closest local degree central nodes. It discovers a local community from each
of these nodes, respectively. It still starts from single node and discovers many local communities for
the initial node. In general, the aggregation ability of a single node is lower than that of multiple
nodes. So we do not just rely on the initial node as the beginning of local community expansion. Our
primary goal is to find a minimal cluster closely connected to the initial node and then detect local
community based on the minimal cluster. This can avoid instability because of the excessive
dependence on the initial node. In this paper, we introduce a local community detection algorithm
8
based on the minimal cluster—NewLCD. In this new algorithm, the beginning of community
expansion is no longer from the initial node only, but a cluster of nodes relatively closely connected
to the initial node. The algorithm mainly consists of two parts: one is the detection of the minimal
cluster, and the other is the detection of the local community based on the minimal cluster. At the
same time, the algorithm can be applied to the global community detection. After finding one local
community using this algorithm, we can repeat the process to obtain the global community structure
of the whole network.
Community Detection
The concept of community detection has emerged in network science as a method for finding
groups within complex systems through represented on a graph. In contrast to more
traditional decomposition methods which seek a strict block diagonal or block triangular structure,
community detection methods find subnetworks with statistically significantly more links between
nodes in the same group than nodes in different groups (Girvan and Newman, 2002). Central to
community detection is the notion of modularity, a metric that captures this difference:
(1)Q=12m∑i,jAij−kikj2mδgigj
Here, Q is the modularity, Aij is the edge weight between nodes i and j, ki is the total weight of all
edges connecting node i with all other nodes, and m is the total weight of all edges in the graph.
The Kronecker delta function δ(gi,gj) will evaluate to one if nodes i and j belong to the same group,
and zero otherwise. Modularity is a property of how one decides to partition a network: networks that
are not partitioned and those that place every node in its own community will both have modularity
equal to zero. The goal of community detection, then, is to find communities that maximize
modularity. Although modularity maximization is an NP-hard integer program, many efficient
algorithms exist to solve it approximately, including spectral clustering (Newman, 2006) and fast
unfolding (Blondel et al., 2008). Figure 1 shows a few networks with increasing maximum
modularity; note how the community structure becomes increasingly apparent as this value increases.
Previous efforts in our group have applied community detection to chemical plant networks by
creating an equation graph of the corresponding dynamic model (Moharir et al., 2017). By doing so,
communities of state variables, inputs, and outputs can be obtained which are tightly interacting
amongst themselves but weakly interacting with other communities. As such, these communities can
form the basis of distributed control architectures (Jogwar and Daoutidis, 2017) which typically
perform better than other distributed control architectures that one may obtain from “intuition”
(Pourkargar et al., 2017). For a more comprehensive review of the use of community detection in
distributed control, we refer the reader to Daoutidis et al., 2017. An alternative method for finding
communities for distributed model predictive control is to apply a decomposition on the optimization
problem as a whole (Tang et al., 2017).
9
Figure 1. Networks of different maximum modularity.
However, community structure can exist in all optimization problems, thus it makes sense to extend
this method to any generic optimization problem, and this work proposes to do so. The advantage of
using community detection to find decompositions is that subproblems generated will have
statistically minimal interactions, through complicating variables or constraints, and thus require
minimal coordination through the decomposition solution method. The proposed method is generic,
applicable to any optimization problem or decomposition solution approach, and scalable, using
computationally efficient graph theory algorithms.
Automatically identifying groups based on network clustering algorithms
NodeXL can automatically identify groups within a network based solely on network
structure. In contrast to the approach of using existing data about the attributes as used in Section
7.2.3, this approach is based solely on who is connected to whom. A number of different network
“clustering” (also known as “community detection”) algorithms exist, which help find subgroups of
highly inter-connected vertices within a network. NodeXL includes three such algorithms: Clauset-
Newman-Moore, Wakita-Tsurumi [3], and Girvan-Newman (which can take a long time to run on
large graphs). In all of these algorithms, the number of clusters is not predetermined; instead the
algorithm dynamically determines the number it thinks is best. Each vertex is assigned to exactly one
cluster, meaning that clusters do not overlap. The number of vertices in each cluster can vary
significantly. In some cases, a single cluster can encompass all vertices, whereas in other cases, a
cluster can consist of a single vertex. See Newman [4] for background on some of these and other
community identifying algorithms.
There is no “right” or “wrong” algorithm to use; instead, it is often useful to try out different ones and
see which ones you believe provide the best results given your network. For example, in this network,
the Clauset-Newman-Moore algorithm results in fewer, larger groups than the other algorithms,
which provide more groups of a smaller size. Try applying the Wakita-Tsurumi clustering
algorithm by clicking on the Groups dropdown menu in the NodeXL ribbon and choosing Group by
Cluster and the checking the appropriate selector as shown in Figure 7.14. Notice that the data on the
Groups worksheet is now updated to reflect the new groups.
10
11
A social network for an individual is created with his/her interactions and personal relationships with
other members in the society. Social networks represent and model the social ties among individuals.
With the rapid expansion of the web, there is a tremendous growth in online interaction of the users.
Many social networking sites, e.g., Facebook, Twitter etc. have also come up to facilitate user
interaction. As the number of interactions have increased manifold, it is becoming difficult to keep
track of these communications. Human beings tend to get associated with people of similar likings
and tastes. The easy-to-use social media allows people to extend their social life in unprecedented
ways since it is difficult to meet friends in the physical world, but much easier to find friends online
with similar interests. These real-world social networks have interesting patterns and properties
which may be analysed for numerous useful purposes. Social networks have a characteristic property
to exhibit a community structure. If the vertices of the network can be partitioned into either disjoint
or overlapping sets of vertices such that the number of edges within a set exceeds the number of
edges between any two sets by some reasonable amount, we say that the network displays a
community structure. Networks displaying a community structure may often exhibit a hierarchical
community structure as well1 . The process of discovering the cohesive groups or clusters in the
network is known as community detection. It forms one of the key tasks of Social network analysis2 .
The detection of communities in social networks can be useful in many applications where group
decisions are taken, e.g., multicasting a message of interest to a community instead of sending it to
each one in the group or recommending a set of products to a community. The applications of
community detection have been highlighted towards the end of the article. State of the art in
community detection research for social networks is presented in this work. The paper begins with
the basic concepts of social networks and communities. Various methods for community detection
are categorised and discussed in the next section followed by list of standard datasets used for
analysis in community detection research along with the links for download if available online. Some
potential applications of community detection in social networks are briefly described in the next
section. Discussion section argues the advantages of using a method with respect to another, the kind
of community structure they obtain, etc. and the conclusion section concludes the paper. BASIC
CONCEPTS SOCIAL NETWORK A social network is depicted by social network graph 𝐺
consisting of 𝑛 number of nodes denoting 𝑛 individuals or the participants in the network. The
connection between node 𝑖 and node 𝑗 is represented by the edge 𝑒𝑖𝑗 of the graph. A directed or an
undirected graph may illustrate these connections between the participants of the network. The graph
can be represented by an adjacency matrix 𝐴 in which 𝐴𝑖𝑗 = 1 in case there is an edge between 𝑖 and 𝑗
else 𝐴𝑖𝑗 = 0. Social networks follow the properties of complex networks3,4 . Some real life
12
examples1 of social networks include friends based, telephone, email and collaboration networks.
These networks can be represented as graphs and it is feasible to study and analyse them to find
interesting patterns amongst the entities. These appealing prototypes can be utilized in various useful
applications. Community A community can be defined as a group of entities closer to each other in
comparison to other entities of the dataset. Community is formed by individuals such that those
within a group interact with each other more frequently than with those outside the group. The
closeness between entities of a group can be measured via similarity or distance measures between
entities. McPherson et al5 stated that “similarity breeds connection”. They discussed various social
factors which lead to similar behaviour or homophily in networks. The communities in social
networks are analogous to clusters in networks. An individual represented by a node in graphs may
not be part of just a community or a group, it may be an element of many closely associated or
different groups existing in the network. For example a person may concurrently belong to college,
school, friends and family groups. All such communities which have common nodes are called
overlapping communities. Identification and analysis of the community structure has been done by
many researchers applying methodologies from numerous form of sciences. The quality of clustering
in networks is normally judged by clustering coefficient which is a measure of how much the vertices
of a network tend to cluster together. The global clustering coefficient6 and the local clustering
coefficient7 are two types of clustering coefficients discussed in literature. Methods for grouping
similar items Communities are those parts of the graph which have denser connections inside and few
connections with the rest of the graph8 . The aim of unsupervised learning is to group together similar
objects without any prior knowledge about them. In case of networks, the clustering problem refers to
grouping of nodes according to their similarity computed based on topological features and/or other
characteristics of the graph. Network partitioning and clustering are two commonly used methods in
literature to find the groups in the social network graph. These methods are briefly described in the
next subsections. Graph partitioning Graph partitioning is the process of partitioning a graph into a
predefined number of smaller components with specific properties. A common property to be
minimized is called cut size. A cut is a partition of the vertex set of a graph into two disjoint subsets
and the size of the cut is the number of edges between the components. A multicut is a set of edges
whose removal divides the graph into two or more components. It is necessary to specify the number
of components one wishes to get in case of graph partitioning. The size of the components must also
be specified, as otherwise a likely but not meaningful solution would be to put the minimum degree
vertex into one component and the rest of the vertices into another. Since the number of communities
is usually not known in advance, graph partitioning methods are not suitable to detect communities in
13
such cases. Clustering Clustering is the process of grouping a set of similar items together in
structures known as clusters. Clustering the social network graph may give a lot of information about
the underlying hidden attributes, relationships and properties of the participants as well as the
interactions among them. Hierarchical clustering and partitioning method of clustering are the
commonly used clustering techniques used in literature. In hierarchical clustering, a hierarchy of
clusters is formed. The process of hierarchy creation or levelling can be agglomerative or divisive. In
agglomerative clustering methods, a bottom-up approach to clustering is followed. A particular node
is clubbed or agglomerated with similar nodes to form a cluster or a community. This aggregation is
based on similarity. In divisive clustering approaches, a large cluster is repeatedly divided into
smaller clusters. Partitioning methods begin with an initial partition amidst the number of clusters
pre-set and relocation of instances by moving them across clusters, e.g., K-means clustering. An
exhaustive evaluation of all possible partitions is required to achieve global optimality in partitioned-
based clustering. This is time consuming and sometimes infeasible, hence researchers use greedy
heuristics for iterative optimization in partitioning methods of clustering. The next section categorizes
and discusses major algorithms for community detection. ALGORITHMS FOR COMMUNITY
DETECTION A number of community detection algorithms and methods have been proposed and
deployed for the identification of communities in literature. There have also been modifications and
revisions to many methods and algorithms already proposed. A comprehensive survey of community
detection in graphs has been done by Fortunato8 in the year 2010. Other reviews available in
literature are by Coscia et al9 in 2011, Fortunato et al10 in 2012, Porter et al11 in 2009, Danon et
al12 in 2005, and Plantié et al13 in 2013. The presented work reviews the algorithms available till
2015 to the best of our knowledge including the algorithms given in the earlier surveys. Papers based
on new approaches and techniques like big data, not discussed by previous authors have been
incorporated in our article.The algorithms for community detection are categorized into approaches
based on graph partitioning, clustering, genetic algorithms, label propagation along with methods for
overlapping community detection (clique based and non-clique based methods), and community
detection for dynamic networks. Algorithms under each of these categories are described below.
Graph partitioning based community detection Graph partitioning based methods have been used in
literature to divide the graph into components such that there are few connections between
components. The Kernighan-Line14 algorithm for graph partitioning was amongst the earliest
techniques to divide a graph. It partitions the nodes of the graph with cost on edges into subsets of
given sizes so as to minimize the sum of costs on all edges cut. A major disadvantage of this
algorithm however is that the number of groups have to be predefined. The algorithm however is
14
quite fast with a worst case running time of 𝑂(𝑛 2 ). Newman15 reduces the widely-studied
maximum likelihood method for community detection to a search through a group of candidate
solutions, each of which is itself a solution to a minimum cut graph partitioning problem. The paper
shows that the two most essential community inference methods based on the stochastic block model
or its degree-corrected variant16 can be mapped onto versions of the familiar minimum-cut graph
partitioning problem. This has been illustrated by adapting Laplacian spectral partitioning method17,
18 to perform community inference. Clustering based community detection The main concern of
community detection is to detect clusters, groups or cohesive subgroups. The basis of a large number
of community detection algorithms is clustering. Amongst the innovators of community detection
methods, Girvan and Newman19 had a main role. They proposed a divisive algorithm based on edge-
betweenness for a graph with undirected and unweighted edges. The algorithm focused on edges that
are most “between” the communities and communities are constructed progressively by removing
these edges from the original graph. Three different measures for calculation of edge-betweenness in
vertices of a graph were proposed in Newman and Girvan20 . The worst-case time complexity of the
edge betweenness algorithm is 𝑂(𝑚2𝑛) and is 𝑂(𝑛 3 ) for sparse graphs, where m denotes the number
of edges and n is the number of vertices. The Girvan Newman (GN) algorithm has been enhanced by
many authors and applied to various networks21-28. Chen et al22 extended GN algorithm to partition
weighted graphs and used it to identify functional modules in the yeast proteome network. Rattigan et
al21 proposed the indexing methods to reduce the computational complexity of the GN algorithm
significantly. Pinney et al24 also build an algorithm which uses GN algorithm for the decomposition
of networks based on graph theoretical concept of betweenness centrality. Their paper inspected
utility of betweenness centrality to decompose such networks in diverse ways. Radicchi29 et al also
proposed an algorithm based on GN algorithm introducing a new definition of community. They
defined ‘strong’ and ‘weak’ communities. The algorithm uses an edge clustering coefficient to
perform the divisive edge removal step of GN and has a running time of 𝑂( 𝑚4 𝑛2 ) and 𝑂(𝑛 2 ) for
sparse graphs. Moon et al30 have proposed and implemented the parallel version of the GN algorithm
to handle large scale data. They have used MapReduce model (Apache Hadoop) and GraphChi.
Newman and Girvan first defined a measure known as ‘modularity’ to judge the quality of partitions
or communities formed20 . The modularity measure proposed by them has been widely accepted and
used by researchers to gauge the goodness of the modules obtained from the community detection
algorithms with high modularity corresponding to a better community structure. Modularity was
defined as ∑𝒊 𝒆𝒊𝒊 − 𝒂𝒊 𝟐 , where 𝒆𝒊𝒊 denotes fraction of the edges that connect vertices in community
𝒊, 𝒆𝒊𝒋 denotes fraction of the edges connecting vertices in two different communities 𝒊 and 𝒋 while 𝒂𝒊
15
= ∑𝒋 𝒆𝒊𝒋 is the fraction of edges that connect to vertices in community 𝒊. The value 𝑄 = 1 indicates a
network with strong community structure. The optimization of modularity function has received great
attention in literature. The table 1 lists clustering based community detection methods, including
algorithms which use modularity and modularity optimization. Newman31 has worked to maximize
modularity so that the process of aggregating nodes to form communities leads to maximum
modularity gain. This change in modularity upon joining two communities defined as ∆𝑄 = 𝑒𝑖𝑗 + 𝑒𝑗𝑖
− 2𝑎𝑖𝑎𝑗 = 2(𝑒𝑖𝑗−𝑎𝑖𝑎𝑗) can be calculated in constant time and hence is faster to execute in comparison
to the GN algorithm. The run time of the algorithm is 𝑂(𝑛 2 ) for sparse graphs and 𝑂((𝑚 + 𝑛)𝑛) for
others. In a recent work, a scalable version of this algorithm has been implemented using MapReduce
by Chen et al32. Newman33 generalized the betweenness algorithm for weighted networks. The
modularity was now represented as 𝑄 = 1 2𝑚 ∑𝑖𝑗[𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 ]δ(𝑐𝑖 , 𝑐𝑗) where 𝑚 = 1 2 ∑𝑖𝑗 𝐴𝑖𝑗
represents the number of edges between communities 𝑐𝑖 and 𝑐𝑗 in the graph, while 𝑘𝑖 , 𝑘𝑗 are degrees
of vertices 𝑖 and 𝑗 while 𝛿(𝑢, 𝑣) is 1 if 𝑢 = 𝑣 and 0 otherwise. Newman34 in yet another approach
characterised the modularity matrix in terms of eigenvectors. The equation for modularity was
changed to 𝑄 = 1 4𝑚 𝑠 𝑇 𝐵𝑠 , where the modularity matrix was given as 𝐵𝑖𝑗 = 𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 and
modularity was defined using eigenvectors of the modularity matrix. The algorithm runs in 𝑂(𝑛 2
𝑙𝑜𝑔𝑛) time, where 𝑙𝑜𝑔𝑛 represents the average depth of the dendrogram .
16
Clauset et al35 used greedy optimization of modularity to detect communities for large networks. For
a network structure with m edges and n vertices, the algorithm has a running time of (𝑚𝑑𝑙𝑜𝑔𝑛) ,
where ‘d’ denotes the depth of the dendrogram. For sparse real world networks the running time
is(𝑛𝑙𝑜𝑔2n) . Blondel et al36 designed an iterative two phase algorithm known as Louvain method. In
first phase, all nodes are placed into different communities and then the modularity gain of moving a
node 𝑖 from one community to another is found. In case this modularity gain is positive, the node is
shifted to a new community. In second phase all the communities found in earlier phase are treated as
nodes and the weight of links is found. The algorithm improves the time complexity of the GN
algorithm. It has a linear run time of 𝑂(𝑚). Guimera et al37 used simulated annealing for modularity
17
optimization and showed that computing the modularity of a network is similar to determining the
ground-state energy of a spin system. Additionally, the authors showed that the stochastic network
models give rise to modular networks due to fluctuations. Zhou et al38 attempted to improve
modularity using simulated annealing introducing the idea of ‘inter edges’ and ‘intra edges’. The
authors modified the modularity equation to include inter and intra edges as 𝑄 = 1 2𝑚 ∑ [(𝐴𝑖𝑗 𝑛 𝑖𝑗 −
𝑘𝑖𝑘𝑗 2𝑚 )𝛿(𝐶𝑖 , 𝐶𝑗) − 𝛽 (𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 ) 𝛼 (1 − 𝛿(𝐶𝑖 , 𝐶𝑗))] Intra factor Inter factor Here α and β are
undetermined parameters and affect the value of the inter-factor. The value of β is increased and α is
reduced when large communities are expected. Duch et al39 proposed a heuristic search based
approach for the optimization of modularity function using extremal optimization technique, which
has a complexity of 𝑂(𝑛 2 𝑙𝑜𝑔2𝑛). AdClust method40 can extract modules from complex networks
with significant precision and strength. Each node in the network is assumed to act as a self-directed
agent representing flocking behaviour. The vertices of the network travel towards the desirable
adjoining groups. Wahl and Sheppard41 proposed hierarchical fuzzy spectral clustering based
approach. They argued that determining the sub-communities and their hierarchies are as important as
determining communities within a network. DENGRAPH42 algorithm uses the idea of density-based
incremental clustering of spatial data and is intended to work for large dynamic datasets with noise.
The Markov Clustering Algorithm(MCL)43 is a graph flow simulation algorithm which can be used
to detect clusters in a graph and is analogous to detection of communities in the networks. This
algorithm consists of two alternate processes of ‘expansion’ and ‘inflation’. Markov chains are
employed to perform random walk through a graph. The method has a worst case run time of 𝑂(𝑛𝑘
2 ) where n represents the number of nodes and k is the number of resources. Nikolaev et al44 used
‘entropy centrality measure’ based on Markovian process to iteratively detect communities. A
random walk through the nodes is performed to find the communities existing in the network
structure. For a graph, the transition probability matrix for a Markov chain is created. A locality t is
selected and those edges for which the average entropy centrality for the nodes over the graph is
reduced are selected and removed. The algorithm proposed by Steinhaeuser et al45 performs many
short random walks and interprets visited nodes during the same walk as similar nodes which gives
an indication that they belong to the same community. The similar nodes are aggregated and
community structure is created using consensus clustering. It has a runtime of 𝑂(𝑛 2 𝑙𝑜𝑔𝑛).
Genetic algorithms (GA) based community detection :
Genetic algorithms (GA) are adaptive heuristic search algorithms whose aim is to find the best
solution under the given circumstances. A genetic algorithm starts with a set of solutions known as
chromosomes and fitness function is calculated for these chromosomes. If a solution with a maximum
18
fitness is obtained, one stops else with some probability crossover and mutation operators are applied
to the current set of solutions to obtain the new set of solutions. Community detection can be viewed
as an optimization problem in which an objective function that captures the intuition of a community
with better internal connectivity than external connectivity is chosen to be optimized. GA have been
applied to the process of community discovery and analysis in a few recent research works. These are
described briefly in this section. Table 2 enlists the algorithms available in literature for community
detection based on GA.
Pizzuti46 proposed the GA-Net algorithm which uses a locus based graph representation of the
network. The nodes of the social network are depicted by genes and alleles. The algorithm introduces
and optimizes the community score to measure the quality of partitioning. All the dense communities
present in the network structure are obtained at the end of the algorithm by selectively exploring the
search space, without the need to know in advance the exact number of groups. Another GA based
approach MOGA-Net47 proposed by the same author optimizes two objective functions i.e. the
community score and community fitness . The higher the community score, the denser the clustering
obtained. The community fitness is sum of fitness of nodes belonging to a module. When this sum
reaches its maximum, the number of external links is minimized. MOGA-Net generates a set of
communities at different hierarchical levels in which solutions at deeper levels, consisting of a higher
number of modules, are contained in solutions having a lower number of communities. Hafez et al48
19
have performed both Single-Objective and Multi-Objective optimization for community detection
problem. The former optimization was done using roulette selection based GA while NSGA-II
algorithm was used for the latter process. Mazur et al49 have used modularity as the fitness function
in addition to the community score. The authors worked on undirected graphs and their algorithm can
also discover single node communities. Liu et al50 used GA in addition to clustering to find the
community structures in a network. The authors have used a strategy of repeated divisions. The graph
is initially divided into two parts, then the subgraphs are further divided and a nested GA is applied to
them. Tasgin et al51 have also optimized the network modularity using GA. A multi-cultural
algorithm52 for community detection employs the fitness function defined by Pizzuti46 in GA-Net.
The belief space which is a state space for the network and contains a set of individuals that have a
better fitness value has been used in this work to guide the search direction by determining a range of
possible states for individuals. A genetic algorithm for the optimization of modularity, proposed by
Nicosia et al53 and has been explained in the overlapping communities section later.
Label propagation based community detection

Label propagation in a network is the propagation of a label to various nodes existing in the network.
Each node attains the label possessed by a maximum number of the neighbouring nodes. This section
discusses some label propagation based algorithms for discovering communities. Table 3 contains a
listing of these algorithms, discussed in detail later in the section.
20
Label Propagation Algorithm(LPA) was proposed by Raghavan et al54 in which initially each node
tries to achieve a label from the maximum number of labels possessed by its neighbours. The
stopping criteria for the process was also the same, i.e., when each node achieves a label, which a
maximum number of its neighbouring nodes have. Each iteration of the algorithm takes 𝑂(𝑚) time
where m is the number of edges. SLPA (speaker listener label propagation algorithm)55 is an
extension to LPA which could analyse different kinds of communities such as disjoint communities,
overlapping communities and hierarchical communities in both unipartite and bipartite networks. The
algorithm has a linear run time of 𝑂(𝑇𝑚), where T is the user defined maximum number of iterations
and m is the number of edges. Based on the SLPA algorithm, Hu56 proposed a Weighted Label
Propagation Algorithm (WLPA). It uses the similarity between any two of the vertices in a network
based on the labels of the vertices achieved in label propagation. The similarity of these vertices is
then used as a weight of the edge in label propagation. LPA was further improved by Gregory57 in
his algorithm COPRA (Community Overlap Propagation Algorithm). It was the first label
propagation based procedure which could also detect overlapping communities. The run time per
iteration is 𝑂(𝑣𝑚𝑙𝑜𝑔(𝑣𝑚⁄𝑛), here n is the number of nodes, m is the edges and v is the maximum
number of communities per vertex. LabelRank Algorithm58 uses the LPA and MCL (Markov
Clustering Algorithm). The node identifiers are used as labels. Each node receives a number of labels
from the neighbouring nodes. A community is formed for nodes having the same highest probability
label. Four operators are applied namely propagation which propagates the label to neighbours,
inflation i.e. the inflation operator of the MCL algorithm, cut-off operator that removes the labels
below a threshold and an explicit conditional update operator responsible for a conditional update.
The algorithm runs in 𝑂(𝑚) time where m is the number of edges. The LabelRank algorithm was
modified to LabelRankT algorithm by Xie et al60. This algorithm included both the edge weights and
the edge directions in the detection of communities. This algorithm works for dynamic networks as
well and is able to detect evolving communities also. Wu et al59 proposed a Balanced Multi Label
Propagation Algorithm (BMPLA) for detection of overlapping communities. Using this algorithm,
vertices can belong to any number of communities without having a global maximum limit on largest
number of communities membership required by COPRA57 . Each iteration of the algorithm takes
𝑂(𝑛𝑙𝑜𝑔𝑛) time to execute, where n is the number of nodes. Semantics based community detection
Semantic content and edge relationships in a semantic network may be additionally used to partition
the nodes into communities. The context, as well as the relationship of the nodes, both are taken into
consideration in the process of semantic community detection. LDA(Latent Dirichlet Allocation)61 is
used in several semantic community based community detection approaches. A clustering algorithm
21
based on the link-field-topic (LFT) model is put forward by Xin et al62 to overcome the limitation of
defining the number of communities beforehand. The study forms the semantic link weight (SLW)
based on the investigation of LFT, to evaluate the semantic weight of links for each sampling field.
The proposed clustering algorithm is based on the SLW which could separate the semantic social
network into clustering units. In another work63 the authors have used ARTs model and divided the
process into two phases namely LDA sampling and community detection. In the former process
multiple sampling ARTs have been designed. A community clustering algorithm has also been
proposed. The procedure could detect the overlapping communities. Xia et al64 constructed a
semantic network using information from the comment content extracted from the initial HTML
source files. An average score is obtained for two users for each link assuming comments to be
implicit links between people. An analytic method for taking out comment content is proposed to
build the semantic network for example, the terms and phrases in data are counted in comments as
supportive or opposing. Each phrase is given an associated numerical trust value. On this semantic
network, the classical community detection algorithm is applied henceforth. Ding65 has considered
the impact of topological as well as topical elements in community detection. Topology based
approaches are based on the idea that the real world networks can be modelled as graphs where the
nodes depict the entities whereas the interactions between them are shown by the edges of the graph.
On the other hand topic based community detection have a basis that the more words two objects
share, the more similar they are. The author performs systematic analysis with topology-based and
topic-based community detection methodologies on the co-authorship networks. The paper puts
forward the argument that, to detect communities, one should take into account together the topical
and topological features of networks. A community detection algorithm, SemTagP (Semantic Tag
propagation) has been proposed by Ereteo et al66 that takes yield of the semantic data captured while
organizing the RDF graphs of social networks. It basically is an extension of the LPA54 algorithm to
perform the semantic propagation of tags. The algorithm detects and moreover labels communities
using the tags used by group during the social labelling process and the semantic associations derived
between tags. In a study by Zhao et al67, a topic oriented approach consisting of an amalgam of
social objects clustering and link analysis has been used. Firstly a modified form of k means
clustering named as ‘Entropy Weighting K-Means (EWKM) algorithm’ has been used to cluster the
social objects. A subspace clustering algorithm is applied to cluster all the social objects into topics.
On the clusters obtained in this process, topical community detection or link analysis is performed
using a modularity optimization algorithm. The members of the objects are separated into topical
clusters having unique topic. A link analysis is performed on each topical cluster to discover the
22
topical communities. The end result of the entire method is topical communities. A community
extraction approach is given by Abdelbary et al68, which integrates the content published within the
social network with its semantic features. Community discovery is performed using two layer
generative Restricted Boltzmann Machines model. The model presumes that members of a
community communicate over matters of common concern. The model permits associate members to
belong to multiple communities. Latent semantic analysis (LSA)69 and Latent Dirichlet Allocation
(LDA)61 are the two techniques extensively employed in the process to detect topical communities.
Nyugen et al70 have used LDA to find hyper groups in the blog content and then sentiment analysis
is done to further find the meta-groups in these units. A Link-Content model is proposed by Natarajan
et al71 for discovering topic based communities in social networks. Community has been modelled as
a distribution employing Gibbs sampling. This paper uses links and content to extract communities in
a content sharing network Twitter. Methods to detect overlapping communities A recent survey by
Amelio et al gives a comprehensive review of major overlapping community detection algorithms
and includes the methods on dynamic networks. There exists another review of methods for discover
overlapping communities done by Xie et al72. The following section discusses some of the methods
to detect overlapping communities. Tables 4 and 5 enlist the methods discussed in this section. Clique
based methods for overlapping community detection A community can be interpreted as a union of
smaller complete (fully connected) subgraphs that share nodes. A k-clique is a fully connected
subgraph consisting of k nodes. A k-clique community can be defined as union of all k-cliques that
can be reached from each other through a series of adjacent k-cliques. Many researchers have used
cliques to detect overlapping communities. Important contributions using cliques for overlapping
community detection are summarized in table 4.
23
The Clique Percolation Method (CPM) was proposed by Palla et al73 to detect overlapping
communities. The method first finds all cliques of the network and uses the algorithm of Everett et
al83 to identify communities by component analysis of clique-clique overlap matrix. CPM has a
runtime of 𝑂(exp(𝑛)). The Clique Percolation Method (CPM) method proposed by Palla et al73 could
not discover the hierarchical structure along with the overlapping attribute. This limitation was
overcome through method proposed by Lancichinetti et al74 . It performs a local exploration in order
to find the community for each of the node. In this process the nodes may be revisited any number of
times. The main objective was to find local maxima based on a fitness function. CFinder84 software
was developed using CPM for overlapping community detection. Du et al75 proposed ComTector
(Community DeTector) for detection of overlapping communities using maximal cliques. Initially all
maximal cliques in the network are found which form the kernels of potential community. Then
agglomerative technique is iteratively used to add the vertices left to their closest kernels. The
obtained clusters are adjusted by merging pair of fractional communities in order to optimize the
modularity of the network. The running time of the algorithm is (𝐶 ∗ 𝑇 2 ), where the communities
detected are denoted by C and T is the number of triangles in the network. EAGLE, an agglomerative
hierarchical clustering based algorithm has been proposed by Shen et al 76. In the first step maximal
cliques are discovered and those smaller than a threshold are discarded. Subordinate maximal cliques
are neglected, remaining give the initial communities (also the subordinate vertices). The similarity is
found between these communities, and communities are repeatedly merged together on the basis of
this similarity. This is repeated till one community remains at the end. Evans et al77 proposed that by
24
partitioning the links of a network, the overlapping communities may be discovered. In an extension
to this work, Evans et al78 used weighted line graphs. In another work Evans79 used clique graphs to
detect the overlapping communities in real world social networks. GCE (Greedy Clique
Expansion)80 first identifies cliques in a network. These cliques act as seeds for expansion along with
the greedy optimization of a fitness function. A community is created by expanding the selected seed
and performing its greedy optimization via the fitness function proposed by Lancichinetti et al74 .
CONGA (Cluster-Overlap Newman Girvan Algorithm) was proposed by Gregory25. This method
was based on split- betweenness algorithm of Girvan-Newman. The runtime of the method is 𝑂(𝑚3 ).
In another work CONGO81 (CONGA Optimized) algorithm was proposed which used local
betweenness measure, leading to an improved complexity 𝑂(𝑛𝑙𝑜𝑔𝑛). A two phase Peacock algorithm
for detection of overlapping communities is proposed in Gregory82 using disjoint community
detection approaches. In the first phase, the network transformation was performed using the split
betweenness concept proposed earlier by the author. In the second phase, the transformed network is
processed by a disjoint community detection algorithm and the detected communities were converted
back to overlapping communities of the original network. Non clique methods for overlapping
community detection Some other non-clique methods to discover overlapping communities are given
in the table 5. These methods have been briefly explained in this section. An extension of Newman’s
modularity for directed graphs and overlapping communities was done by Nicosia et al53 and
modularity was given by 𝑄𝑜𝑣 = 1 𝑚 ∑𝑐∊𝐶 ∑𝑖,𝑗∊𝑉[𝛽𝑙(𝑖,𝑗),𝑐𝐴𝑖𝑗 − 𝛽𝑙(𝑖,𝑗),𝑐 𝑜𝑢𝑡 𝑘𝑖 𝑜𝑢𝑡𝛽𝑙(𝑖,𝑗),𝑐 𝑖𝑛 𝑘𝑗 𝑖𝑛
𝑚 . The authors defined a belongingness coefficient 𝛽𝑙,𝑐 of an edge 𝑙 connecting nodes 𝑖 and 𝑗 for a
particular community 𝑐 and is given by 𝛽𝑙,𝑐 = ℱ(𝛼𝑖,𝑐 , 𝛼𝑗,𝑐) where definition for ℱ(𝛼𝑖,𝑐 , 𝛼𝑗,𝑐) is
taken as arbitrary, e.g., it can be taken as a product of the belonging coefficients of the nodes
involved, or as max(𝛼𝑖,𝑐 , 𝛼𝑗,𝑐). 𝛽𝑙(𝑖,𝑗),𝑐 𝑜𝑢𝑡 = ∑𝑗∊𝑉 ℱ(𝛼𝑖,𝑐,𝛼𝑗,𝑐) |𝑉| , 𝛽𝑙(𝑖,𝑗),𝑐 𝑖𝑛 = ∑𝑖∊𝑉
ℱ(𝛼𝑖,𝑐,𝛼𝑗,𝑐) |𝑉| . A genetic approach has been used in this work for the optimization of modularity
function. Another work which uses genetic approach to overlapping community detection is GA-
Net+, by Pizzuti85 GA-NET+ could detect overlapping communities using edge clustering. Order
Statistics Local Optimization Method(OSLOM)86 detects clusters in networks, and can handle
various kind of graph properties like edge direction, edge weights, overlapping communities,
hierarchy and network dynamics. It is based on local optimization of a fitness function expressing the
statistical significance of clusters with respect to random fluctuations, which is estimated with tools
of Extreme and Order Statistics. Baumes et al87 considered a community as a subset of nodes which
induces a locally optimal subgraph with respect to a density function. Two different subsets with
significant overlap can be locally optimal which forms the basis to find overlapping communities.
25
Chen et al 88 used game-theoretic approach to address the issue of overlapping communities. Each
node is assumed to be an agent trying to improve the utility by joining or leaving the community. The
community of the nodes in Nash equilibrium are assumed to form the output of the algorithm. Utility
of an agent is formulated as combination of a gain and a loss function. To capture the idea of
overlapping communities, each agent is permitted to select multiple communities. In another game-
theoretic approach, Alvari et al89 proposed an algorithm consisting of two methods PSGAME based
on Pearson correlation, and NGGAME centred on neighbourhood similarity measure. Alvari et al90
proposed the Dynamic Game Theory method (D-GT) which treated nodes as rational agents. These
agents perform actions in iterative and game theoretic manner so as to maximize the total utility.
Community detection for Dynamic networks

Dynamic networks are the networks in which the membership of the nodes of communities
evolve or change over time. The task of community identification for dynamic networks has received
relatively less attention than the static networks The methods have been categorized into two classes
by Bansal et al94, one designed for data which is evolving in real time known as incremental or
online community detection; and the other for data where all the changes of the network evolution are
known a priori, known as offline community detection. Wolf et al95 proposed mathematical and
computational formulations for the analysis of dynamic communities on the basis of social
interactions occurring in the network. Tantipathananandh et al96 made assumptions about the
individual behaviour and group membership. Henceforth they framed the objective as an optimization
problem by formulating three cost functions, namely i-cost, g-cost and c-cost. Graph colouring and
heuristics based approach were deployed. FacetNet, proposed by Lin et al97 is a unified framework
to study the dynamic evolutions of communities. The community structure at any time includes the
network data as well as the previous history of the evolution. They have used a cost function and
proposed an iterative algorithm which converges to an optimal solution. Palla et al98 conducted
experiments on two diverse datasets of phone call network and collaboration network to find time
dependence. After building joint graphs for two time steps, the CPM algorithm73 was applied. They
have used an auto-correlation function to find overlap among two states of a community, and a
stationarity parameter which denotes the average correlation of various states. Greene et al99
proposed a heuristic technique for identification of dynamic communities in the network data. They
represented the dynamic network graph as an aggregation of time step graphs. Step communities
represent the dynamic communities at a particular time. The algorithm begins with the application of
26
a static community detection algorithm on the graph. In the subsequent steps, dynamic communities
are created for each step and Jaccard similarity is calculated. They have also generated benchmark
dataset for experimental work. The algorithm by Bansal et al94 involves the addition or deletion of
edges in the network. The algorithm is built on the greedy agglomerative technique of the modularity
based method earlier proposed in the work of Clauset et al35 . He et al100 improvised Louvain
method36 to include concept of dynamicity in the formation of communities. A key point in their
algorithm is to make use of previously detected communities at time 𝑡 − 1 to identify the
communities at time 𝑡. Dinh et al101 proposed A3CS, an adaptive framework which uses the power-
law distribution and achieves approximation guarantees for the NP-hard modularity maximization
problem, particularly on dynamic networks. Nguyen et al102 have attempted to identify disjoint
community structure in dynamic social networks. An adaptive modularity-based framework Quick
Community Adaptation (QCA) is proposed. The method finds and traces the progress of network
communities in dynamic online social networks. Takaffoli et al103 have proposed a two-step
approach to community detection. In the first step the communities extracted at different time
instances are compared using weighted bipartite matching. Next, a ‘meta’ community is constructed
which is defined as a series of similar communities at various time instances. Five events to capture
the changes to community are split, survive, dissolve, merge, and form. A similarity function is used
to calculate the similarity between two communities and a community matching algorithm has been
employed thereafter. The authors, Kim et al104 proposed a particle-and-density based evolutionary
clustering method for discovery of communities in dynamic networks. Their approach is grounded on
the assumption that a network is built of a number of particles termed as nano-communities, where
each community further is made up of particles termed as quasi-clique-by-clique (l-KK). The density
based clustering method uses cost embedding technique and optimal modularity method to ensure
temporal smoothness even when the number of cluster varies. They have used an information theory
based mapping technique to recognize the stages of the community i.e. evolving, forming or
dissolving. Their method improves accuracy and is time efficient as compared to the FacetNet
method proposed earlier. In another approach proposed by Chi et al105, two frameworks for
evolutionary spectral clustering have been proposed namely PCQ (Preserving cluster quality) and
PCM (Preserving cluster membership). In this work the temporal smoothness is ensured by some
terms in the clustering cost functions. These two frameworks combine the processes of community
extraction and the community evolution process. They use a cost function which consists of the
snapshot and temporal cost. The clustering quality of any partition determines the snapshot cost while
the temporal cost definition varies for each of the frameworks. For PCQ framework, the temporal
27
cost is decided by the cluster quality when the current partition is applied to the historic data. In
PCM, the difference between the current and the historic partition gives the temporal cost. Both the
frameworks proposed, can tackle the change in number of clusters. In their work DYNMOGA
(Dynamic MultiObjective Genetic Algorithm), the authors Folino et al106 have used a genetic
algorithm based approach to dynamic community detection. They attempt to achieve temporal
smoothness by multiobjective optimisation, i.e. maximisation of snapshot quality (community score
is used) and minimization of temporal cost (here NMI is used). Kim et al107 in their method
CHRONICLE have performed two stage clustering and the method can detect clusters of path group
type also in addition to the single path type clusters. In first stage of the algorithm, called as
CHRONICLE1st the cosine similarity measure is used. In second stage of the algorithm the measure
proposed and used is general similarity (GS). It is a combination of the two measures structural
affinity and weight affinity.
SOME POTENTIAL APPLICATIONS OF COMMUNITY DETECTION

With the enormous growth of the social networking site users, the graphs representing these sites are
becoming very complex, hence difficult to visualize and understand. Communities can be considered
as a summary of the whole network thus making the network easy to comprehend. The discovery of
these communities in social networks can be useful in various applications. Some of the applications
where community detection is useful are briefly described below. Improving recommender
systems with community detection
Recommender Systems use data of similar users or similar items to generate
recommendations. This is analogous to the identification of groups, or similar nodes in a graph.
Hence community detection holds an immense potential for recommendation algorithms. Cao et
al114 have used a community detection based approach to improve the traditional collaborative
filtering process of Recommender Systems. The process starts with the mapping of user-item matrix
to user similarity structure. On this matrix, a discrete PSO (particle swarm optimization) algorithm is
applied to detect communities. The items are then recommended to the user based on the discovered
communities.
Evolution of communities in social media
With the increase in the number of social networking sites, the focus and scope of sites are getting
expanded. The sites are getting diversified in terms of focus. In addition to common sites like
Facebook, Twitter, MySpace and Bebo, other sites like Flickr for photo-sharing have also come up.
28
The analysis of the tweetretweet and the follower-followee network in twitter provides an insight into
the community structure existing in the Twitter network. Sentiment analysis of the tweets may be
performed as an intermediary step to find the general nature of the tweets and then community
detection algorithms may be applied to help deduce the structure of communities. Zalmout et al115,
applied the community detection algorithm to UK political tweets dataset. CQA(Community question
answering) has been used by Zhang et al116 to discover overlapping communities in dynamic
networks based on user interactions.
1.1 WHAT IS SOFTWARE?
Software, generally sense, is understood as a group of instructions or programs that instructs to a

computer to perform specific tasks. Software could be a general term that's wont to describe computer
programs. Scripts, applications, programs and a group of instructions are all different terms wont to
describe software.
The theory of software was first proposed by Alan Mathison Turing in 1935 in his essay
"Computable numbers with an application to the Entscheidungsproblem." However, the word
software was proposed by statistician and mathematician John Tukey in a very 1958 issue of yank
Mathematical Monthly within which he discussed the electronic calculators' programs.
Software is typically divided into three categories:
• System software could be a base for application software. System software generally includes
operating systems, device drivers, text editors,compilers, disk formatters and utilities helping the pc
to regulate more efficiently. it's responsible in providing basic non-specific-task functionalities and
management of hardware components. The system software is typically written within the language
of C programming.
• Programming software could also be a group of tools to help developers in writing programs. the
numerous tools available are linkers, compilers, interpreters debuggers and text editors.
• Application software is typically used to perform certain tasks and also the samples of the applying
software includes educational software, database managing systems, office suites, application on
gaming. the applying software can either be one program or a gaggle of portable programs.
29
Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s
elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or
binary form for all major platforms from the Python Web site, https://www.python.org/, and
may be freely distributed. The same site also contains distributions of and pointers to many
free third party Python modules, programs and tools, and additional documentation.
The Python interpreter is easily extended with new functions and data types implemented in
C or C++ (or other languages callable from C). Python is also suitable as an extension
language for customizable applications.
This tutorial introduces the reader informally to the basic concepts and features of the Python
language and system. It helps to have a Python interpreter handy for hands-on experience, but
all examples are self-contained, so the tutorial can be read off-line as well.
For a description of standard objects and modules, see The Python Standard Library. The
Python Language Reference gives a more formal definition of the language. To write
extensions in C or C++, read Extending and Embedding the Python Interpreter and Python/C
API Reference Manual. There are also several books covering Python in depth.
This tutorial does not attempt to be comprehensive and cover every single feature, or even
every commonly used feature. Instead, it introduces many of Python’s most noteworthy
features, and will give you a good idea of the language’s flavor and style. After reading it,
you will be able to read and write Python modules and programs, and you will be ready to
learn more about the various Python library modules described in The Python Standard
Library.
30
The Python Standard Library
While The Python Language Reference describes the exact syntax and semantics of the
Python language, this library reference manual describes the standard library that is
distributed with Python. It also describes some of the optional components that are commonly
included in Python distributions.
Python’s standard library is very extensive, offering a wide range of facilities as indicated by
the long table of contents listed below. The library contains built-in modules (written in C)
that provide access to system functionality such as file I/O that would otherwise be
inaccessible to Python programmers, as well as modules written in Python that provide
standardized solutions for many problems that occur in everyday programming. Some of
these modules are explicitly designed to encourage and enhance the portability of Python
programs by abstracting away platform-specifics into platform-neutral APIs.
The Python installers for the Windows platform usually include the entire standard library
and often also include many additional components. For Unix-like operating systems Python
is normally provided as a collection of packages, so it may be necessary to use the packaging
tools provided with the operating system to obtain some or all of the optional components
Dealing with Bugs
Python is a mature programming language which has established a reputation for stability. In
order to maintain this reputation, the developers would like to know of any deficiencies you
find in Python.
It can be sometimes faster to fix bugs yourself and contribute patches to Python as it
streamlines the process and involves less people. Learn how to contribute.
31
Documentation bugs
If you find a bug in this documentation or would like to propose an improvement, please
submit a bug report on the tracker. If you have a suggestion how to fix it, include that as well.
If you’re short on time, you can also email documentation bug reports to docs@python.org
(behavioral bugs can be sent to python-list@python.org). ‘docs@’ is a mailing list run by
volunteers; your request will be noticed, though it may take a while to be processed.
See also
Documentation bugs on the Python issue tracker
Using the Python issue tracker
Bug reports for Python itself should be submitted via the Python Bug Tracker
(https://bugs.python.org/). The bug tracker offers a Web form which allows pertinent
information to be entered and submitted to the developers.
The first step in filing a report is to determine whether the problem has already been reported.
The advantage in doing so, aside from saving the developers time, is that you learn what has
been done to fix it; it may be that the problem has already been fixed for the next release, or
additional information is needed (in which case you are welcome to provide it if you can!).
To do this, search the bug database using the search box on the top of the page.
If the problem you’re reporting is not already in the bug tracker, go back to the Python Bug
Tracker and log in. If you don’t already have a tracker account, select the “Register” link or,
if you use OpenID, one of the OpenID provider logos in the sidebar. It is not possible to
submit a bug report anonymously.
Being now logged in, you can submit a bug. Select the “Create New” link in the sidebar to
open the bug reporting form.
32
The submission form has a number of fields. For the “Title” field, enter a very short
description of the problem; less than ten words is good. In the “Type” field, select the type of
your problem; also select the “Component” and “Versions” to which the bug relates.
In the “Comment” field, describe the problem in detail, including what you expected to
happen and what did happen. Be sure to include whether any extension modules were
involved, and what hardware and software platform you were using (including version
information as appropriate).
Each bug report will be assigned to a developer who will determine what needs to be done to
correct the problem. You will receive an update each time action is taken on the bug.
1.2 WHAT IS SOFTWARE DEVELOPMENT LIFE CYCLE(SDLC)
The software development life cycle can be a framework that defines the tasks performed at each step
within the software development process. SDLC could be a structure followed by a development
team within the software organization. It consists of an in depth plan describing the way to develop,
maintain and replace specific software. The life cycle defines a technique for improving the standard
of software and therefore the overall development process. The software development life cycle is
additionally called the software development process. SDLC consists of following activities:
1.Planning: the foremost important parts of software development, requirement gathering or

requirement analysis are usually done by the foremost skilled and experienced software engineers
within the organization. After the necessities are gathered from the client, a scope document is made
during which the scope of the project is decided and documented.
2.Implementation: The software engineers start writing the code in keeping with the client's
requirements.
3.Testing: this is often the method of finding defects or bugs within the created software.
4.Documentation: Every step within the project is documented for future reference and for the
development of the software within the development process. the planning documentation may
include writing the appliance programming interface (API).
33
5.Deployment and maintenance: The software is deployed after it's been approved for release.
6.Maintaining: Software maintenance is completed for future reference. Software improvement and
new requirements (change requests) can take longer than the time needed to form the initial
development of the software.
SDLC is nothing but Software Development Life Cycle. It is a standard which is used by software
industry to develop good software.
SDLC (Spiral Model):
Fig 1: Spiral Model

Stages of SDLC:
 Requirement Gathering and Analysis
 Designing
 Coding
 Testing
 Deployment
Requirements Definition Stage and Analysis:

The requirements gathering process takes as its input the goals identified in the high-level
requirements section of the project plan. Each goal will be refined into a set of one or more
requirements. These requirements define the major functions of the intended application, define
34
operational data areas and reference data areas, and define the initial data entities. Major functions
include critical processes to be managed, as well as mission critical inputs, outputs and reports. A
user class hierarchy is developed and associated with these major functions, data areas, and data
entities. Each of these definitions is termed a Requirement. Requirements are identified by unique
requirement identifiers and, at minimum, contain a requirement title and textual description.
Fig 2: Requirement Stage

These requirements are fully described in the primary deliverables for this stage: the
Requirements Document and the Requirements Traceability Matrix (RTM). the requirements
document contains complete descriptions of each requirement, including diagrams and references to
external documents as necessary. Note that detailed listings of database tables and fields are not
included in the requirements document. The title of each requirement is also placed into the first
version of the RTM, along with the title of each goal from the project plan. The purpose of the RTM
is to show that the product components developed during each stage of the software development
lifecycle are formally connected to the components developed in prior stages.
In the requirements stage, the RTM consists of a list of high-level requirements, or goals, by
title, with a listing of associated requirements for each goal, listed by requirement title. In this
hierarchical listing, the RTM shows that each requirement developed during this stage is formally
linked to a specific product goal. In this format, each requirement can be traced to a specific product
goal, hence the term requirements traceability. The outputs of the requirements definition stage
include the requirements document, the RTM, and an updated project plan.
Design Stage:
The design stage takes as its initial input the requirements identified in the approved
requirements document. For each requirement, a set of one or more design elements will be produced
35
as a result of interviews, workshops, and/or prototype efforts. Design elements describe the desired
software features in detail, and generally include functional hierarchy diagrams, screen layout
diagrams, tables of business rules, business process diagrams, pseudo code, and a complete entity-
relationship diagram with a full data dictionary. These design elements are intended to describe the
software in sufficient detail that skilled programmers may develop the software with minimal
additional input.
Fig 3: Design Stage

When the design document is finalized and accepted, the RTM is updated to show that each
design element is formally associated with a specific requirement. The outputs of the design stage are
the design document, an updated RTM, and an updated project plan.
Development Stage:
The development stage takes as its primary input the design elements described in the
approved design document. For each design element, a set of one or more software artifacts will be
produced. Software artifacts include but are not limited to menus, dialogs, data management forms,
data reporting formats, and specialized procedures and functions. Appropriate test cases will be
36
developed for each set of functionally related software artifacts, and an online help system will be
developed to guide users in their interactions with the software.
Fig 4: Development Stage

The RTM will be updated to show that each developed artefact is linked to a specific design element,
and that each developed artefact has one or more corresponding test case items. At this point, the
RTM is in its final configuration. The outputs of the development stage include a fully functional set
of software that satisfies the requirements and design elements previously documented, an online help
system that describes the operation of the software, an implementation map that identifies the primary
code entry points for all major system functions, a test plan that describes the test cases to be used to
validate the correctness and completeness of the software, an updated RTM, and an updated project
plan.
Integration & Test Stage:
During the integration and test stage, the software artefacts, online help, and test data are
migrated from the development environment to a separate test environment. At this point, all test
cases are run to verify the correctness and completeness of the software. Successful execution of the
test suite confirms a robust and complete migration capability.
37
During this stage, reference data is finalized for production use and production users are
identified and linked to their appropriate roles. The final reference data (or links to reference data
source files) and production user list are compiled into the Production Initiation Plan.
Fig 5: Integration and Test stage

The outputs of the integration and test stage include an integrated set of software, an online
help system, an implementation map, a production initiation plan that describes reference data and
production users, an acceptance plan which contains the final suite of test cases, and an updated
project plan.
Installation & Acceptance Stage
During the installation and acceptance stage, the software artifacts, online help, and initial
production data are loaded onto the production server. At this point, all test cases are run to verify the
correctness and completeness of the software. Successful execution of the test suite is a prerequisite
to acceptance of the software by the customer.
After customer personnel have verified that the initial production data load is correct and the
test suite has been executed with satisfactory results, the customer formally accepts the delivery of
the software.
38
Fig 6: Installation and Acceptance Stage
The primary outputs of the installation and acceptance stage include a production application,
a completed acceptance test suite, and a memorandum of customer acceptance of the software.
Finally, the PDR enters the last of the actual labour data into the project schedule and locks the
project as a permanent project record. At this point the PDR "locks" the project by archiving all
software items, the implementation map, the source code, and the documentation for future reference.
2.4 SYSTEM ARCHITECTURE
Architecture Flow:
Below architecture diagram represents mainly flow of request from the users to database
through servers. In this scenario overall system is designed in three tiers separately using three layers
called presentation layer, business layer, data link layer. This project was developed using 3-tier
architecture.
3-Tier Architecture:
The three-tier software architecture (three layer architecture) emerged in the 1990s to
overcome the limitations of the two-tier architecture. The third tier (middle tier server) is between the
user interface (client) and the data management (server) components. This middle tier provides
process management where business logic and rules are executed and can accommodate hundreds of
users (as compared to only 100 users with the two tier architecture) by providing functions such as
queuing, application execution, and database staging.
The three tier architecture is used when an effective distributed client/server design is needed
that provides (when compared to the two tier) increased performance, flexibility, maintainability,
39
reusability, and scalability, while hiding the complexity of distributed processing from the user.
These characteristics have made three layer architectures a popular choice for Internet applications
and net-centric information systems..
Advanstages of Three-Tier:
 Separates functionality from presentation.
 Clear separation - better understanding.
 Changes limited to well define components.
CHAPTER-2
LITERATURE REVIEW
A wide research study in the recent years focused on community detection in complex systems [4],
most of them focus on undirected networks to enhance the efficiency of identifying communities in
understanding complex networks. For instances, Fortunato et al [3] based his proposed approach on
statistical inference perspectives, Schaeffer et al [5], proposed their approach for clustering problem
as an unsupervised learning task based on similarity measure over the data of the network, Girvan
and Newman based their community detection proposal on betweenness calculation to find out
community boundaries where modularity measure is the overall quality of the graph partitioning [6,
7]. The weight used by Newman and Girvan [7] aims to be the betweenness measure of the edge,
representing the number of shortest paths connecting any pair of nodes passing through. However,
community detection problem has been studied mainly in case of undirected networks, various
solutions was proposed in this context, motivating many disciplines to deal with the issue.
Interestingly, Fortunato et al [3] mentioned the few possibilities for extending techniques from
undirected to directed case, where the edge directedness is not the only complication that could face
the clustering problem. Nevertheless, diverse graph data in many real-world applications are by
nature directed, thus interesting to save available information behinds the edge directionality.
Malliaros et al [8], revealed in their survey that the most common way for researcher community to
deal with the problem of clustering is to ignore the directionality of the graph, then proceed to
clustering with a wide range of proposed tools. Therefore, most of community detection proposals
can not be used directly on weighted directed graphs, where the number of communities not always
40
known in advance and the communities present different granularity scale. Since the problem of
community detection in complex network analysis acquires more attention, many researchers have
been interested into structural information and topological networks metrics [1, 3, 4, 6, 7, 8, 9]. In
[10], S. Ahajjam based the proposed community detection algorithms on a new scalable approach
using leader nodes characteristics through two steps: (i) Identification of potential leaders in the
network, and (ii) exploration of nodes similarities around leaders to build communities. Therefore,
recent works start focusing on both topological and topical aspects [9, 11, 12] to overpass limited
performances of topology-based community detection approaches. Topic-based community detection
started gaining attention through different works for community detection in complex network [9, 13,
14]. The essence behind the approach is to similarly detect nodes with same properties, which are not
necessarily real connections between nodes of the network, in which actors communicate on topics of
mutual interest [14] to determine the communities which are topically similar.
2.1 RELATED WORK
Definition of Local Community
The problem of local community detection is proposed by Clauset [15]. Usually we define the local
community problem in the following way: there is a nondirected graph , represents the set of nodes,
and represents the edges in the graph. The connecting information of partial nodes in the graph is
known or can be obtained. The local community is defined as . The set of nodes connected with is
defined as and the set of nodes in connected with nodes in is defined as the boundary node set .
That is to say, any node in is connected to one node in , and the rest of is the core node set , as
shown in Figure 1.
Figure 1
Definition of local community.
Local community detection problem is to start from a preselected source node. It adds the node
meeting the conditions in into and removes the node which does not meet the conditions
from D gradually.
41
2.2. Related Algorithms
At present, many local community detection algorithms have been proposed. We introduce two
representative local community detection algorithms.
(1) Clauset Algorithm. In order to solve the problem of local community detection, Clauset [15] put
forward the local community modularity R and gave a fast convergence greedy algorithm to find the
local community with the greatest modularity.
The definition of local community modularity is as follows:where and represent two nodes in the

graph. If nodes and are connected, the value of is 1; otherwise, it is 0; if nodes and are both in , the
value of is 1; otherwise, it is 0.
The local community detection process of Clauset algorithm is similar to that of web crawler
algorithm. First, Clauset algorithm starts from an initial node . Node is added to the subgraph , and
all its neighbor nodes are added to . Then the algorithm adds the node in which can bring the
maximum increment of into the local community iteratively, until the scale of the local community
reaches the preset size. That is to say, the algorithm needs to set up a parameter to decide the size of
the community, and the result is greatly influenced by the initial node.
(2) LWP Algorithm. LWP [16] algorithm is an improved algorithm and it has a clear end condition
compared with Clauset algorithm. The algorithm defines another local community modularity , which
is expressed aswhere and represent two nodes in the graph. If nodes and are connected to each
other, the value of is 1; otherwise, it is 0; if nodes and are both in , the value of is 1; otherwise, it is
0; if only one of the nodes and is in , the value of is 1; otherwise, it is 0.
Given an undirected and unweighted graph , LWP algorithm starts from an initial node to find a
subgraph with maximum value of . If the subgraph is a community (i.e., ), then it returns the subgraph
as a community. Otherwise, it is considered that there is no community that can be found starting
from this initial node. For an initial node, LWP algorithm finds a subgraph with the maximum value
of local modularity by two steps. First, the algorithm is initialized by constructing a subgraph with
only an initial node and all the neighbor nodes of node are added to the set . Then the algorithm
performs incremental step and pruning step.
In the incremental step, the node selected from which can make the local modularity of increase
with the highest value is added to iteratively. The greedy algorithm will iteratively add nodes in to ,
until no node in can be added. In the pruning step, if the local modularity of becomes larger when
42
removing a node from , then really remove it from . In the process of pruning, the algorithm must
ensure that the connectivity of is not destroyed until no node can be removed. Then update the
set and repeat the two steps until there is no change in the process. The algorithm has a high Recall,
but its accuracy is low.
The complexity of these two algorithms is , where is the number of nodes to be explored in the local
community and is the average degree of the nodes to be explored in the local community.
3. Description of the Proposed Algorithm
3.1. Discovery of Minimal Cluster
Generally, a network can be described by a graph , where is the set of nodes and is the set of edges.
It contains nodes and edges. represents a node set of a local community in the network and is the
number of nodes in . We introduce two definitions related to the algorithm proposed in this paper.
Definition 1 (neighbor node set). It is a set of nodes connected directly to a single node or a
community.
For node , its neighbor node set can be expressed as .
For community containing nodes, its neighbor node set can be expressed as follows:
Definition 2 (number of shared neighbors). The number of shared neighbors for nodes and can be
calculated as
The minimal cluster detection is the key of the algorithm. The minimal cluster is the set of nodes that
connect to the initial node most closely. We introduce a method proposed in [22] to find the nodes
that are closely connected with the initial nodes. It uses the density function [23] which is widely
used and can be calculated aswhere represents the number of edges in community and represents
the number of nodes in community . The larger is, the more densely the nodes in are connected. It is
necessary to set a threshold for to decide which nodes are selected to form the initial minimal
cluster. Reference [22] gave the definition of this threshold function as shown in
and are the thresholds to select the nodes that constitute the minimal cluster . If or , these nodes are
considered to form a minimal cluster. Compared with other methods, the threshold value does not
depend on the artificial setting, but it is totally dependent on the nodes in , so the uncertainty of the
algorithm is reduced. Through this process, all nodes in the network can be assigned to several
densely connected clusters. In the process, the constraint conditions of the minimal clusters are
43
relatively strict. Then the global community structure of the network is found by combining these
minimal clusters. This is a process from local to global by finding all minimal clusters to obtain the
global structure of the network. Our local community detection algorithm only needs to find one
community in the global network. Inspired by this idea, we improve this algorithm as shown in
Algorithm 1.
Input: ,
Output: Minimal Cluster
(1) ;
(2) for do
if (3) is the largest
Let (4) ;
end if (5)
(6) end for
(7) return
Algorithm 1
Locating minimal cluster.
In the network , we want to find the minimal cluster containing node . First we need to traverse all the
neighbors of node and to find the node which shares the most neighbors with node (step 3). Then
take nodes , and their shared neighbor nodes as the initial minimal cluster (step 4). Generally
speaking, node and its neighbor nodes are most likely to belong to the same community. We find the
node most closely connected with v according to the number of their shared neighbors. The more the
number of their shared neighbors is, the more closely the two nodes are connected. That is to say, the
nodes connected with both nodes and are more likely to belong to the same community. We put
them together as the initial minimal cluster of local community expansion, which is effective and
reliable verified by experiences.
The process of finding the minimal cluster is illustrated by an example shown in Figure 2. Suppose
that we want to find the minimal cluster containing node 1. We need to traverse its neighbor nodes 2,
3, 4, and 6, where , and . We can see that node 3 is the most closely connected one to node 1, so the
minimal cluster is . is the starting node set of local community extension.
Figure 2
The discovery of minimal cluster.
44
3.2. Detection of Local Community
First of all, we use Algorithm 1 to find the node which is most closely connected to the initial node.
We take node and node as well as their shared neighbor nodes as the initial minimal cluster. The
second part of the algorithm is based on the minimal cluster to carry out the expansion of nodes and
finally find the local community. The specific process is shown in Algorithm 2.
Input: , C
Output: Local Community LC
(01) Let
(02) Calculate N(LC), M
(03) While do
foreach (04) (LC)
if Δ (05) M is the largest
Let (06)
End if (07)
End for (08)
Update (09) N(LC), M
(10) Until no node can be added into LC
(11) Return LC
Algorithm 2
Local community detection.
In the algorithm, we still use function used in the LWP algorithm as the criteria of local community
expansion. Algorithm 1 can find the initial minimal cluster . After that, Algorithm 2 finds the
neighbor node set N(LC) of LC and calculates the initial value of (step 02). Then it traverses all the
nodes in N(LC) (steps 03-04) to find a node which can make maximum and add it into the local
community LC (steps 05–08); update N(LC) and (step 09) until no new node is added to LC (step
10).
The complexity of the NewLCD algorithm is almost the same as the Clauset algorithm. The
NewLCD algorithm uses extra time of finding minimal cluster which is linear to the degree of the
initial node .
CHAPTER-3
45
PROBLEM IDENTIFICATION AND OBJECTIVES
This chapter states the problem in hand and lists out various objectives that are required to be met for
solving the problem.
3.1 Problem Statement
CHAPTER-4
METHODOLOGY
4.1 Methodology
4.1.1 Using Python Tool on Standalone machine Environment
The Python computer programs are an essential tool for progression in the numeric examination and
machine learning spaces. Python is a perfect way to deal with make reproducible, extraordinary
examination. Python is extensible and offers rich value for architects to manufacture their own
specific gadgets and procedures for examining data. With machines winding up recognizably more
basic as data generators, the noticeable quality of the dialects must be depended upon to create. When
it at first turned out, the best-favored angle wasthat it was free programming.
The vastness of package organic framework is irrefutably one of the Python's most grounded qualities
- if a true technique exists, odds are there's presently an Python package out there for it. Python's
positive conditions fuse its package natural framework.
Here, the accuracy of different machine learning algorithms has been explored using Python Tool on
the Standalone machine. Here initial analysis has been done using Microsoft excel. A csv file has
been provided as an input for Python. Analysis has been done using programming language Python in
JUPYTER NOTEBOOK.
46
FIG 4.1.1 Block Diagram of Proposed work
The data is gathered from web sources after which Pre-processing of Data is done which includes
Data cleaning, Integration of data, and Data Transformation.
1. FEASIBILITY STUDY
Feasibility study is conducted once the problem is clearly understood. The feasibility study which is a
high level capsule version of the entire system analysis and design process. The objective is to
determine whether the proposed system is feasible or not and it helps us to the minimum expense of
how to solve the problem and to determine, if the Problem is worth solving. The following are the
three important tests that have been carried out for feasibility study.
47
This study tells about how this package is useful to the users and its advantages and
disadvantages, and also it tells whether this package is cost effective are not. There are three types of
feasibility study, they are
 Economic Feasibility.
 Technical Feasibility.
 Operational Feasibility.
3.1 TECHNICAL FEASIBILITY
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is because,
at this point in time, not too many detailed design of the system, making it difficult to access issues
like performance, costs on (on account of the kind of technology to be deployed) etc. A number of
issues have to be considered while doing a technical analysis. Understand the different technologies
involved in the proposed system before commencing the project we have to be very clear about what
are the technologies that are to be required for the development of the new system. Find out whether
the organization currently possesses the required technologies. Is the required technology available
with the organization?
3.2 OPERATIONAL FEASIBILITY
Proposed project is beneficial only if it can be turned into information systems that will
meet the organizations operating requirements. Simply stated, this test of feasibility asks if the
system will work when it is developed and installed. Are there major barriers to Implementation?
Here are questions that will help test the operational feasibility of a project:
 Is there sufficient support for the project from management from users? If the current system
is well liked and used to the extent that persons will not be able to see reasons for change,
there may be resistance.
 Are the current business methods acceptable to the user? If they are not, Users may welcome
a change that will bring about a more operational and useful systems.
 Are there major barriers to Implementation? Here are questions that will help test the
operational feasibility of a project
 Have the user been involved in the planning and development of the project?
 Since the proposed system was to help reduce the hardships encountered. In the existing
48
manual system, the new system was considered to be operational feasible.
3.3 ECONOMIC FEASIBILITY
Economic feasibility attempts 2 weigh the costs of developing and implementing a new
system, against the benefits that would accrue from having the new system in place. This feasibility
study gives the top management the economic justification for the new system. A simple economic
analysis which gives the actual comparison of costs and benefits are much more meaningful in this
case. In addition, this proves to be a useful point of reference to compare actual costs as the project
progresses. There could be various types of intangible benefits on account of automation. These could
include increased customer satisfaction, improvement in product quality better decision making
timeliness of information, expediting activities, improved accuracy of operations, better
documentation and record keeping, faster retrieval of information, better employee morale.
System Design
UML Diagrams
UML (Unified Modeling Language) is a standard language for specifying,

visualizing, constructing, and documenting the artifacts of software systems. UML was
created by the Object Management Group (OMG) and UML 1.0 specification draft was
proposed to the OMG in January 1997. It was initially started to capture the behavior of
complex software and non-software system and now it has become an OMG standard.
This tutorial gives a complete understanding on UML.
UML is a standard language for specifying, visualizing, constructing, and

documenting the artifacts of software systems.
UML was created by the Object Management Group (OMG) and UML 1.0 specification
draft was proposed to the OMG in January 1997.
OMG is continuously making efforts to create a truly industry standard.
49
 UML stands for Unified Modeling Language.
 UML is different from the other common programming languages such as C++,
Java, COBOL, etc.
 UML is a pictorial language used to make software blueprints.
 UML can be described as a general-purpose visual modeling language to
visualize, specify, construct, and document software system.
 Although UML is generally used to model software systems, it is not limited
within this boundary. It is also used to model non-software systems as well. For
example, the process flow in a manufacturing unit, etc.
UML is not a programming language but tools can be used to generate code in
various languages using UML diagrams. UML has a direct relation with object- oriented
analysis and design. After some standardization, UML has become an OMG standard.
50
Components of the UML
UML diagrams are the ultimate output of the entire discussion. All the elements,
relationships are used to make a complete UML diagram and the diagram represents a
system. The visual effect of the UML diagram is the most important part of the entire
process. All the other elements are used to make it complete.
UML includes the following nine diagrams, the details of which are described in
the subsequent chapters.
 Class diagram
 Object diagram
 Use case diagram
 Sequence diagram
 Collaboration diagram
 Activity diagram
 State chart diagram
 Deployment diagram
 Component diagram
The following are the main components of uml: -
1. Use-case Diagram
2. Class Diagram
3. Sequence Diagram
4. Activity Diagram
5. Collaboration Diagram
51
4.4 UML DIAGRAMS
4.4.1 USE CASE DIAGRAM
A use case diagram in the Unified Modeling Language (UML) is a type

of behavioral diagram defined by and created from a Use-case analysis. Its
purpose is to present a graphical overview of the functionality provided by a
system in terms of actors, their goals (represented as use cases), and any
dependencies between those use cases. The main purpose of a use case diagram
is to show what system functions are performed for which actor. Roles of the
actors in the system can be depicted.
4.4.2 STATE DIAGRAM
1|Page
1. TESTING
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of

trying to discover every conceivable fault or weakness in a work product. It
provides a way to check the functionality of components, sub assemblies,
assemblies and/or a finished product It is the process of exercising software
with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail
in an unacceptable manner. There are various types of test. Each test type
addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is
the testing of individual software units of the application .it is done after the
2|Page
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to

determine if they actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct
and consistent. Integration testing is specifically aimed at exposing the
problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are

available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
3|Page
Output : identified classes of application outputs must be
exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements,

key functions, or special test cases. In addition, systematic coverage pertaining
to identify Business process flows; data fields, predefined processes, and
successive processes must be considered for testing. Before functional testing is
complete, additional tests are identified and the effective value of current tests is
determined.
System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results.
An example of system testing is the configuration oriented system integration
test. System testing is based on process descriptions and flows, emphasizing
pre-driven process links and integration points.
White Box Testing

White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at
least its purpose. It is purpose. It is used to test areas that cannot be reached
from a black box level.
Black Box Testing

Black Box Testing is testing the software without any knowledge of the
inner workings, structure or language of the module being tested. Black box
tests, as most other kinds of tests, must be written from a definitive source
document, such as specification or requirements document, such as
4|Page
specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot “see” into it. The test provides
inputs and responds to outputs without considering how the software works.
6.1 Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach

Field testing will be performed manually and functional tests will be
written in detail.
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
6.2 Integration Testing
5|Page
Software integration testing is the incremental integration testing of two
or more integrated software components on a single platform to produce failures
caused by interface defects.
The task of the integration test is to check that components or software
applications, e.g. components in a software system or – one step up – software
applications at the company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No
defects encountered.
6.3 Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires

significant participation by the end user. It also ensures that the system meets
the functional requirements.
CHAPTER-6
IMPLEMENTATION& RESULTS
6|Page
7.2 CODING & TESTING
CONCLUSION
This paper proposes a new local community detection algorithm based on minimal cluster—
NewLCD. This algorithm mainly consists of two parts. The first part is to find the initial
minimal cluster for local community expansion. The second part is to add nodes from the
neighbor node set which meet the local community condition into the local community. We
compare the improved algorithm with other three local community detection algorithms on
the real and artificial networks. The experimental results show that the proposed algorithm
can find the local community structure more effectively than other algorithms.
7|Page
REFERENCES
[1] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,”
Physical Review E-Statistical, Nonlinear, and Soft Matter Physics, vol. 69, no. 2, pp. 292–313, 2004.
[2] J. Lee, S. P. Gross, and J. Lee, “Modularity optimization by conformational space annealing,”
Physical Review E, vol. 85, no. 5, Article ID 056702, pp. 499–508, 2012.
[3] H.-W. Shen and X.-Q. Cheng, “Spectral methods for the detection of network community
structure: a comparative analysis,” Journal of Statistical Mechanics: Theory and Experiment, vol.
2010, no. 10, Article ID P10020, 2010.
[4] J. Wu, Z.-M. Cui, Y.-J. Shi, S.-L. Sheng, and S.-R. Gong, “Local density-based similarity matrix
construction for spectral clustering,” Journal on Communications, vol. 34, no. 3, pp. 14–22, 2013.
[5] S. Mehrkanoon, C. Alzate, R. Mall, R. Langone, and J. A. K. Suykens, “Multiclass semisupervised
learning based upon kernel spectral clustering,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 26, no. 4, pp. 720–733, 2015.
[6] K. Tas¸demir, B. Yalc¸in, and I. Yildirim, “Approximate spectral clustering with utilized similarity
information using geodesic based hybrid distance measures,” Pattern Recognition, vol. 48, no. 4, pp.
1461–1473, 2015.
8|Page
[7] V. D. Blondel, J. Guillaume, R. Lambiotte et al., “Fast unfolding of communities in large
networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 30, no. 2, pp. 155–168,
2008.
[8] K. M. Tan, D. Witten, and A. Shojaie, “The cluster graphical lasso for improved estimation of
Gaussian graphical models,” Computational Statistics and Data Analysis, vol. 85, pp. 23–36, 2015.
[9] F. De Morsier, D. Tuia, M. Borgeaud, V. Gass, and J.-P. Thiran, “Cluster validity measure and
merging system for hierarchical clustering considering outliers,” Pattern Recognition, vol. 48, no. 4,
pp. 1478–1489, 2015.
[10] A. Bouguettaya, Q. Yu, X. Liu, X. Zhou, and A. Song, “Efficient agglomerative hierarchical
clustering,” Expert Systems with Applications, vol. 42, no. 5, pp. 2785–2797, 2015.
[11] L. Subelj and M. Bajec, “Unfolding communities in large complex networks: combining
defensive and offensive label propagation for core extraction,” Physical Review E. Statistical,
Nonlinear, and Soft Matter Physics, vol. 83, no. 3, pp. 885–896, 2011.
[12] S. Li, H. Lou, W. Jiang, and J. Tang, “Detecting community structure via synchronous label
propagation,” Neurocomputing, vol. 151, no. 3, pp. 1063–1075, 2015.
[13] Y. Yi, Y. Shi, H. Zhang, J. Wang, and J. Kong, “Label propagation based semi-supervised non-
negative matrix factorization for feature extraction,” Neurocomputing, vol. 149, pp. 1021–1037, 2015.
[14] D. Zikic, B. Glocker, and A. Criminisi, “Encoding atlases by randomized classification forests
for efficient multi-atlas label propagation,” Medical Image Analysis, vol. 18, no. 8, pp. 1262– 1273,
2014.
9|Page
[15] A. Clauset, “Finding local community structure in networks,” Physical Review E—Statistical,
Nonlinear, and Soft Matter Physics, vol. 72, no. 2, pp. 254–271, 2005.
[16] F. Luo, J. Z. Wang, and E. Promislow, “Exploring local community structures in large
networks,” Web Intelligence and Agent Systems, vol. 6, no. 4, pp. 387–400, 2008.
[17] Y. J. Wu, H. Huang, Z. F. Hao, and F. Chen, “Local community detection using link similarity,”
Journal of Computer Science and Technology, vol. 27, no. 6, pp. 1261–1268, 2012.
[18] Q. Chen, T.-T. Wu, and M. Fang, “Detecting local community structures in complex networks
based on local degree central nodes,” Physica A: Statistical Mechanics and Its Applications, vol. 392,
no. 3, pp. 529–537, 2013.
[19] http://www-personal.umich.edu/∼mejn/netdata/.
[20] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of
Anthropological Research, vol. 33, no. 4, pp. 452–473, 1977.
[21] M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,”
Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 12, pp.
7821–7826, 2002.
[22] N. P. Nguyen, T. N. Dinh, S. Tokala, and M. T. Thai, “Overlapping communities in dynamic
networks: their detection and mobile applications,” in Proceedings of the 17th Annual International
Conference on Mobile Computing and Networking (MobiCom ’11), pp. 85–95, Las Vegas, Nev,
USA, September 2011.
10 | P a g e
[23] S. Fortunato and C. Castellano, “Community structure in graphs,” in Computational Complexity,
pp. 490–512, Springer, 2012.
[24] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for testing community
detection algorithms,” Physical Review E, vol. 78, no. 4, Article ID 046110, pp. 561–570, 2008.
11 | P a g e

LCD Documentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LCD Documentation

Uploaded by

Copyright:

Available Formats

Local Community Detection Algorithm Based

1.1 WHAT IS SOFTWARE.....................................................................................................1

1.2 WHAT IS SOFTWARE DEVELOPMENT LIFE CYCLE……………………………...1

1.3 BUG PREDICTION………………………………………………...................................2

2.1 RELATED WORK…………………..................................................................................5

PROBLEM IDENTIFICATION AND OBJECTIVE

3.1 PROBLEM STATEMENT..................................................................................................8

3.2 PROJECT OBJECTIVE.......................................................................................................8

4.1.1USING PYTHON TOOL ON STANDALONE MACHINE LEARNING

4.3 EVALUATION CRITERIA USED FOR CLASSIFICATION………………………….11

4.3.1 CONFUSION MATRIX.................................................................................................12

4.3.2 ACCURACY AND PRECISION...................................................................................12

4.3.3 RECALL AND F-SQUARE...........................................................................................13

4.3.4 SENSITIVITY, SPECIFICITY AND ROC....................................................................13

4.3.5 SIGNIFICANCE AND ANALYSIS OF ENSEMBLE METHOD IN MACHINE

4.4 UML DIAGRAMS……………………………………………………………………….14

4.4.1 USE CASE DIAGRAM………………………………………………………………..14

4.4.2 STATE DIAGRAM……………………………………………………………………15

5.1 ALGORITHMS USED......................................................................................................17

5.1.1 DECISION TREE INDUCTION………………………………………………………17

5.1.2 NAÏVE BAYES..............................................................................................................21

5.1.4 SUPPORT VECTOR MACHINE MODEL...................................................................25

IMPLEMENTATION AND RESULTS

6.1 FRAMEWORK DESIGN………………………………………………………………..31

6.2 CODING AND TESTING……………………………………………………………….34

FIG 5.1.1 Decision tree induction…………………………………………………...17

FIG 5.1.4(i). Support Vector Machine Hyperplane………………………...............25

FIG 5.2 TensorFlow in ML…………………………………………………………..29

FIG 4.3.1 Use Case Diagram………………………………………………………...14

FIG 4.3.2 State Diagram………………………………………………………….....15

FIG 4.1.1 Block Diagram of Proposed work………………………………………..11

FIG 6.2 Bar graph of knn,svm,nauvebayes………………………………………….39

Automatically identifying groups based on network clustering algorithms

Label propagation based community detection

Community detection for Dynamic networks

SOME POTENTIAL APPLICATIONS OF COMMUNITY DETECTION

1.1 WHAT IS SOFTWARE?

Software, generally sense, is understood as a group of instructions or programs that instructs to a

Dealing with Bugs

Documentation bugs on the Python issue tracker

Using the Python issue tracker

1.2 WHAT IS SOFTWARE DEVELOPMENT LIFE CYCLE(SDLC)

1.Planning: the foremost important parts of software development, requirement gathering or

SDLC (Spiral Model):

Fig 1: Spiral Model

 Requirement Gathering and Analysis

Requirements Definition Stage and Analysis:

Fig 2: Requirement Stage

Fig 3: Design Stage

Fig 4: Development Stage

Fig 5: Integration and Test stage

 Clear separation - better understanding.

 Changes limited to well define components.

2.1 RELATED WORK

Definition of Local Community

The definition of local community modularity is as follows:where and represent two nodes in the

3. Description of the Proposed Algorithm

3.1. Discovery of Minimal Cluster

3.1 Problem Statement

4.1.1 Using Python Tool on Standalone machine Environment

3.1 TECHNICAL FEASIBILITY

3.3 ECONOMIC FEASIBILITY

UML (Unified Modeling Language) is a standard language for specifying,