Professional Documents
Culture Documents
Kolaczyk E. Tutorial. Statistical Analysis of Network Data 2024
Kolaczyk E. Tutorial. Statistical Analysis of Network Data 2024
Kolaczyk E. Tutorial. Statistical Analysis of Network Data 2024
Eric D. Kolaczyk
kolaczyk@bu.edu
Why Networks?
dCLK Cyc
Pdp Tim
Per Vri
Often is not even clear what is meant when an author refers to ‘the’
network!
1
I’ll use the slightly redundant term ‘network graph’.
SAMSI Program on Complex Networks: Opening Workshop
Introduction
Our Focus . . .
Challenges:
relational aspect to the data;
complex statistical dependencies (often the focus!);
high-dimensional and often massive in quantity.
Examples of Networks
To set some context, let’s look quickly at examples from four general
areas:
Technological
Biological
Social
Informational
Technological Nets
Biological Nets
Questions
Are certain patterns of
interaction among genes
more common than
expected?
Which regions of the
brain ‘communicate’
during a given task?
Social Nets
a17
Questions include
a12
a7
a11
a13
Who is friends with
a6
a5 a8
a4
whom?
a1
a18
a22
a2
a20
a9 Which ‘actors’ are
a10
a14
a3
a31
a16
the ‘power
brokers’ ?
a32
a33 a19
a34
a29
a28 a15
a21
a25
a24 a23
a30
a27
are present?
Information Nets
Questions
What does this
network look like?
How does
information ‘flow’
on this network?
But from a statistical perspective, there are certain canonical tasks and
problems faced in the questions addressed across the different areas of
specialty.
Better vertical depth can achieved in this area by viewing problems – and
pursuing solutions – from this perspective.
A more leisurely trip through this material will be taken during the first
three weeks of the accompanying SAMSI course2 .
2
See also the book,
Kolaczyk, E.D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer, New York.
which the SAMSI course will follow for the first few weeks.
SAMSI Program on Complex Networks: Opening Workshop
Network Mapping
Note: It’s sufficiently different from standard descriptive statistics that it’s
something unto itself.
Network Mapping
Tim Dbt
dCLK Cyc
Frequently ad hoc . . .
. . . sometimes formal (e.g., network inference).
Even with direct and error free observation of edges, decisions may be
made to thin edges, adjust topology to match additional variables, etc.
Stage 3: Visualization
network mapping
network characterization
network sampling
network inference
network processes
Characterization of Vertices/Edges
Examples include
Degree distribution
Vertex/edge centrality
Role/positional analysis
Centrality: Motivation
Centrality: An Illustration
density
clustering
connectivity
flow
partitioning
. . . and more . . .
Components
Tendrils
Strongly
In−Component Out−Component
Connected
Component
Tubes
network mapping
network characterization
network sampling
network inference
network processes
Examples include
Induced Subgraph Sampling
Incident Subgraph Sampling
Snowball Sampling
Link Tracing
s1
s2
t1
t2
Caveat emptor . . .
Lee et al (2006): Entries indicate direction of bias for vertex (red), edge
(green), and snowball (blue) sampling.
network mapping
network characterization
network sampling
network inference
network processes
Note: Casting the task this way also allows us to formalize the question of
validation.
Classification methods
SAMSI Program on Complex Networks: Opening Workshop
Network Characterization
Examples include
citation networks
movie networks
gene regulatory networks
neuro functional
connectivity networks
Genes
(Unknown) edges in G =
(V , E ) defined based on some
notion of sim(i, j).
This is arguably one the major and currently most active areas of
contribution to ‘network science’ from statistics3 .
3
Discussion topic: Is this good or bad? ^
¨
SAMSI Program on Complex Networks: Opening Workshop
Network Characterization
Graph assumed to be
a tree T = (VT , ET ).
Leaf vertices (and perhaps the
root) known to us.
TX
Two major classes of methods
IND
are those based on
hierarchical clustering
Berkeley I.S.T.
Portugal
I.T. M.S.U. Illinois U. Wisc . Rice Owlnet
likelihoods.
network mapping
network characterization
network sampling
network inference
network processes
Examples:
Behaviors and beliefs influenced by social interactions.
Functional role of proteins influenced by their sequence similarity.
Computer infections by viruses may be affected by ‘proximity’ to
infected computers.
23 20
32 18 Question: Is knowledge of
15 28
4
22
35 3 that lawyer’s collaborators and
34
12
26
16
19 14
25 their practice predictive of that
9
29
17
7 lawyer’s practice?
2
27
21 1 11
In fact . . . yes!
Let
2
0
(
0.0 0.2 0.4 0.6 0.8 1.0
1, if corporate
Fraction of Corporate Neighbors, Among Litigation Xi =
0, if litigation
0 1 2 3 4
Frequency
Compare
P
0.0 0.2 0.4 0.6 0.8 1.0
j∈Ni xj
Fraction of Corporate Neighbors, Among Corporate
|N i |
to a threshold.
SAMSI Program on Complex Networks: Opening Workshop
Network Characterization
Two classes of processes that have received a great deal of attention are
epidemics/diffusions on networks, and
flows
Network Flows
Examples include
transportation networks
i.e., flows of commodities and/or people
communication networks
i.e., flows of data
link volumes
Flow Volume (Kilobytes per second)
Estimation)
0
Time
SAMSI Program on Complex Networks: Opening Workshop
Wrap-Up
Wrapping Up . . .
This SAMSI program has in mind (at least) 4 sub-themes for focus:
Network sampling and inference
Dynamic networks
Kramer et al. (2010). Coalescence and fragmentation of cortical networks during focal seizures. J. of Neuroscience