Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Data Field

A field exists commonly in physical space. The concept of a data field, which
simulates the methodology of a physical field, is introduced in this chapter for
depicting the interaction between the objects associated with each data point of
the whole space. The field function mathematically models how the data contribu-
tions for a given task are diffused from the universe of a sample to the universe of
a population when interacting between objects. In the universe of discourse, all
objects with sampled data not only radiate their data contributions but also receive
data contributions from other objects. Depending on the given task and the physi-
cal nature of the objects in the data distribution, the field function may be derived
from the physical field. All of the equipotential lines depict the interesting topo-
logical relationships among the objects, which visually indicate the interacted
characteristics of objects. Thus, a data field bridges the gap between a mining
model and a data model.

6.1 From a Physical Field to a Data Field

Anaxagoras, a Greek philosopher, thought physical nature was “a portion of


everything in everything.” Visible or invisible particles mutually interact; some-
times, the interaction is untouched between physical particles. Michael Faraday
first put forward the field concept to describe this interaction. He believed that
the untouched interactions between physical particles had to be performed via a
transforming media that he suggested was the field. To visually portray an elec-
tronic field configuration, he further introduced the concept of an electronic field
line—that is, drawing a vector of attribute length and direction at every point of
space. Each field line has a starting point and an ending point, which originates
at the source and terminates at the sink. In a charge-free region, every field line

© Springer-Verlag Berlin Heidelberg 2015


175
D. Li et al., Spatial Data Mining, DOI 10.1007/978-3-662-48538-5_6
176 6 Data Field

is continuous that originates on a positive charge and terminates on a negative


charge, and they do not intersect. In space, a particle makes a field, and a field acts
on another particle (Giachetta et al. 2009). Because the interaction of particles in
physics can be described with the help of a field of force, we wonder whether it is
possible to formalize the self-cognition of humans in terms of the idea of field. If
so, we perhaps can establish a virtual cognitive field to model the interactions of
data objects.

6.1.1 Field in Physical Space

In physical space, interacted particles lead to various fields. Due to the interacting
particles, the field can be either a classical field or a quantum field. Depending on
the interacting range, the field may be long-range or short-range. As an interact-
ing range increases, the strength of the long-range field slowly diminishes to the
point of being undetectable (e.g., gravitational fields and electromagnetic fields),
while the strength of the short-range field rapidly attenuates to zero (e.g., nuclear
field). Based on the interacting transformation, the field is often classified as a sca-
lar field, vector field, tensor field, or spinor field. At each point, a scalar field’s
values are given by a single variable on the quantitative difference, a vector field
is specified by attaching a vector with magnitude and direction, a tensor field is
specified by a tensor, and a spinor field is for quantum field theory. A vector field
may be with or without sources. A vector field with constant zero divergence is a
field without sources (e.g., a magnetic field); otherwise, it is a field with sources
(e.g., an electrostatic field). In basic physics, the vector field is often studied as a
field with sources, such as Newton’s law of universal gravitation and Coulomb’s
law. When it is put in the field with sources, an arbitrary particle feels a force with
strength (Giachetta et al. 2009). Moreover, according to the values of a field vari-
able, whether or not it changes over time at each point, the field with sources can
be further grouped into time-varying fields and stable active fields. The stable
field with sources is also known as a potential field.
Field strength is often represented by using the potential. In a field, the poten-
tial difference between an arbitrary point and another referenced point is a deter-
minate value on a unit particle. The physical potential is the work performed by a
field force when a unit particle is moved from an arbitrary point to another refer-
enced point in the field. It is a scalar variable of the amount of energy transferred
by the force acting at a distance. The energy is the capacity to do the work, and
the potential energy refers to the ability of a system to do work by virtue of its
position or internal structure. The distribution of the potential field corresponds to
the distribution of the potential energy determined by the relative position between
interacted particles. Specifying the potential value at a point in space requires
parameters such as object mass or electrical charge, and distance between a point
6.1 From a Physical Field to a Data Field 177

and its field source, while it is independent of the direction of the distance vec-
tor. The strength of the field is visualized via the equipotential, which refers to a
region in space where every point in it is at the same potential.
In physics, it is a fact that a vacuum is free of matter but not free of field.
Modern physics even believes that field is one of the basic forms of particle exist-
ence, such as a mechanical field, nuclear field, gravitational field, thermal field,
electromagnetic field, and crystal field. As field theory developed, field was
abstracted as a mathematical concept to depict the distribution rule of a physical
variable or mathematical function in space (Landau and Lifshitz 2013). The poten-
tial function is often seen simply as a monotonic function of spatial location, hav-
ing nothing to do with the existence of the particle. That is, a field of the physical
variable or mathematical function in a space exists if every point in the space has
an exact value from the physical variable or mathematical function.

6.1.2 Field in Data Space

Data space is an exact branch of physical space. In a data space, all objects are
mutually interacting via data. From a generic physical space to an exact data
space, the physical nature of the relationship between the objects is identical to
that of the interaction between the particles. If the object described with data is
taken as a particle with mass, then a data field can be derived from a physical
field. Inspired by the field in physical space, the field was introduced to
illuminate the interaction between objects in data space. In the data space, all
objects are mutually interacting via data. Given a logic database with N records
and M attrib- utes, the process of knowledge discovery can begin with the
underlying distribu- tion of the N data points in the M-dimensional universal
space. If each data object is viewed as a point charge or mass point, it will exert a
force on any other objects in its vicinity, and the interactions of all the data objects
will form a field. Thus, an object described with data is treated as a particle with
mass. The data refer to the attributes of the object determined from observation,
while the mass is the physi- cal variable of matter determined from its weight or
from Newton’s second law of motion. Surrounding the object, there is a virtual
field simulating the physical field around particles. Each object has its own field.
Given a task, the field gives a vir- tual media to show the mutual interaction
between objects without touching each other. An object transforms its field to
other objects, and it also receives all their fields. Depending on their
contributions to a given task, the field strength of each object is different, which
then uncovers a different interaction. By using this inter- action, objects may be
self-organized under the given data and task. Some patterns that are previously
unknown, potentially useful, and ultimately understood may bes
further extracted from datasets.
178 6 Data Field

6.2 Fundamental Definitions of Data Fields

Definition 6.1 In data space Ω ⊆RP, let dataset D {x1, x2, …, xn} denote a P-
dimensional independent random sample, where xi = (xi1, xi2, …, xiP)T with i
1,=2, …, n. Each data object xi is taken as a particle with mass mi. xi radiates its
data energy and is also influenced by others simultaneously. Surrounding xi, a
virtual field is derived from the corresponding physical field. Thus, the virtual
field is called a data field.

6.2.1 Necessary Conditions

In the context of the definition, a data field exists only if the following neces-
sary conditions are characterized (i.e., short-range with source and temporal
behavior):
1. The data field is a short-range field. For an arbitrary point in the data space
Ω, all the fields from different objects are overlapped to achieve a superposed
field. If there are two points—x1 with enough data samples and x2 with less or
no data samples—the summarized strength of the data field at point x1 must be
stronger than that of x2. That is, the range of the data field is so short that its
magnitude decreases rapidly to 0 as the distance increases. The rapid attenua-
tion may reduce the effect of noisy data or outlier data. The effect of the super-
posed field may further highlight the clustering characteristics of the objects in
close proximity within data-intensive areas. The potential value of an arbitrary
point in space does not rely on the direction from the object, so the data field
is isotropic and spherical in symmetry. The interaction between two faraway
objects can be ignored.
2. The data field is a field with sources. The divergence is to measure the magni-
tude of a vector field at a given point in terms of a signed scalar. If it is
positive, it is called a source. If it is negative, it is called a sink. If it is zero in a
domain, there is no source or sink, and it is called solenoidal (Landau and
Lifshitz 2013). The data field is smooth everywhere away from the original
source. The outward flux of a vector field through a closed surface is equal to
the volume integral of the divergence on the region inside the surface.
3. The data field has temporal behavior. The data on objects are static independ-
ent of time. For a stable field with sources independent of time in space, there
exists a scalar potential function ϕ(x) corresponding to the vector field function
F(x) that describes the intensity function. Both functions can be interconnected
with a differential operator (Giachetta et al. 2009), F(x) = ∇ϕ(x).
6.2 Fundamental Definitions of Data Fields 179

6.2.2 Mathematical Model

The distribution law of a data field can be mathematically described with a poten-
tial function. In a data space Ω ⊆ RP, a data object xi brings about a virtual field
with the mass mi. If an arbitrary point x= (x1, x2, …, xP)T exists in Ω, the scalar
potential function of the data field on xi is defined as Eq. (6.1):
ǁx−xiǁ
ϕ(xi) = mi × K
σ (6.1)

where mi is the mass of xi. K(x) is the unit potential function. ||x− xi|| is the dis-
tance between xi and x in the field. σ is an impact factor.
Each data object xi in D has its own data field in Ω. All of the data fields are
superposed on point x. In other words, any data object is affected by all the other
objects in the data space. Therefore, the potential value at x in the data fields on D
in Ω is defined as follows.
Definition 6.2 Given a data field created by a set of data objects D = (x1, x2, …, xn)
in space Ω ⊆ RP, the potential at any point x ∈ Ω can be calculated as

ϕ(x) Σ
ϕ = ǁx − xiǁ
m ×K
(x) = Σ ϕ (x) =
D i (6.2)
i σ
i=1 i=1

For the fact that the vector intensity always can be constructed out of scalar poten-
tial by a gradient operator, the vector intensity F(x) at x can be given as
Σ ǁx − xiǁ
2
F(x) = ∇ϕ(x) = (x − x) · m · K
i i
σ2 σ
i=1

Because the term σ2 is a constant, the above formula can be simply written as
2
Σ ǁx − xiǁ
F(x) = (x − x) · m · K
i i (6.3)
σ
i=1
6.2.3 Mass

Mass mi is the mass of object xi (i = 1, 2, …, n). It represents the strength of


Σ
the data field from xi; and it meets mi ≥ 0 and normalization condition mi = 1.
i=1
If each data object is supposed to be equal in mass, indicating the same influence
over space, then a simplified potentialΣ
function can be derived from Eq. (6.4):
1 ǁx − xiǁ
ϕ(x) = K
n σ (6.4)
i=1
180 6 Data Field

6.2.4 Unit Potential Function

The unit potential function K(x) ( K(x)dx = 1, xK(x)dx = 0) expresses the law
that xi always radiates its data energy in the same way in its data field. In fact,
the potential function reflects the density of the data distribution. According to
the properties of probability density function, it can be proven that the difference
between the potential function and the probability density function is only a nor-
malization constant if K(x) has finite integral in Ω; that is, K(x)dx = M < +∞
(Giachetta et al. 2009; Wang et al. 2011). �

K(x) is a basis potential. As the data field must be short-range with a source and
temporal behavior, the selection of K(x) may need to consider such conditions as
the nature inside the data characteristics, the feature of data radiation, the standard
to determine the field-strength, the application domain of the data field, etc.
(Li et al. 2013). Because the scalar operation is simpler and more intuitive than the
vector operation, the following criteria are recommended for selecting the poten-
tial function of the data field so that the potential ϕ(x) of an arbitrary data object xi
at an arbitrary position x satisfies three conditions:
1. K(x) is a continuous, smooth, and finite function that is defined in Ω,
2. K(x) is isotropic, and
3. K(x) is a monotonically decreasing function on the distance between xi and x.
K(x) is maximum when the distance is 0, while K(x) approaches 0 when the
distance tends to infinity.
Generally speaking, a function that matches the above three criteria can define the
potential function of a data field.
In this book, nuclear-like potential function is taken as K(x) in the form of
Gaussian potential. It is known that a nucleon generates a centralized force field in
an atomic nucleus. In space, the nuclear force acts radially toward or away from
the source of the force field, and the potential of the nuclear field decreases
rapidly to 0. Thus, a potential function of a data field is studied to simulate a
nuclear field; that is, a nuclear-like field. There are three methods to compute the
potential value of a point in a nuclear field: Gaussian potential, square-well
potential, and exponential poten- tial (Li et al. 2013). Because Gaussian
distribution is ubiquitous and Gaussian func- tion matches the physical nature of a
data field, adopting the Gaussian function—that is, a nuclear-like potential function
—is preferred for modeling the distribution of data fields. In terms ∈of nuclear-like
potentials, at the arbitrary point x Ω, the poten- tial of the data field from data
object xi is shown in Eq. (6.5); the potential superposi- tion of n basis potentials
from all the data objects in dataset D is shown in Eq. (6.6).
2
ǁx−xi ǁ

ϕ(xi) = mi × e (6.5)

Σn ǁx−xi ǁ 2

ϕ(x) = mi × e σ

i=1
(6.6)
6.2 Fundamental Definitions of Data Fields 181

6.2.5 Impact Factor

Once the form of potential function K(x) is fixed, the distribution of the associated
data field is primarily determined by the impact factor σ under the given data set
D in space Ω.
The impact factor σ ∈ (0, +∞) controls the interacted distance between data
objects. Its value has a great impact on the distribution of the data field. The
effective- ness of the impact factor is from various sources, such as radiation
brightness, radia- tion gene, data amount, distance between the neighbor
equipotential lines, and grid density of Descartes coordinate. They all make their
contributions to the data field. As a result, the distribution of a potential value is
determined by the impact factor, and the different potential function has a much
smaller influence on the estimation.
In nature, the optimal choice of σ is a minimization problem of a univariate,
nonlinear function, which can be solved by standard optimization algorithms, such
as a simple one-dimensional searching method, stochastic searching method, or
simulated annealing method. Li and Du (2007) introduced the optimization algo-
rithm of impact factor σ.
Take an example in two-dimensional space. Figure 6.1 shows the equipotential
distributions of 5 data field from the same five data objects with a variance of σ.
When σ is very small (Fig. 6.1a), the interaction distance between two data objects
is very short. ϕ(x) is equivalent to the superposition of n peak functions, each
center of which is the data object. The potential value around every data object is
very small. The extreme case is that there is no interaction between two objects,
and the potential value is 1/n at the position of data object. Otherwise, when
σ is very large (Fig. 6.1c), the data objects are strongly interacting. ϕ(x) is equiva-
lent to the superposition of n basic functions, which change slowly when the width
is large. The potential value around every data object is very large. The extreme
case is that the potential value is approximately 1 at the position of the data object.
When the integral of unit potential function is finite, there is a normalized con-
stant between the potential function and the probability intensity function at most
(Giachetta et al. 2009). As can be seen from the abovementioned two extreme

Fig. 6.1 Equipotential distributions of data field on 5 objects with different σ a σ = 0.05
b σ = 0.2 c σ = 0.8
182 6 Data Field

conditions, the potential distribution of a data field may not uncover the required
rule on inherent data characteristics. Thus, the selection of impact fact σ does por-
tray the inherent distribution of data objects (Fig. 6.1b).

6.3 Depiction of Data Field

In physics, a field can be depicted with the help of field force lines for a vector
field and isolines (or iso-surfaces) for a scalar field. The field lines, equipotential
lines (or surfaces), are also adopted to represent the overall distribution of the data
fields.

6.3.1 Field Lines

A field line is a line with an arrowhead and length. The line arrowhead indicates
the direction, and the line length indicates the field strength. It is a locus that is
defined by a vector field and a starting location within the field. A series of field
lines visually portray the field of a data object. The density of the field lines fur-
ther shows the field strength. All the data field lines centralize the data object and
the field of the source is the strongest. When approaching the source, the length of
the field lines becomes longer and the density thicker. Otherwise, given the dis-
tance ||x − xi|| leaving the field source, the larger the distance grows, the weaker
the field becomes. The same is true with the field line—that is, leaving the data
object, the length of the field lines becomes shorter and the density scarcer. Their
distribution shows that the data field is a distribution of spherical symmetry.
Figure 6.2 is a plot of force lines for a data force field produced by a single object,
which indicates that the radial field lines always are in the direction of the spheri-
cal coordinate towards the source. The force lines are uniformly distributed and
spherically symmetric, decreasing as the radial distance ||x − xi|| further increases,
which indicates a strong attractive force on the sphere centered on the object.

6.3.2 Equipotential Line (Surface)

Equipotential lines (surfaces) come into being if the points with the same potential
value are lined up together.
Given potential ψ, an equipotential can be extracted as one where all the points
on it satisfy the implicit equation ϕ(x) = ψ. By specifying a set of discrete poten-
tials {ψ1, ψ2, . . . , ψn}, a series of nested equipotential lines or surfaces can
be
drawn for better understanding of the associated potential field as a whole form.
Let ψ1, ψ2, . . . , ψn respectively be the potentials at the positions of the
objects 1, 2, , n. The potential entropy can be defined as (Li and Du 2007)
6.3 Depiction of Data Field 183

Fig. 6.2 The plot of force lines for a data force field created by a single object (m = 1, σ = 1)
a 2-dimensional plot, b 3-dimensional plot

Σ ψi ψi
H =− log (6.7)
i=1
Z

Z
Σ
where Z = ψi is a normalization factor. For any σ∈ [0,+ ∞], potential entropy
i=1
H satisfies 0 ≤ H ≤ log (n), and H =log(n) if and only if ψ1 = ψ2 = · · · = ψn.
The equipotential lines are mathematical entities that describe the spherical cir-
cles at a distance ||x −xi|| on which the field has a constant value, such as con-
tour lines on map tracing lines of equal altitude. They always cross the field lines
at right angles to each other, in no specific direction at all. The potential differ-
ence compares the potential from one equipotential line to the next. In the context
of equal data masses, the area distributed with intensive data will have a greater
potential value when the potential functions are superposed. At the position with
the largest potential value, the data objects are the most intensive. Figure 6.3a
is a map of the equipotential lines for a potential field produced by a single data
object in a two-dimensional space, which consists of a family of nested, concen-
tric circles or spheres centered on the data object. As can be seen from Fig. 6.3a,
the equipotentials closer to the object have higher potentials and are farther apart,
which indicates strong field strength near and around the object.
In a multi-dimensional space, the equipotential lines are extended to the
equipo- tential surfaces because the field becomes a set of nested spheres with the
center of the coordinate of the data object, and the radius
− is ||x xi||. Figure 6.3b is
another map of the equipotential surfaces for another 3D space. Obviously, the
equipoten- tial surfaces corresponding to different potentials are quite different in
topology.
184 6 Data Field

Fig. 6.3 The equipotential map for a potential field produced by a single data object (σ = 1)
a equipotential lines, b equipotential surfaces

6.3.3 Topological Cluster

Every topology of equipotentials containing data objects actually corresponds to


a cluster with relatively dense data distribution. The nested structure composed of
different equipotentials may be treated as the clustering partitions at different den-
sity levels.
Figure 6.4 shows the distribution of field lines and equipotential lines on data
objects in two-dimensional space. As can be seen from Fig. 6.4, the equipotential
lines are always perpendicular to the field lines associated with the data field, and
all the data objects in the field are centralized, surrounding five positions (i.e., A, B,
C, D, and E) under the forces from the data objects. The field lines always point to
the five positions with local maximum potential in the data field (Fig. 6.4a), and
the self-organized process discovers the hierarchical clustering-characteristics of
data objects with the field level by level (Fig. 6.4b). Under a set of suitable
potential val- ues, the corresponding set of equipotential lines show the clustering
feature of data objects by considering different data-intensive areas as the centers.
When the poten- tial is 0.0678 at level 1, all the data objects are self-grouped into
the five clusters A, B, C, D, and E. As the potential decreases, the clusters that are
in close proxim- ity are merged. Cluster A and cluster B form the same lineage AB
at level 2. Cluster D and cluster E are amalgamated as cluster DE at level 3. When
the potential is 0.0085 at level 5, all the clusters join the generalized cluster
ABCDE (Fig. 6.4a). During the merging process from the five clusters A, B, C, D,
and E at the bottom level 1 to the generalized cluster ABCDE at the top level 5, a
clustering spectrum automatically comes into being (Fig. 6.4b). In other words, the
center in the data- intensive area has the potential maximum. From the local
maximums to the final global maximum, five positions are further self-organized
cluster by cluster.
References 185

Fig. 6.4 Self-organized characteristics of 390 data objects in the field


= (σ 0.091) a self- organized level by level, b
characterized level

References

Giachetta G, Mangiarotti L, Sardanashvily G (2009) Advanced classical field theory. World Scientific, Singapore
Landau LD, Lifshitz EM (2013) Statistical physics, 3rd edn, part 1: volume 5 (course of theoretical physics, volume 5). Pergamon
Press, Oxford
Li DY, Du Y (2007) Artificial intelligence with uncertainty. Chapman & Hall/CRC, London
Li DR, Wang SL, Li DY (2013) Spatial data mining theories and applications, 2nd edn. Science Press, Beijing, China
Wang SL, Gan W, Li DY, Li DR (2011) Data field for hierarchical clustering. Int J Data Warehouse Min 7(3):235–
246

You might also like