Professional Documents
Culture Documents
Data Field: 6.1 From A Physical Field To A Data Field
Data Field: 6.1 From A Physical Field To A Data Field
A field exists commonly in physical space. The concept of a data field, which
simulates the methodology of a physical field, is introduced in this chapter for
depicting the interaction between the objects associated with each data point of
the whole space. The field function mathematically models how the data contribu-
tions for a given task are diffused from the universe of a sample to the universe of
a population when interacting between objects. In the universe of discourse, all
objects with sampled data not only radiate their data contributions but also receive
data contributions from other objects. Depending on the given task and the physi-
cal nature of the objects in the data distribution, the field function may be derived
from the physical field. All of the equipotential lines depict the interesting topo-
logical relationships among the objects, which visually indicate the interacted
characteristics of objects. Thus, a data field bridges the gap between a mining
model and a data model.
In physical space, interacted particles lead to various fields. Due to the interacting
particles, the field can be either a classical field or a quantum field. Depending on
the interacting range, the field may be long-range or short-range. As an interact-
ing range increases, the strength of the long-range field slowly diminishes to the
point of being undetectable (e.g., gravitational fields and electromagnetic fields),
while the strength of the short-range field rapidly attenuates to zero (e.g., nuclear
field). Based on the interacting transformation, the field is often classified as a sca-
lar field, vector field, tensor field, or spinor field. At each point, a scalar field’s
values are given by a single variable on the quantitative difference, a vector field
is specified by attaching a vector with magnitude and direction, a tensor field is
specified by a tensor, and a spinor field is for quantum field theory. A vector field
may be with or without sources. A vector field with constant zero divergence is a
field without sources (e.g., a magnetic field); otherwise, it is a field with sources
(e.g., an electrostatic field). In basic physics, the vector field is often studied as a
field with sources, such as Newton’s law of universal gravitation and Coulomb’s
law. When it is put in the field with sources, an arbitrary particle feels a force with
strength (Giachetta et al. 2009). Moreover, according to the values of a field vari-
able, whether or not it changes over time at each point, the field with sources can
be further grouped into time-varying fields and stable active fields. The stable
field with sources is also known as a potential field.
Field strength is often represented by using the potential. In a field, the poten-
tial difference between an arbitrary point and another referenced point is a deter-
minate value on a unit particle. The physical potential is the work performed by a
field force when a unit particle is moved from an arbitrary point to another refer-
enced point in the field. It is a scalar variable of the amount of energy transferred
by the force acting at a distance. The energy is the capacity to do the work, and
the potential energy refers to the ability of a system to do work by virtue of its
position or internal structure. The distribution of the potential field corresponds to
the distribution of the potential energy determined by the relative position between
interacted particles. Specifying the potential value at a point in space requires
parameters such as object mass or electrical charge, and distance between a point
6.1 From a Physical Field to a Data Field 177
and its field source, while it is independent of the direction of the distance vec-
tor. The strength of the field is visualized via the equipotential, which refers to a
region in space where every point in it is at the same potential.
In physics, it is a fact that a vacuum is free of matter but not free of field.
Modern physics even believes that field is one of the basic forms of particle exist-
ence, such as a mechanical field, nuclear field, gravitational field, thermal field,
electromagnetic field, and crystal field. As field theory developed, field was
abstracted as a mathematical concept to depict the distribution rule of a physical
variable or mathematical function in space (Landau and Lifshitz 2013). The poten-
tial function is often seen simply as a monotonic function of spatial location, hav-
ing nothing to do with the existence of the particle. That is, a field of the physical
variable or mathematical function in a space exists if every point in the space has
an exact value from the physical variable or mathematical function.
Data space is an exact branch of physical space. In a data space, all objects are
mutually interacting via data. From a generic physical space to an exact data
space, the physical nature of the relationship between the objects is identical to
that of the interaction between the particles. If the object described with data is
taken as a particle with mass, then a data field can be derived from a physical
field. Inspired by the field in physical space, the field was introduced to
illuminate the interaction between objects in data space. In the data space, all
objects are mutually interacting via data. Given a logic database with N records
and M attrib- utes, the process of knowledge discovery can begin with the
underlying distribu- tion of the N data points in the M-dimensional universal
space. If each data object is viewed as a point charge or mass point, it will exert a
force on any other objects in its vicinity, and the interactions of all the data objects
will form a field. Thus, an object described with data is treated as a particle with
mass. The data refer to the attributes of the object determined from observation,
while the mass is the physi- cal variable of matter determined from its weight or
from Newton’s second law of motion. Surrounding the object, there is a virtual
field simulating the physical field around particles. Each object has its own field.
Given a task, the field gives a vir- tual media to show the mutual interaction
between objects without touching each other. An object transforms its field to
other objects, and it also receives all their fields. Depending on their
contributions to a given task, the field strength of each object is different, which
then uncovers a different interaction. By using this inter- action, objects may be
self-organized under the given data and task. Some patterns that are previously
unknown, potentially useful, and ultimately understood may bes
further extracted from datasets.
178 6 Data Field
Definition 6.1 In data space Ω ⊆RP, let dataset D {x1, x2, …, xn} denote a P-
dimensional independent random sample, where xi = (xi1, xi2, …, xiP)T with i
1,=2, …, n. Each data object xi is taken as a particle with mass mi. xi radiates its
data energy and is also influenced by others simultaneously. Surrounding xi, a
virtual field is derived from the corresponding physical field. Thus, the virtual
field is called a data field.
In the context of the definition, a data field exists only if the following neces-
sary conditions are characterized (i.e., short-range with source and temporal
behavior):
1. The data field is a short-range field. For an arbitrary point in the data space
Ω, all the fields from different objects are overlapped to achieve a superposed
field. If there are two points—x1 with enough data samples and x2 with less or
no data samples—the summarized strength of the data field at point x1 must be
stronger than that of x2. That is, the range of the data field is so short that its
magnitude decreases rapidly to 0 as the distance increases. The rapid attenua-
tion may reduce the effect of noisy data or outlier data. The effect of the super-
posed field may further highlight the clustering characteristics of the objects in
close proximity within data-intensive areas. The potential value of an arbitrary
point in space does not rely on the direction from the object, so the data field
is isotropic and spherical in symmetry. The interaction between two faraway
objects can be ignored.
2. The data field is a field with sources. The divergence is to measure the magni-
tude of a vector field at a given point in terms of a signed scalar. If it is
positive, it is called a source. If it is negative, it is called a sink. If it is zero in a
domain, there is no source or sink, and it is called solenoidal (Landau and
Lifshitz 2013). The data field is smooth everywhere away from the original
source. The outward flux of a vector field through a closed surface is equal to
the volume integral of the divergence on the region inside the surface.
3. The data field has temporal behavior. The data on objects are static independ-
ent of time. For a stable field with sources independent of time in space, there
exists a scalar potential function ϕ(x) corresponding to the vector field function
F(x) that describes the intensity function. Both functions can be interconnected
with a differential operator (Giachetta et al. 2009), F(x) = ∇ϕ(x).
6.2 Fundamental Definitions of Data Fields 179
The distribution law of a data field can be mathematically described with a poten-
tial function. In a data space Ω ⊆ RP, a data object xi brings about a virtual field
with the mass mi. If an arbitrary point x= (x1, x2, …, xP)T exists in Ω, the scalar
potential function of the data field on xi is defined as Eq. (6.1):
ǁx−xiǁ
ϕ(xi) = mi × K
σ (6.1)
where mi is the mass of xi. K(x) is the unit potential function. ||x− xi|| is the dis-
tance between xi and x in the field. σ is an impact factor.
Each data object xi in D has its own data field in Ω. All of the data fields are
superposed on point x. In other words, any data object is affected by all the other
objects in the data space. Therefore, the potential value at x in the data fields on D
in Ω is defined as follows.
Definition 6.2 Given a data field created by a set of data objects D = (x1, x2, …, xn)
in space Ω ⊆ RP, the potential at any point x ∈ Ω can be calculated as
ϕ(x) Σ
ϕ = ǁx − xiǁ
m ×K
(x) = Σ ϕ (x) =
D i (6.2)
i σ
i=1 i=1
For the fact that the vector intensity always can be constructed out of scalar poten-
tial by a gradient operator, the vector intensity F(x) at x can be given as
Σ ǁx − xiǁ
2
F(x) = ∇ϕ(x) = (x − x) · m · K
i i
σ2 σ
i=1
Because the term σ2 is a constant, the above formula can be simply written as
2
Σ ǁx − xiǁ
F(x) = (x − x) · m · K
i i (6.3)
σ
i=1
6.2.3 Mass
The unit potential function K(x) ( K(x)dx = 1, xK(x)dx = 0) expresses the law
that xi always radiates its data energy in the same way in its data field. In fact,
the potential function reflects the density of the data distribution. According to
the properties of probability density function, it can be proven that the difference
between the potential function and the probability density function is only a nor-
malization constant if K(x) has finite integral in Ω; that is, K(x)dx = M < +∞
(Giachetta et al. 2009; Wang et al. 2011). �
K(x) is a basis potential. As the data field must be short-range with a source and
temporal behavior, the selection of K(x) may need to consider such conditions as
the nature inside the data characteristics, the feature of data radiation, the standard
to determine the field-strength, the application domain of the data field, etc.
(Li et al. 2013). Because the scalar operation is simpler and more intuitive than the
vector operation, the following criteria are recommended for selecting the poten-
tial function of the data field so that the potential ϕ(x) of an arbitrary data object xi
at an arbitrary position x satisfies three conditions:
1. K(x) is a continuous, smooth, and finite function that is defined in Ω,
2. K(x) is isotropic, and
3. K(x) is a monotonically decreasing function on the distance between xi and x.
K(x) is maximum when the distance is 0, while K(x) approaches 0 when the
distance tends to infinity.
Generally speaking, a function that matches the above three criteria can define the
potential function of a data field.
In this book, nuclear-like potential function is taken as K(x) in the form of
Gaussian potential. It is known that a nucleon generates a centralized force field in
an atomic nucleus. In space, the nuclear force acts radially toward or away from
the source of the force field, and the potential of the nuclear field decreases
rapidly to 0. Thus, a potential function of a data field is studied to simulate a
nuclear field; that is, a nuclear-like field. There are three methods to compute the
potential value of a point in a nuclear field: Gaussian potential, square-well
potential, and exponential poten- tial (Li et al. 2013). Because Gaussian
distribution is ubiquitous and Gaussian func- tion matches the physical nature of a
data field, adopting the Gaussian function—that is, a nuclear-like potential function
—is preferred for modeling the distribution of data fields. In terms ∈of nuclear-like
potentials, at the arbitrary point x Ω, the poten- tial of the data field from data
object xi is shown in Eq. (6.5); the potential superposi- tion of n basis potentials
from all the data objects in dataset D is shown in Eq. (6.6).
2
ǁx−xi ǁ
−
ϕ(xi) = mi × e (6.5)
Σn ǁx−xi ǁ 2
−
ϕ(x) = mi × e σ
i=1
(6.6)
6.2 Fundamental Definitions of Data Fields 181
Once the form of potential function K(x) is fixed, the distribution of the associated
data field is primarily determined by the impact factor σ under the given data set
D in space Ω.
The impact factor σ ∈ (0, +∞) controls the interacted distance between data
objects. Its value has a great impact on the distribution of the data field. The
effective- ness of the impact factor is from various sources, such as radiation
brightness, radia- tion gene, data amount, distance between the neighbor
equipotential lines, and grid density of Descartes coordinate. They all make their
contributions to the data field. As a result, the distribution of a potential value is
determined by the impact factor, and the different potential function has a much
smaller influence on the estimation.
In nature, the optimal choice of σ is a minimization problem of a univariate,
nonlinear function, which can be solved by standard optimization algorithms, such
as a simple one-dimensional searching method, stochastic searching method, or
simulated annealing method. Li and Du (2007) introduced the optimization algo-
rithm of impact factor σ.
Take an example in two-dimensional space. Figure 6.1 shows the equipotential
distributions of 5 data field from the same five data objects with a variance of σ.
When σ is very small (Fig. 6.1a), the interaction distance between two data objects
is very short. ϕ(x) is equivalent to the superposition of n peak functions, each
center of which is the data object. The potential value around every data object is
very small. The extreme case is that there is no interaction between two objects,
and the potential value is 1/n at the position of data object. Otherwise, when
σ is very large (Fig. 6.1c), the data objects are strongly interacting. ϕ(x) is equiva-
lent to the superposition of n basic functions, which change slowly when the width
is large. The potential value around every data object is very large. The extreme
case is that the potential value is approximately 1 at the position of the data object.
When the integral of unit potential function is finite, there is a normalized con-
stant between the potential function and the probability intensity function at most
(Giachetta et al. 2009). As can be seen from the abovementioned two extreme
Fig. 6.1 Equipotential distributions of data field on 5 objects with different σ a σ = 0.05
b σ = 0.2 c σ = 0.8
182 6 Data Field
conditions, the potential distribution of a data field may not uncover the required
rule on inherent data characteristics. Thus, the selection of impact fact σ does por-
tray the inherent distribution of data objects (Fig. 6.1b).
In physics, a field can be depicted with the help of field force lines for a vector
field and isolines (or iso-surfaces) for a scalar field. The field lines, equipotential
lines (or surfaces), are also adopted to represent the overall distribution of the data
fields.
A field line is a line with an arrowhead and length. The line arrowhead indicates
the direction, and the line length indicates the field strength. It is a locus that is
defined by a vector field and a starting location within the field. A series of field
lines visually portray the field of a data object. The density of the field lines fur-
ther shows the field strength. All the data field lines centralize the data object and
the field of the source is the strongest. When approaching the source, the length of
the field lines becomes longer and the density thicker. Otherwise, given the dis-
tance ||x − xi|| leaving the field source, the larger the distance grows, the weaker
the field becomes. The same is true with the field line—that is, leaving the data
object, the length of the field lines becomes shorter and the density scarcer. Their
distribution shows that the data field is a distribution of spherical symmetry.
Figure 6.2 is a plot of force lines for a data force field produced by a single object,
which indicates that the radial field lines always are in the direction of the spheri-
cal coordinate towards the source. The force lines are uniformly distributed and
spherically symmetric, decreasing as the radial distance ||x − xi|| further increases,
which indicates a strong attractive force on the sphere centered on the object.
Equipotential lines (surfaces) come into being if the points with the same potential
value are lined up together.
Given potential ψ, an equipotential can be extracted as one where all the points
on it satisfy the implicit equation ϕ(x) = ψ. By specifying a set of discrete poten-
tials {ψ1, ψ2, . . . , ψn}, a series of nested equipotential lines or surfaces can
be
drawn for better understanding of the associated potential field as a whole form.
Let ψ1, ψ2, . . . , ψn respectively be the potentials at the positions of the
objects 1, 2, , n. The potential entropy can be defined as (Li and Du 2007)
6.3 Depiction of Data Field 183
Fig. 6.2 The plot of force lines for a data force field created by a single object (m = 1, σ = 1)
a 2-dimensional plot, b 3-dimensional plot
Σ ψi ψi
H =− log (6.7)
i=1
Z
Z
Σ
where Z = ψi is a normalization factor. For any σ∈ [0,+ ∞], potential entropy
i=1
H satisfies 0 ≤ H ≤ log (n), and H =log(n) if and only if ψ1 = ψ2 = · · · = ψn.
The equipotential lines are mathematical entities that describe the spherical cir-
cles at a distance ||x −xi|| on which the field has a constant value, such as con-
tour lines on map tracing lines of equal altitude. They always cross the field lines
at right angles to each other, in no specific direction at all. The potential differ-
ence compares the potential from one equipotential line to the next. In the context
of equal data masses, the area distributed with intensive data will have a greater
potential value when the potential functions are superposed. At the position with
the largest potential value, the data objects are the most intensive. Figure 6.3a
is a map of the equipotential lines for a potential field produced by a single data
object in a two-dimensional space, which consists of a family of nested, concen-
tric circles or spheres centered on the data object. As can be seen from Fig. 6.3a,
the equipotentials closer to the object have higher potentials and are farther apart,
which indicates strong field strength near and around the object.
In a multi-dimensional space, the equipotential lines are extended to the
equipo- tential surfaces because the field becomes a set of nested spheres with the
center of the coordinate of the data object, and the radius
− is ||x xi||. Figure 6.3b is
another map of the equipotential surfaces for another 3D space. Obviously, the
equipoten- tial surfaces corresponding to different potentials are quite different in
topology.
184 6 Data Field
Fig. 6.3 The equipotential map for a potential field produced by a single data object (σ = 1)
a equipotential lines, b equipotential surfaces
References
Giachetta G, Mangiarotti L, Sardanashvily G (2009) Advanced classical field theory. World Scientific, Singapore
Landau LD, Lifshitz EM (2013) Statistical physics, 3rd edn, part 1: volume 5 (course of theoretical physics, volume 5). Pergamon
Press, Oxford
Li DY, Du Y (2007) Artificial intelligence with uncertainty. Chapman & Hall/CRC, London
Li DR, Wang SL, Li DY (2013) Spatial data mining theories and applications, 2nd edn. Science Press, Beijing, China
Wang SL, Gan W, Li DY, Li DR (2011) Data field for hierarchical clustering. Int J Data Warehouse Min 7(3):235–
246