Data Assimilation Vs Data Mining

Data Mining vs.

Data Assimilation
S. Lakshmivarahan
School of Computer Science
University of Oklahoma
Norman, Oklahoma

Data Mining(DM) - early beginnings

Much of what we know in physical sciences had their origins

in Astronomy - with observations of celestial objects
Thanks to the Herculean eorts of:
Copernicus (1473-1543)
Galileo (1544-1642)
Kepler (1571-1630)
Newton (1643-1727)

This is only a small sampling from a long list of pioneers

Discovery of simple laws from observations

Observations collected over decades were meticulously
analyzed by hand to formulate new laws of nature
Heliocentric system
The Four laws of Kepler
Law of gravitation by Newton
Three Newtons laws

Within the context of physical sciences these are some of the

earliest examples of data mining
Note: In Chemical, Biological and other Sciences there are
instances such as the above that are re pleat with historical
facts that can illustrate the use of data mining in each of
these disciplines

What is Data Mining

DM is the process extracting the structure or patterns that are

inherent in the data/observations
These patterns provide clues about the data generating
Ultimate goal of DM is to understand and quantify the data
generating process
Since the motion of celestial objects inherently followed
certain laws, early pioneers with their hard work and ingenuity
could discover the laws that laid the foundation of the
physical sciences and engineering as we know today

Abundance of data - revival of DM

Volume of data collected doubles in every three years -Thanks

to technology
Large scale storage device technology
Communication and sensor technologies

Today interest in DM include:

Physical sciences, Biological sciences, Medical Sciences
Space exploration, All branches of Engineering,
Environmental Sciences, Ecology
Economics, Social Sciences, Finance, Banking and Commerce,
Sports and recreation
Governments, private companies

More about DM a bit later. Back to early Astronomy

Development of Calculus and discovery of dynamic models

Introduction of mathematical models by combining concurrent

developments in
Physical laws - Newtons laws
Calculus by Newton (1643-1727) and Leibnitz (1646-1716)
among others

Naturally lead to the development of dynamic models to

describe the motion of planets around the sun
With the availability of models, the potential for forecast or
prediction became very clear

Discovery of Least squares- Beginnings of Data

Assimilation (DA)
Gauss (1777-1855) (when he was only 24 years old) using the
known models of his time, undertook the challenging problem
of predicting when the celestial object called Ceres will
reappear on the telescope
The model had unknown parameters that needed to be
By combining the model with observations in the least squares
sense, Gauss, estimated the unknown parameters - created the
rst assimilated model
He then used this assimilated model to accurately predict of
the time and location of reappearance of the lost astronomical

Gauss laid the foundation for DA

This work leads the development of the method of least

squares as we know it today
Method of least squares still continues to dominate the theory
and practice of estimation of unknown parameters
By this time Gauss had also invented the notion of statistical
analysis relating to the distribution observational errors
following the bell shaped curve which we now call as the
normal or Gaussian distribution

What is Data Assimilation?

Fusion of model with data

Models are general descriptions of the underlying physical
processes in question
Model represents a class - suitably parametrized
Static regression models have unknown coecients
Dynamic model has unknown initial/boundary conditions +
physical parameters such as Reynolds number, coecient of
thermal expansion of water, specic heat of water, etc

Data/observations reveal all the secrets of or the truth about

the process that model tries to capture

DA - fusion of models with data

By combining models and data - estimating the unknown

parameters of the models using the data - we can get a
specialized instantiation of the model called the assimilated
This assimilated model is a good tool for creating forecast or
One of the standard tools for the fusion of model and data is
based on the method of least squares
The discipline of DA primarily deals with development of
methods for assimilating models with data

Goal of DA - generate good forecast/prediction

Predict the path of a hurricane, tornado- using one of several

models + data collected using satellites, Radars, special
planes that y into the hurricanes twice a day
From the crime scene data, reconstruct the case - CSI, Miami
NTSB estimate the causes of failure using the data from the
Predict the potential tax revenues so that a Government can
develop its budget for the next year
Medical diagnosis - from symptoms to the cure

Direct vs. Inverse problems - A classification

To further explore the relation between DM and DA introduce an useful classication
Scientic and Engineering problems can be classied into one
of two types
Direct problems - Examples
Given a polynomial p(x), evaluate it at x = 1.0
Given a dierential equation and the initial condition, nd the
Given a matrix A and a vector x, compute the vector b = Ax

Inverse problems - Examples

Given a polynomial p(x), solve for the roots of p(x) = 0
Given a dierential equation and a particular solution, nd the
initial condition that corresponds to the solution
Given a matrix A and a vector b, nd the solution x such that
Ax = b

It turns out that DM and DA naturally correspond to two

types of inverse problems and prediction is a direct problem
First level of inverse problems - The Core of Data Mining

At the highest level, Data Mining relates to solving the
important class of inverse problems leading to the discovery
of basic laws/models that are implied by the data
Examples of discovery of laws/models from data include:
Basic laws in early Astronomy -Kepler, Newton,
Atom models in early 1900s
Higgs Boson, the so called God particle in 2012
Theory of evolution by C. Darwin
Building models to identify credit card fraud
Based on the observed structure of the autocorrelation of a
time series, decide on the class and the type of model that
might be capture the observed autocorrelation

Data Mining has been and still continues to be the basis for
the advancement of knowledge in all of Sciences and
Second level of inverse problems - The Core of Data

Assume now that the newly discovered mathematical laws are
expressed in the form of a class of models
The problem then becomes one of data assimilation that
relates to solving a second level of inverse problem that deals
with the estimation of the unknown parameters of the model
using the same or similar data
Determination of the weights for links connecting the neurons
in an Articial Neural Network - minimize classication error
Estimate the sea surface temperature using satellite
observations - based on Planck/Stefans law of radiation
Estimate the amount of rain in a cloud system using radar
observation - based on an empirical law
Estimate the structure of the earth - based on the anomaly of
the local gravitational eld - basis for geophysical exploration

Third level involves Prediction - a direct problem

Once an assimilated model is made available, interest then

shifts to the direct problem of generation of short term
Predict lunar/solar eclipse
Prediction of total revenue by a state treasury
Prediction of how snow will fall in Boston due to a coastal low
pressure system
Prediction of the amount of green houses in the atmosphere by

Is it DM or DA?

First phase: At its core DM relates to the discovery of basic

knowledge - Remember Kepler and Newton
This knowledge is often expressed as a law which is
encapsulated in a (mathematical) model
Emphasis then shifts to testing the goodness of a model
Second phase: At its core DA deals with the problem of
estimating the unknowns by tting the model to data Remember Gauss
Third phase: Using the assimilated model generate forecast
products for public consumption
DM and DA are the two parts of a continuum

A classification of models

Models: Based on causality (Motion of a Hurricane) vs

correlation (ARMA model in time series)
Models: Explicit (ARMA model) vs implicit (Neural Networks)
Static (Regression) vs. Dynamic (ODE/PDE)
Models: Deterministic (motion of a planet) vs. stochastic
(evolution stock prices)
Model: Linear vs. nonlinear
Model Time: Discrete (unemployment) vs. continuous
Model Space: Discrete (Markov chain) vs. continuous (rain

Forms of Data
Data arise in various forms:
Time series data - annual rain fall, total monthly sales
Data martix m n - n objects (columns) and m attributes
Cross Sectional data - Tabular forms
Practical problems: Missing data, outliers, Data quality
Note: In Science and Engineering, data are often of the
quantitative type (permiting full blown arithmetic
operations). In Economics, Social Sciences etc., data could be
a mixture of both quantitative and qualitative types.
Algorithms for mining/assimialtion qualitative data dier from
those of quantitative data sets

Estimation - Over vs. under determined problems

Two scenarios arise depending on the cost of collecting

Over determined (OD) case - abundance of observation much
larger than the number of unknowns to be estimated - Once
deployed, satellites, radars will deliver large amounts of data
for quite a long time
Under determined (UD) case - less number of observations
compared with the number of unknowns - Exploration for
minerals, natural gas, oil, etc.,

In the OD case there is no solution and in the UD case there

are innitely many solutions
These cases are the motivation for the denition of solution in
the least squares sense

Framework for DA

Estimation problem is recast as a constrained minimization

Constraints arise naturally:
Positivity of certain physical parameters - inequality constraint
Model itself acts as a constraint - equality constraint

Strict enforcement of constraints - Strong constraint

formulation -Lagrangian multiplier technique
Weak enforcement of constraints - Weak constraint
formulation - Penalty function technique

Well-posed vs. ill-posed problems

In a well posed problem solution exits and is unique

In an ill-posed problem solution may not exist or it may have
innitely many solutions
Many of the inverse problems are ill-posed
These are solved by using some form of regularization
techniques - Tikhonov regularization
Using regularization we solve the nearest well-posed version of
a given ill-posed problem
Example: Solving (A + I )X () = b instead of AX = b for
some small positive for which (A + I ) is positive denite

Methods for estimation

Parametric vs. non-parametric methods

Least squares - two versions
Unweighed least squares - orthogonal projection
Weighted least squares - oblique projections

Generalized method of moments

Maximum likelihood methods
Bayesian methods where we combine a known prior with
conditional distributions

Optimization problems

Unimodal vs. multi modal problems

Continuous vs. discrete optimization problems
Continuous,Unimodal problems solved using:
Gradient method
Conjugate gradient method
Quasi-Newton method

Continuous multi modal and discrete optimization problems

solved using randomized techniques:
Simulated annealing
Genetic algorithms

Methods for DM/DA - I

Time series analysis

Signal processing in EE
Econometrics, Finance

The goal is to build stochastic dynamic models in discrete

time by exploiting the underlying correlation, seasonality
properties of the data set
In Finance model both level and volatility
Autoregressive, integrated, moving average (ARIMA) models
This is one of the well developed areas in empirical modeling

Methods for DM/DA - II

Multivariate regression analysis (1800s) Statistics
Data reduction using PCA (1940), ICA (1990) - Statistics
Classication using
Clustering (1950s)
Neural networks (1950s),
Pattern recognition (1950s),
Support Vector Machines (SVM) (1980s)

Association rules
Image processing, voice recognition
Decision trees (1960)
Probabilistic reasoning in networks (1990s) - J. Pearl Turing
Award in 2012
Random eld - Spatial data analysis

Commonality of approaches in DM, DA, AI,Machine


Supervised learning
Learning with a teacher - Learning in Neural Networks
Learning with a probabilistic teacher - using imprecise

Unsupervised learning/Learning without a teacher Clustering, Adaptive Control

At the rst level Data Mining seeks to uncover the basic laws
that are hidden in the data. These laws are presented by
models of some kind with unknown parameters
At the second level Data Assimilation deals with the task of
fusing data with models to produce an assimilated model - by
estimating the unknown parameters
At the third level, using the given assimilated model produce
various forecast products for public consumption
DM, DA and Forecasting are the three parts of a continuum
in knowledge discovery

J. M. Lewis, S. Lakshmivarahan and S. K. Dhall (2006)

Dynamic Data Assimilation: a least squares approach, Volume
104, Encyclopedia of Mathematics and its Applications,
Cambridge University Press, 654 pages
J. D. Hamilton (1994) Time Series Analysis, Princeton
University Press, 799 pages
P. Tang, M. Steinbach and V. Kumar (2006) Introduction to
Data Mining, Addison Wesley

