Professional Documents
Culture Documents
Data Mining and Neural Networks
Data Mining and Neural Networks
Data Mining and Neural Networks
neural networks
ASSIGNMENT
Data evaluation and elementary preprocessing.
6 6 6 6
Data completeness is traditionally assessed in the data warehouse through ETL testing, which employs
6 6 6 6 6 6 6 6 6 6 6 6 6
aggregate functions such as (sum, max, min, count) to assess the average completeness of a column
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
or record. Additionally, manual instructions are used to validate data profiles, such as comparing
6 6 6 6 6 6 6 6 6 6 6 6 6 6
distinct values and counting the number of rows for each distinct value. However, before running
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
commands, the user must first decide the type of incompleteness they are dealing with and the type
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
If all phone numbers are lacking city codes, for example, there could be a data quality issue at the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6entry level. It could be an MNAR (missing not at random) problem if more than half of the audience
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
All is fantastic. However, there is one crucial move that all users overlook:
6 6 6 6 6 6 6 6 6 6 6 6
You can't clean, erase, or restore missing values until you know what's missing and how much of it you
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2
Time series 2:
6 6
Time series 3:
6 6
Time series 4:
6 6
2
There are no missing values in these time series.
6 6 6 6 6 6 6 6
2
Segmentation
A time-series can frequently be interpreted as a sequence of discrete segments of finite length. For
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
instance, the stock market's trajectory could be divided into regions that fall between major world
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
events, the input to a handwriting recognition application could be segmented into the different
6 6 6 6 6 6 6 6 6 6 6 6 6 6
words or letters that it was thought to contain, and the audio recording of a conference could be
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
divided according to who spoke when. In the latter two cases, one may take advantage of the fact that
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
individual segment label assignments can repeat themselves by clustering segments based on their
6 6 6 6 6 6 6 6 6 6 6 6 6
distinguishing properties.
6 6
This dilemma can be approached in two ways. The first involves searching for shift points in the time-
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
series: for example, if there is a significant jump in the average value of the signal, a segment
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
boundary can be assigned. The second method assumes that each time-series segment is created by a
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
system with distinct parameters, and then infers the most likely segment locations as well as the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
system parameters that define them. When determining which mark to apply to a given point, the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
first approach appears to only look for improvements in a small window of time, while the second
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2
Mean square error:
6 6
2
In time series analysis, the upper representation of each segment demonstrates that all attributes are
6 6 6 6 6 6 6 6 6 6 6 6 6 6
Prediction
We choose the data of 2020 as sample to calculate the predictors for g(t+1) they
6 6 6 6 6 6 6 6 6 6 6 6 6 6
day next values. And other three are as supportive time series. Now let’s
6 6 6 6 6 6 6 6 6 6 6 6 6 6
2
2
We try to fit linear models in so many difficult problem situations that there's no reason to believe the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
true data-generating model is linear, particularly when the errors are Gaussian or homoscedastic. As a
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2
6 consequence, a modern perspective is that, since the linear model is just a rough approximation, it is
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Prediction is a much more widely used term than linear models. We'll return to this topic later this
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
week, but for now, here's a short rundown: Models are only approximations; some methods don't
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
even need them; let's evaluate prediction accuracy and use that to determine model/method
6 6 6 6 6 6 6 6 6 6 6 6 6
usefulness. The definition of test error was explained in the sense of a linear model, but the concept is
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
the same in all situations. We often need a precise estimate of the test error of our system (e.g., linear
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
regression). What is the rationale behind this? There are two main objectives: Predictive analysis:
6 6 6 6 6 6 6 6 6 6 6 6 6 6
understand the magnitude of errors we might expect when making future predictions. Model/method
6 6 6 6 6 6 6 6 6 6 6 6 6
selection: to minimise test error, choose from a range of models/methods. Assume we estimate our
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
system's test error using the observed training error 1ni=1n(YiYi)2. What exactly is the problem here?
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
As a test error calculation, it's generally overly optimistic—after all, the parameters 0–1,...,p were
6 6 6 6 6 6 6 6 6 6 6 6 6 6
chosen in the first place to bring Yi close to Yi, i=1,...,n! Furthermore, the more complex/adaptive the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
system is, the more positive the training error estimate is as a test error estimate.
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Adaptive predictors. 6
In linear regression, when measuring unknown true errors, ordinary least squares residuals are often
6 6 6 6 6 6 6 6 6 6 6 6 6
6used. These estimates may give a false impression of the true error distribution due to shrinkage and
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6superimposed normality effects. RMOLS is a novel method for improving moment estimation by
6 6 6 6 6 6 6 6 6 6 6 6
6rescaling the moment estimators derived from least squares residuals appropriately. These RMOLS
6 6 6 6 6 6 6 6 6 6 6
6moments give more precise skewness and kurtosis coefficient estimates, as well as more power for
6 6 6 6 6 6 6 6 6 6 6 6 6 6
6one type of normality measure. These properties are demonstrated using a Monte Carlo analysis with
6 6 6 6 6 6 6 6 6 6 6 6 6 6
6a variety of random error distributions. Before the supervised learning process may begin, the train,
6 6 6 6 6 6 6 6 6 6 6 6 6 6
6test, and sometimes tune set must all be identified. Train: g1, gf; tune: gf+1, g+r; test: gf+r+1,
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6gf+r+pmax, f, r, pmax N) are the three sets specified in the standard time series prediction task. It's
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Based on the assumption that data collection can be transferred to new locations as quickly as
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
possible,
6
2
A prediction horizon of pm yields the following sets: Some examples include train(g1, g1+pm) (xn3pm,
6 6 6 6 6 6 6 6 6 6 6 6 6 6
xn2pm ); tuning (xn3pm+1, xn2pm+1),..., (xn2pm, xnpm ); measure (xn3pm+1, xn2pm+1),..., (gn2pm,
6 6 6 6 6 6 6 6 6 6 6 6
The mean square error of a regression line shows how near it is to a set of points. By squaring the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
distances between the points and the regression axis, it achieves this (these distances are the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
"errors"). To get rid of any negative signals, you'll need to square them up. Larger variants are often
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
given more weight. It's called the mean squared error since you're estimating the sum of a number of
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
errors.
6
Follow these steps to calculate the mean squared error from a collection of X and Y values:
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Plug the X values into the linear regression equation to find the new Y values (Y').
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2
2
Comparison results/comments: 6
The smaller the means squared error, the closer you are to determining the best fit rows. It may be
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
difficult to obtain a very small mean squared error value depending on your results. For example, the
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
data above is strewn all over the regression line, so 6.08 is the best we can do (and is in fact, the line
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
of best fit). Note that I obtained the regression line using an online calculator; the mean squared error
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
would be useful if you were trying to find an equation for the regression line by hand: you could try
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
several equations and choose the one with the smallest mean squared error as the line of best fit. A
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
mathematical model or estimator can need to be "tweaked" at times in order to achieve the best
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
results. The MSE criterion is a tradeoff between (squared) bias and variance, and it is defined as
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
follows: “T is a minimum [MSE] estimator of if MSE(T,) MSE(T' ), where T' is any alternative estimator.
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6