Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Application of Outlier Detection using Re-Weighted

Least Squares and R-squared for IoT Extracted Data


Moath Awawdeh Tarig Faisal Anees Bashir
Department of Electrical Engineering Department of Electrical Engineering Department of Electrical Engineering
Higher College of Technology Higher College of Technology Higher College of Technology
Abu Dhabi, UAE Abu Dhabi, UAE Abu Dhabi, UAE
*mawawdeh@hct.ac.ae tfaisal@hct.ac.ae abashir1@hct.ac.ae

Amjad Sheikh
Department of Chemical Engineering
Higher College of Technology
Abu Dhabi, UAE
asheikh@hct.ac.ae

Abstract—The problem of outlier detection for measurements such as, intrusion detection systems [6], [7], wireless sensors
extracted from IoT platform has been addressed. An approach to networks [6], [8], and more applications to outlier detection
validate the detected values via the coefficient of determination in satellite image analysis, motion segmentation, and severe
analysis is presented by applying a combination procedure of
weighted least square, bisquare algorithm and robust fit. We weather prediction, can be found in [2]–[4], [6], [9]–[11]
fit the model firstly by weighted least square then we used the where surveys and analytic comparison of different methods
method of bisquare weight where the weight of each measure is and applications are presented. The effect of those outliers on
assigned based on the distance of that value to the generated best the regression analysis can be reduced by applying one of the
fit. The map of weights for each measure in the dataset determine two options (delete or transform) of dealing with outliers. If
if the value is a possible outlier by searching for those measures
with zero weight. In the second part of the paper we apply the measures contain a limited number of outliers so deleting
some of the most common method of outlier processing with those values is a common option to do in such cases, still
logarithmic and square root transformation where the reweighted the observations may carry useful knowledge. This option
least square has been used to detect the outliers. We Analyze is widely used with Tukey’s box-and-whisker plot. Deleting
three cases of dealing with outliers including transformation and the variables instead, would be the easiest choice in case of
outlier removal then we measure the coefficient of determination
for each case together with other statistic. The results are shown many outliers presence in the variables. The other option is
by mean of simulation with estimation result and weight labeling. transformation, whether to transform the values by changing
Keywords—Weighted least squares, outlier detection, coeffi- them to the neighbor value or transforming the variables
cient of determination, outlier transformation. instead of changing each outlying observation individually
[12]–[14]. In any IoT platfrom, where sensors are sending
I. I NTRODUCTION measures from the field remotely, the data can be affected
by noise or any other factor like sensor malfunction where
Data mining has been widely investigated in the literature
outlying events occurs. Analyzing those reading with no pre-
in term of data preprocessing and transformation where a
processing generate a bad estimator, and hence miss leading
useful patterns are generated using analysis tools and al-
decision. An IoT platform equipped with Matlab analysis and
gorithms [1]. For most of the applications in data analysis
visualization tools can be used to accomplish dataset prepa-
a huge amount of data are recorded including the outlying
ration and processing. In this paper we introduce a method
variables that would considerably affect the robustness of the
for validating outlier detection processing via coefficient of
model regression. Those observations are usually considered
determination check analysis. The data are considered to be
as noise which may lead to generate a bad estimator and
extracted from an IoT platform, where the dataset can be
incorrect models although they may in many situations hide
created using those measures stored in the cloud and then
useful information. Therefore, it is crucial task to identify
processed and analyzed. Taking a decision using data in an IoT
outliers before applying data analysis. [2]–[4]. Hawkins [5]
platfrom need a primary step of data preparation, together with
defined the concept of an outlier based on the deviations
robust interpretation of the information, in this stage our paper
among those measures considering the mechanism in which
is contributing. This paper is organized as follow in Section
those observations are generated. The detection methods of
II, a brief of IoT data processing is addressed, Section III, we
outliers have been studied and suggested for many applications
introduce the polynomial regression least squares, in Section
This work is supported by Higher College of Technology Research Pro- VI, re-weighted least square modeling is presented, and finally
gram, 2018 Simulation example is drawn in Section V.

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.
II. I OT- BASED DATA E XTRACTION AND M ODELING IV. R E -W EIGHTED L EAST S QUARES (R E -WLS)
M ODELING
Using the concept of Internet-of-Thing (IoT), sensors mea-
For estimating the known parameter in regression analysis
sure data of different physical quantity and communicate those
the ordinary least squares is popular technique to fit the
data in various forms. An IoT platform provides the sensors
model by the concept of minimizing the sum of squares
with the ability to send data to the cloud where it can be
between observed and predicted values. The validity of least
analyzed and stored. Such a platform is equipped with Matlab
squares estimation depend on the validity of the assumptions
visualization and analyzing tool, where stored data can be
underlying the regression model. Least squares (LS), generate
easily explored and interpreted. This feature gives the user an
a bad estimation result in case of outlier, or generally in case
access to Matlab to explore more features and extract different
when some measures deviate so much from other measures
knowledge using the built-in tools. The data coming from
in the dataset, this is because of the concept of LS, where
those sensors are real-time data and have not been processed
the non-normal points (those differ from the rest points) will
or transformed. Thus, interpreting those data and extracting
greatly affect the analysis of the data (effect on the slope of the
useful patterns without applying primary steps of data pre-
regression line. Regression parameter can be estimated using
processing and transformation may lead to misinterpreted
weighted least squares, this estimator is also can be affected by
pattern that effect greatly the decision making. A crucial step
outliers like ordinary least squares. Thus, we will be proposing
in the chain of data processing is outlier detection, where data
the re-weighted least squares algorithms using robust fitting,
must be clean and correct. Such a step could be integrated in
see [18] for detection of outliers in weighted least squares
an IoT platform using the access of Matlab to improve and
regression and modeling description.
validate the analysis result.
A. Weighted Least Square (WLS) Regression
III. P OLYNOMIAL R EGRESSION , L EAST S QUARES (LS) The weighted least square model is given by
Polynomial regression present the fitting line of the nonlin- Y = Xβ +  (4)
ear model between x and y with nth degree polynomial. It can
be considered as a form of estimation statistical processes. where Y is an n×1 vector of dependent observations, X is an
The basic regression model of a dependent variable Y n×p0 full column rank matrix of known explanatory variables,
on a set of k independent variables X1 , X2 , · · · , Xk can be β is a p0 × 1 vector of unknown parameters to be estimated,
expressed as [15]: and error term  is an n × 1 vector of independent random
2
errors with zero mean and unknown variance
ˆ P σ . The TWLS
yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + ei (1) estimator βw is obtained by minimizing wi (yi − xi β)2 ,
ˆ T −1 T
that is βw = (X Dw X) X Dw Y . The WLS algorithm
i = 1, 2, · · · , n. Where yi is the value of Y for the ith case, providing the variance of the estimator, the vector of weighted
xij is the value of the j th independent variable Xj for the ith fitted value, the vector of residuals, and the unbiased estimator.
case, and β0 is the Y -intercept of the regression surface,see B. Outlier Detection using Re-WLS
[15], [16] for model assumptions and components.
For the polynomial curve fitting, we consider the general As discussed earlier, the least-squares fitting is sensitive to
form for a polynomial of order j: the outlying observations. To minimize this effect, the data are
fit using Bisquare weights method, which weighted sum of
j
X squares are minimized. The distance between a point and the
f (x) = β0 + βk x k (2) estimation line determines the weight that a point will carry.
k=1 Nearest points will get the highest weight while those point
that are very far from the fitted line will get zero weight and
and the general least squares error (residuals): considered as outliers. Matlab provides a combination pro-
!#2 cedure of WLS and bisquare weight algorithms for detecting
j
n
"
X X outliers and it can be described as follow
r2 = yi − β0 + βk x k (3)
i=1 k=1
1) WLS for model fitting
2) Computing the adjusted residuals
In regression analysis the polynomial regression is a primary ri
step which has the crucial impact on the design. for our scope, radj = √ (5)
1 − hi
a Matlab-based tool [17] has been used which provide the
decision support system users with an interactive plot of the ri are the usual least-squares residuals and hi are lever-
result in a graphical interface. The result is interactive plotting ages that adjust the residuals
in graphical interface where user can change the parameters 3) Standardizing the adjusted residuals
of the fit and exporting fit results to the Matlab’s workspace radj
u= (6)
for more analysis. Ks

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.
where K is 4.685, and s is the robust variance given by
M AD
s= (7)
0.6745
where M AD is the median absolute deviation of the
residuals.
4) Bisquare weight computation
(1 − u2i )2

|ui | < 1
wi = (8)
0 |ui | ≥ 1
5) Finding the zeros in the weight matrix and label each
zero’s weight index with the value’s position in the data.
This labeled data is to be considered as outlier. If the fit
converges, and the outliers are detected then the process
is done. Otherwise, to repeat form step 1.
We consider en example of 20 measures as shown in Fig. 1. In
Fig. 2 the using WLS for outlier detection is applied. We used
a data set of 20 measures with single synthetic outlier at time
Fig. 1. Least square fitting with a single outlier instance t = 16. The model has been fitted using WLS. We
have followed the steps that mentioned earlier. Fig. 3 shows the
weights labeling where each measures has a bisquare weight
according to Equation (8).
C. Coefficient of determination R2
R-squared (R2 ) is a measure of the goodness of the re-
gression line to the measures, it is known as coefficient of
determination. This tool can be a very useful in assessing the
quality of the estimator in linear regression. As mentioned be-
fore about LS, unfortunately it can be said also that coefficient
of determination is also sensitive to outlier and the goodness
of the model can be greatly affected. It is assigned with values
zero to one, where a poor fitting line represented by a value
of closer to 0, while values near 1, represents the best fitting.
We have used this tool as a measure of fitting accuracy in the
presence of outlying observation and how do outliers affect the
regression analysis. The R2 can be presented in the following
ratio [19]:
ESS RSS
Fig. 2. Outlier detection via WLS R2 = =1− (9)
T SS T SS
P n 2
(yi − yˆi )
= 1 − Pi=1 n 2
(10)
i=1 (yi − ȳ)
where ESS, T SS and RSS are the explained, total and resid-
ual sum of squares respectively. When there is an intercept
term in the linear model, this coefficient of determination
is actually equal to the square of the correlation coefficient
between yi and yˆi
 2
Pn ¯
(yi − ȳ)(yˆi − ŷ)
R2 =  qP i=1  (11)
n 2
Pn ¯ 2
(y
i=1 i − ȳ) (
i=1 iy
ˆ − ŷ)
with ŷ¯ the mean predicted responses. Following the line of
[19], [20]. The adjusted R2 is defined as
 
2 2 n−1
Radj = 1 − (1 − R ) (12)
n−p−1
 
p
= R2 − (1 − R2 ) (13)
Fig. 3. Weights labeling for the measurements in Fig. 2 n−p−1

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.
action of dealing with outlier is presented and the simulation
result shows the different statistic when remove and transform
those values. Considering the transformation,one can say that
logarithmic transformation for the given data-set shows better
fitting line than other transformation, still of course removing
all values give us the maximum R2 . The reverse measure tool
can be used in a way that finding the value of the coefficient of
determination for the all advised outlier processing methods,
then approaching the maximum value as the indicator to
the best fit, then to apply that chosen method for the same
confidence.

Fig. 4. Original data with three synthetic outliers and data transformation
R EFERENCES
[1] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. ”From Data Mining to
Knowledge Discovery in Databases” AI Magazine American Association
where p is the total number of regressions in the linear model, for Artificial Intelligence, pp.37–54, Fall 1996.
[2] I. Ben-Gal ”Outlier Detection” In: Maimon O. and Rockach L. (Eds.)
and n is the sample size. In the following pages, we show the Data Mining and Knowledge Discovery Handbook: A Complete Guide
effect of outliers on R2 by the mean of simulation with sta- for Practitioners and Researchers, Kluwer Academic Publishers, 2005.
tistical regression analysis. We use the same observations for [3] G. J. Williams ,R. A. Baxter ,H. X. He ,S. Hawkins, and L. Gu ”A
Comparative Study of RNN for Outlier Detection in Data Mining” IEEE
all comparisons with synthetic outliers. Data transformation is International Conference on Data-mining (ICDM’02), Maebashi City,
used to reduce the effect of outlier in case of not removing Japan, CSIRO Technical Report CMIS-02/102, 2002.
them. Hence, we show also the effect of transformation on R2 [4] H. Liu, S. Shah, and W. Jiang. ”On-line Outlier Detection and Data
Cleaning” Journal of Computers and Chemical Engineering, vol. 28,
and we compare that with the case of removing outliers. The pp. 1635–7647, 2004.
measurements which have been used and their transformation [5] D. Hawkins ”Identification of Outliers”. Chapman and Hall, 1980.
are shown in Fig. 4. Outliers at time instance t = 4, 10, 20 have [6] C. C. Aggarwal ”Outlier Analysis”. Kluwer Academic Publishers.
[7] N. Devarakonda, S. Pamidi, V. V. Kumari, A. Govardhan ”Outliers
been considered in the original measures, we have counted Detection as Network Intrusion Detection System Using Multi Layered
for three outliers instead of one as in Fig. 2 to increase the Framework” Advances in Computer Science and Information Technology
model misfitting and to count for more than single outlier Communications in Computer and Information Science. Springer Vol.
131, pp. 101–111, 2011.
case. Data transformation has been used and we show the [8] O. Ghorbel, M.W. Jmal, W. Ayedi, H. Snoussi, and M. Abid. ”An
effect of outliers on R2 with in next pages. Fig. 5 and Fig. 6 overview of outlier detection technique developed for wireless sensor
show the detection of outliers using WLS and weight labeling networks” IEEE 10th International Multi-Conference on Systems,
Signals Devices (SSD), pp. 1–6, 2013.
respectively, where three measures have been assigned with [9] S. Cateni, V. Colla, and M. Vannucci ”Outlier Detection Methods for
weight zero which can be considered as outliers. No any outlier Industrial Applications” Advances in Robotics, Automation and Control,
processing has been applied and the resulted R2 is around 51 Book edited by: Jess Armburo and Antonio Ramrez Trevio, I-Tech,
Vienna, Austria, 2008.
%, more regression and statistical analysis are extracted in [10] V. J. Hodge, and J. Austin ”A Survey of Outlier Detection Methodolo-
Table I and II. gies.”. Kluwer Academic Publishers, 2004.
[11] V. Chandola, A. Banerjee, and V. Kumar ”Outlier Detection : A Survey”.
V. S IMULATION E XAMPLE University of Minnesota.
[12] J. Osborne ”Notes on the use of data transformations”. Practical
A set of 20 measures as shown in Fig. 4, has been Assessment, Research Evaluation, Vol. 8(6), 2002.
considered for our analysis. Three different values have been [13] J. W. Osborne, and A. Overbay ”The power of outliers and why
assigned as outliers, the values as 18, 26 and 32. The same researchers should always check for them”. Practical Assessment,
Research Evaluation, Vol. 9(6), 2004.
approach of last section with different outlier processing has [14] J. W. Osborne ”Improving your data transformations: Applying the Box-
been implemented considering three cases of dealing with Cox transformation”. Practical Assessment, Research Evaluation, Vol.
outlier as following, outlier removal (Fig. 7-8 and its statistic in 15(12), 2010.
[15] E. Ostertagova ”Modelling using polynomial regression ”. ”Procedia
Table III and IV), square root and logarithmic transformation Engineering, Elsevier”, Vol 48, pp. 500–506, 2012.
for Fig. 9-10, and Fig. 11-12 respectively. For the same [16] A. D. Aczel ”Complete Business Statistics”. Irwin, ISBN 0-256-
measure, the data can be reversely analyzed, with an indicator 05710-8, 1989.
[17] Matlab and Statistics Toolbox Release 2012b The MathWorks
of the best dealing option according to its fitted model. The Inc., Natick, Massachusetts, United States. Available online on
approach can be briefed as step 1, detect the outlier using http://www.mathwork.com
Re-WLS, step 2, process the outlier as per model assumption, [18] Sohn, B.Y. Kim, G.B. ”Detection of outliers in weighted least squares
regression”. ”Korean J. Comp. Appl. Math”, Vol. 4(2), pp. 441–452,
step 3, find the R2 for each model, last step is processing 1997.
the outlier based on the model that generate the maximum [19] O. Renaud, and M.-P. Victoria-Feser ”A robust coefficient of determi-
R-squared except for the removal case. nation for regression”. Journal of Statistical Planning and Inference,
Vol. 140(7), pp. 1852–1862, 2010.
C ONCLUSION [20] W. Greene. ”Econometric Analysis (third ed.)”. Prentice Hall, 1997.

A method for visualizing and dealing with detected outlier


has been proposed. The use of R2 to measure the optimal

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Outliers detection using Re-WLS and no outliers processing Fig. 7. Outlier removal with Re-WLS detection

Fig. 6. Weight labeling with outliers label 0.00 Fig. 8. Weight labeling with 3 measures have been removed

TABLE I TABLE III


R EGRESSION S TATISTICS R EGRESSION S TATISTICS

Statistic Value Statistic Value


Multiple R 0.71939 Multiple R 0.94209
R2 51.752 % R2 88.753 %
Adjusted R2 0.49072 Adjusted R2 0.88003
Standard Error 5.88154 Standard Error 2.04944
Outliers 3 Outliers Removed 3
Outlier process No Action Outlier process Removed

TABLE II TABLE IV
S TATISTICAL A NALYSIS S TATISTICAL A NALYSIS

No action to outlier Outliers have been removed


Coefficient St. Error t stat Coefficient St. Error t stat
Intercept 1.432 2.732 0.524 Intercept -0.341 1.039 -0.328
Variable 1.002 0.228 4.394 Variable 1.103 0.101 10.88

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 9. Outlier detection and square root transformation Fig. 11. Outlier detection and logarithmic transformation

Fig. 10. Transformed data weight labeling Fig. 12. Transformed data weight labeling

TABLE V TABLE VII


R EGRESSION S TATISTICS R EGRESSION S TATISTICS

Statistic Value Statistic Value


Multiple R 0.77557 Multiple R 0.80056
R2 60.151 % R2 64.090 %
Adjusted R2 0.57937 Adjusted R2 0.62095
Standard Error 0.80536 Standard Error 0.23256
Outliers 2 Outliers 2
Transformation Square root Transformation Logarithmic

TABLE VI TABLE VIII


S TATISTICAL A NALYSIS S TATISTICAL A NALYSIS

Square root transformation Logarithmic transformation


Coefficient St. Error t stat Coefficient St. Error t stat
Intercept 1.529 0.374 4.088 Intercept 0.413 0.108 3.822
Variable 0.162 0.031 5.212 Variable 0.051 0.009 5.667

Authorized licensed use limited to: Higher College of Technology. Downloaded on September 26,2021 at 14:29:13 UTC from IEEE Xplore. Restrictions apply.

You might also like