Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Forecasting Weekly Rainfall Using Data Mining

Technologies
T. Dananjali S. Wijesinghe J. Ekanayake
Faculty of Graduate Studies, Department of Economics and Statistics Department of Computer Science and
Sabaragamuwa University of Sri Lanka Sabaragamuwa University of Sri Lanka Informatics
Belihuloya, Sri Lanka Belihuloya, Sri Lanka Uva Wellassa University
dananjali@ccs.sab.ac.lk wijesinghewadsk@gmail.com Badulla, Sri Lanka
jayalath@uwu.ac.lk

Abstract— Rainfall forecasting is a technologically and mining technologies, statistical models, and hybrid models,
scientifically a challenging task around the world. Rainfall is which are combinations of data mining models and statistics.
one of the most important weather conditions in a given area.
Forecasting possible rainfall can help to solve several problems The multiplicative seasonal autoregressive integrated
moving average model (SARIMA) is used to simulate
2020 From Innovation to Impact (FITI) | 978-1-6654-1471-5/20/$31.00 ©2020 IEEE | DOI: 10.1109/FITI52050.2020.9424877

related to the tourism industry, natural disaster management,


agricultural industry etc. As the Sri Lankan rural economy is monthly rainfall in Nyala station, Sudan. Monthly rainfall
mostly based on agriculture, it is important to forecast rainfall data for the years 1971–2010 were used. SARIMA (0,0,0) x
as well as other weather conditions accurately. The weather (0,1,1)12 model identified as the most suitable model for
patterns are localized and hence, generalization of weather monthly rainfall prediction over Nyala station [2]. Historical
prediction models is very difficult. Therefore, this project data for the period 2003-2012 were used to predict rainfall in
proposes three data mining models to forecast rainfall, and Warri Town, Nigeria [3]. SARIMA (1, 1, 1) (0, 1, 1) model
compares the prediction performances of those models. To that was the chosen model after their analyzing.
end the data mining models linear regression, SMO regression,
and M5P model were trained from rainfall data collected from Rainfall prediction is necessary for flood management,
the Badulla district, Sri Lanka, during the period 2002 to 2017, rainwater harvesting, urban planning, water resource
to forecast weekly rainfall for the following five months lead- management, planning and optimal operation of the
time. Each model was evaluated using Mean Absolute Error irrigation system [4]. Meanwhile, autoregressive integrated
(MAE), Root Mean Squared Error (RMSE), Root Relative moving average (ARIMA) models and SARIMA models
Squared Error (RRSE), Root Absolute Error (RAE), Direction were used to forecast rainfall in Bangladesh. According to
Accuracy (DA) and residual analysis. According to the their findings, forecasting quality of the SARIMA models is
findings, the M5P model tree provided the lowest error value, reasonably precise. Hence, it was suggested that these
highest direction accuracy, highest correlation between actual SARIMA models can be used as a convenient tool for
and predicted rainfall values, and better randomness of the nationwide rainfall forecasting in Bangladesh [4].
error values compared to the linear and SMO regression
models The ARMA model was used to forecast short-term
rainfall in central Italy to flood forecasting. Two estimation
Keywords— forecasting, linear regression, M5P model tree, and fitting rules were investigated; the first considers all
rainfall, SMO regression rainfall occurrences throughout the period of record as the
basis for parameter estimation, and the second is an event-
I. INTRODUCTION based adaptive procedure. Their results show and prove that
Scientists heavily focus on weather forecasting using the event-based estimation approach yields better forecasts
various technologies though it is a challenging task. [5].
Deterministic and probabilistic are the two different
techniques following by meteorologists and climatologists. Literature shows that the SARIMA model is identified as
Meteorologists use data from satellites technologies, ships, the best traditional statistical model for the time series
airplanes, weather stations and buoys, and devices dropped forecasting of rainfall in most research. However, of late
from airplanes or weather balloons. Climatologists use the researchers are moving to modern data mining technologies
probabilistic technique and suggest perspective weather over the traditional statistical models. Modern data mining
conditions [1]. Apart from meteorologist and climatologists, technologies such as artificial neural network (ANN) models
data scientists and statisticians are involved in weather performed best prediction quality than the traditional models.
forecasting based on historical data using data mining and It provides a methodology for solving many types of non-
machine learning technologies. The data mining technologies linear problems that are difficult to be solved through
are applicable in rainfall forecasting in various manners. As traditional techniques. Moreover, artificial neural networks
an example, predicting rainfall based on many other models are used not only for analyzing the data but also to
variables or predict based on date and time. The objective of make future predictions. Furthermore, neural networks are
the study is to predict the weekly rainfall using different data capable of extracting the relationship between input and
mining technologies and finally identify the most effective output of a process [6].
rainfall forecasting model among them. To that end, linear Clustering, artificial neural networks, linear regression
regression, Sequential Minimal Optimization (SMO) ,the are the widely used data mining technologies for rainfall
extended version of the Support Vector Machine (SVM) forecasting [7]. Furthermore, multiple linear regression
algorithm and M5P model tree algorithms are used. model was used to forecast rainfall in Bangladesh.
II. RELATED WORKS Onyari et al [8] presents a rainfall-runoff modelling
approach using data mining techniques namely multilayer
There are many studies for rainfall prediction in
perceptron neural network and M5P-Model tree. The M5P
literature. Most of these studies have used modern data
Model Tree developed with 66% the training set was realized

978-1-6654-1471-5/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Makerere University Library. Downloaded on November 01,2022 at 14:41:25 UTC from IEEE Xplore. Restrictions apply.
to be the best model. An MLP-ANN with 4 hidden nodes TABLE I. AVERAGE ERROR VALUES OF THE MODELS PERFORMED
BY TRAINING DATA SET
performed satisfactorily. According to their results model
tree, M5 predicts better than ANN-MLP. Model MAE RMSE RRSE RAE DA
According to the survey paper [5], the widely used Linear regression
14.2 18.9 63.8 66.1 51.8
techniques for prediction are Regression analysis, clustering, model
and Artificial Neural Network (ANN). Most of the M5P model tree 14.5 19.1 64.6 67.9 49.5
researches have used the statistical method and the ANN
SMO regression
model simultaneously and a comparison between those two model
13.1 19.9 67.3 60.9 55.4
types. shows that ANN models were better than the
traditional statistical models [9], [10], [11]. TABLE II. AVERAGE ERROR VALUES OF THE MODELS PERFORMED
BY TESTING DATA SET
Mishra et al [12] developed and analyzed ANN models to
forecast rainfall based on time series data. They developed Model MAE RMSE RRSE RAE DA
two models to forecast one month and two months ahead
Linear regression
predictions. Feed Forward Neural Network (FFNN) using model
19.2 25.9 78.3 85.5 44.4
Back Propagation Algorithm and Levenberg- Marquardt
training function has been used. MSE, Magnitude of Relative M5P model tree 17.0 23.4 70.9 75.7 46.7
Error (MRE) were used to evaluate models. According to SMO regression
19.9 28.0 84.5 88.3 44.3
their findings, one month ahead prediction models performed model
better than two months ahead prediction models.
According to the results, the SMO regression model
Even though many existing rainfall forecasting models provides the lowest MAE, RAE and the highest direction
are available, the models based on Sri Lankan weather accuracy whereas the linear regression model provides the
conditions are comparatively rare. Therefore, in our study we lowest RMSE and RRSE. The M5P model performs
aim to forecast weekly rainfall in Badulla district using three comparatively similar to the linear regression model. Also,
different data mining technologies for next six months’ time the M5P model provides the lowest error values for each
period. The prediction quality of each model was evaluated
evaluation matrix while providing the highest direction
using different evaluation matrix after which the prediction
accuracy according to the Table II. Hence, the M5P model is
quality was analyzed using residual analysis and the quality
identified as the comparatively best performed prediction
of prediction models was further confirmed. Thereby, the
model among these three models.
best fit data mining model was identified for weekly rainfall
prediction.
Actual Value Predicted Value
III. METHODOLOGY
The SMO regression, Linear regression and M5P model 150
Rainfall Value (mm)

tree were trained and tested to predict weekly rainfall data in 100
Badulla district and then, the best performing model among
them was identified to forecast weekly rainfall. The rainfall 50
data was collected from the Meteorological Department of 0
Sri Lanka for past fifteen years from 1st January 2002 to 31st 7/25/2012 8/25/2012 9/25/2012 10/25/2012 11/25/2012 12/25/2012
December 2017. The first week is defined from 1st January Week
2002 to 7th January 2002 and thereafter each seven day
period was considered as a week. Dataset was preprocessed
as it contains missing values and outliers. Removed outliers Fig. 1. Actual values and predicted values of the linear regression model
from the data set and each missing values were filled with
mean values. There are 835 instances in the final dataset. The
algorithms are implemented in Weka (weka-dev-3.9.3.) data 150
mining tool. Minitab (Minitab 18.0) was used to analyze the
Rainfall Value(mm)

results. The size of the training dataset is 66% of the total 100
and the rest was used for testing the models. Performances of 50
each linear regression, SMO regression and M5P model tree
were evaluated using Mean Absolute Error (MAE), Root 0
7/25/2012 8/25/2012 9/25/2012 10/25/2012 11/25/2012 12/25/2012
Mean Squired Error (RMSE), Root Relative Squired Error -50
(RRSE), Root Absolute Error (RAE), and Direction Week
Accuracy (DA). Further, the performances of the models
were evaluated using residual analysis.
Fig. 2. Actual values and predicted values of the M5P model tree
IV. RESULTS
The rainfall prediction was conducted for weekly basis in
six (06) months’ time period ahead hence, the models predict
twenty four (24) rainfall values each for a week.
Table I shows the prediction quality of each linear
regression, SMO regression and M5P model tree in training
process. Table II shows the prediction quality of each models
in testing process.

Authorized licensed use limited to: Makerere University Library. Downloaded on November 01,2022 at 14:41:25 UTC from IEEE Xplore. Restrictions apply.
150
Rainfall Value(mm)

100

50

0
7/25/2012 8/25/2012 9/25/2012 10/25/2012 11/25/2012 12/25/2012
-50
Week

Fig. 3. Actual values and predicted values of the SMOreg model

Fig. 1, 2, and 3 show the variation of the actual rainfall and


the predicted rainfall at one test point. The X-axis of the
Figures shows the prediction period in weeks ahead and Y- Fig. 5. Run chart of M5P model
axis shows the rainfall in millimeters (mm). According to the
figures the prediction quality decreases when expanding the
prediction period. This property is common for the other
testing points and hence, the models cannot be used for
predicting rainfall far ahead

TABLE III. RELATIONSHIP BETWEEN ACTUAL VALUE AND PREDICTED


VALUE

Pearson Correlation
Model P-Value
Coefficient Value
Linear regression model 0.16 0.451

M5P model tree 0.41 0.046

SMO regression model 0.18 0.396

We define following two hypotheses; Fig. 6. Run chart of SMO regression model

H0 = There is no correlation between actual and Runs chart is used to check the randomness of the
predicted rainfall values residuals. Fig. 4, 5, and 6 show the run chart generated for
H1 = There is correlation between actual and predicted residual values of the each models. The X-axis of the charts
rainfall values shows the observation and Y-axis shows residuals of each
models.
Table III shows the Pearson correlation coefficient of
each model. H0 is rejected when P-Value is less than 0.05. To that end we defined two hypotheses:
According to Table III, P-Values of the linear regression and H0 = Error values performed by model are random
the SMO regression models are greater than 0.05. Hence, H0
is not rejected and we accept that there is no correlation H1 = Error values performed by model are not random
between actual and predicted values. The P-Value of M5P and there is a pattern.
model is less than 0.05 and the correlation coefficient is 0.41. H0 : the null hypothesis, is rejected when the P-Value is
According to the correlation analysis the M5P outperforms less than 0.05. The hypothesis is tested using the
the other two models in predicting rainfall six weeks ahead. approximate P value for clustering and trends.
Next the residual analysis is conducted to check the goodness
of the model. According to the run charts, the linear, M5P and SMO
regression models obtain approximate P-Value 0.338, 0.202
and 0.338 respectively for clustering and 0.567, 0.749 and
0.201 respectively for trends. According to the P-Values H0
is not rejected for any of the models, which indicates that the
error values are random. The M5P model obtains
approximately a higher P-Value than other two models.
Therefore, the randomness of the M5P model is greater than
the other two models. This further confirms that the
prediction quality of M5P is better than the other two
models.
V. CONCLUSION
This project proposes an approach to forecast rainfall in
Badulla district Sri Lanka. Towards that end, three data
mining models were trained from the rainfall data collected
Fig. 4. Run chart of linear regression model from Budulla District. From the evidence, it was concluded

Authorized licensed use limited to: Makerere University Library. Downloaded on November 01,2022 at 14:41:25 UTC from IEEE Xplore. Restrictions apply.
that the M5P model tree performed better than linear [5] P. Burlando, R. Rosso, L. G. Cadavid, and J. D. Salas, “Forecasting
regression and SMO regression models. The M5P model of short-term rainfall using ARMA models,” Journal of Hydrology,
vol. 144, no. 1, pp. 193–211, 1993, doi:
recorded comparatively lower MAE, RMSE, RRSE, RAE, https://doi.org/10.1016/0022-1694(93)90172-6.
and higher DA values in both training and testing processes. [6] J. Joseph and T. K. Rathees, “Rainfall Prediction using Data
Only the M5P model provides a positive correlation of 0.41 Mining Techniques,” International Journal of Computer
whereas the other two models do not show any correlation Applications, vol. 83, pp. 11–15, 2013, doi: 10.5120/14467-2750.
between actual and predicted rainfall values. Furthermore, [7] M. A. I. Navid and N. H. Niloy, “Multiple Linear Regressions for
Predicting Rainfall for Bangladesh,” Communications, vol. 6, no. 1,
the M5P provides greater randomness in error distribution. Art. no. 1, Feb. 2018, doi: 10.11648/j.com.20180601.11.
Hence, the, M5P model tree is proposed as the best model [8] E. Onyari and F. Ilunga, “Application of MLP Neural Network and
for weekly rainfall prediction at Badulla district. M5P Model Tree in Predicting Streamflow: A Case Study of
Luvuvhu Catchment, South Africa,” International Journal of
Summarizing, the rainfall can be forecasted for six Innovation, Management and Technology, vol. 4, pp. 11–15, 2013,
months ahead in Badulla District using the M5P model tree doi: 10.7763/IJIMT.2013.V4.347.
with a decent accuracy. This finding is useful for many [9] A. El-Shafie, H. Mazoghi, A. AbouKheira, and M. Taha, “Artificial
industries particularly for the agriculture sector in Badulla neural network technique for rainfall forecasting applied to
area. Alexandria, Egypt,” International Journal of the Physical Sciences,
vol. 6, pp. 1306–1316, 2011.
[10] I. Khandelwal, R. Adhikari, and G. Verma, “Time Series
ACKNOWLEDGMENT Forecasting Using Hybrid ARIMA and ANN Models Based on
The work/publication is supported by Research grant DWT Decomposition,” Procedia Computer Science, vol. 48, pp.
2016, Sabaragamuwa University of Sri Lanka. 173–179, 2015, doi: https://doi.org/10.1016/j.procs.2015.04.167.
[11] “Comparative Study of Rainfall Prediction Modeling Techniques
(A Case Study on Srinagar, J&K, India),” The Research
REFERENCES
Publication. https://www.trp.org.in/issues/comparative-study-of-
[1] “Four Types of Forecasting,” Sciencing. https://sciencing.com/four- rainfall-prediction-modeling-techniques-a-case-study-on-srinagar-
types-forecasting-8155139.html (accessed Oct. 15, 2020). jk-india (accessed Oct. 15, 2020).
[2] T. Mohamed and A. Ibrahim, “Time Series Analysis of Nyala [12] Department of Computer Science and Engineering, NRI College of
Rainfall Using ARIMA Method,” vol. 17, 2016. Engineering and Management, Gwalior, Madhya Pradesh, India, N.
[3] D. Eni and F. Adeyeye, “Seasonal ARIMA Modeling and Mishra, H. K. Soni, S. Sharma, and A. K. Upadhyay,
Forecasting of Rainfall in Warri Town, Nigeria,” Journal of “Development and Analysis of Artificial Neural Network Models
Geoscience and Environment Protection, vol. 03, pp. 91–98, 2015, for Rainfall Prediction by Using Time-Series Data,” IJISA, vol. 10,
doi: 10.4236/gep.2015.36015. no. 1, pp. 16–23, Jan. 2018, doi: 10.5815/ijisa.2018.01.03.
[4] I. Mahmud, S. H. Bari, and M. T. U. Rahman, “Monthly rainfall
forecast of Bangladesh using autoregressive integrated moving
average method,” Environmental Engineering Research, vol. 0, p.
0, 2016, doi: 10.4491/eer.2016.075.

Authorized licensed use limited to: Makerere University Library. Downloaded on November 01,2022 at 14:41:25 UTC from IEEE Xplore. Restrictions apply.

You might also like