2009 - Feature-Based Clustering For Electricity Use - Räsänen, Kolehmainen

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Feature-Based Clustering for Electricity Use

Time Series Data

Teemu Räsänen and Mikko Kolehmainen

Research Group of Environmental Informatics


Department of Environmental Sciences
University of Kuopio
P.O. Box 1627
FIN-70211 Kuopio, Finland
{Teemu.Rasanen,Mikko.Kolehmainen}@uku.fi

Abstract. Time series clustering has been shown effective in providing useful
information in various applications. This paper presents an efficient
computational method for time series clustering and its application focusing
creation of more accurate electricity use load curves for small customers.
Presented approach was based on extraction of statistical features and their use
in feature-based clustering of customer specific hourly measured electricity use
data. The feature-based clustering was able to cluster time series using just a set
of derive statistical features. The main advantages of this method were; ability
to reduce the dimensionality of original time series, it is less sensitive to
missing values and it can handle different lengths of time series. The
performance of the approach was evaluated using real hourly measured data for
1035 customers during 84 days testing time period. After all, clustering resulted
into more accurate load curves for this set of customers than present load curves
used earlier. This kind of approach helps energy companies to take advantage of
new hourly information for example in electricity distribution network
planning, load management, customer service and billing.

Keywords: time series clustering, feature-based clustering, feature extraction,


electricity use data, load curves, electricity distribution.

1 Introduction
Data mining of multivariate time series is a well known research area where feature
extraction as a data reduction technique, plays an important role in pattern recognition
and data analysis [1]. There are several real world situations where large amount of
data has to be reduced, variables of multivariate time-series are not timely
synchronized to each other or there is lot of missing values in data. These are the main
reasons for use of a feature-based method for clustering of time series [2]. Besides
these problems, clustering of original time series data is more computationally
demanding than feature-based approaches. Moreover, recent studies have proven that
pattern recognition methodologies, such as the k-means, self-organized maps, fuzzy

M. Kolehmainen et al. (Eds.): ICANNGA 2009, LNCS 5495, pp. 401–412, 2009.
© Springer-Verlag Berlin Heidelberg 2009
402 T. Räsänen and M. Kolehmainen

k-means and hierarchical methods, can be applied for the study of the customer
electricity behavior, where problems mentioned above are typically occurring [3][4].
The analysis of customer loads and load estimation is a traditional area of
electricity distribution technology because electricity distribution utilities need
accurate load data for pricing and tariff planning, distribution network planning and
operation, power production planning, load management, customer service and billing
and also for providing information to customers and public authorities [5]. Recently,
there has been major technological progress in small customer energy use metering
and consequently hourly measured information is available in near future for great
majority of customers. Furthermore, current European Union legislation has brought
new requirements to energy distributors and retail energy sales companies ordering
them to provide information about consumer’s energy consumption in more detailed
level [6].
The electricity load curve describes the amount of electrical energy customer uses
over the course of time and it is used to plan how much electricity retailer or
distribution Company will need to make electricity available at any given time.
Furthermore, end-use load curves (i.e. load profiles) show how the load of a particular
customer varies throughout the day and week and gives understanding of peak
demand [7]. The most important load information is how a customer or a group of
customers uses electricity at different hours of the day, different days of the week and
seasons of the year and what their share of the utility's total load is and how loads of
different customers aggregate in different locations of a distribution network [5]. The
factors affecting to the customer or customer groups load are 1) customer
consumption behavior and residence characteristics, 2) time of day, week or year and
3) local climate factors like temperature, humidity or solar radiation [8][5].
Typically the energy companies have classified customers into groups concerning
their characteristics and annual demand for electricity. Based on this classification
each customer has load curve estimate which is used for billing and distribution
management. However, it is typical that changes in customers life and electricity use
doesn't mediate to the energy company and the needed load curve update cannot be
done. Another problem is that given load curve can be wrong in the first place
because customer has similar characteristics but electricity consumption behavior is
different than proposed typical customer group. As a result of these problems,
demand side management and distribution planning deals with misinformation
causing extra costs.
The purpose of this study was to develop efficient computational approach to
handle complex and large time series datasets in the context of electricity load
research. Moreover, in the presented application, the main aim was to utilize large
amounts of hourly measured electricity use data in order to validate and improve
customer specific load curves. In this paper, we compared given load curves to real
measured electricity use and investigated how well they are correlating. Furthermore,
we present here computationally efficient data-based approach to create more accurate
up-to-date customer specific load curves using real measurement information.
Proposed methods were tested using hourly measured electricity use data from 1035
customers locating Northern-Savo, Finland. The returns showed that original load
curves were not very accurate and they can be improved using data based clustering.
Feature-Based Clustering for Electricity Use Time Series Data 403

2 Materials and Methods

2.1 Data Used

In this study, we used data describing 1035 small customer's hourly measured
electricity use (kWh) during the winter 2007. The data contained 84 days (2016
hours) starting the first of January 2007. The customers were located in Pohjois-Savo
region, which is an area in eastern Finland. The region has two major cities called
Kuopio and Iisalmi but major part of customer where located outside of cities in the
sparsely populated area.
The energy company classifies the customer's according their characteristics when
customer joins to company's distribution network. Each customer is attached to
specific load curve which is used as a base of billing and distribution planning. These
1035 customers were divided to use 18 different load curves describing best each
customer's electricity use and behavior. For example, house (detached house, terraced
house, etc.) and heating type (use of electric heating) or type of activity of the
residence (spare time cottage, agriculture residence, etc.) had been used as a
classifying characters.
Used data set contained hourly energy use time series for 1035 customers and we
had also original load curves for each customer. With this data we solved how
customer's real electricity use corresponds to the original load curve. Furthermore,
data where used to create new load curves based on each customer electricity use
behavior and characteristic based clustering.

2.2 Feature Extraction

Transforming the raw time-series data into the set of features is called feature
extraction. Despite of the length of the time series and missing values, a finite set of
statistical measures can be used to capture the global nature of the time series [2].
Furthermore, feature extraction is used to compress large data sets by the means of
dimensionality reduction. In this way, computational efficiency can be increased and
use of more sophisticated algorithms is possible. Nevertheless, the majority of feature
extraction methods are generic in nature, the extracted features are usually application
dependent. Thus one set of features that work well on one application might not be
relevant to another [9].
In this study, features were extracted from the raw hourly measured electricity use
data using window of one week i.e. 168 data rows (hours). We extracted 7 features
from each customer's data. The features extracted were; mean, standard deviation,
skewness, kurtosis, chaos, energy and periodicity.
Mean and standard deviation (Eq. 1) are simple but useful features. Skewness (Eq.
2) is the degree of symmetry in the distribution of energy consumption data and
kurtosis (Eq. 3) measures how much a distribution is peaked at the center of a
distribution [10].
Many real world systems may contain chaotic behavior and especially nonlinear
dynamical systems often exhibit chaos, which is characterized by sensitive
dependence on initial values, or more precisely by a positive Lyapunov Exponent
(LE). LE, as a measure of the divergence of nearby trajectories has been used to
404 T. Räsänen and M. Kolehmainen

qualifying chaos by giving a quantitative value. It is common to just refer to the


largest one, i.e. to the Maximal Lyapunov exponent (MLE), because it determines the
predictability of a dynamical system. A positive MLE is usually taken as an indication
that the system is chaotic. The maximal Lyapunov exponent (λ) can be defined using
Eq. 4 [11].

1 k
σ= ∑ (ci − m )2 ⋅ ni (1)
N i=1
1 k
3 ∑
(2)
Skew = (ci − m)3 ⋅ ni
Nσ i=1
k

∑ (c − m)
1 (3)
Kurt = ⋅ ni
4

Nσ 4
i
i =1

1 δZ (t )
λ = lim ln (4)
t →∞ t δZ 0
w

∑x
2
i (5)
Energy = i =1
w
The periodicity is important for determining the seasonality and examining the cyclic
pattern of the time series [2]. In this case, length of occurring period was solved using
Discrete Power Spectrum (periodogram), which describes the distribution of the
signal strength into different frequency values. The spectrum generally enlightens the
nature of the data [12]. The most powerful frequency value was transformed into hour
form and it was taken into feature data set and used as a periodicity feature.
Additionally, to capture data periodicity, the energy feature was calculated which is
the sum of the squared discrete FFT component magnitudes of the signal. This sum
was divided by the window length for normalization. Energy feature was calculated
using Eq. 5, where x1, x2, ... are the FFT components of the window [13].
Finally, after the feature extraction, data set contained 84 variables (7 features for
each week) and 1035 rows (amount of customers) which was used in creation of new
load curves using K-means clustering method.

2.3 K-Means Clustering

The number of clusters in the case specific application may not be known a priori.
However, in the K-means algorithm the number of clusters has to be predefined.
Therefore, it is common that the algorithm is applied with different number of clusters
and then the best solution among them is selected using a validity index like the
Davies-Bouldin (DB) Index [14]. It is calculated as follows,
Feature-Based Clustering for Electricity Use Time Series Data 405

1 N Si + S j
DB =
N
∑ max
i =1
j , j ≠i d ij (6)

where N is the number of clusters. The within (Si) and between (dij) cluster distances
are calculated using the cluster centroids as follows:
1
Si =
Ci

x∈Ci
x − mj (7)

d ij = mi − m j (8)

where mi is the centre of cluster Ci, with | Ci | the number of points belonging to
cluster Ci. The objective is to find the set of clusters that minimizes the Eq. 8.
The Davies-Bouldin index was used to solve optimal number of clusters. The DB
index varies slightly between calculations because initial starting point is set
randomly. In this case, indexes were calculated 20 times and mean value of the index
using different numbers of clusters was used when the optimal number of clusters was
selected. After that K-means algorithm was used to cluster feature data set in order to
create reasonable number of comparing groups.
The K-means algorithm was applied to the clustering of the feature vectors which
were created using raw time series data. The K-means is a well-known non-
hierarchical cluster algorithm [15]. The basic version begins by choosing number of
clusters and randomly picking K cluster centers. After that each point is assigned to
the cluster whose mean is closest in a Euclidean distances sense. Finally, the mean
vectors of the points assigned to each cluster are computed, and those are used as new
centers in an iterative approach until convergence criterion is met.

2.4 Estimating Goodness of Clustering

The difference between customer's electricity use and clustered load curve was
calculated using Index-of-Agreement (IA). It is a dimensionless measure, limited to
the range 0...1, giving a relative size of the difference [16]. It is easily understandable
and ideal for making cross-comparisons between time series or models. The values
range from 0 to 1, with a value of 1 indicating perfect fit between the observed and
predicted data. IA is calculated as follows:
n

∑ (P − O ) i i
2

IA = 1 − i =1

∑ (P )
n
(9)
2
+O
' '
i i
i =1

Pi ' = Pi − O (10)

Oi' = Oi − O (11)
406 T. Räsänen and M. Kolehmainen

In this equation, n is the number of observations, Oi is the observed variable at time i,


Pi is the predicted variable at time i, Õ is the mean value of the observed variable over
n observations.

3 Results
The aim of the study was to (1) evaluate correspondence of customers measured real
electricity use and load curve set by the energy company and (2) use feature-based
clustering in order to create new more accurate load curves. First, the mean of index-
of-agreement was calculated between present load curves and electricity use of
customers belonging to each curve. The values of index-of-agreement, standard
deviation and number of customers using each load curve are illustrated in Table 1.

Table 1. The results of comparison between measured electricity use and original load curves

Load Curve Mean IA Mean Std Number of customers


LC1 0.31 0.08 426
LC2 0.33 0.10 189
LC3 0.35 0.08 76
LC4 0.51 0.04 6
LC5 0.16 0.00 1
LC6 0.30 0.06 48
LC7 0.35 0.00 1
LC8 0.26 0.08 3
LC9 0.35 0.00 1
LC10 0.22 0.08 2
LC11 0.28 0.07 6
LC12 0.30 0.11 15
LC13 0.27 0.15 6
LC14 0.35 0.17 8
LC15 0.38 0.05 7
LC16 0.07 0.00 1
LC17 0.39 0.07 15
LC18 0.16 0.17 4
Unknown 220
Mean 0.30 0.07 1035

Next, the feature data set was created and raw time series data (1035 rows by 2016
columns matrix) were transformed into more compact format (1035 rows by 84
columns matrix). Before clustering of feature data set the Davies-Bouldin index was
used to solve optimal number of clusters. In this case, there were two clear options
according DB-index and 16 clusters were selected because final comparison results
were better than using 32 clusters. The values of DB-index for different clusters
amounts are illustrated in Figure 1.
Feature-Based Clustering for Electricity Use Time Series Data 407

Davies-Bouldin Index (DBI)


2.1

1.9

1.8

1.7

1.6

1.5

1.4

1.3
0 5 10 15 20 25 30 35
Number of Clusters

Fig. 1. The values of Davies-Bouldin index for different number of clusters. The optimum
number of clusters is where index is lowest (in this case options were 32 clusters or 16
clusters).

Table 2. The results of comparison between measured electricity use and new data-based load
curves

New Load Curve Mean IA Mean Std Number of customers


NLC1 0.50 0.12 21
NLC2 0.68 0.10 166
NLC3 0.68 0.13 25
NLC4 0.46 0.12 20
NLC5 0.77 0.09 209
NLC6 0.83 0.09 181
NLC7 0.63 0.21 11
NLC8 0.57 0.14 25
NLC9 0.62 0.17 17
NLC10 0.69 0.14 24
NLC11 0.70 0.16 16
NLC12 0.66 0.10 91
NLC13 0.73 0.11 76
NLC14 0.55 0.20 100
NLC15 0.68 0.13 26
NLC16 0.65 0.14 27
Mean 0.65 0.13 1035
408 T. Räsänen and M. Kolehmainen

The K-means algorithm was used to cluster feature data set and as a result of that
16 new customer groups were created. The actual new load curves were calculated
from original data according to each customers cluster id. In other words, the mean of
electricity use of each cluster customers was calculated and used as a new load curve.
Furthermore, the comparison between measured electricity use and new data-based
load curves was carried out by calculating mean index-of-agreement values for each
load curve. The results of comparison are described in Table 2.

Mean IA: 0.82847, Customers:181

12 Measured
Load Curve (cluster center)

10

-2

200 400 600 800 1000 1200 1400 1600 1800 2000
Time (hours)

Fig. 2. An example of one of the created data-based load curves (black) and variation of
customer electricity use (grey), containing all 181 customers electricity use figures during 84
days test period. Values were variance-scaled.

An example of new load curve and variation of customer electricity use is illustrated
in Figure 2. In this example, mean index-of-agreement was 0.83 and 181 customers
were classified to use this load curve. Additionally, the IA values were calculated for
all customers in both two comparison cases; using original load curve and new data-
based load curves. Results of this is shown in Figure 3, where comparison between
each customer electricity use and original load curve (dash dot line) or new data based
load curve (solid line) are presented using histogram.
Feature-Based Clustering for Electricity Use Time Series Data 409

160
New Load Curve
Original Load Curve
140

120
Number of Customers

100

80

60

40

20

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Index-Of-Agreement (IA)

Fig. 3. The histogram of Index-Of-Agreement values for all customers in comparison between
each customer electricity use and original load curve (dash dot line) or new data based load
curve (solid line)

4 Discussion

The objective of this study was to apply feature-based clustering into creation of more
accurate load curves for energy companies and present efficient way to take
advantage of new hourly measured electricity use data. Moreover, the aim was
evaluate correlation between customer real electricity use and original load curves set
by energy companies.
The results of this comparison showed clearly that these present load curves do not
correspond properly customer real electricity use. Mean index-of agreement for all
load curves was only 0.30 (highest was 0.51) meaning almost that random values
would result the same accuracy. This was actually an expected result because the
energy company has already notified that there might be some problems concerning
present load curves caused by changes of customer characteristics and behavior.
New hourly measured data gives whole a new view point of creation more accurate
load curves. The size of the available data is huge and that is the reason why efficient
computational methods are needed. Grouping customers using time series data can be
done using feature-based clustering. In this case, set of statistical features were
calculated and used in clustering with k-means algorithm. Features were calculated
using time window of one week (168 hours). Optimal number of clusters was solved
using Davies-Bouldin index resulting 16 clusters. The new data-based load curves
410 T. Räsänen and M. Kolehmainen

were calculated using raw electricity use data according to each customers cluster id.
Finally, the goodness of clustering was evaluated using Index-of-Agreement measure
between customers and each load curve. The results in this phase were satisfactory
but further development is still needed. Despite of that, the created data-based new
load curves where more accurate than present load curves set by the energy company.
Mean IA for all new load curves was 0.65 (highest was 0.83) showing clearly the
improvement achieved by feature-based clustering.
After all, feature-based clustering worked well in time-series clustering, at least in
this kind of application, but selecting proper features and setting best time-window for
feature calculation has to be done carefully. Moreover, in applications concerning
electricity use, data should cover time period of one year or several years. In this case,
data was covering only winter season of the year and it was suitable for testing
performance of used computational methods. Longer period of data is needed for
deeper understanding of customer electricity use, seasonality or consuming behavior.
Furthermore, only the K-means was used in clustering but comparisons using
different clustering algorithms may result some improvements in clustering accuracy.
The electric load in electricity distribution varies with time and place and the
power production and distribution system must respond to the customers load demand
at any time. This is the mean reason why energy companies need accurate load
information for pricing and tariff planning, distribution network planning and
operation, power production planning, load management, customer service and billing
and also providing information to customers and public authorities [5]. The methods
presented in this paper, feature extraction and feature-based clustering, are suitable for
creation of more accurate load curves using new hourly measured electricity use data
concerning small customers. It is obvious that large number of customers and amount
of raw data raises new challenges but in opposite to that, it gives great opportunity to
use thousands of customers as a base of a load curve creation.
In this study we presented approach capable to cluster large and complex time-
series data using feature-based clustering. The features were selected so that main
characteristics, like periodicity, average, standard deviation and chaotic behavior, of
electricity use data were captured. Furthermore, performance of approach was tested
using real world data for over one thousand electricity customers.

5 Conclusions

This paper presents an efficient computational method for time series clustering and
application concerning creation of electricity use load curves for small customers.
Presented approach was based on extraction of statistical features from time series
and their use in feature-based clustering of hourly measured electricity use data. The
performance of approach was evaluated using data of 1035 real customers.
The feature-based clustering was able to cluster time series using just a set of
derive statistical features. There were three advantages of the approach; (1) its ability
to reduce the dimensionality of original time series, (2) it is less sensitive to missing
values and (3) it can handle different lengths of time series. In addition, the presented
Feature-Based Clustering for Electricity Use Time Series Data 411

approach resulted into more accurate load curves for this set of customers than present
load curves set by the energy company.

Acknowledgements

This study was part of ENETE project and scientific collaboration between Research
Group of Environmental Informatics (University of Kuopio), Savon Voima Oy and
Enfo Ltd. in order to develop electricity distribution information systems and
intelligent services for customers. We would like to thank Mr. Eero Sinkko, Mr. Ari
Salovaara, Mr. Matti Huovinen, Mr. Ilkka Holmavuo and Mr. Sami Viiliäinen from
Savon Voima Oy and also Mr. Harri Smolander and Mr. Jouko Kaihua from Enfo
Ltd. for providing experimental data, important technical information and guidance
during the research project.

References
1. Olier, I., Vellido, A.: Advances in Clustering and Visualization of Time Series Using GMT
Through Time. Neural Networks 21, 904–913 (2008)
2. Wang, X., Smith, K., Hyndman, R.: Characteristic-Based Clustering for Time Series Data.
Data Mining and Knowledge Discovery 13, 335–364 (2006)
3. Tsekouras, G.J., Kotoulas, P.B., Tsirekis, C.D., Dialynas, E.N., Hatziargyriou, N.D.: A
Pattern Regocnition Methodology for Evaluation of Load Profiles and Typical Days of
Large Electricity Customers. Elect. Power Syst. Res (2008),
doi:10.1016/j.epsr.2008.01.010
4. Chicco, G., Napoli, R., Piglione, F.: Comparisons Among Clustering Techniques for
Electricity Customer Classification. IEEE Transactions on Power Systems 21, 933–940
(2006)
5. Seppälä, A.: Load Research and Load Estimation in Electricity Distribution. VTT
Publications 289. Technical Research Centre of Finland
6. The European Parliament and The Council of the European Union. Directive 2006/32/EC
of the European Parliament and of the Council on Energy End-Use Efficiency and Energy
Service and Repealing Council Directive 93/76/EEC (2006)
7. Bartels, R., Fiebig, D.G.: Metering and Modeling Residential End-Use Electricity Load
Curves. Journal of Forecasting 15, 415–426 (1996)
8. Elkarmi, F.: Load Research as a Tool in Electric Power System Planning, Operation, and
Control - The Case of Jordan. Energy Policy 36, 1757–1763 (2008)
9. Liao, W.: Clustering of Time Series Data - A Survey. Pattern Recognition 38, 1857–1874
(2005)
10. Baek, J., Geehyuk, L., Wonbae, P., Byoung-Ju, Y.: Accelerometer Signal Processing for
User Activity Detection. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004.
LNCS (LNAI), vol. 3215, pp. 610–617. Springer, Heidelberg (2004)
11. Sprott, J.C.: Chaos and Time-Series Analysis. Oxford University Press, Oxford (2003)
12. Masters, T.: Neural, Novel & Hybrid Algorithms for Time Series Prediction. John Wiley
& Sons Inc., New York (1995)
412 T. Räsänen and M. Kolehmainen

13. Ravi, N., Dandekar, N., Mysore, P., Littman, M.L.: Activity Recognition from
Accelometer Data. In: The Twentieth National Conference on Artificial Intelligence AAAI
2005. American Association for Artificial Intelligence, Stanford (2005)
14. Davies, D., Bouldin, D.: A Cluster Separation Measure. IEEE Transactions on Pattern
Analysis and Machine Intelligence 2, 224–227 (1979)
15. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate
Observations. In: The Fifth Berkeley Symposium on Mathematical Statistics and
Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
16. Willmot, C.: Some Comments on the Evaluation of Model Perfomance. Bulletin of
American Meteorological Society 63, 1309–1313 (1982)

You might also like