Electricity Fraud Detection Using Committee Semi-Supervised Learning

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/328399948
Electricity fraud detection using committee semi-supervised learning
Conference Paper · July 2018

DOI: 10.1109/IJCNN.2018.8489389
CITATIONS READS
4 481
3 authors, including:
Joaquim Viegas Susana Vieira

University of Lisbon University of Lisbon
25 PUBLICATIONS 236 CITATIONS 133 PUBLICATIONS 1,347 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Personalized Medicine in the ICU View project
PhD program View project
All content following this page was uploaded by Joaquim Viegas on 03 November 2018.
The user has requested enhancement of the downloaded file.

Electricity fraud detection using committee
semi-supervised learning
Joaquim L. Viegas∗ , Nuno M. Cepeda† Susana M. Vieira∗
∗ IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
† PowerData, Portugal
joaquim.viegas@tecnico.ulisboa.pt
Abstract—Electricity fraud results in significant losses to util- Data-based detection of electricity fraud and other types
ities. This paper proposes the use of a semi-supervised learning of non-technical losses has been a topic of growing interest
framework to derive an electricity fraud detector from data in scientific literature [7]. Computational intelligence and
lacking information on the presence of fraud for the majority
of samples. Utilities are only able to make a limited number machine learning techniques such as decision trees, neural
of inspections, resulting in a lack of data representing cases of nets, support vector machines and fuzzy models have been
fraud. Using a co-training by committee semi-supervised learning used in supervised classification schemes to identify suspect
framework, the detection performance is improved in comparison consumer data [8], [9], [10], [11], [12]. Other studies apply
to the use of supervised models only trained with labeled data. unsupervised learning approaches, identifying consumers with
The framework starts by training a random forest classifier on
the labeled data. Next, the unlabeled data that the model can irregular behavior as potential fraudsters [6], [13].
classify with the most confidence is added iteratively to the set One of the most commonly identified challenges in data-
of labeled samples, augmenting the data available for model based detection of electricity fraud is the lack of ground truth
training. The electricity fraud detector achieves a classification in utilities datasets [6], [14]. Electricity fraud is normally
performance of 84% true positive rate, 11% false positive rate verified through the direct inspection of consumption end-
and 0.89 area under the receiver operating characteristic curve
under a positive class balance of 5% and 90% unlabeled samples points and grid asset by utilities employees, meaning only a
in the training data. fraction of consumers have associated information indicating if
they should be labeled as fraudulent or legitimate for the train-
I. I NTRODUCTION ing of classification models. To mitigate this challenge, this
Electricity fraud is the premeditated manipulation of grid paper proposes to use a semi-supervised learning framework
equipments or utility systems with the objective of not paying to develop classification models from a combination of labeled
the total amount owed to the utility for energy consumed. This and unlabeled data, which is novel in the field of electricity
type of theft results in significant losses to utilities across the fraud detection.
world [1]. This paper proposes the use of the co-training by committee
In developing economies, in which electricity fraud is most (CoBC) semi-supervised learning framework for electricity
prevalent, it results in unsustainable situations in already fraud detection. First, the proposed detection approach starts
fragile environments. For example, in Jamaica in 2013, NTLs by transforming consumers smart metering data and infor-
amounted up to US$46 million, accounting for 18% of the total mation on features suitable for classification. Secondly, the
fuel bill for the whole country [1]. Also, contrary to popular approach iteratively trains a random forest (RF) classifier with
belief, electricity fraud has also a strong impact in developed data containing a majority of unlabeled samples, improving
economies. In the UK losses due to electricity theft have been performance in comparison to the case of supervised learning.
estimated at £173 million every year [2], [3]. In the US they The structure of this paper follows: Section II explains
may amount to $6 thousand million [4]. the electricity fraud problem and describes the ways it can
In recent years, with the adoption of the smart grid concept affect electricity consumption data. Section III describes the
as the basis for electricity grid improvements, the widespread feature engineering process used to derive indicators suitable
installation of smart meters (SMs) is being carried out in for classification in this application. Section IV explains the
numerous countries. The SMs, in contrast to conventional semi-supervised learning approach used. Section V presents
mechanical meters, have advanced communications capabil- the dataset user for evaluation and results achieved. Finally,
ities, enabling utilities to remotely collect meter readings, Section VI presents the main conclusions.
control tariffs and connections. Grid assets with advanced
communications, such as SMs, extend the attack surface for II. E LECTRICITY F RAUD T HREAT M ODEL
electricity fraud attacks [5]. In the past, fraud was usually done To describe the threat model and features, the following
through physical manipulation of mechanical meters or direct notation is adopted: the proposed electricity fraud detection
connections to transformers, nowadays fraudsters can hack scheme works on a smart metering dataset with N consumers.
meter firmware and communications to send false consumption mi ∈ <n are the energy consumption readings from consumer
data [6]. i and n = r × nd , with r equal to the number of readings per
day and nd the total number of days in the dataset. Assuming are used to generated the synthetic attacks used to test the
a hourly resolution, the notation is simplified in the following proposed scheme:
way: md,hi is the consumption in hour h of day d for consumer d,h
• h1 (mi ) = αmi ,
d,h
α = random(0.1, 0.8)
i. mdi = (md,1 d,2 d,24
i , mi , ..., mi ) is the 24 hour vector of d,h h d,h
• h2 (mi( ) = β mi
consumption data in day d for consumer i. 0, hstart < h < hend
In the context of electricity fraud detection and mitigation, βh =
1, else
the threat model describes the possible ways consumers can
act to avoid paying the full amount the utility charges for hstart = random(0, 19)
the electricity they use. The attack vectors are the different δ = random(4, 24)
mechanisms consumers may use to this end. The threat model
hend = hstart + δ
considered in this work is adapted from the on proposed in
• h3 (md,h d,h
i ) = γh mi , γh = random(0.1, 0.8)
[6], which presents similarities to the attacks proposed in [15]. d,h
Most commonly, the reported sources of NTLs are fraud • h4 (mi ) = γh µ(mdi ), γh = random(0.1, 0.8)
through meter manipulation, tapping distributions line and non • h5 (md,h
i ) = µ(mi )
d
d,h d,24−h
payment. Problems in meters and utility systems, which result • h6 (mi ) = mi
in incorrect measurements, and cases of collusion with utility
III. E LECTRICITY C ONSUMPTION F EATURES
employees have also been reported [1].
The detection scheme proposed considers a smart grid Techniques for theft detection found in the literature usually
environment, including the presence of an advanced metering make use of raw consumption data [19], [6]. In order to
infrastructure (AMI), composed of SMs at the consump- mitigate difficulties dealing with variable length time-series
tion endpoints, enabling automated meter readings (AMR). data and reduce dimensionality, this paper makes use of four
SMs, in contrast to conventional mechanical meters, have different features, proposed in [13], that enable detection of
advanced communication capabilities and automatically send the possible types of attacks identified.
energy consumption readings to the utility and enable re- The consumption data characteristics translated by each one
mote control. The attack surface related to electricity fraud of the features are the following:
is considered to be growing with these equipments, new • I1 : Indicator of consumption variation. Ratio between
data attack/vulnerabilities, such as the possibility of sending current and past consumption;
false readings appear with the use of SMs [16], [6]. Current e c
• I2 , I2 : Indicators of hourly consumption pattern change.
literature on detection of electricity fraud is giving a increased Relates the hourly pattern of a day with the mean
importance to this issue [17], [18]. hourly pattern of the past. I2e uses the euclidean distance,
It is assumed that most types of electricity fraud can be changes in absolute consumption will be the most relevant
detected through the analysis of consumption data. The theft of for the indicator. I2c uses the Pearson correlation, changes
electricity should result in a reduced, null or irregular reported of dynamic can be detected;
consumption (e.g. if a consumer bypasses the meter to connect • I3 : Indicator of consumption difference in comparison
an equipment consumption will be lower than expected). When to consumers with similar characteristics. Compares the
dealing with conventional mechanical meters, electricity theft average consumption;
results in reduced or zero consumption and can be detected e c
• I4 , I4 : Indicators of hourly consumption pattern differ-
through straightforward methods such as slope analysis and ence in comparison to consumers with similar charac-
rule-based systems [14]. If advanced attackers are considered, teristics. Relates the mean hourly consumption between
fraud may be done through sending false consumption data consumers. I4e uses the euclidean distance and I4c uses
(e.g. through meter hacking) that seems legitimate [19], [6]. the Pearson correlation.
As there are no publicly available examples of these ad- The indicator of consumption variation I1 is a ratio between
vanced types of electricity theft, a number of different attacks the consumption of the last α days and the last β periods of
are considered to test the detection scheme, these are proposed α days.
in [6]:
• Random constant reduction of consumption (h1 );
Pα P24 d−j,k
j=1 k=1 mi
• Drop of consumption to zero during a random period of I1 (i, d) = 1
Pβ P α P 24 d−(l+1)j,k
(1)
β l=1 j=1 k=1 mi
the day (h2 );
• Random hourly reduction of consumption (h3 ); Indicators of hourly consumption pattern change I2ν relates
• Random hourly consumption pattern with reduced aver- the hourly pattern of a day with the mean hourly pattern of
age consumption (h4 ); the α days before. If ν is the euclidean distance (ν = e),
• Constant hourly consumption equal to the average (h5 ); changes in absolute consumption will be the most relevant for
• Reversed hourly consumption: switch consumption of the indicator. If ν is the Pearson correlation (ν = c), changes
hour 1 with hour 24, etc. (h6 ); of dynamic can be detected.
The following equations describe the way an attack starting
on day d by consumer i affects his consumption data. These I2ν (i, d) = ν(mdi , µ(md−1−α
i , ..., md−1
i )) (2)
I3 is the indicator of consumption difference in comparison A. Co-Training by Committee
the consumers r ∈ R with the greatest similarity. Compares The CoBC framework is based on the popular semi-
the mean consumption of the last β days to the mean con- supervised learning paradigm of co-training [21], in which
sumption for the same days for the consumers with the most multiple classifiers are trained on different sets of features,
similar characteristics. R are the τ consumers in {1, 2, ..., N } and the recent success of ensemble learning models. This sep-
with lowest similarity between their characteristics si . The aration of the feature space is also referred in the literature as
similarity between consumer r and i is calculated by ν(sr , si ) multi-view, and is believed to result in reduced generalization
with ν being the euclidean distance. errors due to non-independent features [20].
Pβ P24 We propose to use CoBC with a bagging ensemble learning
1 d−j−β−1,k
β l=1 k=1 mi algorithm for detection of electricity fraud. The learning algo-
I3 (i, d) = Pβ P24 d−j−β−1,k
(3)
µ({ β1 l=1 k=1 mr ∀r ∈ R}) rithm used is known as Random Forests (RF), which creates
a bagging ensemble of decision tree classifiers that achieve
I4e and I4c are the indicators of hourly consumption pattern
improved accuracy and generalization in comparison to the
difference in comparison to consumers with the greatest sim-
use of a single tree [22]. Details on the RF algorithm used are
ilarity. I4ν relates the mean hourly consumption of the last α
described in Section IV-B.
days between consumers. If ν is the euclidean distance ν = e
The pseudo-code of Algorithm 1 describes the framework
changes in absolute consumption will be the most relevant for
used to obtain the committee model for electricity fraud
the indicator. If ν is the Pearson correlation (ν = c), changes
detection, it is based on the one proposed in [21]. It starts by
of dynamic can be detected.
training a RF model M with the set of labeled samples, which
is applied to a randomly chosen set of unlabeled samples U 0
I4ν (i, d) = ν(µ(mid−α , ..., mdi ), µ({(md−α
r , ..., mdi )∀r ∈ R})) (4) of dimension poolsize. In the next step, the samples with
IV. C OMMITTEE S EMI -S UPERVISED L EARNING M ODEL predictions of highest confidence are selected and added to
the labeled set L with labels predicted by the model M .
Similarly to many other potential real-world applications The selection of samples to add labels is proportional to
of modeling for classification, obtaining a labeled dataset the prior probability of each class Pj (e.g. if the problem is
for electricity fraud detection is costly and time-consuming. binary, with classes positive and negative, assuming a positive
Utilities can only make inspections to a limited number class balance and P+ of 5%, 95% of selected samples are
of electricity consumers every year, resulting in a lack of the ones with highest confidence of being negative). These
trustworthy information on the presence of electricity fraud steps are repeated, iteratively adding new unlabeled samples
for most consumers. Semi-supervised learning attempts to to the labeled set till all training samples are labeled or the
leverage unlabeled data to improve performance in comparison max iterations stopping criterion is met. In the end, the final
to supervised learning, either modifying or reprioritizing the committee model is trained with the initial and new labeled
hypothesis obtained from labeled samples alone [20]. training samples.
A semi-supervised learning framework for electricity fraud
detection works on a dataset of features X, that represents the B. Random forests
consumption characteristics of consumers, and the labels Y
indicating if they are fraudulent, legitimate or information on The base committee models used in the semi-supervised
fraud is missing. The approach used in this paper is used to learning framework are RF [23], these are bagging ensemble
deal with data where a majority of samples have no labels. models that combine DTs, each generated leveraging random
Let xid be the feature vector of consumer i in day d. xid = data samples as to maximize the generalization capabilities.
(I1 , I2e , I2c , I3 , I3d , I3c ) is composed of the indicators presented The decision trees that make part of the ensembles are
in Section X. X ∈ <6 is the feature dataset for N consumers classification and regression trees (CART). The RF, used as
composed of the indicators of nd days: the committee model in the CoBC framework, follows the
specifics presented in [22] and its implementation is part of
Scikit-learn [24].
X = (x11 , x12 , . . . , x1nd , x21 , . . . , x2nd , . . . , xN 1 , . . . , xN nd ) (5)
Y are the labels related to the feature dataset X. For each C. Electricity Fraud Detection Model
feature vector xid , yid is equal to 1, 0 or ∅ if, respectively, the Following the framework described, the electricity fraud
consumer is fraudulent (positive label), legitimate (negative detection model is obtained and used as pictured in Figure
label) or his fraud status is unknown. The samples with have 1. Data samples of electricity consumers, including ones that
y = 0 or y = 1 are referred to as labeled and the ones with are unlabeled are transformed through feature engineering to
y = ∅ as unlabeled. obtain relevant variables for model training. The RF CoBC
The positive class balance refers to the percentage of framework leverages this data, generating the final committee
samples with fraud in a dataset with no missing labels, which model and labels for unlabeled training data. The final com-
is assumed in this paper to be equal to the prior probability mittee model can then be used to infer the presence of fraud
of the positive label class. on the data samples of consumers under analysis (testing).
Algorithm 1 RF CoBC Framework V. E XPERIMENTAL R ESULTS
Require: L - set of labeled training samples This section presents the dataset used for evaluation and
U - set of unlabeled training samples the results achieved by the proposed detection model. The use
RF - RF learning algorithm case is similar to the one presented in [13], with the major
rn - number of trees to train in RF algorithm difference of simulation with different percentages of labeled
rd - max depth of trees samples available for training and positive class balances. A
rp - dimension of the feature subset of trees supervised RF model is trained exclusively on the labeled data
poolsize - number of unlabeled samples in pool and evaluated for comparison.
f - fraction of unlabeled samples in pool to label
Pj - prior probability of class j A. Dataset
T - maximum number of iterations
The proposed detection model is tested using a dataset based
i=1
on real consumption readings from 4232 Irish households.
while U 6= ∅ or i < T do
The dataset consists of electricity consumption data logged
M ← RF (L, rn , rd , rp )
at 30 minute intervals, for one and a half years, and surveys
Create set U 0 of poolsize samples chosen randomly
responded before the start of monitoring. This dataset is a
from U without replacement
result from the electricity customer behavior trial by the Irish
Predict the labels for U 0 with model M
Commission for Energy Regulation (CER) [25].
Select the subset S of poolsize × f samples, containing
the most confident predictions, with a class balance propor- The dataset, to include synthetic electricity fraud attack
tional to Pj samples, is generated through the following steps:
Set U 0 = U 0 − S and L = L ∪ S 1) Select five random working days from each one of the
i=i+1 four seasons;
end while 2) For every consumer and every selected day:
M ← RF (L, rn , rd , rp ) a) Generate six synthetic consumption curves for the
return model M attacks presented in Section II;
b) Compute the features presented in Section III for
the legitimate and attack consumption curves.
The survey questions used to select similar consumers for
calculation of indicators I3 , I4e and I4c are: age, employment
status, social class, number of adults in household, number of
children and type of home. Only consumers without missing
Data samples for
data are used (2515). Following the aforementioned steps, the
training
Label information resulting evaluation dataset has a total of 352100 samples.
(including missing data To achieve the percentages of labeled samples in training
for unlabeled samples)
Consumption data and and positive class balances selected for simulation, data is re-
consumer characteristics
Electricity consumption features
sampled without substitution and labels are removed.
B. Parameters
Random Forest
Feature
Co-Training by Three different types of parameters are described. The first
engineering
Committee is for the features presented in Section III. The second is for
the dataset characteristics, including positive class balance and
percentage of labeled training samples. The third is for the RF
CoBC framework presented in Section IV.
For calculation of the features:
Data samples for • I1 : α = 1 and β = 5 (indicator relates the consumption
Committee model
testing
of the past day with the 5 days before);
e c
• I2 and I2 : α = 5 (indicators relate the pattern of past 24
hours with the mean pattern of the 5 days before);
• I3 : τ = 10 and β = 5 (indicator relates the consumption
of the past 5 days with the 10 most similar consumers);
Predicted labels Predicted labels e c
• I4 and I4 : τ = 10 and α = 5 (indicators relate the mean
for testing for unlabeled
samples training samples hourly pattern of consumption of the last 5 days with the
10 most similar consumers).
The dataset and training samples were re-sampled to test
Fig. 1: Diagram for RF CoBC. multiple scenarios, resulting from the variation of positive
class balance between 5%, 10% and 20% and percentage of average model performance for varying percentages of labeled
labeled training samples between 10%, 20% and 30%. samples. The better performance of the CoBC (inductive in
Regarding the RF CoBC parameters, the RF parameters blue) compared to the supervised (orange) approach is visible
were selected through grid-search, maximizing AUC in three- for the lower percentage of labeled samples.
fold cross-validation, with the following search sets for each
parameter: rn ∈ {50, 100, 250}, rd ∈ {5, 10, 15}, rp ∈ TABLE I: Performance of RF CoBC and supervised RF on
{2, 4, 6}. The CoBC learning framework parameters are also testing data
set by grid-search: poolsize ∈ {100, 200, 400, 600, 1000}, f ∈ TPR FPR TPR - FPR AUC
{0.1, 0.2, 0.3}, T ∈ {10, 20, 40, 80, ∞}. Model L (%) B (%)
To evaluate the proposed model, different configurations, CoBC 10 5 0.84 0.11 0.73 0.89
class balances and percentages of unlabeled samples, 5-fold 10 0.88 0.11 0.78 0.93
cross-validation (CV) was used for each combination of pa- 20 0.87 0.12 0.75 0.92
20 5 0.85 0.09 0.75 0.91
rameters, dividing the dataset in 5 and, for each one of the 10 0.86 0.12 0.74 0.90
5 folds, evaluating the performance of the model using 1 20 0.90 0.13 0.77 0.93
part when trained with the remaining 4 - the performance 30 5 0.87 0.09 0.78 0.91
10 0.89 0.11 0.78 0.93
can be summarized by the mean through the 5 folds. This 20 0.89 0.11 0.77 0.94
method is considered the most conservative but may result Supervised 10 5 0.84 0.16 0.68 0.88
in high variance if the dataset is not large enough [26]. The 10 0.89 0.12 0.77 0.92
20 0.88 0.12 0.76 0.92
performance measures used are the following: 20 5 0.87 0.11 0.76 0.92
• True Positive Rate (TPR): Ratio between correctly clas- 10 0.87 0.11 0.76 0.91
20 0.91 0.14 0.77 0.93
sified cases of theft and all theft samples; 30 5 0.90 0.12 0.78 0.93
• False Positive Rate (FPR): Ration between incorrectly 10 0.88 0.11 0.78 0.92
classified benign samples and all benign samples; 20 0.90 0.12 0.78 0.94
• TPR-FPR: Difference between the last two measures;
• Area Under the Receiver Operating Characteristic Curve
(AUC) [27]: translates the performance of a classification 0.8
technique independently of the threshold used (suitable
0.7
when dealing with unbalanced data).
As the model output is continuous between 0 and 1, a 0.6
threshold has to be selected and from that point samples are 0.5
TPR - FPR
labeled positive. The selected threshold tr is the one that 0.4

maximizes T P R − F P R.
0.3
C. Results 0.2 inductive
The results from evaluation of the RF CoBC framework 0.1 inductive (supervised)
and supervised RF are listed in Table I. L (%) and B (%) are, transductive
respectively, the percentage of labeled training samples and 0.0
10
20
30
positive class balance for the different scenarios. The values Labeled samples (%)
of TPR, FPR, TPR-FPR and AUC are the average of these
measures on the testing data of five folds for the best found Fig. 2: Evolution of performance with percentage of labeled
CoBC configuration (poolsize = 600, f = 0.3, T = 10). training samples (assuming a 5% positive class balance).
Under low positive class balance (5%) and low percentage
of labeled training samples (10%) the semi-supervised learning VI. C ONCLUSIONS
framework outperforms the supervised approach (T P R − This paper proposes the use of a semi-supervised learning
F P R = 0.73 vs 0.68, AU C = 0.89 vs 0.88). This slight framework for the detection of electricity consumers com-
performance increase is also verified in scenarios with smaller mitting fraud. Due to the difficulties involved in obtaining
class imbalance. fully labeled datasets in this field, semi-supervised learning
In scenarios with a higher percentage of labeled samples can improve the detection performance of typically used
(20% and 30%), the semi-supervised is not able to outperform supervised learning approaches by leveraging unlabeled data.
the supervised model. The CoBC framework is unable to Based on the co-training by committee framework, random
leverage the unlabeled data to improve the RF classification forest detection models are developed and evaluated on a
performance in these cases or the number of labeled samples dataset based on real data. Scenarios are created for 10%,
is enough and a ceiling on performance is achieved. 20% and 30% percentages of labeled data on the training set.
The performance of the semi-supervised approach on train- The performance evaluation results show that the proposed
ing data is referred to as transductive performance, and on approach can achieve a good classification performance of
testing data inductive performance [20]. Figures 2 pictures the 84% true positive rate, 11% false positive rate and 0.89
area under the receiver operating characteristic curve under [11] C. W. T. Y. Guo and P. Jirutitijaroen, “Online Data Validation for
a positive class balance of 5% and 90% unlabeled samples in Distribution Operations Against Cybertampering,” IEEE Transactions
on Power Systems, vol. 29, no. 2, pp. 550–560, 2014.
the training data. With supervised learning and using the same [12] P. Glauner, J. A. Meira, P. Valtchev, R. State, and F. Bettinger, “The
base model, a performance of 84% true positive rate, 16% false Challenge of Non-Technical Loss Detection using Artificial Intelligence:
positive rate and 0.88 area under the curve is obtained. A Survey,” International Journal of Computational Intelligence Systems,
vol. 10, pp. 760–775, 2017.
Performance of the proposed approach matches the one of [13] J. L. Viegas and S. M. Vieira, “Clustering-based Novelty Detection to
supervised learning for higher percentages of labeled samples, Uncover Electricity Theft,” in Proc. of the 2017 IEEE International
strongly suggesting improvements can be achieved through Conference on Fuzzy Systems (FuzzIEEE), 2017.
[14] I. Monedero, F. Biscarri, C. León, J. I. Guerrero, J. Biscarri, and
the optimization of the framework used or application of R. Millán, “Detection of frauds and other non-technical losses in a power
more sophisticated approaches. Possible improvements could utility using Pearson coefficient, Bayesian networks and decision trees,”
be obtained through a more thorough testing of parameters International Journal of Electrical Power & Energy Systems, vol. 34,
no. 1, pp. 90–98, jan 2012.
and use of generative models such a deep neural nets [28]. [15] V. B. Krishna, G. A. Weaver, and W. H. Sanders, “PCA-based method
for detecting integrity attacks on advanced metering infrastructure,”
ACKNOWLEDGMENT in International Conference on Quantitative Evaluation of Systems.
This work was supported by FCT, through IDMEC, under Springer, 2015, pp. 70–85.
[16] S. McLaughlin, D. Podkuiko, and P. McDaniel, “Energy theft in the
LAETA, project UID/EMS/50022/2013 and SusCity (MITP- advanced metering infrastructure,” Lecture Notes in Computer Science
TB/CS/0026/2013). The work of J. L. Viegas was supported (including subseries Lecture Notes in Artificial Intelligence and Lecture
by the PhD in Industry Scholarship SFRH/BDE/95414/2013 Notes in Bioinformatics), vol. 6027 LNCS, pp. 176–187, 2010.
[17] S. McLaughlin, B. Holbert, A. Fawaz, R. Berthier, and S. Zonouz, “A
from FCT and Novabase. S. M. Vieira acknowledges support multi-sensor energy theft detection framework for advanced metering
by Program Investigador FCT (IF/00833/2014) from FCT, infrastructures,” in 2012 IEEE Third International Conference on Smart
co-funded by the European Social Fund (ESF) through the Grid Communications (SmartGridComm), vol. 31, no. 7, 2013, pp.
1319–1330.
Operational Program Human Potential (POPH). [18] J. B. Leite and J. R. S. Mantovani, “Detecting and Locating Non-
technical Losses in Modern Distribution Networks,” vol. 3053, pp. 1–11,
R EFERENCES 2016.
[19] D. Mashima and A. a. Cárdenas, “Evaluating electricity theft detectors
[1] F. B. Lewis, “Costly throw-ups’: electricity theft and power disruptions,”
in smart grid networks,” Lecture Notes in Computer Science, vol. 7462,
The Electricity Journal, vol. 28, no. 7, pp. 118–135, 2015.
pp. 210–229, 2012.
[2] United Kingdom Revenue Protection Association (UKRPA), “Frequently
[20] “Self-labeled techniques for semi-supervised learning: taxonomy, soft-
Asked Questions: How much energy is stolen?”
ware and empirical study,” Knowledge and Information Systems, pp.
[3] IBM, “Energy theft: incentives to change,” Tech. Rep., 2012.
1–40, 2013.
[4] Energy Association of Pennsylvania, “Energy Theft Kills,
[21] M. F. A. Hady and F. Schwenker, “Co-Training by Committee: A new
Costs Innocent Pennsylvanians Millions,” 2007. [Online].
semi-supervised learning framework,” Proceedings - IEEE International
Available: http://www.paenvironmentdigest.com/newsletter/docs/3/11-
Conference on Data Mining Workshops, ICDM Workshops 2008, pp.
02-2007 505734.pdf
563–572, 2008.
[5] R. Jiang, R. Lu, Y. Wang, J. Luo, C. Shen, and X. S. Shen, “Energy-
[22] G. Louppe, “Understanding Random Forests: From theory to practice,”
theft detection issues for advanced metering infrastructure in smart grid,”
PhD dissertation, University of Liège - Faculty of Applied Sciences,
Tsinghua Science and Technology, vol. 19, no. 2, pp. 105–120, 2014.
2014.
[6] P. Jokar, N. Arianpoo, and V. C. M. Leung, “Electricity Theft Detection
[23] L. Breiman, “Random forests,” Machine learning, pp. 5–32, 2001.
in AMI Using Customers ’ Consumption Patterns,” IEEE Trans. on
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
Smart Grid, vol. 7, no. 1, pp. 216–226, 2016.
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
[7] J. L. Viegas, P. R. Esteves, R. Melı́cio, V. M. F. Mendes, and S. M.
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
Vieira, “Solutions for detection of non-technical losses in the electricity
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
grid: A review,” Renewable and Sustainable Energy Reviews, vol. 80,
Learning Research, vol. 12, pp. 2825–2830, 2011.
no. December, pp. 1256–1268, 2017.
[25] ISSDA, “Data from the Commission for Energy Regulation -
[8] J. R. Filho, E. M. Gontijo, A. C. Delaı́ba, E. Mazina, J. E. Cabral, and
www.ucd.ie/issda.”
J. O. P. Pinto, “Fraud identification in electricity company costumers us-
[26] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
ing decision tree,” in IEEE Int. Conf. on Systems, Man and Cybernetics,
Learning: Data Mining, Inference, and Prediction, second edition ed.
vol. 4, 2004, pp. 3730–3734.
Springer Series in Statistics, 2008.
[9] G. M. Soares, A. G. B. Almeida, R. M. Mendes, E. C. Teixeira,
[27] J. a. Hanley and B. J. McNeil, “The meaning and use of the area under
H. A. C. Braga, and J. G. P. Filho, “Performance evaluation of a sensor-
a receiver operating characteristic (ROC) curve.” Radiology, vol. 143,
based system devised to minimize commercial losses in street lighting
no. 4, pp. 29–36, 1982.
networks,” in IEEE Int. Instrumentation and Measurement Technology
[28] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-
Conference (I2MTC), 2014.
supervised Learning with Deep Generative Models,” Proceedings of
[10] A. Aravkin and M. Wolf, “Analytics for understanding customer behav-
Neural Information Processing Systems (NIPS) 2014, 2014.
ior in the energy and utility industry,” IBM Journal of Research and
Development, vol. 60, no. 1, pp. 1–13, 2016.
View publication stats

Electricity Fraud Detection Using Committee Semi-Supervised Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Electricity Fraud Detection Using Committee Semi-Supervised Learning

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Electricity fraud detection using committee semi-supervised learning

Conference Paper · July 2018

Joaquim Viegas Susana Vieira

SEE PROFILE SEE PROFILE

Personalized Medicine in the ICU View project

PhD program View project

The user has requested enhancement of the downloaded file.

labeled positive. The selected threshold tr is the one that 0.4

View publication stats

You might also like