Group Assignment Group 2

MASTER OF DATA SCIENCE (SEMESTER 1 – 2022/2023)
FACULTY OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY
WQD7005 DATA MINING
GROUP ASSIGNMENT
EVALUATING TELCO CAMPAIGN PERFORMANCE AND PREDICTING

CAMPAIGN OFFER TAKERS
GROUP MEMBERS
S2145827 KHAR SHIN YIN
S2132380 LEE SHIN EE
17088226 OOI SHI YUAN
S2102170 SIM LIN ZHENG
S2121801 YU YUEN HERN
INSTRUCTOR: PROF DR TEH YING WAH

DATE: 15th DECEMBER 2022
WQD7005 Group Assignment
Content Table
1 Introduction ........................................................................................................................... 1
2 Our Dataset ........................................................................................................................... 1
2.1 Dataset Description .......................................................................................................... 1
3 Business Understanding ....................................................................................................... 2
3.1 Analysis Goal ................................................................................................................... 2
3.2 Analysis Data ................................................................................................................... 3
4 Methodology .......................................................................................................................... 3
4.1 SEMMA Description ....................................................................................................... 3
4.2 SEMMA Process.............................................................................................................. 3
5 Results .................................................................................................................................... 4
5.1 Sample.............................................................................................................................. 4
5.1.1 Metadata.................................................................................................................... 5
5.1.2 Reclassification of the Role and Level of the Variables ........................................... 7
5.2 Explore ............................................................................................................................. 8
5.2.1 Summary Statistics.................................................................................................... 8
5.2.2 Univariate Analysis ................................................................................................. 10
5.2.3 Bivariate Analysis – Variable Association ............................................................. 19
5.2.4 Multivariate Analysis .............................................................................................. 21
5.2.5 Interesting Visualization ......................................................................................... 24
6 Conclusion ........................................................................................................................... 25
7 Appendix .............................................................................................................................. 28
1 Introduction
Marketing analytics refers to the study of customer data to evaluate and devise marketing
activities which has been widely incorporated by businesses across the globe (SAS, 2022). In-
depth, analysis of the customer data can help businesses to understand the driving factors of
consumer action, enhance their marketing strategies, and maximise return on investments from
the wonders of their marketing analytics.
In this project, an actual corporate marketing dataset of a pilot campaign launched by a

telecommunication company was used, herein referred to as Kation (made-up name) due to
confidentiality reasons to protect highly sensitive corporate information. Kation is one of the
largest telecommunication companies in Malaysia which has been employing customer-centric
solutions to facilitate seamless, consistent, and excellent customer experiences by delivering
the best value for money offerings and rate plans to its customers.
Kation had recently proposed to migrate their existing telco platform to a more
advanced platform to improve user and customer experience. To optimise the platform
migration, Kation had decided to come up with an initiative called “Right Planning”, which
aimed to migrate all customers’ old rate plan to a newer rate plan at a lower cost with better
benefits. With “Right Planning”, Kation was able to enlighten customer experience with new
rate plans with better benefits and offers, at the same time, remove the outdated rates plans and
standardise the rate plans information stored in the new platform.
Before launching “Right Planning” officially to Kation’s seven million customers, the
Base Management Team in Kation launched a pilot campaign with a small group of customers
to assess the effectiveness of the campaign proposed.
2 Our Dataset
2.1 Dataset Description
A set of campaign data that launched between 5th October 2022 and 11th October 2022. The
dataset collected for this study is a set of secondary data provided by the base management
team in one of the largest telecommunication companies in Malaysia, Kation. There is a total
of 7272 records and the target base is a group of prepaid Malaysian customers with silent status.
The dataset contains target base’ demographic information, usage, and revenue activity before
and after campaign. Table 2.1 showed the variables’ description.
1
Table 2.1: Variables’ description
Variable Description
ID Customer ID
TENURE Customer duration with Kation since registration date
AGE Customer age
GENDER Customer gender
NATIONALITY Customer nationality
STATE Customer hometown (state)
STATUS_BEFORE Customer status before campaign launched.
STATUS_AFTER Customer status after campaign ended.
OFFER_TAKER Indicator for customers who opted-in the migration plan.
OFFER_TAKE_UP_DT Date for customers who opted-in the migration plan.
DATA_PURC_BEFORE Indicator for customer who purchased data before campaign
launched.
DATA_PURC_AFTER Indicator for customer who purchased data after campaign
ended.
DATA_CHRG_BEFORE Total amount of data charged before campaign launched.
DATA_CHRG_AFTER Total amount of data charged after campaign ended.
DATA_USG_BEFORE Data usage before campaign launched.
DATA_USG_AFTER Data usage after campaign ended.
VOICE_USG_BEFORE Voice usage before campaign launched.
VOICE_USG_AFTER Voice usage after campaign ended.
RLD_IND_BEFORE Indicator for customer who reload before campaign launched.
RLD_IND_AFTER Indicator for customer who reload after campaign ended.
RLD_AMT_BEFORE Total of reload amount before campaign launched.
RLD_AMT_AFTER Total of reload amount after campaign ended.
ARPU_BEFORE ARPU before campaign launched.
CPA_RVN_BEFORE Total added value service before campaign launched.
CPA_RVN_AFTER Total added value service after campaign ended.
ARPU_AFTER ARPU after campaign ended.
ACTIVITY_DAYS_AFTER Silent days after campaign ended.
ACTIVITY_STATUS_AFTER Customer activity status after campaign ended.
3 Business Understanding
3.1 Analysis Goal
The analysis goal of the study is to assess the performance of the pilot before implementing the
migration for the entire customer base and identify the customers’ preference on old or new
rate plans.
2
3.2 Analysis Data
The source of this data is extracted based on recently launched campaign and provided by one
of the largest telecommunications companies in Malaysia. The dataset is private and
confidential (not publicly available). The target variable of the dataset is derived from the
customers’ who opted in the new rate plans via migration. The analysis goal is achievable when
data analysis is carried out:
i. To evaluate the effectiveness of the “Right Planning” pilot campaign,

ii. To identify the profile of campaign takers,
iii. To predict the customers if they will opt in for the new rate plans based on their usage
and revenue behaviour.
4 Methodology
4.1 SEMMA Description
Data mining SEMMA (an acronym of sample, explore, modify, model, and assess) is adopted
as the methodology to perform the data analysis. SEMMA was carried out using the SAS
Enterprise Miner tool.
There are 5 major steps in SEMMA: Sample, Explore, Modify, Model and Assess. All
these steps are necessary in conducting a data mining project and available in the SAS
Enterprise Miner.
4.2 SEMMA Process
In this project, first two methodologies, Sample and Explore, were focused to kickstart the
analysis. Figure 4.1 showed the process of SEMMA.
3
• Data is imported as source file. Data is then converted to SAS file. Data
infomation such as data types and values are identified. The dataset
chosen is large enough to contain significant information and good to
Sample process.
• Data is explored to identify the relationships, trends, anomalies and

examine data quality via univariate analysis, bivariate analysis and
multivariate analysis with graphs, to gain better understanding and also
Explore ensuring the dataset is good for analysis.
• Data is modified by creating, selecting, cleaning and transforming to

focus on model selection process. This process is mandatory to ensure
Modify the model performance.
• Data is split into train and test set, then put into several models for
model training. Multiple models are chosen to predict if customers will
opt in the new rate plan after campaign is launched based on their usage
Model and revenue behavior.
• Models' performance will be evaluated based on matrices in confusion

matrix, area under curve graph and lift score. Chosen predictive model
Assess will be used to predict the desired outcome.
Figure 4.1: SEMMA process
5 Results
5.1 Sample
Data collected was saved in Excel format. The Excel file was imported and saved as a SAS file
for data exploration. The procedure on how to create a new project, diagram, library, data
source and converting the excel file as a SAS file are attached in Appendix (5.1 Sample session
i – iv). The dataset collected is a set of campaign data with customer levels information
including customers’ demographic information, usage, and revenue activity before and after
the campaign, with a total of 7272 records and 27 variables.
4
Collected variables are:
i. customer’s tenure with Kation (TENURE)

ii. age (AGE)
iii. gender (GENDER)
iv. nationality (NATIONALITY)
v. hometown (STATE)
vi. line status before and after campaign (STATUS_BEFORE,
STATUS_AFTER)
vii. campaign takers (OFFER_TAKER)
viii. campaign take up date (OFFER_TAKE_UP_DT)
ix. customer who purchased data plan before and after campaign
(DATA_PURC_BEFORE, DATA_PURC_AFTER)
x. amount of data charged before and after campaign (DATA_CHRG_BEFORE,
DATA_CHRG_AFTER)
xi. data usage before and after campaign (DATA_USG_BEFORE,
DATA_USG_AFTER)
xii. voice usage before and after campaign (VOICE_USG_BEFORE,
VOICE_USG_AFTER)
xiii. customer who reload before and after campaign (RLD_IND_BEFORE,
RLD_IND_AFTER)
xiv. reload amount before and after campaign (RLD_AMT_BEFORE,
RLD_AMT_AFTER)
xv. average revenue per user (ARPU) before and after campaign
(ARPU_BEFORE, ARPU_AFTER),
xvi. amount of added value service before and after campaign
(CPA_RVN_BEFORE, CPA_RVN_AFTER)
xvii. activity days after campaign (ACTIVITY_DAYS_AFTER)
xviii. activity status after campaign (ACTIVITY_STATUS_AFTER)
5.1.1 Metadata
With SAS Enterprise Miner, there are basic and advanced settings to define the variables data
types. Figure 5.1 showed the column metadata and Figure 5.2 showed the data types in default
basic settings. Meanwhile, Figure 5.3 showed the column metadata and Figure 5.4 shows the
data types in default advance settings.
5
Figure 5.1: Column metadata (Basic setting)
Figure 5.2: Data type summary (Basic setting)
Figure 5.3: Column metadata (Advanced setting)
Figure 5.4: Data type summary (Advanced setting)
6
Based on Figure 5.1 and Figure 5.2, we observed that the data types are classified into
nominal and interval as inputs role with default basic system settings. The system automatically
detects the measurement level according to the possible values within the variables. By default,
numeric data was classified as interval type, and character was classified as nominal type.
However, the basic settings werewot suitable to be implemented as most of the data types are
still inaccurate.
After selecting the advanced settings (based on Figure 5.3 and Figure 5.4), the data
types were redefined into binary, interval, nominal and unary. The system automatically detects
the role of the variables into input and rejected role. However, there were still error in which
the data types do not reflect correct data type for the dataset. Hence, manual adjustment is
needed to modify the data types before proceeding to the next stage – Explore.
5.1.2 Reclassification of the Role and Level of the Variables
Reclassification was carried out to enable the creation of suitable charts for these variables.
Figure 5.5 showed the data type modification after manual adjustment and Figure 5.6 was the
final data type summary.
Figure 5.5: Comparison between role and measurement level between advance settings and
manual reclassification
7
Figure 5.6: Data type summary
Based on Figure 5.5, the rejected variables were converted to input variables as all the
data should be included before any analysis were carried out to justify the variables deletion.
Target variable is manually added. The unary variables and binary variable which were
supposedly nominal data were converted and the remaining variables remained unchanged.
In addition, we observed that some nominal variables such as ARPU_BEFORE,

ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE,
DATA_CHRG_AFTER, RLD_AMT_BEFORE, and RLD_AMT_AFTER are interval
variables. This is because their possible values are from 0 to positive infinity. The data types
showed up as nominal because it contains missing or null values (denoted by “?”). Once data
cleaning is performed at the Modify stage, the data type of these variables will be more accurate.
5.2 Explore
Data was explored to identify the relationships and anomalies via univariate, bivariate and
multivariate analysis with several graphs.
5.2.1 Summary Statistics
After accessing and assaying the dataset, a summary statistic is generated. The goal of the
summary statistics is to give an overview of the data pattern such as minimum and maximum
value, mean, missing values and standard deviation. Figure 5.7 showed the summary statistics
for interval variables and Figure 5.8 showed the summary statistics for class variables.
Figure 5.7: Interval variable summary statistics
8
Based on Figure 5.7, there are 7 interval variables which do not have any missing values
and summary statistics seems to be normal except AGE. AGE contains a minimum age of -
9999, which seems to be abnormal as human age should between range of 0 to 100 only but
not negative values. This phenomenon occurred because customers’ age information was
missing due to system or human error thus replacing with -9999. The abnormal data points
explain the high value of skewness and large standard deviation.
Figure 5.8: Class variable summary statistics
Based on Figure 5.8, it can be observed that there were no missing values among the
class variables. However, part of the nominal variables such as ARPU_BEFORE,
ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE,
DATA_CHRG_AFTER, RLD_AMT_BEFORE, and RLD_AMT_AFTER contained an
extremely high number of levels which was abnormal.
Figure 5.9: Overview of data ( on first 47 records )
Referring to Figure 5.9, these variables were containing numeric values instead of class
values, which meant they were supposedly numeric data but wrongly classified as character
data type. Moreover, there were blank values found within these variables. However, these
9
blank values were represented by ‘?’ values instead of ‘0’ values make them wrongly tagged
as a class variable.
On top of that, Malaysia comprises 13 states and 3 federal territories. However,

referring to Figure 5.8, variable STATE contains 20 distinct values that violated the
justification. Furthermore, there were 4 levels in GENDER which supposedly only 2 levels,
male and female. It means that STATE and GENDER contain inconsistency data.
Hence the variables AGE, GENDER, STATE, ARPU_BEFORE, ARPU_AFTER,

CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE,
DATA_CHRG_AFTER, RLD_AMT_BEFORE and RLD_AMT_AFTER need to undergo
data cleaning and pre-processing to ensure the data is cleaned before exploration and modelling.
5.2.2 Univariate Analysis
Univariate analysis is the simplest form of statistical data analysis to explore each variable in
the dataset. It does not deal with the causal or relationships but solely find patterns from each
of the variables. Graphs such as histogram, box plot and pie chart are best suited for conducting
the univariate analysis to check pattern distribution, outliers, noisy data and any missing value.
Firstly, histogram was used to visualize distribution and missing values for the 7
interval variables. Figure 5.9 showed the overview of all graphs and Table 5.1 summarized the
patterns, analysis, and abnormalities.
Figure 5.10: Histogram of interval variables
No. Variable Findings

1 ACTIVITY_DAYS_AFTER § Bimodal distribution as there were
two peaks found from the variable.
- Peak 1: Between 0 and 3.9 days
- Peak 2: Between 35.1 and 39 days
10
§ Most of Kation’s customers stayed

active (Peak 1) or stayed entirely
silent (Peak 2) after the campaign.
§ No outliers were detected.
2 AGE § Unusual pattern as values were all on

the right side.
§ Noisy data (value of -9999) were

detected. Human age range between
1 to 100 only. The extreme values
caused the histogram exceedingly
unreliable.
3 TENURE § Right-skewed distribution.
§ Normal behaviour for prepaid

customer for not staying long and
usually left for other service
providers after a quarter year.
§ No outliers detected.
4 DATA_USG_BEFORE § Right-skewed distribution.
§ A significant spike of 0 usage values

due to silent usage behaviour of the
target group as mentioned.
§ Outliers detected.
5 DATA_USG_AFTER § Right-skewed distribution.

11
6 VOICE_USG_BEFORE § Right-skewed distribution.

7 VOICE_USG_AFTER § Right-skewed distribution.

Table 5.1: Histogram’s findings

Besides that, boxplot is recognised as one of the great tools to visualize outliers. Again,
boxplots were created for the 7 interval variables to validate the presence of outliers as shown
in Figure 5.10. Table 5.2 summarized the findings of outliers.
12
Figure 5.11: Box plots of interval variables
No. Findings
1 ACTIVITY_DAYS_AFTER
§ Data is clean and no outliers are detected. [Justified as in Table 5.1]
2 AGE, DATA_USG_AFTER, DATA_USG_BEFORE, VOICE_USG_AFTER,

VOICE_USG_BEFORE, TENURE
13
§ Outliers were detected for all the variables. [Justified as in Table 5.1]
§ TENURE outliers were acceptable as there are prepaid customer who stays loyal.
§ AGE contained noisy data point of -9999 which needs to be removed.
§ The nature of the remaining variables’ outliers needed to be examined further to

conclude whether they were caused by error or omission or natural behaviour prior
to dispelling them from project.
Table 5.2: Boxplot’s findings
For categorical variables such as nominal, ordinal and binary, pie chart and bar plot are
used. These graphs are great in dividing categorical data distribution into numerical proportions.
Pie chart is excellent in presenting the proportions categorical data residing within a nominal
variable in a very straightforward manner, especially when there are only a few categories.
Meanwhile, bar plot is great in visualizing high number of possible values available within the
variable.
14 nominal variables are plotted using pie charts and 6 nominal variables are plotted
using bar plots. Figure 5.11 showed the overview of the variables in pie chart and findings are
summarized in Table 5.3. Meanwhile, Figure 5.12 showed the overview of the variables in bar
chart and findings were summarized in Table 5.4.
14
15
Figure 5.11: Pie charts of nominal variables
No. Findings
1 Correct data type: Is nominal variable
§ Data is clean and looks fine.

2 Incorrect data type: Is not nominal variable but interval variable
16
§ A significant number of blank values exists in these variables: APRU_AFTER,

APRU_BEFORE, CPA_RVN_AFTER, CPA_RVN_BEFORE,
DATA_CHRG_AFTER, DATA_CHRG_BEFORE, OFFER_TAKE_UP_DT,
RLD_AMT_AFTER, and RLD_AMT_BEFORE.
§ These variables were supposed to be classified as interval variables, but wrongly

classified as nominal variables with null values ( “?”, ”0”) by SAS Enterprise
Miner.
Table 5.3: Pie charts’ findings
Figure 5.12: Bar plots of nominal variables
17

1 GENDER § There should not be more than 2
gender type.
§ Inconsistent data value was detected

for unspecified group.
2 STATE § There should not more than 13 states

in Malaysia.
§ Inconsistent data value was detected

for the name of the states, where
there are Malay and English spelling
for the same state.
§ Noisy data should be classified under

‘Other’ category. For instance, Klang
Valley is not a state.
3 ACTIVITY_STATUS_AFTER, NATIONALITY, STATUS_BEFORE,

STATUS_AFTER
§ Data is clean and looks fine.
§ Status of “DURING & AFTER CAMP” and “BEFORE & AFTER CAMP”
represented the customers who opted in the new rate plan and remain active after
campaign.
§ The pre-view count was more than half of the total count, indicating the campaign
was quite successful.
Table 5.4: Bar plots’ findings

18
5.2.3 Bivariate Analysis – Variable Association
Bivariate analysis identifies the relationship between two variables. Charts such as scatter plots,
bar plots and box plots can be extremely helpful in finding simple insights. Table 5.5 displayed
the findings between variables in dataset.

1 TENURE vs AGE § The younger the age, the shorter the
tenure.
§ A sudden spike in tenure for age 32

then drops low for age 33 to 50.
§ A disparity in terms of tenure for old

customers. Some have extremely
long tenure (> 240 months) while
some have similar short tenure like
the younger customers (< 50
months).
2 TENURE vs GENDER § The distribution between male and

female are quite similar.
§ Some customers were listed with “?”

in gender. Perhaps these customers
should be grouped together with
“Unspecified”.
4 TENURE vs STATE § Median tenure across all the states

did not differ a lot, ranging from 33
to 56 months.
§ However, states on the right side of

“PERLIS” were redundant locations
which have to be merged with those
on the left side of “PERLIS”, in
addition to low frequency.
19
5 AGE vs STATE § Median age across all the states did

not differ a lot, ranging from 30 to 40
years old, except Kelantan has a
smaller median age.
§ However, states on the right side of

“PERLIS” were redundant locations
which have to be merged with those
on the left side of “PERLIS”, in
addition to low frequency.
6 AGE vs GENDER § The age distribution pattern is quite

similar for both male and female.
§ There were more males than females
in this dataset.
7 OFFER_TAKER vs AGE § Customers who joined the campaign

are between age 21 to 90.
§ Median for both campaign joiners

and non-joiners is at age 34.
8 OFFER_TAKER vs GENDER § Opted-in male customers has a

slightly higher distribution compared
to female.
9 OFFER_TAKER vs STATE § Majority of the opted-in customers

from all states.
20
10 OFFER_TAKER vs TENURE § The tenure distribution for both offer

taker and non-offer taker is similar.
11 STATE vs GENDER § Surprisingly, Sabah has the most

customers who refuse to disclose
their gender (16).
Table 5.5: Findings from bivariate analysis
5.2.4 Multivariate Analysis
Multivariate analysis is a statistical process involving multiple dependent variables in

producing an outcome. In order words, more than two dependent variables are analysed
simultaneously with all other variables.
Figure 5.14 illustrated STATUS_BEFORE and STATUS_AFTER by
OFFER_TAKER in three separate bar charts. Customers who opted-in the campaign were more
likely to remain active (STATUS_AFTER = Active) or convert to postpaid (STATUS_AFTER
= Conver) after the campaign. Customers who did not participated were more likely to churn
(STATUS_AFTER = Close).
Figure 5.14: Findings of status before and after for offer takers
21
Figure 5.15 illustrated VOICE_USG_BEFORE and VOICE_USG_AFTER by

OFFER_TAKER in scatter chart. There was no relationship between voice usage before and
after the campaign, indicating that the campaign did not change the overall customer behaviour
on voice service usage.
Figure 5.15: Findings of voice usage before and after for offer takers
Figure 5.16 illustrated DATA_USG_BEFORE and DATA_USG_AFTER by

OFFER_TAKER in scatter chart. There was no relationship between data usage before and
after the campaign, indicating that the campaign did not change the overall customer behaviour
on data service usage, except for minority of customers that has a higher data usage after the
campaign.
Figure 5.16: Findings of data usage before and after for offer takers
Other than bar and scatter plots, multivariate analysis can also be done using correlation
matrix. The strength of the correlations is measured from -1 to 1, using the different tone
colours of blue and red, where the deeper the colour, the higher the correlation. Pair of variables
with correlation values greater than 0.9 will undergo the removal of one of the variables to
avoid the collinearity issue. Figure 5.17 and Figure 5.18 showed the correlation matrix and
correlation table.
22
Figure 5.17: Correlation Matrix
Figure 5.18: Correlation Table
Based on the correlation matrix and correlation table visualized in Figure 5.17 and
Figure 5.18, the variables do not have correlation values greater than 0.9. Therefore, no
variables were removed.
For categorical variables, Cramer’s V statistics was conducted to measure the

association between categorical variables. Figure 5.19 below showed the findings of
association between target variable (OFFER_TAKER) and nominal variables.
23
Figure 5.19: Cramer’s V statistics with respect to target variable
In Cramer’s V measure, the coefficient ranges from 0 to 1, where 0 indicates no

association and 1 indicates a perfect association between the variables. A coefficient of 0.1 was
used as a threshold to indicate that there is a relationship between two variables.
Based on Figure 5.19, OFFER_TAKE_UP_DT showed a perfect association whereas

GENDER showed almost no association. By taking a threshold of 0.1, features
OFFER_TAKE_UP_DT, DATA_CHRG_BEFORE, DATA_PURC_BEFORE,
ARPU_BEFORE, ARPU_AFTER, STATUS_AFTER, RLD_AMT_BEFORE,
CPA_RVN_BEFORE, CPA_RVN_AFTER, RLD_AMT_AFTER, and
DATA_CHRG_AFTER were the only variables that showed association with target variable
(OFFER_TAKER).
5.2.5 Interesting Visualization
The revenue impact of campaigns is always the top concern for base management team. Efforts
demonstrated should be within planning and budgeting. Hence, pre and post campaign analysis
on revenue is a mandatory step to evaluate the effectiveness and success of campaigns. In this
study, Figure 5.20 displayed the findings of ARPU before and after for the campaign offer
takers.
24
Figure 5.20: Findings of ARPU before and after for offer takers
Based on Figure 5.20, there were 2 arrow shapes formed for both offer takers (Y) and
non-offer takers (N) under OFFER_TAKER. However, findings were not concluded as both
the variables, ARPU_BEFORE and ARPU_AFTER are incomplete data. The values are not in
ascending order as they are tagged under nominal variable. Findings will be concluded in
conclusion after the data undergoes pre-process
6 Conclusion
Initially, the dataset contains 7 interval variables and 20 nominal variables. After manually
revising the metadata, the output is shown in the Table 6.1.
Role Type of Variable Count

Input Binary 4
Input Interval 7
Input Nominal 15
Target Binary 1
Table 6.1: Revised metadata
In data exploration, there are 4 types of data error such as incomplete, noisy,
inconsistent and intentional data were found in the dataset. The findings are summarised in
Table 6.2.
25
Variable Data Error Type

AGE Intentional
ARPU_BEFORE Noisy, Incomplete
ARPU_AFTER Noisy, Incomplete
CPA_RVN_BEFORE Noisy, Incomplete
CPA_RVN_AFTER Noisy, Incomplete
DATA_CHRG_BEFORE Noisy, Incomplete
DATA_CHRG_AFTER Noisy, Incomplete
DATA_USG_BEFORE Noisy
DATA_USG_AFTER Noisy
RLD_AMT_BEFORE Noisy, Incomplete
RLD_AMT_AFTER Noisy, Incomplete
STATE Inconsistent
GENDER Inconsistent
OFFER_TAKE_UP_DT Incomplete
TENURE Noisy
VOICE_USG_BEFORE Noisy
VOICE_USG_AFTER Noisy
Table 6.2: Data error type
Based on the visualizations displayed in session 5.2 Explore, both objective 1 and
objective 2 are achieved. The objectives are proven with key findings as listed in Table 6.3.
Objective Key Findings

1 (To evaluate the • The pilot campaign launched has a total number of 4,573
effectiveness of the “Right customers who opted in for the new rate plans. The opt in
Planning” pilot campaign) rate for the campaign is 63%.
• There are 2,921 customers out of 4,573 opted-in customers
who remain active after the campaign. The active rate is
64%.
• The above statistics indicating the campaign launched was
moderately successful.
2 (To identify the profile • Campaign takers showed a higher numbers of opt in rate
of campaign takers) among male compared to female.
• Campaign takers were mostly from age group between 22
and 36 years old with tenure more than 1 year.
• Most of the campaign takers are mainly from Klang Valley,
then follow by Sabah and Sarawak.
26
• Majority of the customers remain active after the campaign

but 11 customers are terminating their lines or converting
to Postpaid rate plan.
Table 6.3: Key Findings for Objective 1 and Objective 2
27
7 Appendix
Attached below the procedures for conducting the first two steps of SEMMA in SAS Enterprise
Miner - Sample and Explore.
5.1 Sample
i. Create a new Enterprise Miner Project
28
ii. Create a diagram
29
iii. Import data as a file and save as SAS file
30
iv. Create a library for the imported data file / SAS file
31
v. Create data source and extract the imported SAS file
32
33
34
5.2 Explore
i. Generate a Statistics Table
• Set the Summarize property at the property side bar to Yes before running the “File
Import” node.
• Right click “File Import” node and select Results... to display the Statistics Table.
ii. Set a sample size to be used for creating graphs

• Click the “Graph Explore” node.
• Under the “Train” property at the property side bar, select “Max” for “Size” under
the Sample Properties before running the node.
iii. Create graphs

• Click Explore and drag Graph Explore into the diagram.
35
•
• Connect the “File Import” node to the “Graph Explore” node. Then, right click the
“Graph Explore” node and select Run.
•
• After the “Graph Explore” node has been successfully run, select Plot… icon as
highlighted below and choose your desired graphs.
36
•
• Specify the variable that you wish to visualise and click Finish.
iv. To add a “missing” bin to the histogram

• Right click on the created histogram and select Graph Properties.
• Tick “Show Missing Bin” and select OK.
37
v. To obtain Cramer’s V statistic, StatExplore node is added

• Select Add Node > Explore > StatExplore
38
• Select Run > Results... > View > Plots > Chi-Square Plot
• The Chi-Square plot result window will be shown. Select Cramer’s V from the
drop-down box in top left corner of the window.
vi. To understand the correlation among the variable, Variable Clustering node is added
• Select Add Node > Explore > Variable Clustering
39
•
• When viewing the results on variable correlation, a correlation matrix will be
shown.
• Select Run > Results... > View > Model > Variable Correlation
•
• To view the values used in deriving the correlation matrix, click on the Table icon
on the top left of the result screen. A table should be displayed showing the
correlation values.
40
41

Group Assignment Group 2

Uploaded by

Copyright:

Available Formats

You might also like

Group Assignment Group 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group Assignment Group 2

Uploaded by

Copyright:

Available Formats

MASTER OF DATA SCIENCE (SEMESTER 1 – 2022/2023)

FACULTY OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY

WQD7005 DATA MINING

EVALUATING TELCO CAMPAIGN PERFORMANCE AND PREDICTING

INSTRUCTOR: PROF DR TEH YING WAH

In this project, an actual corporate marketing dataset of a pilot campaign launched by a

2.1 Dataset Description

Table 2.1: Variables’ description

3.1 Analysis Goal

3.2 Analysis Data

i. To evaluate the effectiveness of the “Right Planning” pilot campaign,

4.1 SEMMA Description

4.2 SEMMA Process

• Data is explored to identify the relationships, trends, anomalies and

• Data is modified by creating, selecting, cleaning and transforming to

• Models' performance will be evaluated based on matrices in confusion

Figure 4.1: SEMMA process

Collected variables are:

i. customer’s tenure with Kation (TENURE)

Figure 5.1: Column metadata (Basic setting)

Figure 5.2: Data type summary (Basic setting)

Figure 5.3: Column metadata (Advanced setting)

Figure 5.4: Data type summary (Advanced setting)

5.1.2 Reclassification of the Role and Level of the Variables

Figure 5.6: Data type summary

In addition, we observed that some nominal variables such as ARPU_BEFORE,

5.2.1 Summary Statistics

Figure 5.7: Interval variable summary statistics

Figure 5.8: Class variable summary statistics

Figure 5.9: Overview of data ( on first 47 records )

On top of that, Malaysia comprises 13 states and 3 federal territories. However,

Hence the variables AGE, GENDER, STATE, ARPU_BEFORE, ARPU_AFTER,

5.2.2 Univariate Analysis

Figure 5.10: Histogram of interval variables

No. Variable Findings

§ Most of Kation’s customers stayed

§ No outliers were detected.

2 AGE § Unusual pattern as values were all on

§ Noisy data (value of -9999) were

3 TENURE § Right-skewed distribution.

§ Normal behaviour for prepaid

4 DATA_USG_BEFORE § Right-skewed distribution.

§ A significant spike of 0 usage values

5 DATA_USG_AFTER § Right-skewed distribution.

§ A significant spike of 0 usage values

6 VOICE_USG_BEFORE § Right-skewed distribution.

§ A significant spike of 0 usage values

7 VOICE_USG_AFTER § Right-skewed distribution.

§ A significant spike of 0 usage values

Table 5.1: Histogram’s findings

Figure 5.11: Box plots of interval variables

§ Data is clean and no outliers are detected. [Justified as in Table 5.1]

2 AGE, DATA_USG_AFTER, DATA_USG_BEFORE, VOICE_USG_AFTER,

§ AGE contained noisy data point of -9999 which needs to be removed.

§ The nature of the remaining variables’ outliers needed to be examined further to

Table 5.2: Boxplot’s findings

Figure 5.11: Pie charts of nominal variables

§ Data is clean and looks fine.