Professional Documents
Culture Documents
JustinLiew Assignment1
JustinLiew Assignment1
1. Explain what pre-processing techniques are needed for this dataset. Using those techniques, clean the data if necessary.
'3186-AJIEK']
['Female' 'Male']
Unique SeniorCitizen's count: 2
[0 1]
['Yes' 'No']
['No' 'Yes']
[ 1 34 2 45 8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
5 46 11 70 63 43 15 60 18 66 9 3 31 50 64 56 7 42 35 48 29 65 38 68
32 55 37 36 41 6 4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 0
39]
['No' 'Yes']
['Yes' 'No']
Additionally, an assessment of the data shows that there are two main quality issues present in the data
1. The data type of "TotalCharges" should be float64 instead of object
2. The names in the data values for PaymentMethod needs to be more readable
3. Many rows of total charges do not equal each tenues times monthly charges
2. Discretize attributes tenure and MonthlyCharges into three discrete level using R
3. Manually calculate all items sets (from one item to the maximum number of items you can Find) using Apriori method. Prune the item sets with
minimum support of 35% and minimum confidence of 75%. You can add more records from the modified dataset if it’s required
i. Scan table to count each candidate in the item list ={I1, I2, I3, I4, I5}
Item Set Support Count
{I1} 2
{I2} 3
{I3} 4
{I4} 1
{I5} 4
The association rule that can be generated from L3 are as follows with support(35%)=1.75 & confidence = 75%
Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.
Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability .