JustinLiew Assignment1

Questions
1. Explain what pre-processing techniques are needed for this dataset. Using those techniques, clean the data if necessary.
Data Wrangling using Python
a. Gathering the Data

b. Accessing the data
Unique customerID's count: 7043
['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
'3186-AJIEK']
Unique gender's count: 2
['Female' 'Male']
Unique SeniorCitizen's count: 2
[0 1]
Unique Partner's count: 2
['Yes' 'No']
Unique Dependents's count: 2
['No' 'Yes']
Unique tenure's count: 73
[ 1 34 2 45 8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
5 46 11 70 63 43 15 60 18 66 9 3 31 50 64 56 7 42 35 48 29 65 38 68
32 55 37 36 41 6 4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 0
39]
Unique PhoneService's count: 2
['No' 'Yes']
Unique MultipleLines's count: 3

['No phone service' 'No' 'Yes']
Unique InternetService's count: 3
['DSL' 'Fiber optic' 'No']
Unique OnlineSecurity's count: 3
['No' 'Yes' 'No internet service']
Unique OnlineBackup's count: 3
['Yes' 'No' 'No internet service']
Unique DeviceProtection's count: 3
Unique TechSupport's count: 3
Unique StreamingTV's count: 3

Unique StreamingMovies's count: 3
Unique Contract's count: 3
['Month-to-month' 'One year' 'Two year']
Unique PaperlessBilling's count: 2
['Yes' 'No']
Unique PaymentMethod's count: 4
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
'Credit card (automatic)']
Unique MonthlyCharges's count: 1585
[29.85 56.95 53.85 ... 63.1 44.2 78.7 ]
Unique TotalCharges's count: 6531
['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']
Unique Churn's count: 2

['No' 'Yes']
The two peaks display a distribution that is abnormal, additionally it also means that there are likely two
different kinds of groups of people, and either of them prefer particular services.
Additionally, an assessment of the data shows that there are two main quality issues present in the data
1. The data type of "TotalCharges" should be float64 instead of object
2. The names in the data values for PaymentMethod needs to be more readable
3. Many rows of total charges do not equal each tenues times monthly charges
2. Discretize attributes tenure and MonthlyCharges into three discrete level using R
3. Manually calculate all items sets (from one item to the maximum number of items you can Find) using Apriori method. Prune the item sets with
minimum support of 35% and minimum confidence of 75%. You can add more records from the modified dataset if it’s required
i. Scan table to count each candidate in the item list ={I1, I2, I3, I4, I5}
Item Set Support Count
{I1} 2
{I2} 3
{I3} 4
{I4} 1
{I5} 4
ii. Compare Candidate support count to min support count (35%)

{I1} 2
{I2} 3
{I3} 4
{I5} 4
iii. Generate Candidate List (C2) from L1

{I1, I2} 1
{I1, I3} 2
{I1, I5} 1
{I2, I3} 2
{I2, I5} 3
{I3, I5} 3
iv. Compare C2 with minimum support (35%)

{I1, I3} 2
{I2, I3} 2
{I2, I5} 3
{I3, I5} 3
v. Generate C3 from L2
{I1, I3, I5} 1
{I2, I3, I5} 2
{I1, I2, I3} 1
vi. Compare C3 with minimum support (35%)

{I2, I3, I5} 2
Therefore, Frequent Data Item Sets are {I2, I3, I5}

Therefore, most of these services are used frequently by customers.
The association rule that can be generated from L3 are as follows with support(35%)=1.75 & confidence = 75%
Association Rule Min Support Count Confidence Confidence%

I2 ^ I3 => I5 1.75 1.75/2=0.88 88%
I3 ^ I5 => I2 1.75 1.75/3=0.58 58%
I2 ^ I5 => I3 1.75 1.75/3=0.58 58%
I2 => 13 ^ I5 1.75 1.75/3=0.58 58%
I3 => 12 ^ I5 1.75 1.75/4=0.44 44%
I5 => 12 ^ I3 1.75 1.75/4=0.44 44%
Minimum confidence threshold = 75%

Therefore, only first rule satisfies the given condition
Final output = (I2 ^ I3) => I5
4. Describe what will happen if you change minimum support and minimum confidence.
Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.
Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability .

JustinLiew Assignment1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JustinLiew Assignment1

Uploaded by

Copyright:

Available Formats

Questions

Data Wrangling using Python

a. Gathering the Data

['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'

Unique gender's count: 2

Unique Partner's count: 2

Unique Dependents's count: 2

Unique tenure's count: 73

Unique PhoneService's count: 2

Unique MultipleLines's count: 3

Unique InternetService's count: 3

['DSL' 'Fiber optic' 'No']

Unique OnlineSecurity's count: 3

['No' 'Yes' 'No internet service']

Unique OnlineBackup's count: 3

['Yes' 'No' 'No internet service']

Unique DeviceProtection's count: 3

['No' 'Yes' 'No internet service']

Unique TechSupport's count: 3

['No' 'Yes' 'No internet service']

Unique StreamingTV's count: 3

['No' 'Yes' 'No internet service']

['No' 'Yes' 'No internet service']

Unique Contract's count: 3

['Month-to-month' 'One year' 'Two year']

Unique PaperlessBilling's count: 2

Unique PaymentMethod's count: 4

['Electronic check' 'Mailed check' 'Bank transfer (automatic)'

'Credit card (automatic)']

Unique MonthlyCharges's count: 1585

[29.85 56.95 53.85 ... 63.1 44.2 78.7 ]

Unique TotalCharges's count: 6531

['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']

Unique Churn's count: 2

ii. Compare Candidate support count to min support count (35%)

iii. Generate Candidate List (C2) from L1

iv. Compare C2 with minimum support (35%)

vi. Compare C3 with minimum support (35%)

Therefore, Frequent Data Item Sets are {I2, I3, I5}

Association Rule Min Support Count Confidence Confidence%

Minimum confidence threshold = 75%

You might also like