Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Questions

1. Explain what pre-processing techniques are needed for this dataset. Using those techniques, clean the data if necessary.

Data Wrangling using Python

a. Gathering the Data


b. Accessing the data
Unique customerID's count: 7043

['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'

'3186-AJIEK']

Unique gender's count: 2

['Female' 'Male']
Unique SeniorCitizen's count: 2

[0 1]

Unique Partner's count: 2

['Yes' 'No']

Unique Dependents's count: 2

['No' 'Yes']

Unique tenure's count: 73

[ 1 34 2 45 8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27

5 46 11 70 63 43 15 60 18 66 9 3 31 50 64 56 7 42 35 48 29 65 38 68

32 55 37 36 41 6 4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 0

39]

Unique PhoneService's count: 2

['No' 'Yes']

Unique MultipleLines's count: 3


['No phone service' 'No' 'Yes']

Unique InternetService's count: 3

['DSL' 'Fiber optic' 'No']

Unique OnlineSecurity's count: 3

['No' 'Yes' 'No internet service']

Unique OnlineBackup's count: 3

['Yes' 'No' 'No internet service']

Unique DeviceProtection's count: 3

['No' 'Yes' 'No internet service']

Unique TechSupport's count: 3

['No' 'Yes' 'No internet service']

Unique StreamingTV's count: 3

['No' 'Yes' 'No internet service']


Unique StreamingMovies's count: 3

['No' 'Yes' 'No internet service']

Unique Contract's count: 3

['Month-to-month' 'One year' 'Two year']

Unique PaperlessBilling's count: 2

['Yes' 'No']

Unique PaymentMethod's count: 4

['Electronic check' 'Mailed check' 'Bank transfer (automatic)'

'Credit card (automatic)']

Unique MonthlyCharges's count: 1585

[29.85 56.95 53.85 ... 63.1 44.2 78.7 ]

Unique TotalCharges's count: 6531

['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']

Unique Churn's count: 2


['No' 'Yes']
The two peaks display a distribution that is abnormal, additionally it also means that there are likely two
different kinds of groups of people, and either of them prefer particular services.

Additionally, an assessment of the data shows that there are two main quality issues present in the data
1. The data type of "TotalCharges" should be float64 instead of object
2. The names in the data values for PaymentMethod needs to be more readable

3. Many rows of total charges do not equal each tenues times monthly charges

2. Discretize attributes tenure and MonthlyCharges into three discrete level using R
3. Manually calculate all items sets (from one item to the maximum number of items you can Find) using Apriori method. Prune the item sets with
minimum support of 35% and minimum confidence of 75%. You can add more records from the modified dataset if it’s required

i. Scan table to count each candidate in the item list ={I1, I2, I3, I4, I5}
Item Set Support Count
{I1} 2
{I2} 3
{I3} 4
{I4} 1
{I5} 4

ii. Compare Candidate support count to min support count (35%)


Item Set Support Count
{I1} 2
{I2} 3
{I3} 4
{I5} 4

iii. Generate Candidate List (C2) from L1


Item Set Support Count
{I1, I2} 1
{I1, I3} 2
{I1, I5} 1
{I2, I3} 2
{I2, I5} 3
{I3, I5} 3

iv. Compare C2 with minimum support (35%)


Item Set Support Count
{I1, I3} 2
{I2, I3} 2
{I2, I5} 3
{I3, I5} 3
v. Generate C3 from L2
Item Set Support Count
{I1, I3, I5} 1
{I2, I3, I5} 2
{I1, I2, I3} 1

vi. Compare C3 with minimum support (35%)


Item Set Support Count
{I2, I3, I5} 2

Therefore, Frequent Data Item Sets are {I2, I3, I5}


Therefore, most of these services are used frequently by customers.

The association rule that can be generated from L3 are as follows with support(35%)=1.75 & confidence = 75%

Association Rule Min Support Count Confidence Confidence%


I2 ^ I3 => I5 1.75 1.75/2=0.88 88%
I3 ^ I5 => I2 1.75 1.75/3=0.58 58%
I2 ^ I5 => I3 1.75 1.75/3=0.58 58%
I2 => 13 ^ I5 1.75 1.75/3=0.58 58%
I3 => 12 ^ I5 1.75 1.75/4=0.44 44%
I5 => 12 ^ I3 1.75 1.75/4=0.44 44%

Minimum confidence threshold = 75%


Therefore, only first rule satisfies the given condition
Final output = (I2 ^ I3) => I5
4. Describe what will happen if you change minimum support and minimum confidence.

Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.
Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability .

You might also like