Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

0330446

Question 1
i.
ii.
ii.
R-Code
library(tidyverse)

#1
summary <- dplyr::starwars %>%
drop_na(species) %>%
group_by(species) %>%
summarise(across(
.cols = c(height, mass),
.fns = list(
mean = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE)
),
.names="{fn}_{col}"
)
)%>%
print(n = nrow(.))

#2
summary %>% select(species, mean_height) %>%
slice(1:10) %>%
arrange(desc(mean_height))%>%
print(n = nrow(.))

#3
dplyr::starwars %>%
drop_na(species) %>%
group_by(gender, species) %>%
summarise(across(
.cols = c(height),
.fns = ~ mean(.x, na.rm = TRUE),
.names="mean_{col}"
)
) %>%
arrange(desc(mean_height), .by_group = TRUE) %>%
top_n(3, mean_height) %>%
arrange(gender) %>%
print(n = nrow(.))
Question 2

i.

a.

b.

c.
d.
R-Code
library(dslabs)

murder_df <- murders


rate <- (murder_df$total / murder_df$population) * 100000
rank <- rank(rate)

#a
df_a <- cbind(rate, rank, murder_df) %>%
select(state, region, rate, rank) %>%
filter(rate < 0.71) %>%
print()

#b
df_b <- cbind(rate, rank, murder_df) %>%
filter(region == "Northeast" | region == "West") %>%
select(state, rate, rank) %>%
filter(rate < 1) %>%
print()

#c
df_c <- cbind(rate, rank, murder_df) %>%
arrange(desc(rate)) %>%
select(region) %>%
slice(1:5) %>%
print()

#d
df_d1 <- cbind(rate, rank, murder_df) %>%
top_n(5, rate) %>%
print()
df_d2 <- cbind(rate, rank, murder_df) %>%
top_n(-5, rate) %>%
print()
Question 3
i. a.
Interpretation:
Using the graph, we can compare the range and distribution of the income for each category within
the segment.

For the segment, Moving up (70 values):

1. The largest value within 1.5 times interquartile range above 75% percentile = 73797.5. That
is, the maximum income in this segment.
2. In the IQR Range
a. 75th percentile = 58889.1
b. Median (50th percentile) = 52564.6
c. 25th percentile = 46649.4
d. Mean is around 54000
3. The smallest value within 1.5 times interquartile range above 25% percentile = 29771.9

We observe that there is a greater variability for Moving up’s income over the Urban hip
segment. However, it is still lower in variability when compared to Suburb mix and Travelers.

Additionally, this segment has a closer median to Suburb mix than the rest.

The income for this segment is in 3rd place, right above the lowest, Urban hip.

This segment is neither right or left skewed, which means it has normal distribution.

There are no outliers here.

For the segment, Suburb mix (100 values):

1. The largest value within 1.5 times interquartile range above 75% percentile = 85290.5. That
is, the maximum income in this segment.
2. In the IQR Range
a. 75th percentile = 61315.5
b. Median (50th percentile) = 54819.0
c. 25th percentile = 48127.7
d. Mean is around 56000
3. The smallest value within 1.5 times interquartile range above 25% percentile = 19282.2

We observe that there is a greater variability for Suburb mix’s income over both Urban hip and
Moving up segment. However, it is still lower in variability when compared to Travelers.

Additionally, this segment has a closer median to Travelers than the rest.

The max income for this segment is in 2nd place, right above the lowest, Urban hip. The
variability of the income for this category is greater.

This segment is slightly right-skewed. Which means that this category has a moderately higher
percentage of people enjoying higher income.

There are some outliers which indicate that the data has some errors or abnormal values.
For the segment, Travelers (80 values):

1. The largest value within 1.5 times interquartile range above 75% percentile = 114278.3. That
is, the maximum income in this segment.
2. In the IQR Range
a. 75th percentile = 77327.2
b. Median (50th percentile) = 61014.3
c. 25th percentile = 48623.4
d. Mean is around 65000
3. The smallest value within 1.5 times interquartile range above 25% percentile = -5183.4

We observe that this segment has the greatest variability over all the other categories.

Additionally, this segment has a further median from Moving Up and Suburb Mix.

The max income for this segment is in the 1st place. The variability of the income for this
category is the greatest.

However, this segment is moderately right-skewed. Which means that this category has a
moderately higher percentage of people enjoying higher income.

There are some outliers which indicate that the data has some errors or abnormal values. The
minimum income in the segment is at a negative, which means that this person is experiencing
loss or is in bankruptcy. The 2 points of outliers suggest that there are 2 people in the same
predicament in this category.

For the segment, Urban hip (150 values):

1. The largest value within 1.5 times interquartile range above 75% percentile = 33909.5. That
is, the maximum income in this segment.
2. In the IQR Range
a. 75th percentile = 24399.6
b. Median (50th percentile) = 22141.0
c. 25th percentile = 17865.1
d. Mean is around 23000
3. The smallest value within 1.5 times interquartile range above 25% percentile = 11985.2

We observe that this segment has the lowest variability over all the other categories.

Additionally, this segment has a furthest median from the rest.

The max income for this segment is in the 4th place. The variability of the income for this
category is the lowest.

However, this segment is highly left-skewed. Which means that this category has a moderately
higher percentage of people experiencing lower income.

There are no outliers. No abnormal values


I. b.

Interpretation:
The segment category with the highest subscription count is – Suburb mix at 94, since they enjoy a
higher income than Moving up and Urban hip, as we saw in the boxplot above, it is only natural. The
lowest subscription count belons to Urban hip at 40. People living in the urban area enjoy lesser
income as seen in the boxplot and hence would be more reluctant to subscribe.

The group that the company needs to target for subscription-conversion is the Moving Up category.
They have the highest unsubscribe count and their income levels put them in a position where they
would be more likely to subscribe as they can afford it. Their 3rd place in subscription also means
that the category most likely is willing to subscribe to the service. Suburb mix has the lowest
unsubscribe count which is natural since the majority has already been subscribed.
i. C.
Interpretation:
The histogram is slightly symmetric in the middle, however, when taking into account the left-side,
one can say that it is right-skewed which means that there are very few people who are experiencing
an income of 80000 and above. The data has a good fit but is also poor on the left side.

Observing the gender distribution by its color and count, we can observe that there are more males
who are suffering from low income, however, there are also more males enjoying a higher income.
The gender distribution in the middle and overall is quite equally distributed.

The peak of the income is between 40000 and 60000, in the 50000 range. Most people’s income are
at this level. There are outliers in the negatives as well as one in the higher incomes close to 120000.

In the 15000 to 30000 range, we see multiple peaks, those are modes which mean there are
important variables to be accounted for.
ii. a.

Interpretation:
The mean, the most common value in a collection of numbers is listed in the table above. Moving up
and Suburb mix both have close means, they are those who are close to middle age, while urban hip
is the furthest and youngest group. The oldest group by far is the Travelers. The similar pattern
follows for their income, Moving up and Suburb mix both have similar income while urban hip have
the lowest while Travelers have the highest. For the kids, it seems that Travelers have no children.
While Urban hip has the lower number of children while Moving up and suburb mix have almost
equal and high the most children in the Segment category.
ii. b.

Interpretation:
In the statistical results, the p-value is less than 0.005. The likelihood of Chi-Square statistic is
12.734. Therefore, at a significance level of 0.05. We can reject the null hypothesis that there is a
relationship between gender and segment.
iii. c.

Interpretation:
In the table, all data has been accounted for and is properly separated by gender. We can now look
at the relationships in our data. We can see that the highest percentage of males, 36.4% is in the
Suburb mix segment. While the highest percentage of females, 31.2%, is in the Moving Up segment.
The least female count is in Urban Hip at 12.7% while for Males, it is in the Moving Up segment at
14.7%.
ii. d.

Interpretation:
While the p-value can only provide a numerical number to be understood, the Crosstabulation table
provides a more qualitative measure of strength and direction. The Pearson chi-square test
determines if a crosstab's results are statistically significant. A crosstab is used in conjunction with
the Chi-square analysis to determine if the study's variables are independent or linked. The
tabulation is called unimportant if the two factors are independent, and the study is called a null
hypothesis. However, in crosstabulation table, that strength and direction is more appropriately
visualized.
ii. e.
Interpretation:
In the table, all data has been accounted for and is properly separated by gender. We can now look
at the relationships in our data. We can see that the highest percentage of No children amongst all
the segments is in the Travelers category, 66.1%. While the lowest percentage of No children
amongst all the segments, 9.1%, is in the Suburb mix segment. The category with the most counts of
having a single child is Suburb mix, while Travelers have none. The same is for counts for two and
more children, Suburb mix has the highest while Travelers have 0.
R-Code
library(tidyverse)
library(janitor)
library(gt)
library(gmodels)

segment_df <- read.csv("C:\\Users\\corse\\OneDrive - Taylor's


Education\\Desktop\\MAFinal\\segment.csv")

#1a
ggplot(segment_df, aes(x=Segment, y=income, fill=Segment)) +
geom_boxplot() +
stat_summary(geom="text", fun=quantile,
aes(label=sprintf("%1.1f", ..y..), color=Segment),
position=position_nudge(x=0.505), size=3.5) +
theme(legend.position="none")

#1b
ggplot(segment_df, aes(x = Segment, fill=subscribe)) +
geom_bar(position = position_dodge(width = 0.8), alpha=0.7) +
scale_fill_manual(name = "Subscription status", labels = c("Subscribed", "Unsubscribed"), values =
c("blue", "yellow")) +
stat_count(geom='text', color='black', aes(label=..count..), position=position_stack(vjust = 0.5))

#1c
segment_df <- segment_df %>%
drop_na(income) %>%
distinct(income, .keep_all = TRUE)

ggplot(segment_df, aes(x = income, fill = gender)) +


geom_histogram(col="black") +
stat_bin(geom='text', color='white', aes(label=..count..), position=position_stack(vjust = 0.5))

#2a
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

segment_df$kids_norm<-normalize(segment_df$kids)

segment_summary <- segment_df %>%


group_by(Segment) %>%
summarise(across(c(age, income, kids_norm), ~ mean(unique(.x)),
.names="mean_{col}")) %>%
print()

#2b
chisq.test(segment_df$gender,segment_df$Segment,correct = FALSE)

#2c
segment_df %>%
tabyl(gender, Segment) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns("front")%>%
rename(Gender = gender) %>%
gt() %>%
tab_header(
title = "Relationship between gender and segment",
)

CrossTable(segment_df$gender, segment_df$Segment)

#2e
segment_df$kids_category <-
ifelse(segment_df$kids == 0, "1. No children",
ifelse(segment_df$kids == 1 , "2. One child",
ifelse(segment_df$kids >= 2, "3. Two and more children", NA)
)
)

segment_df %>%
select (kids_category) %>%
mutate_all(as.ordered)

segment_df %>%
tabyl(kids_category, Segment) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns("front")%>%
rename("Children Category" = kids_category) %>%
gt() %>%
tab_header(
title = "Relationship between children and segment",
)

CrossTable(segment_df$gender, segment_df$Segment)

You might also like